In [1]:
!pip install transformers
!pip install datasets[s3]
!pip install sagemaker wandb --upgrade

Collecting s3fs (from datasets[s3])
  Downloading s3fs-2024.10.0-py3-none-any.whl.metadata (1.7 kB)
INFO: pip is looking at multiple versions of s3fs to determine which version is compatible with other requirements. This could take a while.
  Using cached s3fs-2024.9.0-py3-none-any.whl.metadata (1.6 kB)
  Using cached s3fs-2024.6.1-py3-none-any.whl.metadata (1.6 kB)
  Using cached s3fs-2024.6.0-py3-none-any.whl.metadata (1.6 kB)
  Using cached s3fs-2024.5.0-py3-none-any.whl.metadata (1.6 kB)
  Using cached s3fs-2024.3.1-py3-none-any.whl.metadata (1.6 kB)
  Using cached s3fs-2024.3.0-py3-none-any.whl.metadata (1.6 kB)
  Using cached s3fs-2024.2.0-py3-none-any.whl.metadata (1.6 kB)
INFO: pip is still looking at multiple versions of s3fs to determine which version is compatible with other requirements. This could take a while.
  Using cached s3fs-2023.12.2-py3-none-any.whl.metadata (1.6 kB)
  Using cached s3fs-2023.12.1-py3-none-any.whl.metadata (1.6 kB)
  Using cached s3fs-2023.10.0-py3-

In [2]:
!huggingface-cli login --token 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

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to /home/sagemaker-user/.cache/huggingface/token
Login successful


If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker.


In [3]:
import sagemaker
import boto3
sess = sagemaker.Session()
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml
sagemaker role arn: arn:aws:iam::024863509636:role/service-role/AmazonSageMaker-ExecutionRole-20210323T080862
sagemaker bucket: sagemaker-us-east-1-024863509636
sagemaker session region: us-east-1


## Load and prepare the dataset

we will use the jeopardy dataset for training.  This includes dates, categories, questions, and answers.  Here is an example:

```python

{
    'category': 'TALK TV', 
    'air_date': '2002-11-13', 
    'question': '\'She\'s the talk show host mentioned in the Offspring song "Pretty Fly for a White Guy"\'', 
    'value': 2000, 
    'answer': 'Ricki Lake', 
    'round': 'Double Jeopardy!', 
    'show_number': 4188
}

```

To load the dataset, we use the `load_dataset()` method from the 🤗 Datasets library.

In [4]:
from datasets import load_dataset
from random import randrange

# Load dataset from the hub
dataset = load_dataset("jeopardy", split="train")

print(f"dataset size: {len(dataset)}")
print(dataset[randrange(len(dataset))])

Downloading builder script:   0%|          | 0.00/4.19k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/6.62k [00:00<?, ?B/s]

dataset size: 216930
{'category': 'LAKES & RIVERS', 'air_date': '2005-10-13', 'question': "'This river irrigates millions of acres of land in Egypt & Sudan'", 'value': 400, 'answer': 'the Nile', 'round': 'Double Jeopardy!', 'show_number': 4849}


To instruct tune our model we need to convert our structured examples into a collection of tasks described via instructions. We define a `formatting_function` that takes a sample and returns a string with our format instruction.

In [5]:
print(dataset[randrange(len(dataset))])

{'category': 'ARE YOU WELL RED?', 'air_date': '2010-06-16', 'question': '\'"The Custom-House" is an introductory section to this Hawthorne classic\'', 'value': 400, 'answer': 'The Scarlet Letter', 'round': 'Double Jeopardy!', 'show_number': 5943}


In [6]:
def format_jeopardy(sample):
    instruction = f"### Instruction\n{sample['category'].lower()}\n{sample['question'][1:-1]}\n{sample['value']}\n{sample['round'].lower().strip('!')}"
    response = f"### Question\n{sample['answer']}"
    # join all the parts together
    prompt = "\n\n".join([i for i in [instruction, response]])
    return prompt

lets test our formatting function on a random example.

In [7]:
print(format_jeopardy(dataset[randrange(len(dataset))]))

### Instruction
all "four" you
Slang for a home run in baseball
400
double jeopardy

### Question
Four-bagger


In addition, to formatting our samples we also want to pack multiple samples to one sequence to have a more efficient training.

In [8]:
from transformers import AutoTokenizer

model_id = "meta-llama/Llama-3.2-1B" # sharded weights
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


We define some helper functions to pack our samples into sequences of a given length and then tokenize them.

In [9]:
from random import randint
from itertools import chain
from functools import partial

# template dataset to add prompt to each sample
def template_dataset(sample):
    sample["text"] = f"{format_jeopardy(sample)}{tokenizer.eos_token}"
    return sample

# apply prompt template per sample
dataset = dataset.map(template_dataset, remove_columns=list(dataset.features))
# print random sample
print(dataset[randint(0, len(dataset))]["text"])

# empty list to save remainder from batches to use in next batch
remainder = {"input_ids": [], "attention_mask": [], "token_type_ids": []}

def chunk(sample, chunk_length=2048):
    # define global remainder variable to save remainder from batches to use in next batch
    global remainder
    # Concatenate all texts and add remainder from previous batch
    concatenated_examples = {k: list(chain(*sample[k])) for k in sample.keys()}
    concatenated_examples = {k: remainder[k] + concatenated_examples[k] for k in concatenated_examples.keys()}
    # get total number of tokens for batch
    batch_total_length = len(concatenated_examples[list(sample.keys())[0]])

    # get max number of chunks for batch
    if batch_total_length >= chunk_length:
        batch_chunk_length = (batch_total_length // chunk_length) * chunk_length

    # Split by chunks of max_len.
    result = {
        k: [t[i : i + chunk_length] for i in range(0, batch_chunk_length, chunk_length)]
        for k, t in concatenated_examples.items()
    }
    # add remainder to global variable for next batch
    remainder = {k: concatenated_examples[k][batch_chunk_length:] for k in concatenated_examples.keys()}
    # prepare labels
    result["labels"] = result["input_ids"].copy()
    return result


# tokenize and chunk dataset
lm_dataset = dataset.map(
    lambda sample: tokenizer(sample["text"]), batched=True, remove_columns=list(dataset.features)
).map(
    partial(chunk, chunk_length=2048),
    batched=True,
)

# Print total number of samples
print(f"Total number of samples: {len(lm_dataset)}")

Map:   0%|          | 0/216930 [00:00<?, ? examples/s]

### Instruction
"court" briefs
Marsupial term for a self-appointed tribunal that parodies existing principles of law
400
jeopardy

### Question
a kangaroo court<|end_of_text|>


Map:   0%|          | 0/216930 [00:00<?, ? examples/s]

Map:   0%|          | 0/216930 [00:00<?, ? examples/s]

Total number of samples: 4620


After we processed the datasets we are going to use the new [FileSystem integration](https://huggingface.co/docs/datasets/filesystems) to upload our dataset to S3. We are using the `sess.default_bucket()`, adjust this if you want to store the dataset in a different S3 bucket. We will use the S3 path later in our training script.

In [10]:
# save train_dataset to s3
training_input_path = f's3://{sess.default_bucket()}/processed/llama/jeopardy_answers/train'
lm_dataset.save_to_disk(training_input_path)

print("uploaded data to:")
print(f"training dataset to: {training_input_path}")

Saving the dataset (0/1 shards):   0%|          | 0/4620 [00:00<?, ? examples/s]

uploaded data to:
training dataset to: s3://sagemaker-us-east-1-024863509636/processed/llama/jeopardy_answers/train


## Fine-Tune LLaMA 3.2 1B with QLoRA on Amazon SageMaker

We are going to use the method in the paper "[QLoRA: Quantization-aware Low-Rank Adapter Tuning for Language Generation](https://arxiv.org/abs/2106.09685)" by Tim Dettmers et al. QLoRA is a new technique to reduce the memory footprint of large language models during finetuning, without sacrificing performance. The TL;DR; of how QLoRA works is: 

* Quantize the pretrained model to 4 bits and freezing it.
* Attach small, trainable adapter layers. (LoRA)
* Finetune only the adapter layers, while using the frozen quantized model for context.

We prepared a [run_clm.py](./scripts/run_clm.py), which implements QLora using PEFT to train our model. The script also merges the LoRA weights into the model weights after training. That way you can use the model as a normal model without any additional code. The model will be temporally offloaded to disk, if it is too large to fit into memory.

In order to create a sagemaker training job we need an `HuggingFace` Estimator. The Estimator handles end-to-end Amazon SageMaker training and deployment tasks. The Estimator manages the infrastructure use. 
SagMaker takes care of starting and managing all the required ec2 instances for us, provides the correct huggingface container, uploads the provided scripts and downloads the data from our S3 bucket into the container at `/opt/ml/input/data`. Then, it starts the training job by running.

### Hardware requirements

We also ran several experiments to determine, which instance type can be used for the different model sizes. The following table shows the results of our experiments. The table shows the instance type, model size, context length, and max batch size. 

| Model        | Instance Type     | Max Batch Size | Context Length |
|--------------|-------------------|----------------|----------------|
| [LLama 7B]() | `(ml.)g5.4xlarge` | `3`            | `2048`         |
| [LLama 13B]() | `(ml.)g5.4xlarge` | `2`            | `2048`         |
| [LLama 70B]() | `(ml.)p4d.24xlarge` | `1++` (need to test more configs)            | `2048`         |
| [Llama-3.2-1B] | (ml.)g5.4xlarge` | `3`            | `2048`         |


> You can also use `g5.2xlarge` instead of the `g5.4xlarge` instance type, but then it is not possible to use `merge_weights` parameter, since to merge the LoRA weights into the model weights, the model needs to fit into memory. But you could save the adapter weights and merge them using [merge_adapter_weights.py](./scripts/merge_adapter_weights.py) after training.

_Note: We plan to extend this list in the future. feel free to contribute your setup!_

In [11]:
import time
from sagemaker.huggingface import HuggingFace
from huggingface_hub import HfFolder

# define Training Job Name 
job_name = f'huggingface-qlora-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}'

# hyperparameters, which are passed into the training job
hyperparameters ={
  'model_id': model_id,                             # pre-trained model
  'dataset_path': '/opt/ml/input/data/training',    # path where sagemaker will save training dataset
  'epochs': 3,                                      # number of training epochs
  'per_device_train_batch_size': 3,                 # batch size for training
  'lr': 2e-4,                                       # learning rate used during training
  'hf_token': HfFolder.get_token(),                 # huggingface token to access llama 2
  'merge_weights': True,                            # wether to merge LoRA into the model (needs more memory)
}

# create the Estimator
huggingface_estimator = HuggingFace(
    entry_point          = 'run_clm.py',      # train script
    source_dir           = 'scripts',         # directory which includes all the files needed for training
    instance_type        = 'ml.g5.4xlarge',   # instances type used for the training job
    instance_count       = 1,                 # the number of instances used for training
    base_job_name        = job_name,          # the name of the training job
    role                 = role,              # Iam role used in training job to access AWS ressources, e.g. S3
    volume_size          = 300,               # the size of the EBS volume in GB
    transformers_version = '4.28',            # the transformers version used in the training job
    pytorch_version      = '2.0',             # the pytorch_version version used in the training job
    py_version           = 'py310',           # the python version used in the training job
    hyperparameters      =  hyperparameters,  # the hyperparameters passed to the training job
    environment          = { "HUGGINGFACE_HUB_CACHE": "/tmp/.cache" }, # set env variable to cache models in /tmp
)

We can now start our training job, with the `.fit()` method passing our S3 path to the training script.

In [12]:
# define a data input dictonary with our uploaded s3 uris
data = {'training': training_input_path}

# starting the train job with our uploaded datasets as input
huggingface_estimator.fit(data, wait=False)

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: huggingface-qlora-2024-10-24-16-15-19-2024-10-24-16-15-23-131


In our example for LLaMA 3.2 1B, the SageMaker training job took `???? seconds`, which is about `?.? hours`. The ml.g5.4xlarge instance we used costs `$?.?? per hour` for on-demand usage. As a result, the total cost for training our fine-tuned LLaMa 2 model was only ~`$??`.

In [None]:
huggingface_estimator.model_data