# Fine-tuning LLaMA 2 13B
## 1. Introduction
In this notebook, I will fine-tune [LLaMa 2 13B](https://huggingface.co/meta-llama/Llama-2-13b-hf) on the training set procured in previous notebooks. A large part of this will be following the tutorial from https://www.philschmid.de/sagemaker-llama2-qlora. Instead of adjusting the weights of all 13 billion parameters, I am going to use a technique called ***QLoRA***. 

`"QLoRA is an efficient finetuning technique that quantizes a pretrained language model to 4 bits and attaches small 'Low-Rank Adapters' which are fine-tuned. This enables fine-tuning of models with up to 65 billion parameters on a single GPU; despite its efficiency, QLoRA matches the performance of full-precision fine-tuning and achieves state-of-the-art results on language tasks."`

Let's get started!

## 2. Connect to Hugging Face
Access must be granted in order to download the LLaMA models from Meta. To verify access, you must be logged into Hugging Face, so I am going to authenticate with my token.

In [6]:
!pip install -q accelerate peft bitsandbytes transformers trl datasets

In [7]:
import io
import json
import boto3
import sagemaker
import pandas as pd

In [7]:
def get_secret():
    """Get secret from AWS Secrets Manager"""

    secret_name = "reddit_scraper"
    region_name = "us-east-1"

    # Create a Secrets Manager client
    session = boto3.session.Session()
    client = session.client(
        service_name='secretsmanager',
        region_name=region_name
    )

    try:
        get_secret_value_response = client.get_secret_value(
            SecretId=secret_name
        )
    except ClientError as e:
        raise e

    secret = get_secret_value_response['SecretString']

    return json.loads(secret)


In [8]:
secret = get_secret()
HUGGINGFACE_TOKEN = secret['huggingface_token']

In [15]:
!huggingface-cli login --token {HUGGINGFACE_TOKEN}

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /home/ec2-user/.cache/huggingface/token
Login successful


In [16]:
# Set sagemaker session
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

sagemaker role arn: arn:aws:iam::513033806411:role/service-role/AmazonSageMaker-ExecutionRole-20210815T111148
sagemaker bucket: sagemaker-us-east-1-513033806411
sagemaker session region: us-east-1


## 3. Load and prepare the dataset
Load the dataset from S3 and format for fine-tuning

In [31]:
# Initialize the S3 client
s3_client = boto3.client('s3')

# Specify your bucket name and the object key
bucket_name = 'sagemaker-us-east-1-513033806411'
object_key = 'reddit/funny/data/training_data.csv'

# Get the object from S3
response = s3_client.get_object(Bucket=bucket_name, Key=object_key)

# Read the CSV data into a DataFrame
# The 'Body' key of the response contains the file content
train_df = pd.read_csv(response['Body'])

# Now df contains your data, you can inspect it using df.head()
cols = ['submissionId', 'title', 'image_description','topComment']
train_df = train_df[cols]

display(train_df.head())

Unnamed: 0,submissionId,title,image_description,topComment
0,eqq9j8,"In Minnesota, we like to play a game called ""a...",- Description: snowy road with cars driving on...,The first person to drive on a snowy road gets...
1,grrjug,The most suspicious looking technician at toda...,- Description: arafed male doctor in white lab...,“He’s not supposed to be there”
2,m8kft5,Slip given out at one of my local bars if secu...,- Description: someone is giving a note to som...,I was at a bar in Colorado a few years back an...
3,11fdi6b,My daughter's school don't dress up for world ...,- Description: there is a small white mask wit...,Spud-Who-Must-Not-Be-Named
4,a2pc8h,Every year I try to disguise my sister's Chris...,- Description: there is a blue and red present...,I bet it's a new car


In [44]:
from datasets import Dataset

# Convert the pandas DataFrame to a Hugging Face Dataset
reddit_dataset = Dataset.from_pandas(train_df)

print(reddit_dataset)  # To verify the conversion

Dataset({
    features: ['submissionId', 'title', 'image_description', 'topComment'],
    num_rows: 2601
})


To instruct tune our model I will convert the structured examples into a collection of tasks described via instructions. Here, I define a `format_reddit()` function that takes a sample and returns a string with our format instruction.

In [54]:
def format_reddit(sample):
    instruction = f"### Instruction:\nRespond to this Reddit post with an award winning top comment."
    context = f"### Reddit Post:\n{sample['title']}\n\n### Image Context:\n{sample['image_description']}" 
    response = f"### Response:\n{sample['topComment']}"
    # join all the parts together
    prompt = "\n\n".join([i for i in [instruction, context, response] if i is not None])
    return prompt

Here is an example:

In [52]:
from random import randrange

print(format_reddit(reddit_dataset[randrange(len(reddit_dataset))]))

### Instruction:
Respond to this Reddit post with an award winning top comment.

### Reddit Post:
🦸‍♂️ Iron-Deficiency Man

### Image Context:
- Description: A person wearing a homemade Iron Man costume made from cardboard and other household materials
- Text: 
- Celebrities: 

### Response:
Love the creative spirit.  Everyone should make their own costumes.  Celebrate creativity over handing your money to corporations.  This guy nailed it.


Now that the samples are formatted, I am going to pack multiple samples to one sequence for more efficient training.

In [39]:
from transformers import AutoTokenizer

model_id = "meta-llama/Llama-2-13b-hf" # sharded weights
tokenizer = AutoTokenizer.from_pretrained(model_id, use_auth_token=True)
tokenizer.pad_token = tokenizer.eos_token



tokenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

We define some helper functions to pack our samples into sequences of a given length and then tokenize them.

In [40]:
from random import randint
from itertools import chain
from functools import partial


# template dataset to add prompt to each sample
def template_dataset(sample):
    sample["text"] = f"{format_reddit(sample)}{tokenizer.eos_token}"
    return sample

# apply prompt template per sample
reddit_dataset = reddit_dataset.map(template_dataset, remove_columns=list(reddit_dataset.features))

In [56]:
# empty list to save remainder from batches to use in next batch
remainder = {"input_ids": [], "attention_mask": [], "token_type_ids": []}

def chunk(sample, chunk_length=2048):
    # define global remainder variable to save remainder from batches to use in next batch
    global remainder
    # Concatenate all texts and add remainder from previous batch
    concatenated_examples = {k: list(chain(*sample[k])) for k in sample.keys()}
    concatenated_examples = {k: remainder[k] + concatenated_examples[k] for k in concatenated_examples.keys()}
    # get total number of tokens for batch
    batch_total_length = len(concatenated_examples[list(sample.keys())[0]])

    # get max number of chunks for batch
    if batch_total_length >= chunk_length:
        batch_chunk_length = (batch_total_length // chunk_length) * chunk_length

    # Split by chunks of max_len.
    result = {
        k: [t[i : i + chunk_length] for i in range(0, batch_chunk_length, chunk_length)]
        for k, t in concatenated_examples.items()
    }
    # add remainder to global variable for next batch
    remainder = {k: concatenated_examples[k][batch_chunk_length:] for k in concatenated_examples.keys()}
    # prepare labels
    result["labels"] = result["input_ids"].copy()
    return result


# tokenize and chunk dataset
lm_dataset = reddit_dataset.map(
    lambda sample: tokenizer(sample["text"]), batched=True, remove_columns=list(reddit_dataset.features)
).map(
    partial(chunk, chunk_length=2048),
    batched=True,
)

# Print total number of samples
print(f"Total number of samples: {len(lm_dataset)}")

Map:   0%|          | 0/2601 [00:00<?, ? examples/s]

Map:   0%|          | 0/2601 [00:00<?, ? examples/s]

Total number of samples: 212


In [57]:
lm_dataset

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 212
})

In [58]:
# save train_dataset to s3
bucket_name = 'sagemaker-us-east-1-513033806411'
training_input_path = f's3://{bucket_name}/reddit/funny/data/train'
lm_dataset.save_to_disk(training_input_path)

print(f"uploaded training dataset to: {training_input_path}")

severe performance issues, see also https://github.com/dask/dask/issues/10276

To fix, you should specify a lower version bound on s3fs, or
update the current installation.



Saving the dataset (0/1 shards):   0%|          | 0/212 [00:00<?, ? examples/s]

uploaded data to:
training dataset to: s3://sagemaker-us-east-1-513033806411/reddit/funny/data/train


## 4. Fine-tune LLaMA 13B with QLoRA
Now it's time to implement ["QLoRA: Quantization-aware Low-Rank Adapter Tuning for Language Generation"](https://arxiv.org/abs/2106.09685). QLoRA introduces an innovative strategy for minimizing the memory usage of extensive language models during fine-tuning processes, all while maintaining their efficiency. In essence, QLoRA's methodology involves:

- Compressing the pre-trained model to a 4-bit representation and locking it in place.
- Incorporating small, modifiable adapter layers known as LoRA.
- Focusing the fine-tuning efforts solely on these adapter layers, utilizing the compressed, static model for contextual guidance.

I am going to use Philipp Schmid's script from the tutorial, `run_clm.py`, which utilizes parameter-efficient fine-tuning (PEFT) to facilitate the model's training with QLoRA. This script is also designed to integrate the LoRA weights back into the primary model weights post-training, enabling the use of the model in a standard fashion without necessitating any supplementary coding efforts.

In the code below, I create a HuggingFace Estimator and kickoff a SageMaker training job to fine-tune the model and store the artifacts for future use.

In [59]:
import os
os.environ["HF_HOME"] = HUGGINGFACE_TOKEN

In [65]:
import time
from sagemaker.huggingface import HuggingFace
from huggingface_hub import HfFolder

# define Training Job Name
job_name = f'huggingface-qlora-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}'

# hyperparameters, which are passed into the training job
hyperparameters ={
  'model_id': model_id,                             # pre-trained model
  'dataset_path': '/opt/ml/input/data/training',    # path where sagemaker will save training dataset
  'epochs': 3,                                      # number of training epochs
  'per_device_train_batch_size': 2,                 # batch size for training
  'lr': 2e-4,                                       # learning rate used during training
  'hf_token': HfFolder.get_token(),                 # huggingface token to access llama 2
  'merge_weights': True,                            # wether to merge LoRA into the model (needs more memory)
}

# create the Estimator
huggingface_estimator = HuggingFace(
    entry_point          = 'run_clm.py',      # train script
    source_dir           = 'scripts',         # directory which includes all the files needed for training
    instance_type        = 'ml.g5.4xlarge',   # instances type used for the training job
    instance_count       = 1,                 # the number of instances used for training
    base_job_name        = job_name,          # the name of the training job
    role                 = role,              # Iam role used in training job to access AWS ressources, e.g. S3
    volume_size          = 300,               # the size of the EBS volume in GB
    transformers_version = '4.28',            # the transformers version used in the training job
    pytorch_version      = '2.0',             # the pytorch_version version used in the training job
    py_version           = 'py310',           # the python version used in the training job
    hyperparameters      =  hyperparameters,  # the hyperparameters passed to the training job
    environment          = { "HUGGINGFACE_HUB_CACHE": "/tmp/.cache" }, # set env variable to cache models in /tmp
)

Start the training job with the `.fit()` method, passing the S3 path to the training script.

In [66]:
# define a data input dictonary with our uploaded s3 uris
data = {'training': training_input_path}

# starting the train job with our uploaded datasets as input
huggingface_estimator.fit(data, wait=True)

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: huggingface-qlora-2024-02-17-17-15-24-2024-02-17-17-16-14-114


2024-02-17 17:16:15 Starting - Starting the training job
2024-02-17 17:16:15 Pending - Training job waiting for capacity...
2024-02-17 17:16:34 Pending - Preparing the instances for training......
2024-02-17 17:17:39 Downloading - Downloading input data...
2024-02-17 17:18:14 Downloading - Downloading the training image.....................
2024-02-17 17:21:25 Training - Training image download completed. Training in progress......[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2024-02-17 17:22:27,786 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2024-02-17 17:22:27,805 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2024-02-17 17:22:27,814 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2024-02-17 17:22:27,816 sagemaker_pytorch_container.training I

## 5. Conclusion
The SageMaker training job completed in 1.5 hours. The `ml.g5.4xlarge` instance we used costs $1.624 per hour for on-demand usage. 

As a result, the total cost for training this fine-tuned LLaMa 2 model was only ~$2.44.

Now that the fine-tuning is complete, the model is ready for deployment!