# Efficient Large Language Model Training with LoRA and Hugging Face

In this Amazon SageMaker example, we are going to learn how to apply [Low-Rank Adaptation of Large Language Models (LoRA)](https://arxiv.org/abs/2106.09685) to fine-tune BLOOMZ (7 billion parameter version instruction tuned version of BLOOM) on a single GPU. We are going to leverage Hugging Face [Transformers](https://huggingface.co/docs/transformers/index), [Accelerate](https://huggingface.co/docs/accelerate/index), and [PEFT](https://github.com/huggingface/peft).

You can read the blog post about how to "Train a Large Language Model on a single Amazon SageMaker GPU with Hugging Face and LoRA" [on the AWS Machine Learning Blog](https://aws.amazon.com/blogs/machine-learning/train-a-large-language-model-on-a-single-amazon-sagemaker-gpu-with-hugging-face-and-lora/).

## Overview

[PEFT](https://github.com/huggingface/peft), or Parameter Efficient Fine-tuning, is a new open-source library from Hugging Face to enable efficient adaptation of pre-trained language models (PLMs) to various downstream applications without fine-tuning all the model's parameters. PEFT currently includes techniques for:

- LoRA: [LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS](https://arxiv.org/pdf/2106.09685.pdf)
- Prefix Tuning: [P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks](https://arxiv.org/pdf/2110.07602.pdf)   
- P-Tuning: [GPT Understands, Too](https://arxiv.org/pdf/2103.10385.pdf)
- Prompt Tuning: [The Power of Scale for Parameter-Efficient Prompt Tuning](https://arxiv.org/pdf/2104.08691.pdf)

This notebook walks you through the process of training and deploying a large language model using Amazon SageMaker. The process includes the following steps: 

1. Preparation and setting up of the training job   
2. Training the model     
3. Deploying the model
4. Testing the model

By the end, we will have a working model that can generate text summaries.  

### Glossary
- **Epoch:** One complete pass through the entire training dataset     
- **Batch size:** Number of training examples utilized in one iteration
- **Learning rate:** A tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function    
- **SageMaker:** A fully managed service that provides developers and data scientists with the ability to build, train, and deploy machine learning (ML) models quickly.

## 1. Set up the development environment

This has been tested on the default SageMaker Studio Notebook image (Image: "Data Science", Kernel: "Python 3"). Note: this is not the same as the image called "Data Science 3.0".

### Install dependencies

Make sure you've cloned not just this notebook from GitHub but also the `/scripts` folder that comes with it.

In [None]:
%pip install -r "scripts/requirements.txt"

In [None]:
%pip install "transformers==4.26.0" "datasets[s3]==2.9.0" sagemaker py7zr --upgrade --quiet

### Configure IAM

If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) more about it.

In [None]:
import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")


## 2. Load and prepare the dataset

We will use the [SAMSum](https://huggingface.co/datasets/samsum) dataset, a collection of about 16k messenger-like conversations with summaries. Conversations were created and written down by linguists fluent in English.

For example:

```python
{
  "id": "13818513",
  "summary": "Amanda baked cookies and will bring Jerry some tomorrow.",
  "dialogue": "Amanda: I baked cookies. Do you want some?\r\nJerry: Sure!\r\nAmanda: I'll bring you tomorrow :-)"
}
```

### Load Dataset


In [None]:
from datasets import load_dataset

# Load the training split of the 'samsum' dataset from HuggingFace Datasets. 
# This library provides a large collection of pre-split datasets.
dataset = load_dataset("samsum", split="train")

print(f"Train dataset size: {len(dataset)}")
# Train dataset size: 14732

### Tokenize Dataset 

To train our model, we need to convert our inputs (text) to token IDs. This is done by [a 🤗 Transformers Tokenizer](https://huggingface.co/learn/nlp-course/chapter6/1?fw=tf).

In [None]:
from transformers import AutoTokenizer

model_id="bigscience/bloomz-7b1"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.model_max_length = 2048 # overwrite wrong value

Before we can start training, we need to preprocess our data. Abstractive Summarization is a text-generation task. Our model will take a text as input and generate a summary as output. We want to understand how long our input and output will take to batch our data efficiently.

We defined a `prompt_template` which we will use to construct an instruct prompt for better performance of our model. Our `prompt_template` has a “fixed” start and end, and our document is in the middle. This means we need to ensure that the “fixed” template parts + document are not exceeding the max length of the model. 
We preprocess our dataset before training and save it to disk to then upload it to S3. You could run this step on your local machine or a CPU and upload it to the [Hugging Face Hub](https://huggingface.co/docs/hub/datasets-overview).

In [None]:
from random import randint
from itertools import chain
from functools import partial

# custom instruct prompt start
prompt_template = f"Summarize the chat dialogue:\n{{dialogue}}\n---\nSummary:\n{{summary}}{{eos_token}}"

# template dataset to add prompt to each sample
def template_dataset(sample):
    sample["text"] = prompt_template.format(dialogue=sample["dialogue"],
                                            summary=sample["summary"],
                                            eos_token=tokenizer.eos_token)
    return sample


# apply prompt template per sample
dataset = dataset.map(template_dataset, remove_columns=list(dataset.features))

print(dataset[randint(0, len(dataset))]["text"])

# empty list to save remainder from batches to use in next batch
remainder = {"input_ids": [], "attention_mask": []}


def chunk(sample, chunk_length=2048):
    # define global remainder variable to save remainder from batches to use in next batch
    global remainder
    # Concatenate all texts and add remainder from previous batch
    concatenated_examples = {k: list(chain(*sample[k])) for k in sample.keys()}
    concatenated_examples = {k: remainder[k] + concatenated_examples[k] for k in concatenated_examples.keys()}
    # get total number of tokens for batch
    batch_total_length = len(concatenated_examples[list(sample.keys())[0]])

    # get max number of chunks for batch
    if batch_total_length >= chunk_length:
        batch_chunk_length = (batch_total_length // chunk_length) * chunk_length

    # Split by chunks of max_len.
    result = {
        k: [t[i : i + chunk_length] for i in range(0, batch_chunk_length, chunk_length)]
        for k, t in concatenated_examples.items()
    }
    # add remainder to global variable for next batch
    remainder = {k: concatenated_examples[k][batch_chunk_length:] for k in concatenated_examples.keys()}
    # prepare labels
    result["labels"] = result["input_ids"].copy()
    return result


# tokenize and chunk dataset
lm_dataset = dataset.map(
    lambda sample: tokenizer(sample["text"]), batched=True, remove_columns=list(dataset.features)
).map(
    partial(chunk, chunk_length=1536),
    batched=True,
)

# Print total number of samples
print(f"Total number of samples: {len(lm_dataset)}")

After we processed the datasets we are going to use the new [FileSystem integration](https://huggingface.co/docs/datasets/filesystems) to upload our dataset to S3. We are using the `sess.default_bucket()`, adjust this if you want to store the dataset in a different S3 bucket. We will use the S3 path later in our training script.

In [None]:
# save train_dataset to s3
training_input_path = f's3://{sess.default_bucket()}/processed/samsum-sagemaker/train'
lm_dataset.save_to_disk(training_input_path)

print("uploaded data to:")
print(f"training dataset to: {training_input_path}")

## 3. Fine-Tuning BLOOM with LoRA and BnB int-8 using Amazon SageMaker

In this section, we combine two powerful techniques to fine-tune the BLOOM language model: LoRA and bitsandbytes' (BnB) int-8 quantization. 

- **LoRA** or [Low-Rank Adaptation](https://arxiv.org/abs/2106.09685) is an efficient approach to adapt large language models to downstream applications. 

- **BnB int-8 quantization** is a technique provided by the [bitsandbytes library](https://huggingface.co/blog/hf-bitsandbytes-integration) that allows us to reduce the memory requirement for our model by approximately 4 times. 

For the fine-tuning process, we'll be using a Python script, [run_clm.py](./scripts/run_clm.py), stored in the `scripts` directory. This script employs PEFT (Parameter-Efficient Fine-tuning) to train our model. To delve into the details of this script and understand the fine-tuning process more deeply, check out this [blog post](https://www.philschmid.de/fine-tune-flan-t5-peft).

The training job in Amazon SageMaker requires an `HuggingFace` Estimator. This Estimator performs various tasks like managing the infrastructure, spinning up the required EC2 instances, providing the appropriate HuggingFace container, uploading the necessary scripts, and downloading the dataset from our S3 bucket into the container at `/opt/ml/input/data`.

### Define the training job and create the HuggingFace Estimator

In [None]:
import time

# Set up our training job name using a timestamp to ensure it's unique
job_name = f'huggingface-peft-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}'

from sagemaker.huggingface import HuggingFace

In the next cell, we'll set up the parameters for our model training. We use a pre-trained model and specify the location of the training data. We also set the number of training rounds (epochs) and other parameters like batch size and learning rate. These are technical parameters that control the model's learning process.

In [None]:
# Set up hyperparameters for the model training
hyperparameters ={
  'model_id': model_id,                                # pre-trained model
  'dataset_path': '/opt/ml/input/data/training',       # path where sagemaker will save training dataset
  'epochs': 3,                                         # number of training epochs
  'per_device_train_batch_size': 1,                    # batch size for training
  'lr': 2e-4,                                          # learning rate used during training
}

In [None]:
# Here we create a HuggingFace Estimator
# This estimator defines the infrastructure that SageMaker will use for the training
# For example, it uses an instance type 'ml.g5.2xlarge', 
# and we are defining that the number of these instances is 1

huggingface_estimator = HuggingFace(
    entry_point          = 'run_clm.py',      # our training script
    source_dir           = 'scripts',         # directory where training scripts are stored
    instance_type        = 'ml.g5.2xlarge',   # type of SageMaker instance for training
    instance_count       = 1,                 # number of instances to be used for training
    base_job_name        = job_name,          # the name of the training job
    role                 = role,              # IAM role used in training to access AWS resources like S3
    volume_size          = 300,               # the size of the EBS volume in GB
    transformers_version = '4.26',            # version of transformers
    pytorch_version      = '1.13',            # version of pytorch
    py_version           = 'py39',            # python version 
    hyperparameters      =  hyperparameters
)

In the next section, we'll actually start the training process for our model. This process uses the parameters we set earlier and runs the training script 'run_clm.py'. This script will load our model, set it up for training, and start the training process, with the `.fit()` method passing our S3 path to the training script.

Our model training process includes several steps:

1. **Load pre-trained model:** We start with a model that has already been trained on a large dataset. This helps our model to understand language patterns better.
2. **Load training data:** We then load our specific training data. This data is in a format that our model can understand.
3. **Train the model:** We then run the training process, where our model tries to learn from our specific training data.
4. **Test the model:** After training, we test our model to make sure it's working as expected.


In [None]:
# define a data input dictonary with our uploaded s3 uris
data = {'training': training_input_path}

# starting the train job with our uploaded datasets as input
huggingface_estimator.fit(data, wait=True)

In our example, the SageMaker training job took `20632 seconds`, which is about `5.7 hours`. The ml.g5.2xlarge instance we used costs `$1.515 per hour` for on-demand usage. As a result, the total cost for training our fine-tuned BLOOMZ-7B model was only `$8.63`.

We could further reduce the training costs by using spot instances. However, there is a possibility this would result in the total training time increasing due to spot instance interruptions. See the SageMaker pricing page for instance pricing details."

## 4. Deploy the model to Amazon SageMaker Endpoint

When using `peft` for training, you normally end up with adapter weights. We added the `merge_and_unload()` method to merge the base model with the adatper to make it easier to deploy the model. Since we can now use the `pipelines` feature of the `transformers` library. 

SageMaker starts the deployment process by creating a SageMaker Endpoint Configuration and a SageMaker Endpoint. The Endpoint Configuration defines the model and the instance type.

In [None]:
from sagemaker.huggingface import HuggingFaceModel

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   model_data=huggingface_estimator.model_data,
   #model_data="s3://hf-sagemaker-inference/model.tar.gz",  # Change to your model path
   role=role, 
   transformers_version="4.26", 
   pytorch_version="1.13", 
   py_version="py39",
   model_server_workers=1
)

We can now deploy our model using the `deploy()` on our HuggingFace estimator object, passing in our desired number of instances and instance type.

In [None]:
# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
   initial_instance_count=1,
   instance_type= "ml.g5.4xlarge"
)

Note: it may take 5-10 min for the SageMaker endpoint to bring your instance online and download your model in order to be ready to accept inference requests.

This concludes the model deployment section! At this point, we have a working model that's ready to make predictions. In the next section, we'll put our model to the test by having it generate summaries for chat dialogues.


## 5. Test the model

Let's select a random chat dialogue from the `test` split of our original dataset and see how well our model generates a summary.


In [None]:
from random import randint
from datasets import load_dataset

# Now, we load the test split of the 'samsum' dataset separately. 
# This ensures an unbiased evaluation of the model's performance later.
test_dataset = load_dataset("samsum", split="test")

# select a random test sample
sample = test_dataset[randint(0,len(test_dataset))]

# format sample
prompt_template = f"Summarize the chat dialogue:\n{{dialogue}}\n---\nSummary:\n"

formatted_sample = {
  "inputs": prompt_template.format(dialogue=sample["dialogue"]),
  "parameters": {
    "do_sample": True, # sample output predicted probabilities
    "top_p": 0.9, # sampling technique Fan et. al (2018)
    "temperature": 0.1, # increasing the likelihood of high probability words and decreasing the likelihood of low probability words
    "max_new_tokens": 100, # The maximum numbers of tokens to generate, ignoring the number of tokens in the prompt
  }
}

# predict
res = predictor.predict(formatted_sample)

print(res[0]["generated_text"].split("Summary:")[-1])

# Sample model output: Kirsten and Alex are going bowling this Friday at 7 pm. They will meet up and then go together.

We've seen what our model has generated. Now let's compare it with the actual summary.


In [None]:
print(sample["summary"])

# Test sample summary: Kirsten reminds Alex that the youth group meets this Friday at 7 pm to go bowling.

## 6. Delete the model endpoint

Finally, we're cleaning up by deleting the endpoint we created for our model. This concludes our journey of fine-tuning and deploying the BLOOMZ-7B model.


In [None]:
predictor.delete_model()
predictor.delete_endpoint()