# Train Aguila LLM using QLoRA on Amazon SageMaker (adapted from previous AWS notebooks)

In this sagemaker example, we are going to learn how to apply [QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314) 
to fine-tune Falcon 7B. QLoRA is an efficient finetuning technique that quantizes a pretrained language model to 4 bits and attaches small “Low-Rank Adapters” which are fine-tuned. This enables fine-tuning of models with up to 65 billion parameters on a single GPU; despite its efficiency, QLoRA matches the performance of full-precision fine-tuning and achieves state-of-the-art results on language tasks.

In our example, we are going to leverage Hugging Face [Transformers](https://huggingface.co/docs/transformers/index), [Accelerate](https://huggingface.co/docs/accelerate/index), and [PEFT](https://github.com/huggingface/peft). 

In Detail you will learn how to:
1. Setup Development Environment
2. Load and prepare the dataset. In our case, InstructCat, instructions created from AINA-commissioned adatsets.
3. Fine-Tune Falcon-based Aguila 7B with QLoRA on Amazon SageMaker

### Quick intro: PEFT or Parameter Efficient Fine-tuning

[PEFT](https://github.com/huggingface/peft), or Parameter Efficient Fine-tuning, is a new open-source library from Hugging Face to enable efficient adaptation of pre-trained language models (PLMs) to various downstream applications without fine-tuning all the model's parameters. PEFT currently includes techniques for:

- (Q)LoRA: [LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS](https://arxiv.org/pdf/2106.09685.pdf)
- Prefix Tuning: [P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks](https://arxiv.org/pdf/2110.07602.pdf)
- P-Tuning: [GPT Understands, Too](https://arxiv.org/pdf/2103.10385.pdf)
- Prompt Tuning: [The Power of Scale for Parameter-Efficient Prompt Tuning](https://arxiv.org/pdf/2104.08691.pdf)

Prepare the Libraries:


In [None]:
!pip install peft==0.6.0 \
    transformers==4.28.1 \
    accelerate==0.21 \
    torch==2.0.1 \
    datasets==2.14.4 \
    bitsandbytes==0.41.1 \
--no-cache-dir

In [None]:
#Sagemaker Autentification, and S3 Bucket, for data transfer and storage
import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None#"ainainstructionsfinetunning"
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")


## 2. Load and prepare the dataset

We will use the InstructCAT collection [InstructCAT] (https://huggingface.co/datasets/BSC-LT/InstruCat_v2), created from Open Source task-specific datasets generated within the AINA project [AINA](http://projecteaina.cat/) in several of the behavioral categories outlined in the [InstructGPT paper](https://arxiv.org/abs/2203.02155), including classification, closed QA, generation, information extraction, open QA, and summarization.

```python
{
  "instruction": "En quants actes va sintetitzar Piave l'estructura del drama?",
  "context": "La complexa estructura del drama shakesperià en cinc actes va ser sintetitzada, no sense dificultat, per Piave, en una estructura de quatre actes. Tot i això, la posada en escena resulta difícil, atesos els nombrosos canvis d'escena i les complexes ambientacions; per exemple les dues escenes ambientades en el bosc, en les introduccions del primer i el darrer acte. També s'hi troba una certa incongruència en les dues parts de l'acte tercer: la Gran Escena de l'aparició i el duet Ora di morte e di vendetta. Però, malgrat tot, s'ha de reconèixer la mestria de Piave en la redacció i el seu, si més no difícil, respecte al text de Shakespeare.",
  "response": "quatre"
}
```

To load the `samsum` dataset, we use the `load_dataset()` method from the 🤗 Datasets library.

In [None]:
from datasets import load_dataset
from random import randrange
DATASET = "BSC-LT/InstruCat_v2"
# Load dataset from the hub
dataset = load_dataset(DATASET, split="train")

print(f"dataset size: {len(dataset)}")
print(dataset[randrange(len(dataset))])



To instruct tune our model we need to convert our structured examples into a collection of tasks described via instructions. We define a `formatting_function` that takes a sample and returns a string with our format instruction, following the Dolly structure.

In [None]:
def format_dolly(sample):
    instruction = f"### Instruction\n{sample['instruction']}"
    context = f"### Context\n{sample['context']}" if len(sample["context"]) > 0 else None
    response = f"### Answer\n{sample['response']}"
    # join all the parts together
    prompt = "\n\n".join([i for i in [instruction, context, response] if i is not None])
    return prompt


Random example to test formatting function.

In [None]:
from random import randrange

print(format_dolly(dataset[randrange(len(dataset))]))

Define the foundational model from Huggingfaceto use for fine-tunning and tokenization

In [None]:
from transformers import AutoTokenizer

model_id = "projecte-aina/aguila-7b" # sharded weights


tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

We define some helper functions to pack our samples into sequences of a given length and then tokenize them.

In [None]:
from random import randint
from itertools import chain
from functools import partial



# template dataset to add prompt to each sample
def template_dataset(sample):
    sample["text"] = f"{format_dolly(sample)}{tokenizer.eos_token}"
    return sample


# apply prompt template per sample
dataset = dataset.map(template_dataset, remove_columns=list(dataset.features))
# print random sample
print(dataset[randint(0, len(dataset))]["text"])

# empty list to save remainder from batches to use in next batch
remainder = {"input_ids": [], "attention_mask": [], "token_type_ids": []}

def chunk(sample, chunk_length=2048):
    # define global remainder variable to save remainder from batches to use in next batch
    global remainder
    # Concatenate all texts and add remainder from previous batch
    concatenated_examples = {k: list(chain(*sample[k])) for k in sample.keys()}
    concatenated_examples = {k: remainder[k] + concatenated_examples[k] for k in concatenated_examples.keys()}
    # get total number of tokens for batch
    batch_total_length = len(concatenated_examples[list(sample.keys())[0]])

    # get max number of chunks for batch
    if batch_total_length >= chunk_length:
        batch_chunk_length = (batch_total_length // chunk_length) * chunk_length

    # Split by chunks of max_len.
    result = {
        k: [t[i : i + chunk_length] for i in range(0, batch_chunk_length, chunk_length)]
        for k, t in concatenated_examples.items()
    }
    # add remainder to global variable for next batch
    remainder = {k: concatenated_examples[k][batch_chunk_length:] for k in concatenated_examples.keys()}
    # prepare labels
    result["labels"] = result["input_ids"].copy()
    return result


# tokenize and chunk dataset
lm_dataset = dataset.map(
    lambda sample: tokenizer(sample["text"]), batched=True, remove_columns=list(dataset.features)
).map(
    partial(chunk, chunk_length=2048),
    batched=True,
)

# Print total number of samples
print(f"Total number of samples: {len(lm_dataset)}")

After we processed the datasets we are going to use the new [FileSystem integration](https://huggingface.co/docs/datasets/filesystems) to upload our dataset to S3. We are using the `sess.default_bucket()`, adjust this if you want to store the dataset in a different S3 bucket. We will use the S3 path later in our training script.

In [None]:
# save train_dataset to s3
training_input_path = f's3://{sess.default_bucket()}/training_data_aguila_instrucat/'
lm_dataset.save_to_disk(training_input_path)

print("uploaded data to:")
print(f"training dataset to: {training_input_path}")


We prepared a [run_clm.py](./scripts/run_clm.py), which implements QLora using PEFT to train our model. The script also merges the LoRA weights into the model weights after training. That way you can use the model as a normal model without any additional code.

In order to create a sagemaker training job we need an `HuggingFace` Estimator. The Estimator handles end-to-end Amazon SageMaker training and deployment tasks. The Estimator manages the infrastructure use. 
SagMaker takes care of starting and managing all the required ec2 instances for us, provides the correct huggingface container, uploads the provided scripts and downloads the data from our S3 bucket into the container at `/opt/ml/input/data`. Then, it starts the training job by running.


In [None]:
import time
# define Training Job Name 
job_name = f'huggingface-instrucat-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}'

from sagemaker.huggingface import HuggingFace

# hyperparameters, which are passed into the training job
hyperparameters ={
  'model_id': model_id,                                # pre-trained model
  'dataset_path': '/opt/ml/input/data/training', # path where sagemaker will save training dataset
  'output_dir': '/opt/ml/model', # path where sagemaker will save model
  'epochs': 3,                                         # number of training epochs
  'per_device_train_batch_size': 4,                    # batch size for training
  'lr': 2e-4,                                          # learning rate used during training
}

# create the Estimator
huggingface_estimator = HuggingFace(
    entry_point          = 'run_clm.py',      # train script
    source_dir           = 'scripts',         # directory which includes all the files needed for training
    instance_type        = 'ml.g5.12xlarge', # instances type used for the training job
    instance_count       = 1,                 # the number of instances used for training
    base_job_name        = job_name,          # the name of the training job
    role                 = role,              # Iam role used in training job to access AWS ressources, e.g. S3
    volume_size          = 300,               # the size of the EBS volume in GB
    transformers_version = '4.28.1',            # the transformers version used in the training job
    pytorch_version      = '2.0.0',            # the pytorch_version version used in the training job
    py_version           = 'py310',            # the python version used in the training job
    hyperparameters      =  hyperparameters,
    #max_run              = '86400',            # Default maximum hours running is 24 (in seconds 86400). Change this if taking longer
    environment          = { "HUGGINGFACE_HUB_CACHE": "/tmp/.cache" }, # set env variable to cache models in /tmp
)

We can now start our training job, with the `.fit()` method passing our S3 path to the training script.

In [None]:
# define a data input dictonary with our uploaded s3 uris
data = {'training': training_input_path}

# starting the train job with our uploaded datasets as input
huggingface_estimator.fit(data, wait=True)

In our example, the SageMaker training job took `Training seconds: 18236`, which is about `5 hours`. The ml.g5.12xlarge instance we used costs `$7.09 per hour` for on-demand usage. As a result, the total cost for training our fine-tuned Falcon-7bB model was only ~`$35`.
When training starts after initialization, you can close the notebook. The training instance will carry on until completion or until max_run is exceeded. You can find the model in the s3 bucket defined previously.

## Next Steps 

You can deploy your fine-tuned model to a SageMaker endpoint and use it for inference. Check out the [Deploy Falcon 7B & 40B on Amazon SageMaker](https://www.philschmid.de/sagemaker-falcon-llm) and [Securely deploy LLMs inside VPCs with Hugging Face and Amazon SageMaker](https://www.philschmid.de/sagemaker-llm-vpc) for more details.