# Fine-tune Falcon Models on Amazon SageMaker

In this sagemaker example, we are going to learn how to fine-tune [Falcon](https://huggingface.co/tiiuae/falcon-40b) using [QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314). 

QLoRA is an efficient finetuning technique that quantizes a pretrained language model to 4 bits and attaches small “Low-Rank Adapters” which are fine-tuned. This enables fine-tuning of models with up to 65 billion parameters on a single GPU; despite its efficiency, QLoRA matches the performance of full-precision fine-tuning and achieves state-of-the-art results on language tasks.

In our example, we are going to leverage Hugging Face [Transformers](https://huggingface.co/docs/transformers/index), [Accelerate](https://huggingface.co/docs/accelerate/index), and [PEFT](https://github.com/huggingface/peft). 

### Quick intro: PEFT or Parameter Efficient Fine-tuning

[PEFT](https://github.com/huggingface/peft), or Parameter Efficient Fine-tuning, is a new open-source library from Hugging Face to enable efficient adaptation of pre-trained language models (PLMs) to various downstream applications without fine-tuning all the model's parameters. PEFT currently includes techniques for:

- (Q)LoRA: [LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS](https://arxiv.org/pdf/2106.09685.pdf)
- Prefix Tuning: [P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks](https://arxiv.org/pdf/2110.07602.pdf)
- P-Tuning: [GPT Understands, Too](https://arxiv.org/pdf/2103.10385.pdf)
- Prompt Tuning: [The Power of Scale for Parameter-Efficient Prompt Tuning](https://arxiv.org/pdf/2104.08691.pdf)
- IA3: [Infused Adapter by Inhibiting and Amplifying Inner Activations](https://arxiv.org/abs/2205.05638)

## 1. Setup Development Environment

In [2]:
!pip install --upgrade pip --quiet --upgrade
!pip install "transformers==4.31.0" "datasets[s3]==2.13.0" sagemaker --upgrade --quiet

[0m

To access any LLaMA 2 asset we need to login into our hugging face account. We can do this by running the following command:

If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) more about it.



In [116]:
import sagemaker
import boto3
import json
from sagemaker.utils import name_from_base

sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists

sm_client = boto3.client("sagemaker")
smr_client = boto3.client("sagemaker-runtime")


sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

s3_key_prefix = "Falcon-spider-dataset"  # folder within bucket where code artifact will go

bucket = sess.default_bucket()
region = sess.boto_region_name

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {bucket}")
print(f"sagemaker session region: {region}")


sagemaker role arn: arn:aws:iam::376678947624:role/service-role/AmazonSageMaker-ExecutionRole-20230315T093911
sagemaker bucket: sagemaker-us-west-2-376678947624
sagemaker session region: us-west-2


## 2. Load and prepare the dataset

In [24]:
from datasets import load_dataset
from random import randrange

# Load dataset from the hub
dataset = load_dataset("spider")

size = dataset["train"]

print(f"dataset size: {len(size)}")
# dataset size: 15011


Found cached dataset spider (/root/.cache/huggingface/datasets/spider/spider/1.0.0/4e5143d825a3895451569c8b9b55432b91a4bc2d04d390376c950837f4680daa)


  0%|          | 0/2 [00:00<?, ?it/s]

dataset size: 7000


To instruct tune our model we need to convert our structured examples into a collection of tasks described via instructions. We define a `formatting_function` that takes a sample and returns a string with our format instruction.

In [25]:
from transformers import AutoTokenizer

model_id = "tiiuae/falcon-40b" # sharded weights
tokenizer = AutoTokenizer.from_pretrained(model_id,use_auth_token=True)
tokenizer.pad_token = tokenizer.eos_token



In [26]:
from random import randint

# template dataset to add prompt to each sample
prompt_template = f"Question:\n{{question}}\n---\nQuery:\n{{query}}{{eos_token}}"

# template dataset to add prompt to each sample
def template_dataset(sample):
    sample["text"] = prompt_template.format(question=sample["question"],
                                            query=sample["query"],
                                            eos_token=tokenizer.eos_token)
    return sample


# apply prompt template per sample
train_dataset = dataset["train"].map(template_dataset, remove_columns=list(dataset["train"].features))

print(train_dataset[randint(0, len(dataset))]["text"])

Loading cached processed dataset at /root/.cache/huggingface/datasets/spider/spider/1.0.0/4e5143d825a3895451569c8b9b55432b91a4bc2d04d390376c950837f4680daa/cache-513702d1f2a4a75f.arrow


Question:
List the name, born state and age of the heads of departments ordered by age.
---
Query:
SELECT name ,  born_state ,  age FROM head ORDER BY age<|endoftext|>


lets test our formatting function on a random example.

In addition, to formatting our samples we also want to pack multiple samples to one sequence to have a more efficient training.

We define some helper functions to pack our samples into sequences of a given length and then tokenize them.

In [27]:
from itertools import chain
from functools import partial

# empty list to save remainder from batches to use in next batch
remainder = {"input_ids": [], "attention_mask": [], "token_type_ids": []}

def chunk(sample, chunk_length=2048):
    # define global remainder variable to save remainder from batches to use in next batch
    global remainder
    # Concatenate all texts and add remainder from previous batch
    concatenated_examples = {k: list(chain(*sample[k])) for k in sample.keys()}
    concatenated_examples = {k: remainder[k] + concatenated_examples[k] for k in concatenated_examples.keys()}
    # get total number of tokens for batch
    batch_total_length = len(concatenated_examples[list(sample.keys())[0]])

    # get max number of chunks for batch
    if batch_total_length >= chunk_length:
        batch_chunk_length = (batch_total_length // chunk_length) * chunk_length

    # Split by chunks of max_len.
    result = {
        k: [t[i : i + chunk_length] for i in range(0, batch_chunk_length, chunk_length)]
        for k, t in concatenated_examples.items()
    }
    # add remainder to global variable for next batch
    remainder = {k: concatenated_examples[k][batch_chunk_length:] for k in concatenated_examples.keys()}
    # prepare labels
    result["labels"] = result["input_ids"].copy()
    return result

In [28]:
 # tokenize and chunk dataset
lm_train_dataset = train_dataset.map(
    lambda sample: tokenizer(sample["text"]), batched=True, remove_columns=list(train_dataset.features)
).map(
    partial(chunk, chunk_length=2048),
    batched=True,
)

# Print total number of samples
print(f"Total number of samples: {len(lm_train_dataset)}")

Loading cached processed dataset at /root/.cache/huggingface/datasets/spider/spider/1.0.0/4e5143d825a3895451569c8b9b55432b91a4bc2d04d390376c950837f4680daa/cache-f02fa07481a06a8f.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/spider/spider/1.0.0/4e5143d825a3895451569c8b9b55432b91a4bc2d04d390376c950837f4680daa/cache-2c0f2a8e5ee812eb.arrow


Total number of samples: 213


After we processed the datasets we are going to use the new [FileSystem integration](https://huggingface.co/docs/datasets/filesystems) to upload our dataset to S3. We are using the `sess.default_bucket()`, adjust this if you want to store the dataset in a different S3 bucket. We will use the S3 path later in our training script.

In [29]:
# save train_dataset to s3
training_input_path = f's3://{bucket}/{s3_key_prefix}/data/train'
lm_train_dataset.save_to_disk(training_input_path)


print("uploaded data to:")
print(f"training dataset to: {training_input_path}")

Saving the dataset (0/1 shards):   0%|          | 0/213 [00:00<?, ? examples/s]

uploaded data to:
training dataset to: s3://sagemaker-us-west-2-376678947624/Falcon-spider-dataset/data/train


## 3. Fine-Tune Falcon on Amazon SageMaker

We are going to use the recently introduced method in the paper "[QLoRA: Quantization-aware Low-Rank Adapter Tuning for Language Generation](https://arxiv.org/abs/2106.09685)" by Tim Dettmers et al. QLoRA is a new technique to reduce the memory footprint of large language models during finetuning, without sacrificing performance. The TL;DR; of how QLoRA works is: 

* Quantize the pretrained model to 4 bits and freezing it.
* Attach small, trainable adapter layers. (LoRA)
* Finetune only the adapter layers, while using the frozen quantized model for context.

We prepared a [train.py](./src/train.py), which implements QLora using PEFT to train our model. The script also merges the LoRA weights into the model weights after training. That way you can use the model as a normal model without any additional code. The model will be temporally offloaded to disk, if it is too large to fit into memory.

In order to create a sagemaker training job we need an `HuggingFace` Estimator. The Estimator handles end-to-end Amazon SageMaker training and deployment tasks. The Estimator manages the infrastructure use. 
SagMaker takes care of starting and managing all the required ec2 instances for us, provides the correct huggingface container, uploads the provided scripts and downloads the data from our S3 bucket into the container at `/opt/ml/input/data`. Then, it starts the training job by running.

### Harwarde requirements

We also ran several experiments to determine, which instance type can be used for the different model sizes. The following table shows the results of our experiments. The table shows the instance type, model size, context length, and max batch size. 

| Model        | Instance Type     | Max Batch Size | Context Length |
|--------------|-------------------|----------------|----------------|
| [falcon-7b]() | `(ml.)g5.2xlarge` | `3`            | `2048`         |
| [falcon-40b]() | `(ml.)p4d.24xlarge` | `2`          | `2048`         |

In [30]:
import time
from sagemaker.huggingface import HuggingFace
from huggingface_hub import HfFolder

# define Training Job Name 
job_name = f'huggingface-qlora-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}'

# hyperparameters, which are passed into the training job
hyperparameters ={
  'model_id': model_id,                             # pre-trained model
  'dataset_path': '/opt/ml/input/data/training',    # path where sagemaker will save training dataset
  'epochs': 3,                                      # number of training epochs
  'per_device_train_batch_size': 1,                 # batch size for training
  'lr': 2e-4,                                       # learning rate used during training
  'hf_token': HfFolder.get_token(),                 # huggingface token to access llama 2
  'merge_weights': True,                            # wether to merge LoRA into the model (needs more memory)
}

# create the Estimator
huggingface_estimator = HuggingFace(
    entry_point          = 'train.py',      # train script
    source_dir           = 'src',         # directory which includes all the files needed for training
    instance_type        = 'ml.p4d.24xlarge',   # instances type used for the training job
    instance_count       = 1,                 # the number of instances used for training
    base_job_name        = job_name,          # the name of the training job
    role                 = role,              # Iam role used in training job to access AWS ressources, e.g. S3
    volume_size          = 300,               # the size of the EBS volume in GB
    transformers_version = '4.28',            # the transformers version used in the training job
    pytorch_version      = '2.0',             # the pytorch_version version used in the training job
    py_version           = 'py310',           # the python version used in the training job
    hyperparameters      =  hyperparameters,  # the hyperparameters passed to the training job
    environment          = { "HUGGINGFACE_HUB_CACHE": "/tmp/.cache" }, # set env variable to cache models in /tmp
)

We can now start our training job, with the `.fit()` method passing our S3 path to the training script.

In [None]:
# define a data input dictonary with our uploaded s3 uris
data = {'training': training_input_path}

# starting the train job with our uploaded datasets as input
huggingface_estimator.fit(data, wait=True)

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: huggingface-qlora-2023-08-05-00-38-49-2023-08-05-00-38-50-948


Using provided s3_resource
2023-08-05 00:38:51 Starting - Starting the training job......
2023-08-05 00:39:34 Starting - Preparing the instances for training.....................
2023-08-05 00:43:19 Downloading - Downloading input data...
2023-08-05 00:43:39 Training - Downloading the training image.....................
2023-08-05 00:47:05 Training - Training image download completed. Training in progress.......[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2023-08-05 00:48:14,708 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2023-08-05 00:48:14,764 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-08-05 00:48:14,771 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2023-08-05 00:48:14,772 sagemaker_pytorch_container.training INFO     Invoking use

## Next Steps 

- Deploy the model using DJL with DeepSpeed
- Pre-download and store FMs prior to training
- Shard the training accross multip CPUs with Accelerate


In [100]:
model_path = huggingface_estimator.model_data.replace('/model.tar.gz', '')
model_path

's3://sagemaker-us-west-2-376678947624/huggingface-qlora-2023-08-05-00-38-49-2023-08-05-00-38-50-948/output'

### Create Model Package Group

In [101]:
model_group_name = name_from_base(s3_key_prefix)
input_dict = {
    "ModelPackageGroupName": model_group_name,
    "ModelPackageGroupDescription": "Falcon model package group",
}

response = sm_client.create_model_package_group(
    **input_dict
)
model_package_arn = response["ModelPackageGroupArn"]
print(f"ModelPackageGroup Arn : {model_package_arn}")

ModelPackageGroup Arn : arn:aws:sagemaker:us-west-2:376678947624:model-package-group/falcon-spider-dataset-2023-08-05-20-50-41-653


### Register the model in the Model Registry
Once the model is registered, you will see it in the Model Registry tab of the SageMaker Studio UI. The model is registered with the approval_status set to "Approved". By default, the model is registered with the approval_status set to PendingManualApproval. Users can then navigate to the Model Registry to manually approve the model based on any criteria set for model evaluation or this can be done via API.

In [102]:
inference_instance_type = "ml.g5.24xlarge"
model_package = huggingface_estimator.register(
    model_package_group_name=model_package_arn,
    inference_instances=[inference_instance_type],
    content_types=["text/csv"],
    response_types=["text/csv"],
    approval_status="Approved",
)

model_package_arn = model_package.model_package_arn
print("Model Package ARN : ", model_package_arn)

Model Package ARN :  arn:aws:sagemaker:us-west-2:376678947624:model-package/falcon-spider-dataset-2023-08-05-20-50-41-653/1


## Deployment

In [103]:
!sed -i "s@option.s3url=.*@option.s3url={model_path}@g" deepspeed/serving.properties

In [104]:
rm -rf `find -type d -name .ipynb_checkpoints`

In [105]:
!rm -f model.tar.gz
!tar czvf model.tar.gz -C deepspeed .
s3_code_artifact_deepspeed = sess.upload_data("model.tar.gz", bucket, f"{s3_key_prefix}/inference")
print(f"S3 Code or Model tar for deepspeed uploaded to --- > {s3_code_artifact_deepspeed}")

./
./model.py
./requirements.txt
./serving.properties
S3 Code or Model tar for deepspeed uploaded to --- > s3://sagemaker-us-west-2-376678947624/Falcon-spider-dataset/inference/model.tar.gz


### Define the serving container
Here we define the container to use for the model for inference. We will be using SageMaker's Large Model Inference(LMI) container using DeepSpeed.

In [106]:
# inference_image_uri = f"{account_id}.dkr.ecr.{region}.amazonaws.com/djl-ds:latest"
inference_image_uri = sagemaker.image_uris.retrieve(
    "djl-deepspeed", region=region, version="0.23.0"
)
print(f"Image going to be used is ---- > {inference_image_uri}")

INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.


Image going to be used is ---- > 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.23.0-deepspeed0.9.5-cu118


Create SageMaker model, endpoint configuration and endpoint.

In [107]:
model_name_ds = name_from_base(s3_key_prefix)
print(model_name_ds)

Falcon-spider-dataset-2023-08-05-20-50-49-228


In [108]:
create_model_response = sm_client.create_model(
    ModelName=model_name_ds,
    ExecutionRoleArn=role,
    PrimaryContainer={"Image": inference_image_uri, "ModelDataUrl": s3_code_artifact_deepspeed},
)
model_arn = create_model_response["ModelArn"]

print(f"Created Model: {model_arn}")

Created Model: arn:aws:sagemaker:us-west-2:376678947624:model/falcon-spider-dataset-2023-08-05-20-50-49-228


In [109]:
endpoint_config_name = f"{model_name_ds}-config"
endpoint_name = f"{model_name_ds}-endpoint"

endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": model_name_ds,
            "InstanceType": inference_instance_type,
            "InitialInstanceCount": 1,
            "ModelDataDownloadTimeoutInSeconds": 3600,
            "ContainerStartupHealthCheckTimeoutInSeconds": 3600,
            # "VolumeSizeInGB": 512
        },
    ],
)
endpoint_config_response

{'EndpointConfigArn': 'arn:aws:sagemaker:us-west-2:376678947624:endpoint-config/falcon-spider-dataset-2023-08-05-20-50-49-228-config',
 'ResponseMetadata': {'RequestId': 'dac118da-9799-4f02-afab-4c782875dd60',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'dac118da-9799-4f02-afab-4c782875dd60',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '133',
   'date': 'Sat, 05 Aug 2023 20:50:50 GMT'},
  'RetryAttempts': 0}}

In [110]:
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name
)
print(f"Created Endpoint: {create_endpoint_response['EndpointArn']}")

Created Endpoint: arn:aws:sagemaker:us-west-2:376678947624:endpoint/falcon-spider-dataset-2023-08-05-20-50-49-228-endpoint


In [111]:
import time

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: InService
Arn: arn:aws:sagemaker:us-west-2:376678947624:endpoint/falcon-spider-dataset-2023-08-05-20-50-49-228-endpoint
Status: InService


### Run Inference

Large models such as Falcon have very high accelerator memory footprint. Thus, a very large input payload or generating a large output can cause out of memory errors. The inference examples below are calibrated such that they will work on the ml.g5.24xlarge instance within the SageMaker response time limit of 60 seconds. If you find that increasing the input length or generation length leads to CUDA Out Of Memory errors, we recommend that you try one of the following solutions:

In [132]:
from random import randint
from datasets import load_dataset

# Load dataset from the hub
test_dataset = load_dataset("spider", split="validation")



In [133]:
%%time

# select a random test sample
sample = test_dataset[randint(0,len(test_dataset))]

prompt_template = f"Question:\n{{question}}\n---\nQuery:\n"

data = {
    "text": prompt_template.format(question=sample["question"]),
    "properties": {
        "min_length": 10,
        "max_length": 100,
        "do_sample": True,
    },
}

response_model = smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps(data),
    ContentType="application/json",
)

outputs = json.loads(response_model["Body"].read().decode("utf8"))['outputs']

generated_text = outputs[0]['generated_text']
generated_text

CPU times: user 12.5 ms, sys: 738 µs, total: 13.3 ms
Wall time: 1.95 s


'Question:\nWhat are all of the episodes ordered by ratings?\n---\nQuery:\nSELECT title,  episode_order FROM tv_episodes ORDER BY ratings'

In [137]:
sample["query"]

'SELECT Episode FROM TV_series ORDER BY rating'

In [140]:
print(f"Ground Truth is: {sample['query']}")

Ground Truth is: SELECT Episode FROM TV_series ORDER BY rating


### Clean Up
Finally, clean up after you are done.

In [144]:
sm_client.delete_endpoint(EndpointName=endpoint_name)
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm_client.delete_model(ModelName=model_name_ds)

{'ResponseMetadata': {'RequestId': '539d8c93-3cc1-4464-a712-234c8fe549fc',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '539d8c93-3cc1-4464-a712-234c8fe549fc',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '0',
   'date': 'Sat, 05 Aug 2023 21:39:30 GMT'},
  'RetryAttempts': 0}}