# Efficient Large Language Model training with QLoRA and Hugging Face (4bit Quantization)

In this sagemaker example, we are going to learn how to apply [Low-Rank Adaptation of Large Language Models (LoRA)](https://arxiv.org/abs/2106.09685) to fine-tune Falcon on a single GPU. We are going to leverage Hugging Face [Transformers](https://huggingface.co/docs/transformers/index), [Accelerate](https://huggingface.co/docs/accelerate/index), and [PEFT](https://github.com/huggingface/peft). 

You will learn how to:

1. Setup Development Environment
2. Load and prepare the dataset
3. Fine-Tune Falcon with QLoRA and bnb int-4 on Amazon SageMaker
4. Deploy the model to Amazon SageMaker Endpoint

### Quick intro: PEFT or Parameter Efficient Fine-tunin

[PEFT](https://github.com/huggingface/peft), or Parameter Efficient Fine-tuning, is a new open-source library from Hugging Face to enable efficient adaptation of pre-trained language models (PLMs) to various downstream applications without fine-tuning all the model's parameters. PEFT currently includes techniques for:

- QLoRA: [Efficient Finetuning of Quantized LLMs](https://arxiv.org/pdf/2305.14314.pdf)
- Prefix Tuning: [P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks](https://arxiv.org/pdf/2110.07602.pdf)
- Prompt Tuning: [The Power of Scale for Parameter-Efficient Prompt Tuning](https://arxiv.org/pdf/2104.08691.pdf)


In [2]:
!pip install "transformers==4.26.0" "datasets[s3]==2.9.0" sagemaker py7zr --upgrade --quiet

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
awscli 1.27.132 requires PyYAML<5.5,>=3.10, but you have pyyaml 6.0 which is incompatible.[0m[31m
[0m

In [3]:
import sagemaker
from sagemaker.experiments.run import Run
from sagemaker.pytorch import PyTorch
from sagemaker.huggingface import HuggingFace
from sagemaker.debugger import TensorBoardOutputConfig  # Debugger TensorBoard config to log training metrics to TensorBoard
from sagemaker import image_uris
from sagemaker.utils import name_from_base
import boto3
import json
import os

In [4]:
role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs 
bucket = sess.default_bucket()  # bucket to house artifacts
model_bucket = sess.default_bucket()  # bucket to house artifacts
s3_key_prefix = "Falcon-7b-spider"  # folder within bucket where code artifact will go

region = sess._region_name  # region name of the current SageMaker Studio environment
account_id = sess.account_id()  # account_id of the current SageMaker Studio environment

bucket = sess.default_bucket()
region = sess.boto_region_name

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {bucket}")
print(f"sagemaker session region: {region}")

sagemaker role arn: arn:aws:iam::376678947624:role/service-role/AmazonSageMaker-ExecutionRole-20230315T093911
sagemaker bucket: sagemaker-us-west-2-376678947624
sagemaker session region: us-west-2


from datasets import load_dataset

dataset_name = "timdettmers/openassistant-guanaco"
dataset = load_dataset(dataset_name)

In [5]:
from transformers import AutoTokenizer

model_id = "tiiuae/falcon-7b" #"tiiuae/falcon-40b"

# Load tokenizer of Falcon
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.model_max_length = 2048 # overwrite wrong value

  from .autonotebook import tqdm as notebook_tqdm


In [6]:
from datasets import load_dataset

# Load dataset from the hub
dataset = load_dataset("spider")

print(f"Train dataset size: {len(dataset['train'])}")
print(f"Test dataset size: {len(dataset['validation'])}")

100%|██████████| 2/2 [00:00<00:00, 41.10it/s]

Train dataset size: 7000
Test dataset size: 1034





In [7]:
from random import randint

# template dataset to add prompt to each sample
prompt_template = f"Question:\n{{question}}\n---\nQuery:\n{{query}}{{eos_token}}"

# template dataset to add prompt to each sample
def template_dataset(sample):
    sample["text"] = prompt_template.format(question=sample["question"],
                                            query=sample["query"],
                                            eos_token=tokenizer.eos_token)
    return sample


# apply prompt template per sample
train_dataset = dataset["train"].map(template_dataset, remove_columns=list(dataset["train"].features))

print(train_dataset[randint(0, len(dataset))]["text"])

100%|██████████| 7000/7000 [00:00<00:00, 17861.19ex/s]


Question:
List the creation year, name and budget of each department.
---
Query:
SELECT creation ,  name ,  budget_in_billions FROM department<|endoftext|>


In [8]:
test_dataset = dataset["validation"].map(template_dataset, remove_columns=list(dataset["validation"].features))

100%|██████████| 1034/1034 [00:00<00:00, 15764.16ex/s]


In [9]:
 # tokenize and chunk dataset
lm_train_dataset = train_dataset.map(
    lambda sample: tokenizer(sample["text"]), batched=True, batch_size=24, remove_columns=list(train_dataset.features)
)


lm_test_dataset = test_dataset.map(
    lambda sample: tokenizer(sample["text"]), batched=True, remove_columns=list(test_dataset.features)
)

# Print total number of samples
print(f"Total number of train samples: {len(lm_train_dataset)}")

100%|██████████| 292/292 [00:01<00:00, 249.43ba/s]
100%|██████████| 2/2 [00:00<00:00,  8.80ba/s]

Total number of train samples: 7000





train_dataset = dataset['train']
test_dataset = dataset['test']

lm_train_dataset = train_dataset.map(
    lambda sample: tokenizer(sample["text"]), batched=True, batch_size=24, remove_columns=list(train_dataset.features)
)

lm_test_dataset = test_dataset.map(
    lambda sample: tokenizer(sample["text"]), batched=True, remove_columns=list(test_dataset.features)
)

# Print total number of samples
print(f"Total number of train samples: {len(lm_train_dataset)}")

In [10]:
# save train_dataset to s3
training_input_path = f's3://{bucket}/{s3_key_prefix}/data/train'
lm_train_dataset.save_to_disk(training_input_path)

testing_input_path = f's3://{bucket}/{s3_key_prefix}/data/test'
lm_test_dataset.save_to_disk(testing_input_path)

print("uploaded data to:")
print(f"training dataset to: {training_input_path}")
print(f"testing dataset to: {testing_input_path}")


                                                                                              

uploaded data to:
training dataset to: s3://sagemaker-us-west-2-376678947624/Falcon-7b-spider/data/train
testing dataset to: s3://sagemaker-us-west-2-376678947624/Falcon-7b-spider/data/test




## Configure SageMaker Training Job

In [20]:
job_name = name_from_base(s3_key_prefix)

# hyperparameters, which are passed into the training job
hyperparameters ={
  'model_id': model_id,                                # pre-trained model
  'train_dataset_path': '/opt/ml/input/data/train', # path where sagemaker will save training dataset
  'test_dataset_path': '/opt/ml/input/data/test', # path where sagemaker will save training dataset
  'epochs': 1,                                         # number of training epochs
  'per_device_train_batch_size': 8,                    # batch size for training
  'lr': 2e-4,                                          # learning rate used during training
}

# create the Estimator
estimator = HuggingFace(
    entry_point          = 'train.py',      # train script
    source_dir           = 'src',         # directory which includes all the files needed for training
    instance_type        = 'ml.g5.2xlarge', # instances type used for the training job
    instance_count       = 1,                 # the number of instances used for training
    base_job_name        = job_name,          # the name of the training job
    role                 = role,              # Iam role used in training job to access AWS ressources, e.g. S3
    volume_size          = 300,               # the size of the EBS volume in GB
    transformers_version = '4.28.1',            # the transformers version used in the training job
    pytorch_version      = '2.0.0',            # the pytorch_version version used in the training job
    py_version           = 'py310',            # the python version used in the training job
    hyperparameters      =  hyperparameters,
)

In [21]:
experiment_name = s3_key_prefix
run_name = f"{s3_key_prefix}-run-1"
with Run(experiment_name=experiment_name, sagemaker_session=sess, run_name=run_name) as run:
    estimator.fit(
        {"train": training_input_path, "test": testing_input_path}, wait=False
    )

INFO:sagemaker.experiments.run:The run (falcon-7b-spider-run-1) under experiment (falcon-7b-spider) already exists. Loading it.
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: Falcon-7b-spider-2023-07-14-16-23-03-20-2023-07-14-16-23-04-473


Using provided s3_resource


## 4. Deploy the model to Amazon SageMaker Endpoint

When using `peft` for training, you normally end up with adapter weights. We added the `merge_and_unload()` method to merge the base model with the adatper to make it easier to deploy the model. Since we can now use the `pipelines` feature of the `transformers` library. 

We can now deploy our model using the `deploy()` on our HuggingFace estimator object, passing in our desired number of instances and instance type.

In [22]:
from sagemaker.huggingface import HuggingFaceModel

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   model_data=estimator.model_data,
   role=role, 
   transformers_version="4.26", 
   pytorch_version="1.13", 
   py_version="py39",
   model_server_workers=1
)
# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
   initial_instance_count=1,
   instance_type= "ml.g5.4xlarge"
)


INFO:sagemaker:Creating model with name: huggingface-pytorch-inference-2023-07-14-17-24-24-350
INFO:sagemaker:Creating endpoint-config with name huggingface-pytorch-inference-2023-07-14-17-24-25-095
INFO:sagemaker:Creating endpoint with name huggingface-pytorch-inference-2023-07-14-17-24-25-095


--------------!

SageMaker starts the deployment process by creating a SageMaker Endpoint Configuration and a SageMaker Endpoint. The Endpoint Configuration defines the model and the instance type.

Lets test by using a example from the test split.

In [25]:
from random import randint
from datasets import load_dataset

# Load dataset from the hub
test_dataset = load_dataset("spider", split="validation")

# select a random test sample
sample = test_dataset[randint(0,len(test_dataset))]

# format sample
prompt_template = f"Question:\n{{question}}\n---\nQuery:\n"

fomatted_sample = {
  "inputs": prompt_template.format(question=sample["question"]),
  "parameters": {
    "do_sample": True,
    "top_p": 0.9,
    "temperature": 0.1,
    "max_new_tokens": 100,
  }
}

# predict
res = predictor.predict(fomatted_sample)

print(res[0]["generated_text"].split("Query:")[-1])




SELECT count(*),  T1.name FROM manufacturers AS T1 JOIN models AS T2 ON T1.id = T2.manufacturer_id GROUP BY T1.name ORDER BY count(*) DESC LIMIT 1;



Lets compare it to the groundtruth

In [26]:
print(sample["query"])

select count(*) ,  t2.fullname from model_list as t1 join car_makers as t2 on t1.maker  =  t2.id group by t2.id;


Finally, clean up after you are done.

In [None]:
predictor.delete_model()
predictor.delete_endpoint()