# Finetune LLaMa-2 (7B) on Vertex AI


Meta developed and publicly released the Llama 2 family of large language models (LLMs), a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters.

In this tutorial you will learn how to finetune [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) on Vertex AI. 


What you'll learn in this tutorial:

1. [Setup development environment](#1-setup-development-environment)
2. [Load Dataset](#2-load-dataset)
3. [Fine-tune Llama-2-7b using `trl` and `SFTTrainer`](#3-fine-tune-llama2-7b-using-trl-and-sfttrainer)

## 1. Setup development environment


In this example, we will use the Vertex AI Workbench instance with A100 and the [Hugging Face Deep Learning Containers](https://cloud.google.com/deep-learning-containers/docs/choosing-container#hugging-face). The Hugging Face PyTorch DLC comes with all important libraries, like Transformers, Datasets, PEFT, TRL and other packages pre-installed this makes it super easy to get started, since there is no need for environment management. You can now find all Hugging Face containers on [Google Cloud](https://cloud.google.com/deep-learning-containers/docs/choosing-container#hugging-face).


**ToDo**: Add info on how to spin-up a workbench instance or small intro about Vertex AI Workbench Instance.

**ToDo**: Update the link for the image once, GPU containers are released. 


Once the instance is up and running, we can access a Jupyter environment, which we can use for preparing our dataset and launching the training.

## 2. Load the dataset 

We use the [Abirate/english_quotes](https://huggingface.co/datasets/Abirate/english_quotes) dataset, is a dataset of all the quotes retrieved from [goodreads quotes](https://www.goodreads.com/quotes). This dataset can be used for multi-label text classification and text generation. The content of each quote is in English and concerns the domain of datasets for NLP and beyond.

An example from the dataset:
```python
{
   "quote": "Be yourself; everyone else is already taken."	
   "author": "Oscar Wilde"
    "tags": [ "be-yourself", "gilbert-perreira", "honesty", "inspirational", "misattributed-oscar-wilde", "quote-investigator" ]

}
```

To load the `Abirate/english_quotes` dataset, we use the load_dataset() method from the 🤗 Datasets library.

In [3]:
from datasets import load_dataset

dataset = load_dataset("Abirate/english_quotes", split="train")

To fine-tune our model we need to convert our structured examples into a format that is supported by the SFTTrainer. We  define a `format_dataset` function that concats the `quote` and `author` column.

In [10]:
def format_dataset(sample):
    sample["text"] = f"Quote: {sample['quote']}\nAuthor: {sample['author']}"
    return sample

Before applying formatting on our entire dataset, lets test our formatting function on a random example.

In [11]:
from random import randrange

print(format_dataset(dataset[randrange(len(dataset))])['text'])

Quote: “When I saw you I fell in love, and you smiled because you knew.”
Author: Arrigo Boito


We can see that the dataset was properly formatted as the `quote` and `author` information has been appended into `text` field.

In [12]:
# apply formatting
dataset = dataset.map(
    format_dataset, remove_columns=list(dataset.features)
)

Map: 100%|███████████████████████████████████████████████████| 2508/2508 [00:00<00:00, 16395.47 examples/s]


## 3. Fine-tune Llama-2-7b using `trl` and `SFTTrainer`

We will use the [SFTTrainer](https://huggingface.co/docs/trl/en/sft_trainer) from  🤗 `trl` to fine-tune our model. The `SFTTrainer`  is built on top of the 🤗 Transformers `Trainer` and inherits all the core functionalities like logging, evaluation, and checkpointing, but offers additional enhancements like:

- Packing datasets for more efficient training
- PEFT (parameter-efficient fine-tuning) support including Q-LoRA
- Preparing the model and tokenizer for conversational fine-tuning (e.g. adding special tokens)

You can read about it in the [trl docs](https://huggingface.co/docs/trl/en/sft_trainer)


As, we all know LLMs are known to be large, and running or training them in consumer hardware is a huge challenge for users and accessibility. Therefore, we  are going to use [QLoRA](https://arxiv.org/abs/2106.09685), a technqiue technique to reduce the memory footprint of LLMs during finetuning, without sacrificing performance. How it works: 

- Quantize the pretrained model to 4 bits and freezing it.
- Attach small, trainable adapter layers. (LoRA)
- Finetune only the adapter layers, while using the frozen quantized model for context.

To further enhance training efficiency, we'll incorporate a recently introduced, high-performance attention mechanism `Flash Attention 2` alongside `QLoRA`. It is nicely integrated with Transformers. It is up to 3x faster than the standard attention mechanism

In [16]:
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer, DataCollatorForLanguageModeling
from transformers import TrainingArguments, Trainer
import torch
from trl import SFTTrainer

[2024-03-26 07:59:26,558] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)


In [17]:
# Hugging Face model id
model_id = "meta-llama/Llama-2-7b-hf"

In [20]:
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                     #  quantize the model to 4-bits when you load it
    bnb_4bit_quant_type="nf4",             # use a special 4-bit data type for weights initialized from a normal distribution
    bnb_4bit_use_double_quant=True,        # use a nested quantization scheme to quantize the already quantized weights
    bnb_4bit_compute_dtype=torch.bfloat16, # Use float16 when running on a GPU(T4, V100) where bfloat16 is not supported
)                                          # conversion from bfloat16 to float16 may lead to overflow (and opposite may lead to loss of precision)

In order to use `meta-llama/Llama-2-7b-hf` you will need the Hugging Face Hub Token with access to the model, so make sure to execute the following:

```bash
huggingface-cli login # The easiest way to authenticate and it saves the token on your machine. 
```

There are other ways too which can be found in the [docs](https://huggingface.co/docs/huggingface_hub/en/quick-start).

In [21]:
# Load model
model = AutoModelForCausalLM.from_pretrained(model_id, 
                                             quantization_config=bnb_config, 
                                             device_map="auto",
                                             attn_implementation="flash_attention_2"
                                            )

# Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

Downloading shards: 100%|████████████████████████████████████████████████████| 2/2 [00:34<00:00, 17.16s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████| 2/2 [00:04<00:00,  2.31s/it]


For using QLoRA with SFTTrainer, we need to create our LoraConfig and pass it as an argument to the SFTTrainer.

In [25]:
from peft import LoraConfig
# LoRA config
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.05,
    r=8,
    bias="none",
    task_type="CAUSAL_LM", 
)

Before we can start our training we need to define the hyperparameters (TrainingArguments) we want to use.

In [49]:
training_args = TrainingArguments(
    output_dir = "output",               # directory to save trained model
    max_steps = 20,                      # number of training steps
    learning_rate = 2e-4,                # learning rate for training
    optim="paged_adamw_8bit",            # optimizer for training
    per_device_train_batch_size = 1,     # batch size per device during training
    gradient_accumulation_steps = 4,     # Number of steps to accumulate gradients before updating the model
    logging_steps = 5,                   # log every 5 steps
    bf16 = True                          # Use float16 when running on a GPU(T4, V100) where bfloat16 is not supported
                                         # conversion from bfloat16 to float16 may lead to overflow (and opposite may lead to loss of precision)                                       
)


In [50]:
## Initialize the trl SFTTrainer
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text", # field that contains the text in the dataset
    args = training_args,
    peft_config = peft_config,
)



In [51]:
# start training
trainer.train()

# save model
trainer.save_model()

Step,Training Loss
5,1.7562
10,1.3049
15,1.0339
20,1.1327


In [None]:
# free the memory again
del model
del trainer
torch.cuda.empty_cache()