<a href="https://colab.research.google.com/github/jeffreyhuang45/llama2_oreilly_live_training_202401/blob/main/Fine_Tuning_Llama2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-Tuning LLaMA 2-7B-Chat

> You can execute this notebook to fine-tune the LLaMA 2-7B-chat model on a Google Colab setup.

---

## Step 1: Initial Setup

### Fine-Tuning LLaMA 2-7B-Chat: Initial Setup

We begin by installing the required libraries. These libraries are essential for working with the LLaMA 2-7B model and include tools for model manipulation, dataset handling, and fine-tuning techniques.





In [None]:
%%capture
%pip install accelerate peft bitsandbytes transformers trl


### Loading Necessary Modules

Now, let's import the necessary Python modules. These modules include various functionalities from PyTorch, the datasets library, and several components from the Hugging Face `transformers` library.

In [None]:
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,
    logging
)
from peft import LoraConfig
from trl import SFTTrainer

---

#### Step 2: Model Configuration

```
# Model and Dataset Configuration

In this step, we define the base model and dataset for fine-tuning. We use a pre-configured LLaMA 2-7B-chat model from Hugging Face and a specific dataset optimized for our tasks.
```



In [None]:
# Model from Hugging Face hub
base_model = "NousResearch/Llama-2-7b-chat-hf"

# New instruction dataset
guanaco_dataset = "mlabonne/guanaco-llama2-1k"

# Fine-tuned model
new_model = "llama-2-7b-chat-guanaco"

#### Step 3: Loading Dataset, Model, and Tokenizer

```
# Loading Dataset, Model, and Tokenizer

We load the dataset from Hugging Face and prepare the LLaMA 2 model and tokenizer for fine-tuning. The dataset is already formatted to be compatible with our model.
```





In [None]:

dataset = load_dataset(guanaco_dataset, split="train")



#### Step 4: 4-Bit Quantization Configuration


```
# Configuring 4-Bit Quantization (QLoRA)

4-bit quantization through QLoRA allows us to fine-tune large language models on consumer hardware effectively. This setup optimizes VRAM usage and retains high model performance.
```



In [None]:
compute_dtype = getattr(torch, "float16")

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=False,
)




## Step 5: Loading the LLaMA 2 Model


### Loading the LLaMA 2 Model with 4-Bit Precision

Now we load the LLaMA 2 model using the 4-bit precision configuration. This step is crucial for efficient training on limited hardware.


In [None]:
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=quant_config,
    device_map={"": 0}
)
model.config.use_cache = False
model.config.pretraining_tp = 1

#### Step 6: Setting Up Training Parameters


```
# Setting Training Parameters

Here, we define the training parameters. These include batch sizes, learning rates, optimizer settings, and others, which are essential for effective model training.
```

In [None]:
# Load Llama tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

In [None]:
# Load LoRA configuration
peft_args = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
)

In [None]:
training_params = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=1,
    optim="paged_adamw_32bit",
    save_steps=25,
    logging_steps=25,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=False,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="constant",
    report_to="tensorboard"
)


#### Step 7: Fine-Tuning the Model

In [None]:
# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_args,
    dataset_text_field="text",
    max_seq_length=None,
    tokenizer=tokenizer,
    args=training_params,
    packing=False,
)

Let's train the model!

In [None]:
# Train model
trainer.train()

# Save Fine-Tuned Model and Tokenizer



In [None]:
trainer.model.save_pretrained(new_model)
trainer.tokenizer.save_pretrained(new_model)

In [None]:
from tensorboard import notebook
log_dir = "results/runs"
notebook.start("--logdir {} --port 4000".format(log_dir))

# Test the model
logging.set_verbosity(logging.CRITICAL)
prompt = "Who is Leonardo Da Vinci?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

The credits for this notebooks go to this awesome article by datacamp:
- https://www.datacamp.com/tutorial/fine-tuning-llama-2

# References

- https://www.datacamp.com/tutorial/fine-tuning-llama-2
- https://huggingface.co/docs/optimum/concept_guides/quantization
- https://arxiv.org/abs/2106.09685
- https://huggingface.co/docs/optimum/concept_guides/quantization