## Fine-Tuning Llama 2 Model with QLoRA on a Custom Dataset

* **You can file complete post related to this notebook on [Medium](https://medium.com/@givkashi/fine-tuning-llama-2-model-with-qlora-on-a-custom-dataset-33126b94dee5)**

Fine-tuning large language models like Llama 2 can significantly improve their performance on specific tasks or domains. This guide will walk you through the process of fine-tuning a Llama 2 model with 7 billion parameters using the QLoRA technique on a custom dataset. We'll utilize a P100 GPU with high RAM using Kaggle, and we'll fine-tune the model in 4-bit precision to drastically reduce VRAM usage.


### Step 1: Install and Import Required Libraries

First, ensure you have the necessary libraries installed. Run the following command to install them:


In [1]:
!pip install -q accelerate==0.26.1 peft==0.4.0 bitsandbytes==0.42.0 transformers==4.36.2 trl==0.4.7 datasets==2.16.1

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf 23.8.0 requires cubinlinker, which is not installed.
cudf 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
cudf 23.8.0 requires ptxcompiler, which is not installed.
cuml 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
dask-cudf 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
apache-beam 2.46.0 requires dill<0.3.2,>=0.3.1.1, but you have dill 0.3.7 which is incompatible.
apache-beam 2.46.0 requires numpy<1.25.0,>=1.14.3, but you have numpy 1.26.4 which is incompatible.
apache-beam 2.46.0 requires pyarrow<10.0.0,>=3.0.0, but you have pyarrow 15.0.2 which is incompatible.
cudf 23.8.0 requires cuda-python<12.0a0,>=11.7.1, but you have cuda-python 12.4.0 which is incompatible.
cudf 23.8.0 requires pandas<1.6.0dev0,>=1.3, but you have pandas 2.1.4 which is inc

**Next, import the required packages:**

In [2]:
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM, 
    AutoTokenizer, 
    BitsAndBytesConfig, 
    HfArgumentParser, 
    TrainingArguments, 
    pipeline, 
    logging
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

2024-05-22 14:41:29.341574: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-22 14:41:29.341701: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-22 14:41:29.473848: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


### Step 2: Define Parameters and Configurations

#### Model and Dataset Parameters

In [3]:
model_name = "NousResearch/Llama-2-7b-hf"
dataset_name = "mlabonne/mini-platypus"
new_model = "llama-2-7b-mini-platypus"

#### QLoRA Parameters

In [4]:
lora_r = 64
lora_alpha = 16
lora_dropout = 0.1

#### BitsAndBytes Parameters

In [5]:
use_4bit = True
bnb_4bit_compute_dtype = "float16"
bnb_4bit_quant_type = "nf4"
use_nested_quant = False

#### Training Parameters

In [6]:
output_dir = "./results"
num_train_epochs = 1
fp16 = False
bf16 = False
per_device_train_batch_size = 4
per_device_eval_batch_size = 4
gradient_accumulation_steps = 1
gradient_checkpointing = True
max_grad_norm = 0.3
learning_rate = 2e-4
weight_decay = 0.001
optim = "paged_adamw_32bit"
lr_scheduler_type = "constant"
max_steps = -1
warmup_ratio = 0.03
group_by_length = True
save_steps = 25
logging_steps = 25

### Step 3: Load Data and Initialize Components


1. **Load Dataset:**

In [7]:
dataset = load_dataset(dataset_name, split="train")

Downloading readme:   0%|          | 0.00/316 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.25M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]


2. **Configure BitsAndBytes for 4-bit Quantization:**


In [8]:
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
   load_in_4bit=use_4bit,
   bnb_4bit_quant_type=bnb_4bit_quant_type,
   bnb_4bit_compute_dtype=compute_dtype,
   bnb_4bit_use_double_quant=use_nested_quant,
)

3. **Check GPU Compatibility:**

In [9]:
if compute_dtype == torch.float16 and use_4bit:
   major, _ = torch.cuda.get_device_capability()
   if major >= 8:
       print("=" * 80)
       print("Your GPU supports bfloat16: accelerate training with bf16=True")
       print("=" * 80)

4. **Load the Llama 2 Model and Tokenizer:**


In [10]:
model = AutoModelForCausalLM.from_pretrained(
   model_name,
   quantization_config=bnb_config,
   device_map={"": 0}
)
model.config.use_cache = False
model.config.pretraining_tp = 1

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/179 [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/746 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

5. **Load LoRA Configuration:**

In [11]:
peft_config = LoraConfig(
   lora_alpha=lora_alpha,
   lora_dropout=lora_dropout,
   r=lora_r,
   bias="none",
   task_type="CAUSAL_LM",
)

6. **Set Training Parameters:**

In [12]:
training_arguments = TrainingArguments(
       output_dir=output_dir,
       num_train_epochs=num_train_epochs,
       per_device_train_batch_size=per_device_train_batch_size,
       gradient_accumulation_steps=gradient_accumulation_steps,
       optim=optim,
       save_steps=save_steps,
       logging_steps=logging_steps,
       learning_rate=learning_rate,
       weight_decay=weight_decay,
       fp16=fp16,
       bf16=bf16,
       max_grad_norm=max_grad_norm,
       max_steps=max_steps,
       warmup_ratio=warmup_ratio,
       group_by_length=group_by_length,
       lr_scheduler_type=lr_scheduler_type,
       report_to="tensorboard"
   )

7. **Initialize SFTTrainer:**

In [13]:
trainer = SFTTrainer(
   model=model,
   train_dataset=dataset,
   peft_config=peft_config,
   dataset_text_field="instruction",
   max_seq_length=None,
   tokenizer=tokenizer,
   args=training_arguments,
   packing=False,
)



Map:   0%|          | 0/1000 [00:00<?, ? examples/s]


8. **Start Training:**

In [14]:
%%time
trainer.train()

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
25,0.9815
50,1.3921
75,0.9062
100,1.1962
125,0.9056
150,1.1688
175,0.8902
200,1.1023
225,0.8651
250,1.1485




CPU times: user 31min 51s, sys: 19min 2s, total: 50min 54s
Wall time: 50min 51s


TrainOutput(global_step=250, training_loss=1.055658546447754, metrics={'train_runtime': 3050.7469, 'train_samples_per_second': 0.328, 'train_steps_per_second': 0.082, 'total_flos': 1.9977202635177984e+16, 'train_loss': 1.055658546447754, 'epoch': 1.0})

### Step 4: Save the Fine-Tuned Model

After training, save the fine-tuned model:

In [15]:
trainer.model.save_pretrained(new_model)

### Step 5: Test the Fine-Tuned Model

**Finally, test the fine-tuned model to ensure it works as expected:**

In [16]:
prompt = "What is a large language model?"
instruction = f"### Instruction:\n{prompt}\n\n### Response:\n"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=128)
result = pipe(instruction)
print(result[0]['generated_text'][len(instruction):])




A large language model (LLM) is a type of artificial intelligence (AI) model that uses deep learning techniques to generate human-like text. LLMs are trained on vast amounts of data, including text from books, articles, and other sources, to learn the patterns and relationships between words and phrases. They can then generate new text based on these patterns and relationships, often producing coherent and contextually appropriate sentences.

LLMs have been developed for various applications, such as chatbots, content generation


### Conclusion

By following these steps, you have successfully fine-tuned a Llama 2 model using QLoRA on a custom dataset. This process enables you to tailor large language models to specific tasks or domains while optimizing for limited VRAM resources. Happy fine-tuning!