# FINE TUNING TECHNIQUES

There are different ways to fine-tune LLMs. The most well-know techniques are listed below:

1. Supervised fine-tuning
2. Direct preference optimization
3. Reinforcement learning with human feedback


### 1. SFT
The most common technique in which we fine-tune the LLM. In SFT, we provide the LLM with the input and output. For example, the input can be "what is a black hole?", and the output can be "A black hole is a cosmic object that pulls objects towards itself".
### 2. DPO
Another technique is DPO. In DPO, we construct the dataset such that we provide it a question, a preferred answer, and a dispreferred answer. The LLMs objective is to make sure it generates the preferred answer.
### 3. RLHF
A third technique is RLHF, on which ChatGPT has also been trained. In this, we construct the dataset such that we have the question, the answer, and then the feedback to that answer.


Link to relevant articles:

1. https://mer.vin/2024/03/comparative-overview-sft-dpo-rlhf

2. https://tulsipatro29.medium.com/mathematical-intuition-behind-lora-qlora-411c9f2cabbe

In an LLM, we have its weights and biases in the form of matrices. Each elemenet in the weights/biases (parameters) are stored in FP-32 (FULL PRECISION/SINGLE PRECISION) format. This means that the weights are in 32 bit format. Now in a 70b model, the 32-bit wieghts end up making the model very heavy.

THerefore, we can ocvenrt the model from a higher memory format to a lower one. So we can instead onvert model from 32 bit to int8 format. This way we can do inference on the model easily as all parameters are stored in 8-bit format.

NOTE: FP16 is also called Half-Precision. Theres also a loss of accurcay due to quantization.


Modes of quantization:
1. Post training quantization - here we have a pretrained model, we quantized it, and just use it. Some loss of data happens here
2. Quantization-aware training - we take pretrained model and quantize it, then we fine-tune it on a different dataset, and then make inferences. We will going to implement this in this notebook.

# LoRA - Low-Rank Adaptation

LoRA is a fine-tuning technique that enables efficient adaptation of large language models (LLMs). In traditional fine-tuning, all model weights are updated, which can be computationally expensive and memory-intensive. LoRA offers a more efficient approach by introducing the following steps:

1. **Freeze the Original Weights**:
   - The pre-trained weights of the model are kept unchanged during fine-tuning to preserve the knowledge they contain.

2. **Introduce Low-Rank Matrices**:
   - Instead of directly modifying the original weight matrix, LoRA performs matrix decomposition to create two smaller low-rank matrices \( A \) and \( B \), which approximate the updates to the weights.

3. **Optimize the Low-Rank Matrices**:
   - Only \( A \) and \( B \) are optimized during fine-tuning, drastically reducing the number of parameters and memory required.

---

### Example: Fine-Tuning a 3x3 Weight Matrix

Let's consider this on a smaller scale example. Consider our model has a 3x3 weight matrix. During fine-tuning with LORA, we freeze the original weights matrix and perform matrix decomposition. This will create two smaller matrices of rank 1, A and B, having dimensions 3x1 and 1x3, respectively. This way, we have effectively reduced the models weights to a smaller dimension while preserving information, because if we multiply A and B, we still get the original weights matrix.  Now our objective is to optimize these smaller weights during the fine-tuning process.

# QLoRA - QUNATIZED LORA (LORA 2.0)

In LORA, we saw that we created low-rank matrices and optimize them during training process. These matrices are stored in float 16 format, meaning 16-bit. QLORA essentially takes these matrices and further compresses them to 4-bit.


# Dataset information
The dataset was taken from Huggingface having around 1000 samples. The link is https://huggingface.co/mlabonne/llama-2-7b-guanaco

# FINE TUNING LLAMA2 USING SUPERVISED FINE-TUNING

## 0. Install dependencies

In [None]:
!pip install transformers==4.46.0
!pip install -q -U accelerate peft trl
!pip install -U bitsandbytes

Collecting transformers==4.46.0
  Downloading transformers-4.46.0-py3-none-any.whl.metadata (44 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/44.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.1/44.1 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
Reason for being yanked: This version unfortunately does not work with 3.8 but we did not drop the support yet[0m[33m
[0mDownloading transformers-4.46.0-py3-none-any.whl (10.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.0/10.0 MB[0m [31m38.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.46.3
    Uninstalling transformers-4.46.3:
      Successfully uninstalled transformers-4.46.3
Successfully installed transformers-4.46.0


Associated code link: https://mlabonne.github.io/blog/posts/Fine_Tune_Your_Own_Llama_2_Model_in_a_Colab_Notebook.html

Associated HuggingFace repo containing model weights: https://huggingface.co/Maaz66/llama-2-7b-miniguanaco/tree/main

Associated kaggle notebook (its not saved so data is lost): https://www.kaggle.com/code/moaaznnt/beautiful-notebook/edit

## 1. Importing dependencies

In [None]:
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

## 2. Defining parameters

In [None]:
# The model that you want to train from the Hugging Face hub
model_name = "NousResearch/llama-2-7b-chat-hf"

# The instruction dataset to use
dataset_name = "mlabonne/guanaco-llama2-1k"

# Fine-tuned model name
new_model = "llama-2-7b-miniguanaco"

# Save results
output_dir = "./results"

# Load the entire model on the GPU 0
device_map = {"": 0}

In [None]:
base_model=model_name

## 3. Loading pretrained model with quantization

In [None]:
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model

# Load base model(Mistral 7B)
bnb_config = BitsAndBytesConfig(
    load_in_4bit= True,
    bnb_4bit_quant_type= "nf4",
    bnb_4bit_compute_dtype= torch.bfloat16,
    bnb_4bit_use_double_quant= False,
)

model = AutoModelForCausalLM.from_pretrained(
        base_model,
        quantization_config=bnb_config,
        torch_dtype=torch.bfloat16,
        device_map="auto",
        trust_remote_code=True,
)
model.config.use_cache = False # silence the warnings. Please re-enable for inference!
model.config.pretraining_tp = 1
model.gradient_checkpointing_enable()

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
tokenizer.padding_side = 'right'
tokenizer.pad_token = tokenizer.eos_token
tokenizer.add_eos_token = True
tokenizer.add_bos_token, tokenizer.add_eos_token

#Adding the adapters in the layers
model = prepare_model_for_kbit_training(model)
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj","gate_proj"]
)
model = get_peft_model(model, peft_config)

config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/746 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

## 4. Loading dataset and setting up SFTTrainer

In [None]:
# Load dataset (
dataset = load_dataset(dataset_name, split="train")

In [None]:
#Hyperparamter
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=1,
    optim="paged_adamw_32bit",
    save_steps=25,
    logging_steps=25,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=False,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="constant",
    report_to="tensorboard"
)

# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=None,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=False,
)

README.md:   0%|          | 0.00/1.02k [00:00<?, ?B/s]

(…)-00000-of-00001-9ad84bb9cf65a42f.parquet:   0%|          | 0.00/967k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [None]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

## 5. Training the model

In [None]:
# Train model
trainer.train()

  return fn(*args, **kwargs)


Step,Training Loss
25,1.2891
50,1.5456
75,1.1937
100,1.4177
125,1.1692
150,1.3483
175,1.1641
200,1.4458
225,1.1465
250,1.5156


  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)


TrainOutput(global_step=250, training_loss=1.323565788269043, metrics={'train_runtime': 5514.3948, 'train_samples_per_second': 0.181, 'train_steps_per_second': 0.045, 'total_flos': 1.7289112257921024e+16, 'train_loss': 1.323565788269043, 'epoch': 1.0})

## 6. Save the pretrained model

In [None]:

# Save trained model
trainer.model.save_pretrained(new_model)

## 7. Ask some questions from the fine-tuned model

In [None]:
import warnings
warnings.filterwarnings("ignore")

# Run text generation pipeline with our next model
prompt = "What is a large language model?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

<s>[INST] What is a large language model? [/INST]<s>[INST] A large language model is a type of artificial intelligence model that is trained on large amounts of text data. It is designed to generate text that is similar to the training data. This type of model is used in natural language processing applications, such as chatbots, language translation, and text summarization. [/INST] A large language model is a type of artificial intelligence model that is trained on large amounts of text data. It is designed to generate text that is similar to the training data. This type of model is used in natural language processing applications, such as chatbots, language translation, and text summarization. The model can be trained on a wide range of text data, including books, articles, and websites. The goal of the model is to generate text that is both coherent and relevant to the context in which it is being used. This can be useful


## 8. Push the model to HuggingFace Hub

In [None]:
secret_hf = "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"
!huggingface-cli login --token $secret_hf

model.push_to_hub(new_model, use_temp_dir=False)
tokenizer.push_to_hub(new_model, use_temp_dir=False)

## 9. Load the model back from the HuggingFace hub and make inference

In [None]:
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

In [None]:
model_name = "NousResearch/llama-2-7b-chat-hf"
device_map = {"": 0}

# Reload model in FP16 and merge it with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map=device_map,
)

In [None]:
new_model = "Maaz66/llama-2-7b-miniguanaco"
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()

# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

adapter_config.json:   0%|          | 0.00/783 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/369M [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/746 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

In [None]:
# Ignore warnings
logging.set_verbosity(logging.CRITICAL)

# Run text generation pipeline with our next model
prompt = "What is a large language model?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=1024)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

<s>[INST] What is a large language model? [/INST]  A large language model is a type of artificial intelligence (AI) model that is trained on a large corpus of text data to generate language outputs that are coherent and natural-sounding. The model is designed to learn the patterns and structures of language by analyzing a large amount of text data, such as books, articles, and websites.

Large language models are typically trained using deep learning techniques, such as recurrent neural networks (RNNs) or transformer networks, and are designed to generate text that is similar to the training data. The models can be used for a variety of natural language processing tasks, such as language translation, text summarization, and language generation.

Some examples of large language models include:

1. BERT (Bidirectional Encoder Representations from Transformers): Developed by Google, BERT is a powerful language model that has achieved state-of-the-art results on a wide range of natural lan