In this notbook, we explore how to apply LoRA, a technique for Parameter Efficient Fine-tuning (PEFT) and then evaluate the model’s performance before and after fine-tuning to highlight the improvements. PEFT is a type of instruction fine-tuning that caters to specific LLM tasks, making the model tuned to a specific task.


Running with **T4 GPU**.

In [1]:
!pip install -q torch                                  # Pytorch
!pip install -q transformers datasets                  # Comes from HuggingFace
!pip install -q bitsandbytes                           # For quantization from HuggingFace
!pip install -q peft                                   # Parameter-efficient Fine-tuning from HuggingFace
!pip install -q trl                                    # For supervised fine-tuning for LLMs from HuggingFace
!pip install -q accelerate                             # For distributed training from HuggingFace

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m52.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m31.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m41.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m15.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

**1. Loading Mistral Model**

In [5]:
from google.colab import userdata

HF_API_TOKEN = "insert token"

import os
os.environ["HF_TOKEN"] = HF_API_TOKEN

In [None]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.3")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/601 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.55G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

**GPU Out of Memory!**

To reduce the model's memory footprint, we can decrease the memory usage of each parameter through model quantization techniques.

**2. Use Mistral 7B Model with Quantization Config**


In [1]:
# Load BitsAndBytes from HuggingFace Transformers
from transformers import BitsAndBytesConfig
import torch

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,                     # Use 4-bit quantization (Q = 4 bits)
    bnb_4bit_use_double_quant=True,        # Double quantization: quantize the quantization constants to save an additional 0.4 bits per parameter
    bnb_4bit_quant_type="nf4",             # Use 4-bit NormalFloat Quantization (optimal for normal weights; enforces w ∈ [-1,1])
    bnb_4bit_compute_dtype=torch.bfloat16  # Dequantize to 16-bits before computation (as described in the paper)
)

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.3",
                                             quantization_config=quant_config)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

**3. Loading the Tokenizer**

In [6]:
from transformers import AutoTokenizer
import os
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.3", use_auth_token=os.environ["HF_TOKEN"])



tokenizer_config.json:   0%|          | 0.00/137k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/587k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

In [7]:
# ensures all sequences in a batch are of equal length.
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

**3. Testing the model**

In [8]:
prompt = """
What do you know about Germany?
"""

inputs = tokenizer(prompt, return_tensors='pt')

inputs = inputs.to('cuda')

output_tokens = model.generate(
    inputs["input_ids"],
    max_new_tokens=40,
    pad_token_id=model.config.eos_token_id
)[0]

output = tokenizer.decode(output_tokens, skip_special_tokens=True)

import textwrap
print(textwrap.fill(output, width=80))

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


 What do you know about Germany?  Germany is a country in Central Europe. It is
bordered to the north by the North Sea, to the northeast by the Baltic Sea, to
the east by Poland and the


## Test Model on Dialogue Summarization Task

In [9]:
# Get a dialogue example:
dialogue = f"""
#Person1#: Ms. Dawson, I need you to take a dictation for me.
#Person2#: Yes, sir...
#Person1#: This should go out as an intra-office memorandum to all employees by this afternoon. Are you ready?
#Person2#: Yes, sir. Go ahead.
#Person1#: Attention all staff... Effective immediately, all office communications are restricted to email correspondence and official memos. The use of Instant Message programs by employees during working hours is strictly prohibited.
#Person2#: Sir, does this apply to intra-office communications only? Or will it also restrict external communications?
#Person1#: It should apply to all communications, not only in this office between employees, but also any outside communications.
#Person2#: But sir, many employees use Instant Messaging to communicate with their clients.
#Person1#: They will just have to change their communication methods. I don't want any - one using Instant Messaging in this office. It wastes too much time! Now, please continue with the memo. Where were we?
#Person2#: This applies to internal and external communications.
#Person1#: Yes. Any employee who persists in using Instant Messaging will first receive a warning and be placed on probation. At second offense, the employee will face termination. Any questions regarding this new policy may be directed to department heads.
#Person2#: Is that all?
#Person1#: Yes. Please get this memo typed up and distributed to all employees before 4 pm.
"""

prompt = f"""
Summarize the following conversation.

### Input: {dialogue}

### Summary:
"""

print(prompt)


Summarize the following conversation.

### Input: 
#Person1#: Ms. Dawson, I need you to take a dictation for me.
#Person2#: Yes, sir...
#Person1#: This should go out as an intra-office memorandum to all employees by this afternoon. Are you ready?
#Person2#: Yes, sir. Go ahead.
#Person1#: Attention all staff... Effective immediately, all office communications are restricted to email correspondence and official memos. The use of Instant Message programs by employees during working hours is strictly prohibited.
#Person2#: Sir, does this apply to intra-office communications only? Or will it also restrict external communications?
#Person1#: It should apply to all communications, not only in this office between employees, but also any outside communications.
#Person2#: But sir, many employees use Instant Messaging to communicate with their clients.
#Person1#: They will just have to change their communication methods. I don't want any - one using Instant Messaging in this office. It wastes t

In [10]:
# Test the model
inputs = tokenizer(prompt, return_tensors='pt')
inputs = inputs.to("cuda")

output_tokens = model.generate(inputs["input_ids"],
                               max_new_tokens=40,
                               pad_token_id=model.config.eos_token_id)[0]
output = tokenizer.decode(output_tokens, skip_special_tokens=True)
res = output.replace(prompt,"")
res

'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n'

The model's response is clearly poor; it just repeated new lines and didn’t understand the task. As a result, we can do **fine-tuning** of the model for this task, specifically instruction fine-tuning with LoRA (Low-rank adaptaion).

# Instruction Fine tuning:  LoRa

**1. Loading the Dataset**

This dataset consists of 13,460 dialogues (plus 100 holdout samples for topic generation) along with their manually labeled summaries and topics. The dataset is acquired from HuggingFace under the ID: [knkarthick/dialogsum](https://huggingface.co/datasets/knkarthick/dialogsum).

This dataset contains three main features:
- **Dialogue**: The text of the dialogue.
- **Summary**: A human-written summary of the dialogue.
- **Topic**: A human-written topic or one-liner describing the dialogue.

In [11]:
from datasets import load_dataset

dataset = load_dataset("knkarthick/dialogsum")

README.md:   0%|          | 0.00/4.65k [00:00<?, ?B/s]

train.csv:   0%|          | 0.00/11.3M [00:00<?, ?B/s]

validation.csv:   0%|          | 0.00/442k [00:00<?, ?B/s]

test.csv:   0%|          | 0.00/1.35M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/12460 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/500 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1500 [00:00<?, ? examples/s]

In [12]:
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
})

The dataset is divided into training, validation, and test sets, with **12,460**, **500**, and **1,500** examples, respectively.

In [13]:
dataset["train"][0]

{'id': 'train_0',
 'dialogue': "#Person1#: Hi, Mr. Smith. I'm Doctor Hawkins. Why are you here today?\n#Person2#: I found it would be a good idea to get a check-up.\n#Person1#: Yes, well, you haven't had one for 5 years. You should have one every year.\n#Person2#: I know. I figure as long as there is nothing wrong, why go see the doctor?\n#Person1#: Well, the best way to avoid serious illnesses is to find out about them early. So try to come at least once a year for your own good.\n#Person2#: Ok.\n#Person1#: Let me see here. Your eyes and ears look fine. Take a deep breath, please. Do you smoke, Mr. Smith?\n#Person2#: Yes.\n#Person1#: Smoking is the leading cause of lung cancer and heart disease, you know. You really should quit.\n#Person2#: I've tried hundreds of times, but I just can't seem to kick the habit.\n#Person1#: Well, we have classes and some medications that might help. I'll give you more information before you leave.\n#Person2#: Ok, thanks doctor.",
 'summary': "Mr. Smith'

**2. Data Preprocessing**

Given a dialogue `D` and its human-written summary `S`, we will use the following prompt format:
```
### Instruction:
Summarize the following conversation.

### Input:
{D}

### Summary:
{S}
```

In [14]:
def format_instruction(dialogue, summary):
    return f"""### Instruction:
Summarize the following conversation.

### Input:
{dialogue.strip()}

### Summary:
{summary.strip()}
"""

In [15]:
def convert_to_instruction_format(data_point):
    return {
        "text": format_instruction(data_point["dialogue"], data_point["summary"])
        }

In [16]:
def process_dataset(data):
    return data.map(
        convert_to_instruction_format
        ).remove_columns(['id', 'topic', 'dialogue', 'summary']) #removing unnecessary columns

In [17]:
# Take a random sample from the training and validation sets and apply process_dataset to them.
train_data = process_dataset(
    dataset["train"].shuffle(seed=42).select([i for i in range(500)])
    )
validation_data  = process_dataset(
    dataset["validation"].shuffle(seed=42).select([i for i in range(50)])
    )

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/50 [00:00<?, ? examples/s]

In [18]:
train_data

Dataset({
    features: ['text'],
    num_rows: 500
})

In [19]:
print(train_data['text'][0])

### Instruction:
Summarize the following conversation.

### Input:
#Person1#: Hello, Anna speaking!
#Person2#: Hey, Anna, this is Jason.
#Person1#: Jason, where have you been hiding lately? You know it's been a long time since your last call. Have you been good?
#Person2#: Yes. How are you, Anna?
#Person1#: I am fine. What have you been doing?
#Person2#: Working. I've been really busy these days. I got a promotion.
#Person1#: That's great, congratulations!
#Person2#: Thanks. I am feeling pretty good about myself too. You know, bigger office, a raise and even an assistant.
#Person1#: That's good. So I guess I'll have to make an appointment to see you.
#Person2#: You are kidding.
#Person1#: How long have you been working there?
#Person2#: A bit over two years. This is a fast-moving company, and seniority isn't the only factor in deciding promotions.
#Person1#: How do you like your new boss?
#Person2#: She is very nice and open-minded.
#Person1#: Much better than the last one, huh?
#Perso

**3. PEFT Setup**

In [20]:
print(model)

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32768, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
        )
        (mlp): MistralMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): MistralRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): MistralRMSNorm((4096,), eps=1e-05)
      )
    )
    (norm): Mist

Looking at the model architecture, we could apply LoRA layers to all layers in the model. However, by default, it is sufficient to add LoRA layers only to the self-attention layers, specifically the components: ["q_proj", "k_proj", "v_proj", "o_proj"]. These are the key components responsible for the attention mechanism, and fine-tuning them with LoRA can significantly enhance the model’s performance without the need for extensive modifications across the entire architecture.

In [21]:
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model

model = prepare_model_for_kbit_training(model)

lora_config = LoraConfig(
    r=16,                                                       # The rank (dimensions) of the LoRA matrices A and B
    lora_alpha=64,                                              # Scales the product of matrices AB [W_new = W_old + (A * B) * α]
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],    # Apply LoRA to the attention matrices
    lora_dropout=0.1,                                           # Dropout rate to reduce overfitting
    bias="none",                                                # Do not train the bias parameter
    task_type="CAUSAL_LM"                                       # Task type for autoregressive text generation
)

model = get_peft_model(model, lora_config)


**4. Fine-Tuning**

In [22]:
# Configure the training hyperparameters using the `SFTConfig` from HuggingFace.
from trl import SFTConfig

training_arguments = SFTConfig(
    fp16=True,                           # Use 16-bit precision for training computations (optimizer states, gradients)
    dataset_text_field="text",           # Specify the text field in the dataset for training
    max_seq_length=1024,                 # Set the maximum sequence length for the training data

    # Batch-related parameters
    per_device_train_batch_size=8,       # Batch size per device during training

    # Optimizer-related parameters
    optim="paged_adamw_32bit",           # Use the paged AdamW optimizer, optimized for 32-bit GPUs
    learning_rate=1e-4,                  # Set the learning rate for training

    # Epochs and saving configuration
    num_train_epochs=2,                  # Number of training epochs (more epochs generally lead to better results)
    save_strategy="epoch",               # Save the model after each epoch
    output_dir="./epoch-finetuned",      # Directory to save the fine-tuned model

    # Validation-related parameters
    eval_strategy="steps",               # Evaluation strategy, performed at specified steps
    eval_steps=0.2,                      # Evaluate after 20% of the training steps

    # Logging-related parameters
    report_to="none",                    # Disable reporting to external tools
    logging_dir="./logs",                # Directory to save the training logs
    logging_steps=20,                    # Number of steps between each log entry
    seed=42,                             # Set a random seed for reproducibility
)

model.gradient_checkpointing_enable()

model.config.use_cache = False

In [23]:
# Import the SFTTrainer from HuggingFace TRL library
from trl import SFTTrainer

trainer = SFTTrainer(

    model=model,
    # tokenizer=tokenizer,

    train_dataset=train_data,
    eval_dataset=validation_data,

    peft_config=lora_config,

    args=training_arguments,
)

Converting train dataset to ChatML:   0%|          | 0/500 [00:00<?, ? examples/s]

Adding EOS to train dataset:   0%|          | 0/500 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/500 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/500 [00:00<?, ? examples/s]

Converting eval dataset to ChatML:   0%|          | 0/50 [00:00<?, ? examples/s]

Adding EOS to eval dataset:   0%|          | 0/50 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/50 [00:00<?, ? examples/s]

Truncating eval dataset:   0%|          | 0/50 [00:00<?, ? examples/s]

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


In [24]:
# Start the training process
trainer.train()

Step,Training Loss,Validation Loss
26,1.354,1.299895
52,1.2862,1.281364
78,1.271,1.281662
104,1.1389,1.280206


TrainOutput(global_step=126, training_loss=1.2344436456286718, metrics={'train_runtime': 1521.4762, 'train_samples_per_second': 0.657, 'train_steps_per_second': 0.083, 'total_flos': 2.0783094026305536e+16, 'train_loss': 1.2344436456286718})

In [25]:
peft_model_path = "./fine-tuned-mistral"

trainer.model.save_pretrained(peft_model_path)

tokenizer.save_pretrained(peft_model_path)

!ls -lh {peft_model_path}


total 57M
-rw-r--r-- 1 root root  808 May  7 09:43 adapter_config.json
-rw-r--r-- 1 root root  53M May  7 09:43 adapter_model.safetensors
-rw-r--r-- 1 root root 5.0K May  7 09:43 README.md
-rw-r--r-- 1 root root  437 May  7 09:43 special_tokens_map.json
-rw-r--r-- 1 root root 134K May  7 09:43 tokenizer_config.json
-rw-r--r-- 1 root root 3.6M May  7 09:43 tokenizer.json
-rw-r--r-- 1 root root 574K May  7 09:43 tokenizer.model


**5. Test Tuned model**

In [26]:
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer


peft_model_path = "./fine-tuned-mistral"
tuned_model = AutoPeftModelForCausalLM.from_pretrained(
    peft_model_path,
    quantization_config=quant_config
)


tokenizer = AutoTokenizer.from_pretrained(peft_model_path)


tokenizer.pad_token = tokenizer.eos_token

tokenizer.padding_side = "right"

tuned_model.config.use_cache = True

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

In [27]:
# Get a dialogue example:
dialogue = f"""
#Person1#: Ms. Dawson, I need you to take a dictation for me.
#Person2#: Yes, sir...
#Person1#: This should go out as an intra-office memorandum to all employees by this afternoon. Are you ready?
#Person2#: Yes, sir. Go ahead.
#Person1#: Attention all staff... Effective immediately, all office communications are restricted to email correspondence and official memos. The use of Instant Message programs by employees during working hours is strictly prohibited.
#Person2#: Sir, does this apply to intra-office communications only? Or will it also restrict external communications?
#Person1#: It should apply to all communications, not only in this office between employees, but also any outside communications.
#Person2#: But sir, many employees use Instant Messaging to communicate with their clients.
#Person1#: They will just have to change their communication methods. I don't want any - one using Instant Messaging in this office. It wastes too much time! Now, please continue with the memo. Where were we?
#Person2#: This applies to internal and external communications.
#Person1#: Yes. Any employee who persists in using Instant Messaging will first receive a warning and be placed on probation. At second offense, the employee will face termination. Any questions regarding this new policy may be directed to department heads.
#Person2#: Is that all?
#Person1#: Yes. Please get this memo typed up and distributed to all employees before 4 pm.
"""

# Build an instruction prompt to the model
prompt = f"""
Summarize the following conversation.

### Input: {dialogue}

### Summary:
"""

print(prompt)


Summarize the following conversation.

### Input: 
#Person1#: Ms. Dawson, I need you to take a dictation for me.
#Person2#: Yes, sir...
#Person1#: This should go out as an intra-office memorandum to all employees by this afternoon. Are you ready?
#Person2#: Yes, sir. Go ahead.
#Person1#: Attention all staff... Effective immediately, all office communications are restricted to email correspondence and official memos. The use of Instant Message programs by employees during working hours is strictly prohibited.
#Person2#: Sir, does this apply to intra-office communications only? Or will it also restrict external communications?
#Person1#: It should apply to all communications, not only in this office between employees, but also any outside communications.
#Person2#: But sir, many employees use Instant Messaging to communicate with their clients.
#Person1#: They will just have to change their communication methods. I don't want any - one using Instant Messaging in this office. It wastes t

In [28]:
# Test fine-tuned model
inputs = tokenizer(prompt, return_tensors='pt')
inputs = inputs.to("cuda")

output_tokens = tuned_model.generate(inputs["input_ids"],
                               max_new_tokens=100,
                               pad_token_id=model.config.eos_token_id)[0]
output = tokenizer.decode(output_tokens, skip_special_tokens=True)
res = output.replace(prompt,"")

import textwrap
print('TRAINED MODEL GENERATED TEXT :')
print(textwrap.fill(res, width=80))

TRAINED MODEL GENERATED TEXT :
#Person1# asks Ms. Dawson to take a dictation for him. #Person1# tells Ms.
Dawson to write a memo to all employees, prohibiting the use of Instant Message
programs by employees during working hours.


The model now provides significantly improved responses! Optional to further fine-tune, so its performance can continue to enhance.