# Parameter Efficient Fine-tuning


Submitted by,
> Suhail Chand <br>
> suhail.chand@outlook.com

**Fine-tuning** is the process of adapting a pre-trained model—like a large language model (LLM)—to perform better on a specific task or domain. Instead of training a model from scratch, fine-tuning leverages the general knowledge already embedded in the model and adjusts it using a smaller, task-specific dataset. This improves performance, especially in specialized areas like legal, medical, or technical domains.

**Parameter-Efficient Fine-Tuning (PEFT)** takes this a step further by updating only a small subset of the model’s parameters, keeping the majority of the model frozen. This drastically reduces computational cost, memory usage, and training time—making it ideal for resource-constrained environments.

**Common Fine-Tuning Techniques**
- *Supervised Fine-Tuning (SFT)*: Traditional method using labeled input-output pairs to adjust the model’s weights for specific tasks.
- *Reinforcement Fine-Tuning (RFT)*: Uses reward signals to guide the model toward better outputs, often used for complex reasoning tasks.
- *Direct Preference Optimization (DPO)*: Fine-tunes the model based on human preferences by comparing pairs of outputs and favoring the preferred one.

**Parameter-Efficient Fine-Tuning Techniques**
- *Adapters*: Small modules inserted between layers of the model; only these are trained while the rest of the model remains unchanged.
- ***LoRA (Low-Rank Adaptation)***: Instead of updating all the model’s parameters, LoRA inserts small trainable matrices into key layers (like attention projections). QLoRA builds on this by applying LoRA to the quantized model, further reducing the number of trainable parameters.
- *Prefix Tuning*: Adds learnable tokens (prefixes) to the input, guiding the model without altering its internal weights.
- *Prompt Tuning*: Similar to prefix tuning but operates entirely at the input level with soft prompts.
- *BitFit*: Fine-tunes only the bias terms of the model, offering a surprisingly effective and lightweight approach.

**In this notebook, we fine-tune the Llama 2 7B model for a summarization task using the `QLoRA` technique, implemented using the `unsloth` library. We then evaluate the fine-tuned model’s performance and compare it with the base model to assess improvements in summarization quality.**


## Install and Import Necessary Libraries

In [1]:
# Update the package lists
!apt-get update

# Install ninja-build and cmake
# They help streamline the process of compiling and managing dependencies
!apt-get install -y ninja-build cmake

# Upgrade the ipywidgets Python package
!pip install ipywidgets --upgrade

Get:1 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Hit:2 https://deb.nodesource.com/node_20.x nodistro InRelease                  
Hit:3 http://archive.ubuntu.com/ubuntu jammy InRelease                         
Get:4 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Hit:5 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Fetched 257 kB in 1s (410 kB/s)
Reading package lists... Done
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
ninja-build is already the newest version (1.10.1-1).
cmake is already the newest version (3.22.1-1ubuntu1.22.04.2).
0 upgraded, 0 newly installed, 0 to remove and 94 not upgraded.
[0m

In [None]:
# Install essential Python libraries for LLM fine-tuning, optimization, and evaluation
!pip install unsloth
!pip install -q --no-deps xformers trl peft accelerate bitsandbytes
!pip install -q datasets evaluate bert_score

Collecting unsloth
  Downloading unsloth-2025.5.7-py3-none-any.whl (265 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.7/265.7 KB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting peft!=0.11.0,>=0.7.1
  Downloading peft-0.15.2-py3-none-any.whl (411 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m411.1/411.1 KB[0m [31m60.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting xformers>=0.0.27.post2
  Downloading xformers-0.0.30-cp310-cp310-manylinux_2_28_x86_64.whl (31.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.5/31.5 MB[0m [31m60.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting unsloth_zoo>=2025.5.8
  Downloading unsloth_zoo-2025.5.8-py3-none-any.whl (146 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m146.6/146.6 KB[0m [31m43.6 MB/s[0m eta [36m0:00:00[0m
Collecting hf_transfer
  Downloading hf_transfer-0.1.9-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3

Installing collected packages: sentencepiece, nvidia-cusparselt-cu12, wheel, typing-extensions, triton, sympy, shtab, safetensors, regex, protobuf, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufile-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, msgspec, hf_transfer, docstring-parser, typeguard, nvidia-cusparse-cu12, nvidia-cufft-cu12, nvidia-cudnn-cu12, tyro, tokenizers, nvidia-cusolver-cu12, diffusers, transformers, torch, xformers, torchvision, cut_cross_entropy, bitsandbytes, accelerate, peft, trl, unsloth_zoo, unsloth
  Attempting uninstall: wheel
    Found existing installation: wheel 0.37.1
    Not uninstalling wheel at /usr/lib/python3/dist-packages, outside environment /usr
    Can't uninstall 'wheel'. No files were found to uninstall.
  Attempting uninstall: typing-extensions
    Found existing installation: typing_extensions 4.12.2
    Uninstalling typing_extensions-4.12.2:
      S

In [3]:
# import FastLanguageModel from unsloth library
from unsloth import FastLanguageModel

# import SFTTrainer from trl
from trl import SFTTrainer

# import TrainingArguments from transformers
from transformers import TrainingArguments

# import evaluate library
import evaluate

# import locale library
import locale

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


In [4]:
# PyTorch for deep learning and tensor computations
import torch

# HuggingFace library for dataset management
import datasets

# Data manipulation and analysis
import numpy as np
import pandas as pd

## Dataset Preprocessing for Text Summarization

### Load Dataset

In [5]:
# Load the customer reviews dataset
sample_reviews_df = pd.read_csv("customer_reviews_dataset.csv")

In [6]:
# Create a dialogue column with concatenated review and response text
sample_reviews_df['dialogue'] = 'customer : ' + sample_reviews_df['review_text'] + '\n' + 'response : ' + sample_reviews_df['response'] + '\n'

In [7]:
# Create an id column with customer IDs
sample_reviews_df['id'] = sample_reviews_df['customer_id']

In [8]:
# Filter specific attributes
sample_reviews_df = sample_reviews_df[['id', 'review_sentiment', 'dialogue', 'summary']]
sample_reviews_df.head()

Unnamed: 0,id,review_sentiment,dialogue,summary
0,CID041,Positive,customer : I bought this laptop for my son who...,"The user purchased a laptop for their son, who..."
1,CID011,Negative,customer : I was very disappointed with these ...,The user expressed disappointment with poor so...
2,CID034,Positive,"customer : Awesome power bank, it charges my p...",The user praises the power bank for its fast c...
3,CID032,Negative,customer : I bought this phone mainly for its ...,The user expressed disappointment with the pho...
4,CID051,Positive,"customer : I love these headphones, they are a...",The user praises the headphones for their exce...


### Split Dataset

In [9]:
# Separate positive and negative reviews
positive_reviews =  sample_reviews_df[sample_reviews_df['review_sentiment'] == 'Positive']
negative_reviews =  sample_reviews_df[sample_reviews_df['review_sentiment'] == 'Negative']

# Sample 2 positive and 2 negative reviews for gold examples
positive_gold_examples = positive_reviews.sample(2, random_state=40)
negative_gold_examples = negative_reviews.sample(2, random_state=40)

# Concatenate positive and negative gold examples
sample_reviews_gold_examples_df = pd.concat([positive_gold_examples, negative_gold_examples])

# Create the training set by excluding gold examples
sample_reviews_examples_df =  sample_reviews_df.drop(index=sample_reviews_gold_examples_df.index)

# Print the shapes of the datasets
print("Training Set Shape:", sample_reviews_examples_df.shape)
print("Gold Examples Shape:", sample_reviews_gold_examples_df.shape)

Training Set Shape: (26, 4)
Gold Examples Shape: (4, 4)


## Model Setup for Fine-tuning

### Load llama-2-7b-bnb-4bit model from unsloth.

In [10]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/llama-2-7b-bnb-4bit",
    max_seq_length=2048,
    dtype=None,                 # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
    load_in_4bit=True           # Use 4bit quantization to reduce memory usage.
)

==((====))==  Unsloth 2025.5.7: Fast Llama patching. Transformers: 4.51.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.568 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.3.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.30. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/3.87G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/183 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/948 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/438 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

In [11]:
# Base model
model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096, padding_idx=0)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear4bit(in_features=11008, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((4

In [12]:
tokenizer

LlamaTokenizerFast(name_or_path='unsloth/llama-2-7b-bnb-4bit', vocab_size=32000, model_max_length=4096, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<unk>'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)

In [13]:
EOS_TOKEN = tokenizer.eos_token

In [14]:
# Configure a PEFT model with LoRA for optimized resource-efficient training
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha=16,
    lora_dropout=0, # Supports any, but = 0 is optimized
    bias="none",    # Supports any, but = "none" is optimized
    use_gradient_checkpointing=True,
    random_state=42,
    use_rslora=False,
    loftq_config=None
)

Unsloth 2025.5.7 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


### Create examples.

In [15]:
# Structured Alpaca prompt
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

In [16]:
def create_examples_with_seed(dataset, n=2, random_seed=None):
    """
    Return two DataFrames with randomized examples of size 2n with two classes.
    Create subsets of each class, choose random samples from the subsets,
    merge and randomize the order of samples in the merged list.
    Each run of this function creates a different random sample of examples
    chosen from the training data.

    Args:
        dataset (DataFrame): A DataFrame with examples (text + label)
        n (int): number of examples of each class to be selected
        random_seed (int): seed for reproducibility (default is None)

    Output:
        few_shot_examples_df (DataFrame): A DataFrame with examples in random order
        new_df (DataFrame): A new DataFrame excluding selected examples
    """

    positive_reviews = (dataset.review_sentiment == 'Positive')
    negative_reviews = (dataset.review_sentiment == 'Negative')
    columns_to_select = ['id', 'review_sentiment' ,'dialogue','summary']

    # Set a fixed random seed for reproducibility
    np.random.seed(random_seed)

    positive_examples = dataset.loc[positive_reviews, columns_to_select].sample(n)
    negative_examples = dataset.loc[negative_reviews, columns_to_select].sample(n)

    few_shot_examples_df = pd.concat([positive_examples, negative_examples])
    # sampling without replacement is equivalent to random shuffling
    few_shot_examples_df = few_shot_examples_df.sample(2 * n, replace=False)

    # Create a new DataFrame excluding selected examples
    new_df = dataset.drop(index=few_shot_examples_df.index)

    return few_shot_examples_df, new_df

In [17]:
# Extract train and validation samples
sample_reviews_train_examples_df, sample_reviews_validation_examples_df = create_examples_with_seed(sample_reviews_examples_df, 2, random_seed=40)

In [18]:
# Convert to HuggingFace Dataset
training_dataset = datasets.Dataset.from_pandas(sample_reviews_train_examples_df)
validation_dataset = datasets.Dataset.from_pandas(sample_reviews_validation_examples_df)

In [19]:
def prompt_formatter(example, prompt_template):
    """
    Construct a formatted prompt with instruction, example dialogue and summary.
    """
    instruction = 'Summarize the following dialogue'
    dialogue = example["dialogue"]
    summary = example["summary"]

    formatted_prompt = prompt_template.format(instruction, dialogue, summary)

    return {'formatted_prompt': formatted_prompt}

In [20]:
# Apply prompt_formatter to format training data using alpaca_prompt
formatted_training_dataset = training_dataset.map(
    prompt_formatter,
    fn_kwargs={'prompt_template': alpaca_prompt}
)

Map:   0%|          | 0/4 [00:00<?, ? examples/s]

In [21]:
# Apply prompt_formatter to format validation data using alpaca_prompt
formatted_validation_dataset = validation_dataset.map(
    prompt_formatter,
    fn_kwargs={'prompt_template': alpaca_prompt}
)

Map:   0%|          | 0/22 [00:00<?, ? examples/s]

### Initialize Model Parameters

In [22]:
# Initialize a SFT Trainer
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=formatted_training_dataset,
    eval_dataset=formatted_validation_dataset,
    dataset_text_field = "formatted_prompt",
    max_seq_length=2048,
    dataset_num_proc=2,
    packing=False,              # Increases efficiency for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size=2,
        per_device_eval_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=50,
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=42,
        output_dir="outputs",
        report_to="none"
    )
)

Unsloth: Tokenizing ["formatted_prompt"] (num_proc=2):   0%|          | 0/4 [00:00<?, ? examples/s]

Unsloth: Tokenizing ["formatted_prompt"] (num_proc=2):   0%|          | 0/22 [00:00<?, ? examples/s]

Detected kernel version 4.14.355, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


## Train and Save the Model

In [23]:
# Retrieve GPU properties, calculate total and reserved memory in GB
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)

print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.568 GB.
6.486 GB of memory reserved.


### Train Model

In [24]:
# Start training
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 4 | Num Epochs = 50 | Total steps = 50
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 39,976,960/7,000,000,000 (0.57% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,1.7507
2,1.7507
3,1.7403
4,1.6947
5,1.6148
6,1.5038
7,1.3572
8,1.2146
9,1.0839
10,0.9555


In [26]:
# Training stats
trainer_stats

TrainOutput(global_step=50, training_loss=0.399557319059968, metrics={'train_runtime': 192.4311, 'train_samples_per_second': 2.079, 'train_steps_per_second': 0.26, 'total_flos': 3051678519312384.0, 'train_loss': 0.399557319059968})

### Inference

In [27]:
# Convert gold samples to HuggingFace Dataset
test_dataset = datasets.Dataset.from_pandas(sample_reviews_gold_examples_df)

In [28]:
test_dataset[0]

{'id': 'CID041',
 'review_sentiment': 'Positive',
 'dialogue': "customer : I bought this laptop for my son who is studying engineering. He is very happy with it. It has a good battery life, fast performance, and a sleek design. The keyboard is comfortable and the screen is bright. The laptop came with a one-year warranty and a free antivirus software. I think it is a great value for money.\nresponse : It's fantastic to hear that the laptop you purchased for your son has met his needs and expectations, especially in his engineering studies. A good battery life, fast performance, and sleek design are essential for a student's productivity. The comfortable keyboard and bright screen further enhance the usability of the laptop. If you ever encounter any issues or have questions about the laptop, please feel free to reach out for support. We're here to ensure that your experience continues to be positive. Thank you for choosing our product and taking the time to share your satisfaction!\n",

In [29]:
# Extract a test dialogue and its corresponding summary
instruction = "Summarize the following dialogue"
test_dialogue = test_dataset[0]['dialogue']
test_summary = test_dataset[0]['summary']

In [30]:
# Prepare model for inference
FastLanguageModel.for_inference(model)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(32000, 4096, padding_idx=0)
        (layers): ModuleList(
          (0-31): 32 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): lora.Line

In [31]:
# Tokenize the formatted prompt
inputs = tokenizer(
[
    alpaca_prompt.format(
        instruction,
        test_dialogue,
        "",                 # leave output blank for generation
    )
], return_tensors="pt").to("cuda")

In [32]:
# Model inference
outputs = model.generate(
    **inputs,
    max_new_tokens=128,
    use_cache=True,
    do_sample=True,
    temperature=0.2
)

In [33]:
# Decode and print the output text
print(tokenizer.batch_decode(outputs)[0])

<s> Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Summarize the following dialogue

### Input:
customer : I bought this laptop for my son who is studying engineering. He is very happy with it. It has a good battery life, fast performance, and a sleek design. The keyboard is comfortable and the screen is bright. The laptop came with a one-year warranty and a free antivirus software. I think it is a great value for money.
response : It's fantastic to hear that the laptop you purchased for your son has met his needs and expectations, especially in his engineering studies. A good battery life, fast performance, and sleek design are essential for a student's productivity. The comfortable keyboard and bright screen further enhance the usability of the laptop. If you ever encounter any issues or have questions about the laptop, please feel free to reach out for sup

In [34]:
# Test summary
test_summary

'The user purchased a laptop for their son, who is studying engineering. They are satisfied with its battery life, fast performance, sleek design, comfortable keyboard, and bright screen, and its one-year warranty.'

The LLM-generated response closely matches the actual summary but includes unnecessary repetition.

### Save Model

In [35]:
import locale

# Override the default locale encoding to always return "UTF-8"
def getpreferredencoding(do_setlocale=True):
    return "UTF-8"

locale.getpreferredencoding = getpreferredencoding

In [36]:
lora_model_name = "dialogue-summarizer-llama"

In [37]:
# save the model using save_pretrained function from model
model.save_pretrained(lora_model_name)

In [38]:
# List model directory contents
!ls -lh {lora_model_name}

total 153M
-rw-r--r-- 1 root root 5.0K May 23 15:57 README.md
-rw-r--r-- 1 root root  857 May 23 15:57 adapter_config.json
-rw-r--r-- 1 root root 153M May 23 15:57 adapter_model.safetensors


In [39]:
# Copy contents to the /data/ folder
!cp -r {lora_model_name} /data/

cp: 'dialogue-summarizer-llama' and '/data/dialogue-summarizer-llama' are the same file


## Evaluate Model Performance

### Llama 2 Base Model Performance

#### Load Base Model

In [40]:
import torch

# Clear the GPU cache to free up memory
torch.cuda.empty_cache()

In [41]:
# Load the baseline model from unsloth
baseline_model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/llama-2-7b-bnb-4bit",
    max_seq_length=2048,
    dtype=None,                 # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
    load_in_4bit=True           # Use 4bit quantization to reduce memory usage.
)

==((====))==  Unsloth 2025.5.7: Fast Llama patching. Transformers: 4.51.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.568 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.3.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.30. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


In [42]:
# Prepare model for inference
FastLanguageModel.for_inference(baseline_model)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096, padding_idx=0)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear4bit(in_features=11008, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((4

In [43]:
# Convert gold samples to HuggingFace Dataset
test_dataset = datasets.Dataset.from_pandas(sample_reviews_gold_examples_df)

#### Single Inference

In [44]:
# Extract a test dialogue and its corresponding summary
instruction = "Summarize the following dialogue"
test_dialogue = test_dataset[0]['dialogue']
test_summary = test_dataset[0]['summary']

In [45]:
# Tokenize the formatted prompt
input = tokenizer(
    alpaca_prompt.format(
        instruction,
        test_dialogue,
        ""
    ), return_tensors="pt"
).to("cuda")

In [46]:
# Model inference
output = baseline_model.generate(
    **input,
    max_new_tokens=128,
    use_cache=True,
    do_sample=True,
    temperature=0.2
)

In [47]:
# Decode and print the output text
print(tokenizer.decode(output[0]))

<s> Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Summarize the following dialogue

### Input:
customer : I bought this laptop for my son who is studying engineering. He is very happy with it. It has a good battery life, fast performance, and a sleek design. The keyboard is comfortable and the screen is bright. The laptop came with a one-year warranty and a free antivirus software. I think it is a great value for money.
response : It's fantastic to hear that the laptop you purchased for your son has met his needs and expectations, especially in his engineering studies. A good battery life, fast performance, and sleek design are essential for a student's productivity. The comfortable keyboard and bright screen further enhance the usability of the laptop. If you ever encounter any issues or have questions about the laptop, please feel free to reach out for sup

The base model has simply generated multiple headlines instead of completing the prompt with a summary of the input content.

#### Batch Inference

In [48]:
# Clear the GPU cache to free up memory
torch.cuda.empty_cache()

In [49]:
instruction = "Summarize the following dialogue"

In [50]:
# List of test dialogues and summaries
test_dialogues = [sample['dialogue'] for sample in test_dataset]
test_summaries = [sample['summary'] for sample in test_dataset]

In [51]:
def extract_summary_from_string(input_string):
    """
    Extract the summary from the LLM response.
    """
    try:
        # Assuming the response is between ### Response: and </s>
        summary_start = input_string.rfind('### Response:\n') + 14      # number of characters in '### Response:\n'
        summary_end = input_string.rfind('</s>')
        summary_str = input_string[summary_start:summary_end]

        return summary_str
    except Exception as e:
        print(f"Error decoding string: {e}")
        return None

In [52]:
# Initialize summary predictions list
predicted_summaries = []

In [53]:
# Generate dialogue summaries using the baseline model
for sample_dialogue in test_dialogues:
    input = tokenizer(
        alpaca_prompt.format(
            instruction,
            sample_dialogue,
            ""
        ), return_tensors="pt"
    ).to("cuda")
    
    outputs = baseline_model.generate(
        **input,
        max_new_tokens=256,
        use_cache=True
    )
    
    predicted_summary = tokenizer.decode(outputs[0])
    
    output_str = extract_summary_from_string(predicted_summary)
    
    predicted_summaries.append(output_str)

#### Evaluate Performance of Base Model

In [54]:
# BERT scorer
bert_scorer = evaluate.load("bertscore")

Downloading builder script:   0%|          | 0.00/7.95k [00:00<?, ?B/s]

In [55]:
# Provide Prediction Summaries and Test Summaries as input
score = bert_scorer.compute(
    predictions=predicted_summaries,
    references=test_summaries,
    lang="en",
    rescale_with_baseline=True
)

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [56]:
# Calculate Average Bert Score. Average Bert Score is sum of f1 score divided by number of samples.
avg_bert_score = sum(score['f1']) / len(score['f1'])
avg_bert_score

0.1683494783937931

### Llama 2 Trained (Fine-tuned) Model Performance

#### Load Trained Model

In [57]:
# Clear the GPU cache to free up memory
torch.cuda.empty_cache()

In [58]:
# Load the fine-tuned model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=lora_model_name,
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=True
)

==((====))==  Unsloth 2025.5.7: Fast Llama patching. Transformers: 4.51.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.568 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.3.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.30. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


In [59]:
# Prepare model for inference
FastLanguageModel.for_inference(model)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(32000, 4096, padding_idx=0)
        (layers): ModuleList(
          (0-31): 32 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): lora.Line

In [60]:
# Convert gold samples to HuggingFace Dataset
test_dataset = datasets.Dataset.from_pandas(sample_reviews_gold_examples_df)

#### Single Inference

In [61]:
# Extract a test dialogue and its corresponding summary
instruction = "Summarize the following dialogue"
test_dialogue = test_dataset[0]['dialogue']
test_summary = test_dataset[0]['summary']

In [62]:
# Tokenize the formatted prompt
input = tokenizer(
    alpaca_prompt.format(
        instruction,
        test_dialogue,
        ""
    ), return_tensors="pt"
).to("cuda")

In [63]:
# Model inference
output = model.generate(
    **input,
    max_new_tokens=128,
    use_cache=True,
    do_sample=True,
    temperature=0.2
)

In [64]:
# Decode and print the output text
print(tokenizer.decode(output[0]))

<s> Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Summarize the following dialogue

### Input:
customer : I bought this laptop for my son who is studying engineering. He is very happy with it. It has a good battery life, fast performance, and a sleek design. The keyboard is comfortable and the screen is bright. The laptop came with a one-year warranty and a free antivirus software. I think it is a great value for money.
response : It's fantastic to hear that the laptop you purchased for your son has met his needs and expectations, especially in his engineering studies. A good battery life, fast performance, and sleek design are essential for a student's productivity. The comfortable keyboard and bright screen further enhance the usability of the laptop. If you ever encounter any issues or have questions about the laptop, please feel free to reach out for sup

The fine-tuned LLM has generated an appropriate summary but includes unnecessary repetition.

#### Batch Inference

In [73]:
# Clear the GPU cache to free up memory
torch.cuda.empty_cache()

In [74]:
instruction = "Summarize the following dialogue"

In [75]:
# List of test dialogues and summaries
test_dialogues = [sample['dialogue'] for sample in test_dataset]
test_summaries = [sample['summary'] for sample in test_dataset]

In [76]:
def extract_summary_from_string(input_string):
    """
    Extract the summary from the LLM response.
    """
    try:
        # Assuming the response is between ### Response: and </s>
        summary_start = input_string.rfind('### Response:\n') + 14      # number of characters in '### Response:\n'
        summary_end = input_string.rfind('</s>')
        summary_str = input_string[summary_start:summary_end]

        return summary_str
    except Exception as e:
        print(f"Error decoding string: {e}")
        return None

In [77]:
# Initialize summary predictions list
predicted_summaries = []

In [78]:
# Generate dialogue summaries using the fine-tuned model
for sample_dialogue in test_dialogues:
    input = tokenizer(
        alpaca_prompt.format(
            instruction,
            sample_dialogue,
            ""
        ), return_tensors="pt"
    ).to("cuda")
    
    outputs = model.generate(
        **input,
        max_new_tokens=256,
        use_cache=True
    )
    
    predicted_summary = tokenizer.decode(outputs[0])
    
    output_str = extract_summary_from_string(predicted_summary)
    
    predicted_summaries.append(output_str)

#### Evaluate Performance of Trained Model

In [79]:
# Input prediction summaries and test summaries in bert scorer
score = bert_scorer.compute(
    predictions=predicted_summaries,
    references=test_summaries,
    lang="en",
    rescale_with_baseline=True
)

In [80]:
# Calculate Average Bert Score. Average Bert Score is sum of f1 score divided by number of samples.
avg_bert_score = sum(score['f1']) / len(score['f1'])
avg_bert_score

0.36662961542606354

### Insights

Fine-tuning has enabled the model to adhere to instructions more effectively, producing summaries that closely resemble human-generated ones. This improvement is evident from the BERT scores obtained during the evaluation of both the base and fine-tuned models. Fine-tuning enhances the model’s ability to generalize, reduces errors, and improves coherence in responses. The BERT score, a similarity metric based on contextual embeddings, indicates how well the generated summary aligns with human-written summaries, reflecting improvements in fluency and relevance.