In this notebook we will use an LLM on our dataset to perform the task of question-asnwerig. In particular we will use Llama-2-7b-chat-hf. There will be three main sections in which we will give zero examples (zero-shot), one (one-shot) and we will perform the fine tuning using our dataset to see the different performance. In the end there will be a comparison among all the models.

In [24]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [25]:
import os

os.chdir(f'/content/drive/MyDrive/Polimi/NLP')
os.getcwd()

'/content/drive/MyDrive/Polimi/NLP'

# Data upload and import libreries

In [26]:
!pip install -q "transformers" "peft" "accelerate" "bitsandbytes" "trl" "safetensors" "tiktoken"
!pip install -q -U langchain
!pip install -q langchain-community langchain-core
!pip install -q --no-deps peft
!pip install -q lightning

In [27]:
import pandas as pd
import torch
from transformers import AutoTokenizer, Trainer, TrainingArguments, pipeline
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from sentence_transformers import SentenceTransformer, util
from peft import LoraConfig
from peft import prepare_model_for_kbit_training, get_peft_model
from datasets import Dataset
from torch.utils.data import DataLoader
import lightning as L
from torch.optim import AdamW
import torch.nn.functional as F
from huggingface_hub import login

import nltk
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from sklearn.metrics.pairwise import cosine_similarity
import string
import re
from collections import Counter
nltk.download('punkt')

In [68]:
import nltk
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from sklearn.metrics.pairwise import cosine_similarity
import string
import re
from collections import Counter
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

Load of the data

In [28]:
# Run this on colab
train_set = pd.read_parquet("tuning_data.parquet")
test_set = pd.read_parquet("test_data.parquet")

In [None]:
# Run this on kaggle
train_set=pd.read_parquet("/kaggle/input/nlp-resources/tuning_data.parquet")
test_set=pd.read_parquet("/kaggle/input/nlp-resources/test_data.parquet")

For testing we will use a very small subset due to the fact that inference takes much time because of the large model.

In [79]:
small_test_set = test_set[:30]

In [6]:
login(token="your_token")

# Load of the model

Here we load the model and his tokenizer.

In [46]:
model_id = "NousResearch/Llama-2-7b-chat-hf"

The model is too big, so we will make use of a 4-bit version of the model.

In [47]:
# BitsAndBytesConfig int-4 config
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

We load the model

In [48]:
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    quantization_config=quantization_config
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Load of the tokenizer

In [49]:
tokenizer = AutoTokenizer.from_pretrained(model_id)

In [50]:
model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096, padding_idx=0)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear4bit(in_features=11008, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
      )
    )
    (norm): Lla

# Evaluation and Generation Utilities

This section defines the core utility functions used for answer generation and evaluation. It includes a text generation pipeline, normalization routines, and metrics for computing Exact Match, F1, BLEU, and semantic similarity scores. These functions will be reused in subsequent evaluation experiments across different settings (zero-shot, one-shot, fine-tuning).

In [86]:
generation_args = {
    "max_new_tokens": 3000,
    "return_full_text": False,
    "temperature": 0.1,
    "do_sample": True,
}

We define the pipeline

In [87]:
qa_pipeline = pipeline("text-generation", model=model, tokenizer=tokenizer)

Device set to use cuda:0


This function returns the generated answer given the prompt. We will use it for zero and one shot model

In [88]:
def chatbot(prompt):
    output = qa_pipeline(prompt, **generation_args)
    return output[0]['generated_text']

These functions compute evaluation metrics used to compare predicted answers with ground truth references.
`normalize_answer` standardizes answers by removing punctuation, articles, and extra spaces.
`exact_match_score` checks whether the normalized prediction matches the reference exactly.
`f1_score` measures the overlap between tokens in the prediction and the reference using precision and recall.
`bleu_score` calculates the BLEU score, a metric based on n-gram overlaps, with smoothing for short texts.

In [89]:
def normalize_answer(s):
    def remove_articles(text):
        return re.sub(r'\b(a|an|the)\b', ' ', text)
    def remove_punc(text):
        return ''.join(ch for ch in text if ch not in set(string.punctuation))
    def white_space_fix(text):
        return ' '.join(text.split())
    def lower(text):
        return text.lower()

    return white_space_fix(remove_articles(remove_punc(lower(s))))

def exact_match_score(prediction, ground_truth):
    return int(normalize_answer(prediction) == normalize_answer(ground_truth))

def f1_score(prediction, ground_truth):
    pred_tokens = normalize_answer(prediction).split()
    gt_tokens = normalize_answer(ground_truth).split()
    common = Counter(pred_tokens) & Counter(gt_tokens)
    num_same = sum(common.values())
    if num_same == 0:
        return 0
    precision = num_same / len(pred_tokens)
    recall = num_same / len(gt_tokens)
    return 2 * precision * recall / (precision + recall)

def bleu_score(prediction, ground_truth):
    smoothie = SmoothingFunction().method4  # migliora BLEU su frasi corte
    pred_tokens = nltk.word_tokenize(prediction.lower())
    gt_tokens = nltk.word_tokenize(ground_truth.lower())
    return sentence_bleu([gt_tokens], pred_tokens, smoothing_function=smoothie)


Load a SentenceTransformer model to compute sentence embeddings for the target and generated answers, enabling semantic similarity evaluation

In [94]:
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

This function Given a single candidate answer and a target answer,
return the semantic similarity score between them.

In [95]:
def compute_similarity_score(candidate_answer, target_answer):
    # Encode the target and the candidate
    target_embedding = embedding_model.encode(target_answer, convert_to_tensor=True)
    candidate_embedding = embedding_model.encode(candidate_answer, convert_to_tensor=True)

    # Compute cosine similarity
    similarity = util.cos_sim(target_embedding, candidate_embedding).item()

    return similarity

# Zero-shot performance

We begin with the zero-shot LLM to see how the model answer to the given task of question-answering without any example.


## Testing

We will print the answer to the first 5 samples of the test set.

In [90]:
n = 5

for i, row in test_set.head(n).reset_index(drop=True).iterrows():
    question = row['question']
    context = row['context']
    target_answer = row['answer']

    prompt = (
        "[INST] <<SYS>>\nYou are a helpful assistant.\n<</SYS>>\n\n"
        f"Question: {question}\nContext: {context}\nAnswer: [/INST]"
    )

    # Generate the answer
    result = chatbot(prompt)

    # Output
    print(f"Example {i+1}:")
    print("Question:", question)
    print("\nTarget answer:", target_answer)
    print("\nGenerated answer:", result)
    print("-" * 40)

Example 1:
Question: Who is the music director of the Quebec Symphony Orchestra?

Target answer: The music director of the Quebec Symphony Orchestra is Fabien Gabel.

Generated answer:   The music director of the Quebec Symphony Orchestra is Fabien Gabel.
----------------------------------------
Example 2:
Question: Who were the four students of the University of Port Harcourt that were allegedly murdered?

Target answer: The four students of the University of Port Harcourt that were allegedly murdered were Chiadika Lordson, Ugonna Kelechi Obusor, Mike Lloyd Toku and Tekena Elkanah.

Generated answer:   The names of the four students of the University of Port Harcourt who were allegedly murdered are:

1. Chiadika Lordson
2. Ugonna Kelechi Obusor
3. Mike Lloyd Toku
4. Tekena Elkanah

I hope this information helps. Let me know if you have any other questions.
----------------------------------------
Example 3:
Question: What did Paul Wall offer to all U.S. Olympic Medalists?

Target answ

## Evaluation

We evaluate the performance of the zero-shot llm.

In [100]:
def eval_answers_zero_shot(model, tokenizer, test_data):
    exact_matches = []
    f1_scores = []
    bleu_scores=[]
    cosine_similarity_scores=[]

    for idx, row in test_data.iterrows():
        question = row['question']
        context = row['context']
        target_answer = row['answer']

        prompt = (
            "[INST] <<SYS>>\nYou are a helpful assistant.\n<</SYS>>\n\n"
            f"Question: {question}\nContext: {context}\nAnswer: [/INST]"
        )

        pred_answer = chatbot(prompt)

        em = exact_match_score(pred_answer, target_answer)
        f1 = f1_score(pred_answer, target_answer)
        bleu = bleu_score(pred_answer, target_answer)
        cosine_similarity_score = compute_similarity_score(pred_answer, target_answer)

        exact_matches.append(em)
        f1_scores.append(f1)
        bleu_scores.append(bleu)
        cosine_similarity_scores.append(cosine_similarity_score)

        print("em: ", em)
        print("f1: ", f1)
        print("bleu: ", bleu)
        print("cosine_similarity_scores: ", cosine_similarity_score)
        print("-" * 40)

    return (
      sum(bleu_scores) / len(bleu_scores) * 100,
      sum(exact_matches) / len(exact_matches) * 100,
      sum(f1_scores) / len(f1_scores) * 100,
      sum(cosine_similarity_scores) / len(cosine_similarity_scores) * 100
    )

In [101]:
avg_bleu_zero_shot, avg_em_zero_shot, avg_f1_zero_shot, cosine_similarity_score_zero_shot= eval_answers_zero_shot(model, tokenizer, small_test_set)

em:  1
f1:  1.0
bleu:  1.0
cosine_similarity_scores:  1.0000001192092896
----------------------------------------
em:  0
f1:  0.6060606060606061
bleu:  0.2567003823288495
cosine_similarity_scores:  0.9050230979919434
----------------------------------------
em:  0
f1:  0.8000000000000002
bleu:  0.5454951299940093
cosine_similarity_scores:  0.9453641176223755
----------------------------------------
em:  0
f1:  0.13765182186234817
bleu:  0.0391252795259626
cosine_similarity_scores:  0.8833816051483154
----------------------------------------
em:  0
f1:  0.25352112676056343
bleu:  0.09276033301639343
cosine_similarity_scores:  0.8959423899650574
----------------------------------------
em:  0
f1:  0.29394812680115273
bleu:  0.09436649561108859
cosine_similarity_scores:  0.8304895162582397
----------------------------------------
em:  0
f1:  0.06837606837606837
bleu:  0.005112042726386534
cosine_similarity_scores:  0.02513803541660309
----------------------------------------
em:  0
f1:  0

In [122]:
print(f"Average BLEU score: {avg_bleu_zero_shot:.2f}")
print(f"Exact Match: {avg_em_zero_shot:.2f}%")
print(f"F1 Score: {avg_f1_zero_shot:.2f}%")
print(f"Cosine Similarity Score: {cosine_similarity_score_zero_shot:.2f}%")

Average BLEU score: 18.89
Exact Match: 6.67%
F1 Score: 36.94%
Cosine Similarity Score: 79.79%


# One-shot performance

In this section we will give to the model a single example in addition with the task and then it will answer to a few question from the test set.

In [123]:
EOS_TOKEN = tokenizer.eos_token

# Prompt template matching LLaMA-2 chat format
llama_prompt = """<s>[INST] <<SYS>>
You are a helpful assistant.
<</SYS>>

Context: {context}

Question: {question}
[/INST]
{answer}"""

# We use this function to format the llama_prompt correctly
def format_prompt(context, question, answer=None):
    # If answer is None, prepare prompt for generation (no answer)
    if answer is None:
        answer = ""
    return llama_prompt.format(context=context, question=question, answer=answer) + EOS_TOKEN

We prepare the prompt containing the example given to the llm.

In [124]:
one_shot_example = train_set.iloc[0]
one_shot_prompt = format_prompt(
    one_shot_example["context"],
    one_shot_example["question"],
    one_shot_example["answer"]
)

In [125]:
print(one_shot_prompt)

<s>[INST] <<SYS>>
You are a helpful assistant.
<</SYS>>

Context: Caption: Tasmanian berry grower Nic Hansen showing Macau chef Antimo Merone around his property as part of export engagement activities.
THE RISE and rise of the Australian strawberry, raspberry and blackberry industries has seen the sectors redouble their international trade focus, with the release of a dedicated export plan to grow their global presence over the next 10 years.
Driven by significant grower input, the Berry Export Summary 2028 maps the sectors’ current position, where they want to be, high-opportunity markets and next steps.
Hort Innovation trade manager Jenny Van de Meeberg said the value and volume of raspberry and blackberry exports rose by 100 per cent between 2016 and 2017. She said the Australian strawberry industry experienced similar success with an almost 30 per cent rise in export volume and a 26 per cent rise in value to $32.6M over the same period.
“Australian berry sectors are in a firm posi

## Testing

We test the one-shot model

In [60]:
n = 5

for i, row in test_set.head(n).reset_index(drop=True).iterrows():
    question = row['question']
    context = row['context']
    target_answer = row['answer']

    # Format current test prompt WITHOUT answer
    test_prompt = format_prompt(context, question, answer=None)

    # Combine one-shot example + current test prompt
    full_prompt = one_shot_prompt + "\n" + test_prompt

    # Generate answer using your chatbot function adapted to receive full_prompt
    result = chatbot(full_prompt)  # Pass full prompt, not just context+question

    print(f"Example {i+1}:")
    print("Question:", question)
    print("\nTarget answer:", target_answer)
    print("\nGenerated answer:", result)
    print("-" * 40)

Example 1:
Question: Who is the music director of the Quebec Symphony Orchestra?

Target answer: The music director of the Quebec Symphony Orchestra is Fabien Gabel.

Generated answer: The music director of the Quebec Symphony Orchestra is Fabien Gabel.
----------------------------------------
Example 2:
Question: Who were the four students of the University of Port Harcourt that were allegedly murdered?

Target answer: The four students of the University of Port Harcourt that were allegedly murdered were Chiadika Lordson, Ugonna Kelechi Obusor, Mike Lloyd Toku and Tekena Elkanah.

Generated answer: According to the article, the four students of the University of Port Harcourt who were allegedly murdered are:

1. Chiadika Lordson
2. Ugonna Kelechi Obusor
3. Mike Lloyd Toku
4. Tekena Elkanah.
----------------------------------------
Example 3:
Question: What did Paul Wall offer to all U.S. Olympic Medalists?

Target answer: Paul Wall wants to give free gold grills to all U.S. Olympic Me

## Evaluation

We evaluate the  one-shot model

In [126]:
def eval_answers_one_shot(model, tokenizer, test_data):
    exact_matches = []
    f1_scores = []
    bleu_scores=[]
    cosine_similarity_scores=[]

    for idx, row in test_data.iterrows():
        question = row['question']
        context = row['context']
        target_answer = row['answer']

        # Format current test prompt WITHOUT answer
        test_prompt = format_prompt(context, question, answer=None)

        # Combine one-shot example + current test prompt
        full_prompt = one_shot_prompt + "\n" + test_prompt

        pred_answer = chatbot(full_prompt)

        em = exact_match_score(pred_answer, target_answer)
        f1 = f1_score(pred_answer, target_answer)
        bleu = bleu_score(pred_answer, target_answer)
        cosine_similarity_score = compute_similarity_score(pred_answer, target_answer)

        exact_matches.append(em)
        f1_scores.append(f1)
        bleu_scores.append(bleu)
        cosine_similarity_scores.append(cosine_similarity_score)

        print("em: ", em)
        print("f1: ", f1)
        print("bleu: ", bleu)
        print("cosine_similarity_scores: ", cosine_similarity_score)
        print("-" * 40)

    return (
      sum(bleu_scores) / len(bleu_scores) * 100,
      sum(exact_matches) / len(exact_matches) * 100,
      sum(f1_scores) / len(f1_scores) * 100,
      sum(cosine_similarity_scores) / len(cosine_similarity_scores) * 100
    )

In [127]:
avg_bleu_one_shot, avg_em_one_shot, avg_f1_one_shot, cosine_similarity_score_one_shot= eval_answers_one_shot(model, tokenizer, small_test_set)

em:  1
f1:  1.0
bleu:  1.0
cosine_similarity_scores:  1.0000001192092896
----------------------------------------
em:  0
f1:  0.7692307692307693
bleu:  0.38260294162784475
cosine_similarity_scores:  0.9396244883537292
----------------------------------------
em:  0
f1:  0.8275862068965517
bleu:  0.5767908748024404
cosine_similarity_scores:  0.9525476694107056
----------------------------------------
em:  0
f1:  0.32142857142857145
bleu:  0.09128266909356
cosine_similarity_scores:  0.8619110584259033
----------------------------------------
em:  0
f1:  0.3681592039800995
bleu:  0.14125123814566634
cosine_similarity_scores:  0.9050204753875732
----------------------------------------
em:  0
f1:  0.29041095890410956
bleu:  0.08695607321230917
cosine_similarity_scores:  0.8855873942375183
----------------------------------------
em:  0
f1:  0.6126126126126126
bleu:  0.3536744074148674
cosine_similarity_scores:  0.7865282893180847
----------------------------------------
em:  0
f1:  0.89473

In [128]:
print(f"Average BLEU score: {avg_bleu_one_shot:.2f}")
print(f"Exact Match: {avg_em_one_shot:.2f}%")
print(f"F1 Score: {avg_f1_one_shot:.2f}%")
print(f"Cosine Similarity Score: {cosine_similarity_score_one_shot:.2f}")

Average BLEU score: 28.49
Exact Match: 6.67%
F1 Score: 49.85%
Cosine Similarity Score: 86.13


# Fine tuning

In this section we will try to fine tune the Llama-2-7b-chat-hf llm using our dataset to see if there are any improvement in the question-answering.

## Preparation of the model

Use the LoRA adapter due to the fact that the model is very big and we wouldn't be able to train it.

In [104]:
adapter_configs = {
    'target_modules': 'all-linear',
    'lora_alpha': 16,
    'lora_dropout': 0.1,
    'r': 16,
    'bias': 'none',
    'task_type': 'CAUSAL_LM'
}

lora_configs = LoraConfig(**adapter_configs)

Prepare the model for 4-bit quantized training and apply LoRA (Low-Rank Adaptation) using the specified configuration

In [105]:
prepared_model_4bit = prepare_model_for_kbit_training(model)
model2 = get_peft_model(prepared_model_4bit, lora_configs)

In [137]:
# Prompt format for LLaMA-2 chat models
llama_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token

# Format the dataset examples into the appropriate prompt format required by the LLaMA model.
# Each prompt includes the question, context, and expected answer, and ends with an EOS token to prevent infinite generation.
def formatting_prompts_func(examples):
    instructions = examples["question"]
    inputs       = examples["context"]
    outputs      = examples["answer"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        text = llama_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

train_dataset = Dataset.from_pandas(train_set)
formatted_train_set = train_dataset .map(formatting_prompts_func, batched=True,)

Map:   0%|          | 0/9600 [00:00<?, ? examples/s]

Wrap the model with a PyTorch Lightning module

In [107]:
class LightningWrapper(L.LightningModule):
    def __init__(self, model, tokeniser, lr=1.e-4):
        super().__init__()
        self._model = model
        self._tokeniser = tokeniser
        self._lr = lr

    def configure_optimizers(self):
        # Build optimiser
        optimiser = AdamW(self.parameters(), lr=self._lr)

        return optimiser

    def forward(self, *args, **kwargs):
        return self._model.forward(*args, **kwargs)

    def training_step(self, mini_batch, mini_batch_idx):
        # Unpack the encoding and the target labels
        input_encodings, labels = mini_batch
        # Run generic forward step
        output = self.forward(**input_encodings)
        # Compute logits
        logits: torch.tensor = output.logits
        # Shift logits to exclude the last element
        logits = logits[..., :-1, :].contiguous()
        # shift labels to exclude the first element
        labels = labels[..., 1:].contiguous()
        # Compute LM loss token-wise
        loss: torch.tensor = F.cross_entropy(logits.view(-1, logits.size(-1)), labels.view(-1))

        return loss

## Training

Now we can train our model.

Prepare training data loader


In [42]:
tokenizer.pad_token = tokenizer.eos_token

def collate(mini_batch):
    input_encodings = tokenizer([sample['text'] for sample in mini_batch], return_tensors='pt', padding=True)
    labels = input_encodings.input_ids.clone()
    labels[~input_encodings.attention_mask.bool()] = -100

    return input_encodings, labels

data_loader = DataLoader(
    formatted_train_set, collate_fn=collate, shuffle=True, batch_size=1
)

In [43]:
lightning_model = LightningWrapper(model2, tokenizer)

Configure the training setup with gradient accumulation, mixed precision, and gradient clipping.
This helps improve training efficiency, stability, and performance.

In [44]:
trainer = L.Trainer(
    accumulate_grad_batches=32,
    precision='bf16-mixed',  # Mixed precision (bf16-mixed or 16-mixed)
    gradient_clip_val=1.0,  # Gradient clipping
    max_epochs=1
)

INFO: Using bfloat16 Automatic Mixed Precision (AMP)
INFO:lightning.pytorch.utilities.rank_zero:Using bfloat16 Automatic Mixed Precision (AMP)
INFO: Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.
INFO:lightning.pytorch.utilities.rank_zero:Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.
INFO: GPU available: True (cuda), used: True
INFO:lightning.pytorch.utilities.rank_zero:GPU available: True (cuda), used: True
INFO: TPU available: False, using: 0 TPU cores
INFO:lightning.pytorch.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO: HPU available: False, using: 0 HPUs
INFO:lightning.pytorch.utilities.rank_zero:HPU available: False, using: 0 HPUs


Start the training

In [45]:
trainer.fit(lightning_model, train_dataloaders=data_loader)

INFO: You are using a CUDA device ('NVIDIA A100-SXM4-40GB') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
INFO:lightning.pytorch.utilities.rank_zero:You are using a CUDA device ('NVIDIA A100-SXM4-40GB') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:lightning.pytorch.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO: 
  | Name   | Type                 | Params | Mode 
--------------------------------------

Training: |          | 0/? [00:00<?, ?it/s]

INFO: `Trainer.fit` stopped: `max_epochs=1` reached.
INFO:lightning.pytorch.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=1` reached.


Save the weights.

In [46]:
torch.save(lightning_model.state_dict(), 'model2_Llama.pth')

In [47]:
trainer.save_checkpoint("model2_Llama.ckpt")

## Testing

Let's test the fine tuned model and see if there are any improvements.

Load first the model with the weights obtained from the fine tuning.

In [118]:
generation_args_fine_tuned = {
    "max_new_tokens": 3000,
    "temperature": 0.1,
    "do_sample": True,
}

Load the model.

In [108]:
fine_tuned_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    quantization_config=quantization_config
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Load the weights in the model

In [109]:
state_dict = torch.load("model2_Llama.pth", map_location="cuda")
fine_tuned_model.load_state_dict(state_dict, strict=False)
fine_tuned_model.to("cuda")

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096, padding_idx=0)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear4bit(in_features=11008, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
      )
    )
    (norm): Lla

Define the function that generates and returns the answer.The genereted answer may include not only the direct answer but also additional context,
explanations, or formatting.

In [115]:
def chatbot_fine_tuned(prompt, tokenizer, model, generation_args):
    # Tokenize input
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    # Generate output
    outputs = model.generate(**inputs, **generation_args_fine_tuned)

    # Decode generated tokens
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

Define a function that extracts only the answer to the question from the generated response.

In [120]:
def extract_answer_only(generated_text):
    # Cerca il delimitatore "### Response:" e restituisce solo ciò che viene dopo
    if "### Response:" in generated_text:
        return generated_text.split("### Response:")[1].strip()
    elif "[/INST]" in generated_text:
        return generated_text.split("[/INST]")[-1].strip()
    else:
        return generated_text.strip()

Test on few samples.

In [142]:
n = 5

for i, row in test_set.head(n).reset_index(drop=True).iterrows():
    question = row['question']
    context = row['context']
    target_answer = row['answer']

    prompt = llama_prompt.format(question, context, "")

    # Generate the answer
    result = chatbot_fine_tuned(prompt, tokenizer, fine_tuned_model, generation_args_fine_tuned)
    clean_answer = extract_answer_only(result)

    # Output
    print(f"Example {i+1}:")
    print("Question:", question)
    print("\nTarget answer:", target_answer)
    print("\nGenerated answer:", clean_answer)
    print("-" * 40)

Example 1:
Question: Who is the music director of the Quebec Symphony Orchestra?

Target answer: The music director of the Quebec Symphony Orchestra is Fabien Gabel.

Generated answer: Fabien Gabel is the music director of the Quebec Symphony Orchestra.
----------------------------------------
Example 2:
Question: Who were the four students of the University of Port Harcourt that were allegedly murdered?

Target answer: The four students of the University of Port Harcourt that were allegedly murdered were Chiadika Lordson, Ugonna Kelechi Obusor, Mike Lloyd Toku and Tekena Elkanah.

Generated answer: The four students of the University of Port Harcourt who were allegedly murdered are:

1. Chiadika Lordson
2. Ugonna Kelechi Obusor
3. Mike Lloyd Toku
4. Tekena Elkanah
----------------------------------------
Example 3:
Question: What did Paul Wall offer to all U.S. Olympic Medalists?

Target answer: Paul Wall wants to give free gold grills to all U.S. Olympic Medalists.

Generated answer:

## Evaluation

Evaluation of the fine tuned model.

In [138]:
def eval_answers_fine_tuning(model, tokenizer, test_data):
    exact_matches = []
    f1_scores = []
    bleu_scores=[]
    cosine_similarity_scores=[]

    for idx, row in test_data.iterrows():
        question = row['question']
        context = row['context']
        target_answer = row['answer']

        prompt = llama_prompt.format(question, context, "")

        result = chatbot_fine_tuned(prompt, tokenizer, fine_tuned_model, generation_args)
        pred_answer = extract_answer_only(result)

        em = exact_match_score(pred_answer, target_answer)
        f1 = f1_score(pred_answer, target_answer)
        bleu = bleu_score(pred_answer, target_answer)
        cosine_similarity_score = compute_similarity_score(pred_answer, target_answer)

        exact_matches.append(em)
        f1_scores.append(f1)
        bleu_scores.append(bleu)
        cosine_similarity_scores.append(cosine_similarity_score)

        print("em: ", em)
        print("f1: ", f1)
        print("bleu: ", bleu)
        print("cosine_similarity_scores: ", cosine_similarity_score)
        print("-" * 40)

    return (
      sum(bleu_scores) / len(bleu_scores) * 100,
      sum(exact_matches) / len(exact_matches) * 100,
      sum(f1_scores) / len(f1_scores) * 100,
      sum(cosine_similarity_scores) / len(cosine_similarity_scores) * 100
    )

In [139]:
avg_bleu_fine_tuning, avg_em_fine_tuning, avg_f1_fine_tuning, cosine_similarity_score_fine_tuning= eval_answers_fine_tuning(fine_tuned_model, tokenizer, small_test_set)

em:  0
f1:  1.0
bleu:  0.7016879391277372
cosine_similarity_scores:  0.9814428091049194
----------------------------------------
em:  0
f1:  0.8163265306122449
bleu:  0.4387328902288626
cosine_similarity_scores:  0.9350205659866333
----------------------------------------
em:  0
f1:  0.9230769230769231
bleu:  0.8091067115702212
cosine_similarity_scores:  0.9610199332237244
----------------------------------------
em:  0
f1:  0.21383647798742136
bleu:  0.09502552658385692
cosine_similarity_scores:  0.7385621070861816
----------------------------------------
em:  0
f1:  0.19819819819819817
bleu:  0.0461812865601642
cosine_similarity_scores:  0.7915804386138916
----------------------------------------
em:  0
f1:  0.3228346456692913
bleu:  0.1068486020366336
cosine_similarity_scores:  0.9212591648101807
----------------------------------------
em:  0
f1:  0.13930348258706468
bleu:  0.018088381880463782
cosine_similarity_scores:  0.4658774137496948
----------------------------------------
e

In [140]:
print(f"Average BLEU score: {avg_bleu_fine_tuning:.2f}")
print(f"Exact Match: {avg_em_fine_tuning:.2f}%")
print(f"F1 Score: {avg_f1_fine_tuning:.2f}%")
print(f"Cosine Similarity Score: {cosine_similarity_score_fine_tuning:.2f}")

Average BLEU score: 25.38
Exact Match: 3.33%
F1 Score: 45.63%
Cosine Similarity Score: 83.00


# Comparison among all the models

In this final section we compare all the models.

In [141]:
# Create a dictionary with the metric values
data = {
    "EM": [avg_em_zero_shot, avg_em_one_shot, avg_em_fine_tuning],
    "F1": [avg_f1_zero_shot, avg_f1_one_shot, avg_f1_fine_tuning],
    "BLEU": [avg_bleu_zero_shot, avg_bleu_one_shot, avg_bleu_fine_tuning],
    "Cosine Similarity": [
        cosine_similarity_score_zero_shot,
        cosine_similarity_score_one_shot,
        cosine_similarity_score_fine_tuning
    ],
}

# Labels for the rows
index = ["Zero-shot", "One-shot", "Fine-tuning"]

# Create and display the table
df = pd.DataFrame(data, index=index)
df = df.round(2)
print(df)

               EM     F1   BLEU  Cosine Similarity
Zero-shot    6.67  36.94  18.89              79.79
One-shot     6.67  49.85  28.49              86.13
Fine-tuning  3.33  45.63  25.38              83.00


The results show an interesting pattern across the zero-shot, one-shot, and fine-tuned settings.

* **Exact Match (EM)**: All approaches performed poorly on exact match, with scores around 3–7%. This suggests that generating answers that exactly match the reference text is particularly challenging, likely due to the high variability in natural language expression. Interestingly, fine-tuning slightly underperforms zero- and one-shot in EM, possibly because the model learned to generalize more than to copy exact phrasings.

* **F1 Score**: One-shot learning achieves the highest F1 score (49.85%), significantly outperforming zero-shot (36.94%) and fine-tuning (45.63%). This may indicate that a well-crafted, in-context example helps the model better align with the format and structure of the expected answer, capturing more relevant tokens even if the phrasing varies.

* **BLEU Score**: BLEU scores follow a similar trend: one-shot > fine-tuning > zero-shot. This supports the idea that a single relevant example helps guide the model toward more syntactically and lexically similar responses.

* **Cosine Similarity**: All three methods show relatively high cosine similarity (above 79%), indicating that even when the outputs are not lexically identical, they remain semantically similar. One-shot achieves the highest semantic similarity, which is consistent with its strong F1 and BLEU scores.

**Conclusion**:
While fine-tuning provides better fluency and task-specific learning, in this case, one-shot prompting surprisingly outperforms it in F1, BLEU, and semantic similarity. This could be due to a small fine-tuning dataset, suboptimal training, or the fact that the model already had strong general capabilities that were better leveraged through prompting than additional supervised training.