## Introduction

This notebook explores a Retrieval-Augmented Generation
- Sentence-BERT as a pure retriever.  
- RAG pipeline using T5-small as the generator.  
- zero-shot and fine-tuned test  

All configurations are evaluated using standard metrics like EM, F1, and BLEU.

## 1. Dataset Loading and Utility Functions

In [2]:
import pandas as pd

data=pd.read_parquet("/kaggle/input/nlp-resources/cleaned_data.parquet")
finetuning_data=pd.read_parquet("/kaggle/input/nlp-resources/tuning_data.parquet")
hyperparameter_tuning_data=pd.read_parquet("/kaggle/input/nlp-resources/hyperparameter_tuning_data.parquet")
test_data=pd.read_parquet("/kaggle/input/nlp-resources/test_data.parquet")

In [3]:
import torch
import pandas as pd
from transformers import T5Tokenizer
from datasets import load_dataset
from datasets import Dataset
from transformers import T5ForConditionalGeneration, Trainer, TrainingArguments, DataCollatorForSeq2Seq
import re
import string
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from transformers import EarlyStoppingCallback
from transformers import TrainerCallback
from functools import reduce
import nltk
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from sklearn.metrics.pairwise import cosine_similarity
import string
import re
from collections import Counter
import numpy as np
import pandas as pd
nltk.download('punkt')

def preprocess(example,tokenizer,max_length_input=512,max_length_labels=64,no_context=False):
    input_text = f"question: {example['question']}" 
    if (not no_context):
        input_text+=f" context: {example['context']}"
    target_text = example["answer"]
    inputs = tokenizer(input_text, max_length=max_length_input, truncation=True, padding="max_length")
    labels = tokenizer(target_text, max_length=max_length_labels, truncation=True, padding="max_length")["input_ids"] 
    inputs["labels"] =labels
    return inputs
def normalize_answer(s):
    def remove_articles(text):
        return re.sub(r'\b(a|an|the)\b', ' ', text)
    def remove_punc(text):
        return ''.join(ch for ch in text if ch not in set(string.punctuation))
    def white_space_fix(text):
        return ' '.join(text.split())
    def lower(text):
        return text.lower()

    return white_space_fix(remove_articles(remove_punc(lower(s))))

def exact_match_score(prediction, ground_truth):
    return int(normalize_answer(prediction) == normalize_answer(ground_truth))

def f1_score(prediction, ground_truth):
    pred_tokens = normalize_answer(prediction).split()
    gt_tokens = normalize_answer(ground_truth).split()
    common = Counter(pred_tokens) & Counter(gt_tokens)
    num_same = sum(common.values())
    if num_same == 0:
        return 0
    precision = num_same / len(pred_tokens)
    recall = num_same / len(gt_tokens)
    return 2 * precision * recall / (precision + recall)
    
def bleu_score(prediction, ground_truth):
    smoothie = SmoothingFunction().method4  # migliora BLEU su frasi corte
    pred_tokens = nltk.word_tokenize(prediction.lower())
    gt_tokens = nltk.word_tokenize(ground_truth.lower())
    return sentence_bleu([gt_tokens], pred_tokens, smoothing_function=smoothie)
    
def get_top_k_contexts(context_embeddings,model_embed,query,k=3):
    query_emb=model_embed.encode([query], convert_to_numpy=True)
    similarity=cosine_similarity(query_emb,context_embeddings)[0]
    top_k_idx = similarity.argsort()[-k:]
    return [contexts[i] for i in top_k_idx]
    
def eval_answers(model,tokenizer,test_data=test_data,max_input=512,max_output=64, rag=False, no_context=False):
    exact_matches = []
    f1_scores = []
    bleu_scores=[]
    for idx, row in test_data.iterrows():
        context= None if no_context else row['context']
        pred_answer = generate_answer(model, tokenizer, row['question'],max_input_length=max_input,max_output_length=max_output, device=device,context=context )
        em = exact_match_score(pred_answer, row['answer'])
        f1 = f1_score(pred_answer, row['answer'])
        bleu = bleu_score(pred_answer, row['answer'])
        exact_matches.append(em)
        f1_scores.append(f1)
        bleu_scores.append(bleu)
    return sum(bleu_scores) / len(bleu_scores) * 100,sum(exact_matches) / len(exact_matches) * 100, sum(f1_scores) / len(f1_scores) * 100
def print_scores(avg_bleu=-1, avg_em=-1, avg_f1=-1,model_name="",scores_path="/kaggle/input/nlp-resources/scores_t5_rag.parquet",use_backup=False):
    
    if(use_backup):
        scores=pd.read_parquet(scores_path)
        scores_array=scores[scores["model_name"]==model_name][["avg_bleu","avg_em","avg_f1"]].values[0]
    else:
        scores_array=np.array([avg_bleu, avg_em, avg_f1])
    print(f"Scores of {model_name} model :")
    print(f"Average BLEU score: {scores_array[0]:.2f}")
    print(f"Exact Match: {scores_array[1]:.2f}%")
    print(f"F1 Score: {scores_array[2]:.2f}%")
device=torch.device("cuda" if torch.cuda.is_available() else "cpu")

2025-05-24 06:39:44.100059: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1748068784.293917      35 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1748068784.350677      35 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [4]:
#da levare
import re
from bs4 import BeautifulSoup

def clean_text(text):
    if not isinstance(text, str):
        return ""
    text = text.lower()
    text = BeautifulSoup(text, "html.parser").get_text()
    text = re.sub(r"\s+", " ", text).strip()
    return text
finetuning_data["context"]=finetuning_data["context"].apply(lambda x : clean_text(x))
hyperparameter_tuning_data["context"]=hyperparameter_tuning_data["context"].apply(lambda x : clean_text(x))
test_data["context"]=test_data["context"].apply(lambda x : clean_text(x))
contexts=pd.concat([test_data["context"],hyperparameter_tuning_data["context"],finetuning_data["context"]]).to_list()


Assuming this really is an XML document, what you're doing might work, but you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the Python package 'lxml' installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.




  text = BeautifulSoup(text, "html.parser").get_text()


In [5]:
tokenizer_T5_tuned = T5Tokenizer.from_pretrained("/kaggle/input/trained-nlp-models/tokenizer_t5/tokenizer")
model_T5_tuned = T5ForConditionalGeneration.from_pretrained("/kaggle/input/trained-nlp-models/model_t5/trainer")
model_T5_tuned.to(device)

T5ForConditionalGeneration(
  (shared): Embedding(32128, 512)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 512)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=512, out_features=512, bias=False)
              (k): Linear(in_features=512, out_features=512, bias=False)
              (v): Linear(in_features=512, out_features=512, bias=False)
              (o): Linear(in_features=512, out_features=512, bias=False)
              (relative_attention_bias): Embedding(32, 8)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseActDense(
              (wi): Linear(in_features=512, out_features=2048, bias=False)
              (wo): Linear(in_features=2048, out_features=512, bias=False)
              (dropout): Drop

## 2. Pure Retrieval using Sentence-BERT
Sentence-BERT is used to encode and retrieve the most relevant documents, based on the cosine similarity with the query embedding

In [6]:
from sentence_transformers import SentenceTransformer
import logging, transformers, sentence_transformers
from sentence_transformers import SentenceTransformer, LoggingHandler
logging.getLogger("transformers").setLevel(logging.ERROR)          
logging.getLogger("sentence_transformers").setLevel(logging.ERROR)
sentence_transformers.util.tqdm = lambda *args, **kwargs: iter([])

model_embed = SentenceTransformer('all-MiniLM-L6-v2') 
context_embeddings = model_embed.encode(contexts, convert_to_numpy=True)

#percentage of hits
def recall_k(test_data,k):
    hits=0
    for i,r in test_data.iterrows():
        top_contexts=get_top_k_contexts(context_embeddings,model_embed,query=r["question"],k=k)
        if(r["context"] in top_contexts):
            hits+=1
    return hits/len(test_data)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [7]:
for k in [1, 3, 5]:
    print(f"Recall@{k}: {recall_k(test_data, k):.2f}%")

Recall@1: 0.80%
Recall@3: 0.85%
Recall@5: 0.87%


### Sentence-BERT Retrieval Evaluation
Using Recall@K metric, with k=1,3 and 5.

In [8]:
import pandas as pd
scores=pd.read_parquet("/kaggle/input/nlp-resources/scores_t5_rag.parquet")
for i in [1,3,5]:
    recall=f"rcall{i}"
    print(f"Recall@{i}: {scores[scores['model_name']=='sb_ctx_retrieval'][recall].values[0]} ")

Recall@1: 0.8 
Recall@3: 0.85 
Recall@5: 0.87 


## 3. RAG Pipeline with T5-small as answer generator
- **Sentence-BERT**: used to retrieve the most relevant context chunks for a given question.
- **T5-small**: used to generate an answer based on the retrieved context.
### Retrieval Strategy (`on_top_chunk` flag)

We support two generation strategies controlled by the `on_top_chunk` flag:

- `on_top_chunk = True` (default):  
  Only the **top-ranked** context chunk (based on Sentence-BERT similarity with the query) is passed to T5 for answer generation.

- `on_top_chunk = False`:  
  The model generates **one answer for each retrieved chunk**.  
  Then, a **cross-encoder** is used to **score and select the best answer** among all candidates.  
  

In [9]:
from sentence_transformers import CrossEncoder
class ChunkRag:
    def __init__(
        self,
        corpus_contexts,
        retriever_name="all-MiniLM-L6-v2",
        qa_name="t5-small",
        crossencoder_name="cross-encoder/ms-marco-MiniLM-L-6-v2",
        qa_tok=None,
        qa_model=None,
        device=None,
    ):
        self.device = device or ("cuda" if torch.cuda.is_available() else "cpu")

        self.embedder = SentenceTransformer(retriever_name, device=self.device)
        self.embedder.eval()
        self.cross_encoder = CrossEncoder(crossencoder_name, device=self.device)
        self.qa_tok = qa_tok if qa_tok else T5Tokenizer.from_pretrained(qa_name)
        self.qa_model = qa_model if qa_model else T5ForConditionalGeneration.from_pretrained(qa_name).to(self.device).eval()

        self.contexts = corpus_contexts
        with torch.no_grad():
            self.ctx_emb = self.embedder.encode(corpus_contexts, convert_to_tensor=True, normalize_embeddings=True)
    def chunk_text(self ,text,  max_tokens=512, stride=None):
        if stride is None or stride >= max_tokens:
            stride = max_tokens
    
        ids = self.qa_tok.encode(text, add_special_tokens=False)
        chunks = []
        start = 0
        while start < len(ids):
            end = min(start + max_tokens, len(ids))
            chunk_ids = ids[start:end]
            chunks.append(self.qa_tok.decode(chunk_ids))
            if end == len(ids):
                break
            start += stride
        return chunks      
    def _get_topk(self, question, k):
        q_emb = self.embedder.encode(question, convert_to_tensor=True, normalize_embeddings=True)

        scores = sentence_transformers.util.cos_sim(q_emb, self.ctx_emb)[0].cpu()
        best = torch.topk(scores, k=k).indices.cpu().tolist()
        return [self.contexts[i] for i in best],self.contexts[np.argmax(scores)]
    def pick_best(self, question, answers):
        pairs = [[question, ans] for ans in answers]
        scores = self.cross_encoder.predict(pairs)
        best_idx = np.argmax(scores)
        return answers[best_idx]

    def answer(self, question, top_k=3, on_top_chunk=True,
               max_tokens=512, stride=None):
        top_contexts,top_ctx = self._get_topk(question, k=top_k)

        all_chunks = []
        for ctx in top_contexts:
            all_chunks.extend(
                self.chunk_text(ctx)
            )

        if on_top_chunk:
            best_answer = self._generate_answer(question, top_ctx)
        else:
            candidate_answers = [
                self._generate_answer(question, ch) for ch in all_chunks
            ]
            best_answer = self.pick_best(question, candidate_answers)
    
        return best_answer
        
    def _generate_answer(self, question, context):
        input_text = f"question: {question} context: {context}"
        inputs = self.qa_tok(input_text, return_tensors="pt", max_length=512, truncation=True).to(self.device)
        outputs = self.qa_model.generate(
            **inputs,
            max_length=64,
            num_beams=4,
            early_stopping=True
        )
        return self.qa_tok.decode(outputs[0], skip_special_tokens=True)
    def eval(self,test_data,on_top_chunk=True):
        exact_matches = []
        f1_scores = []
        bleu_scores=[]
        for idx, row in test_data.iterrows():
            pred_answer = self.answer(row["question"],on_top_chunk=on_top_chunk)
            em = exact_match_score(pred_answer, row['answer'])
            f1 = f1_score(pred_answer, row['answer'])
            bleu = bleu_score(pred_answer, row['answer'])
            exact_matches.append(em)
            f1_scores.append(f1)
            bleu_scores.append(bleu)
        return sum(bleu_scores) / len(bleu_scores) * 100,sum(exact_matches) / len(exact_matches) * 100, sum(f1_scores) / len(f1_scores) * 100

In [10]:
rag_tuned_model=ChunkRag(corpus_contexts=contexts,qa_tok=tokenizer_T5_tuned,qa_model=model_T5_tuned)
rag_model=ChunkRag(corpus_contexts=contexts)

config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.33k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [17]:
#sample an example
sample=test_data.sample(n=1,random_state=42)
sample["question"].values[0]

'What are the responsibilities and requirements for the Software Architect position at the entertainment company?'

In [18]:
sample["context"].values[0]

'an entertainment company specializing in advice-based products and services is looking for a software architect to work alongside their growing team, as a contractor. responsibilities and requirements -develop and analyze the design and architecture of complex software application systems. -provide architectural and implementation oversight and guidance to ensure consistency and quality of design and code. -analyze and document existing systems, review preexisting complex code and provide recommendations to improve performance & maintainability. -ability to communicate technical issues and concepts clearly, both orally and in writing -write test and debug complex problems in various modules of the various software application -code reviews -manage test and acceptance activities -direct contribution to development and test efforts -manage and support system deployment- -design and build reusable modules to be used throughout our applications -collaborate with senior developers to desig

## 4. Aggregated performance
- Tuned T5 on top chunk `on_top_chunk = True`
- Tuned T5 on top answer `on_top_chunk = False`
- Zero-Shot on top answer `on_top_chunk = False`
- Zero-Shot on top chunk `on_top_chunk = True`

Zero shot performance

In [12]:
#avg_bleu, avg_em, avg_f1=rag_model.eval(test_data)
#print_scores(avg_bleu, avg_em, avg_f1)
#avg_bleu, avg_em, avg_f1=rag_model.eval(test_data,on_top_chunk=False)
#print_scores(avg_bleu, avg_em, avg_f1)

print_scores(model_name="rag_zero_shot_best_answer",use_backup=True)
print_scores(model_name="rag_zero_shot_best_chunk",use_backup=True)

Scores of rag_zero_shot_best_answer model :
Average BLEU score: 1.43
Exact Match: 0.06%
F1 Score: 10.76%
Scores of rag_zero_shot_best_chunk model :
Average BLEU score: 2.71
Exact Match: 0.06%
F1 Score: 17.51%


Evaluating zero shot for the sampled example :

In [21]:
zero_shot_answer_top_chunk=rag_model.answer(sample["question"].values[0])
zero_shot_answer_top_answer=rag_model.answer(sample["question"].values[0],on_top_chunk=False)
zero_shot_answer_top_answer,zero_shot_answer_top_chunk,sample["answer"].values[0]

('provide pre-sales consulting, analysis, design for networking solutions including assessments, road mapping, architecture/design/deployment/migration projects',
 'the appropriate technology skills, professional experience, knowledge of the industry’s best practice, the agility to deal with any kind of technology, and a can-do attitude',
 'The responsibilities and requirements for the Software Architect position include developing and analyzing the design and architecture of complex software application systems, providing architectural and implementation oversight, analyzing and documenting existing systems, communicating technical issues clearly, writing and debugging complex problems in various software modules, managing test and acceptance activities, designing and building reusable modules, writing application code using latest C# and ASP.NET framework, assisting in building database structure for SQL Server, understanding cloud engineering, handling and prioritizing multiple task

In [28]:
print("Comparisons with dataset answer, on top_answer: "+
f"\nF1:{f1_score(zero_shot_answer_top_answer, sample['answer'].values[0]):.2f}",
      f"BLEU:{bleu_score(zero_shot_answer_top_answer, sample['answer'].values[0]):.2f}",
f"EM:{exact_match_score(zero_shot_answer_top_answer, sample['answer'].values[0]):.2f}","\non context chunk: "+
f"\nF1:{f1_score(zero_shot_answer_top_chunk, sample['answer'].values[0]):.2f}",
      f"BLEU:{bleu_score(zero_shot_answer_top_chunk, sample['answer'].values[0]):.2f}",
f"EM:{exact_match_score(zero_shot_answer_top_chunk, sample['answer'].values[0]):.2f}")

Comparisons with dataset answer, on top_answer: 
F1:0.02 BLEU:0.00 EM:0.00 
on context chunk: 
F1:0.08 BLEU:0.00 EM:0.00


Tuned Q&A model performance

In [13]:
#avg_bleu, avg_em, avg_f1=rag_model.eval(test_data)
#print_scores(avg_bleu, avg_em, avg_f1)
#avg_bleu, avg_em, avg_f1=rag_model.eval(test_data,on_top_chunk=False)
#print_scores(avg_bleu, avg_em, avg_f1)

print_scores(model_name="rag_best_answer",use_backup=True)
print_scores(model_name="rag_best_chunk",use_backup=True)

Scores of rag_best_answer model :
Average BLEU score: 27.53
Exact Match: 2.86%
F1 Score: 45.05%
Scores of rag_best_chunk model :
Average BLEU score: 36.35
Exact Match: 8.41%
F1 Score: 55.30%


Evaluating tuned model for the sampled example :

In [22]:
answer_top_chunk=rag_tuned_model.answer(sample["question"].values[0])
answer_top_answer=rag_tuned_model.answer(sample["question"].values[0],on_top_chunk=False)
answer_top_answer,answer_top_chunk,sample["answer"].values[0]

('The responsibilities and requirements for the Software Architect position at the entertainment company include establishing relationship with customer technical teams and technical leads, communicating technical architectures to customers, focusing on technical benefits and competitive differentiation, assessing business opportunity and technical innovations, translating customer technical and business requirements into technical solutions, building relationships',
 'The responsibilities and requirements for the Software Architect position at the entertainment company include the ability to interact and work collaboratively with all members of the organization, a bachelor’s degree in information technology, computer science, or related field, 3-5 years related experience, industry certifications and membership, excellent problem-',
 'The responsibilities and requirements for the Software Architect position include developing and analyzing the design and architecture of complex softwa

In [29]:
print("Comparisons with dataset answer, on top_answer: "+
f"\nF1:{f1_score(answer_top_answer, sample['answer'].values[0]):.2f}",
      f"BLEU:{bleu_score(answer_top_answer, sample['answer'].values[0]):.2f}",
f"EM:{exact_match_score(answer_top_answer, sample['answer'].values[0]):.2f}","\non context chunk: "+
f"\nF1:{f1_score(answer_top_chunk, sample['answer'].values[0]):.2f}",
      f"BLEU:{bleu_score(answer_top_chunk, sample['answer'].values[0]):.2f}",
f"EM:{exact_match_score(answer_top_chunk, sample['answer'].values[0]):.2f}")

Comparisons with dataset answer, on top_answer: 
F1:0.15 BLEU:0.01 EM:0.00 
on context chunk: 
F1:0.20 BLEU:0.02 EM:0.00


## 5. Final Comparison of All Model Configurations

This table summarizes the performance of all major configurations evaluated throughout the project.

| Model Configuration         | on_top_chunk | Training Type     | Exact Match (%) | F1 Score (%) | BLEU Score |
|----------------------------|--------------|-------------------|-----------------|--------------|------------|
| Zero-Shot                  | False           | Pretrained only   | 0.06            | 10.76         | 1.43       |
| Zero-Shot      | True            | Pretrained only   | 0.06            | 17.51         | 2.71       |
| Fine-Tuned     | False           | Supervised tuning   | 2.86           | 45.05         | 27.53       |
| Fine-Tuned   | True           | Supervised tuning | 8.41           | 55.30         | 36.35       |