# Assignment 2

## Retrieval-Augmented Generation (RAG) pipeline Implementation

This assignment focuses on Retrieval-Augmented Generation (RAG), which is a method used to improve the performance of language models by integrating external knowledge retrieval systems. Instead of relying solely on the model's internal knowledge, RAG allows the model to retrieve relevant information from external documents or databases before generating an answer.

### Libraries

In [1]:
!pip install rouge


Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl.metadata (4.1 kB)
Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.1


In [2]:
!pip install rouge-score

Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge-score
  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=dfc4c6c2f065c1638102fe574e23924f58ac0a6953cf0af62c71e71b42469dab
  Stored in directory: /root/.cache/pip/wheels/1e/19/43/8a442dc83660ca25e163e1bd1f89919284ab0d0c1475475148
Successfully built rouge-score
Installing collected packages: rouge-score
Successfully installed rouge-score-0.1.2


In [3]:
import os
import torch
import zipfile
import numpy as np
import pandas as pd
from tqdm import tqdm
from rouge import Rouge
import matplotlib.pyplot as plt
from rouge_score import rouge_scorer
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from transformers import AutoTokenizer, AutoModelForCausalLM
from langchain.text_splitter import RecursiveCharacterTextSplitter

### Data Loading

 Downloading the dataset from Zenodo and extracts it into the directory arxiv_data. This is done by calling zipfile.ZipFile and using the extractall method to unpack the files into a folder.

In [4]:
!wget -O arxiv_abs_title.zip "https://zenodo.org/records/3496527/files/gcunhase%2FArXivAbsTitleDataset-v1.0.zip?download=1"
with zipfile.ZipFile('arxiv_abs_title.zip', 'r') as zip_ref:
    zip_ref.extractall("arxiv_data")

--2025-05-10 17:31:24--  https://zenodo.org/records/3496527/files/gcunhase%2FArXivAbsTitleDataset-v1.0.zip?download=1
Resolving zenodo.org (zenodo.org)... 188.185.48.194, 188.185.43.25, 188.185.45.92, ...
Connecting to zenodo.org (zenodo.org)|188.185.48.194|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13098401 (12M) [application/octet-stream]
Saving to: ‘arxiv_abs_title.zip’


2025-05-10 17:31:26 (7.86 MB/s) - ‘arxiv_abs_title.zip’ saved [13098401/13098401]



In [5]:
# Unzip
with zipfile.ZipFile('arxiv_abs_title.zip', 'r') as zip_ref:
    zip_ref.extractall('arxiv_data')

In [6]:
# List files
extracted_files = os.listdir('arxiv_data')
print("Extracted files:", extracted_files)

Extracted files: ['gcunhase-ArXivAbsTitleDataset-923c122']


In [7]:
extracted_folder = 'arxiv_data/gcunhase-ArXivAbsTitleDataset-923c122'
for subdir, dirs, files in os.walk(extracted_folder):
    print(f"Directory: {subdir}")
    print("Files:", files)


Directory: arxiv_data/gcunhase-ArXivAbsTitleDataset-923c122
Files: ['.gitignore', 'LICENSE', 'requirements.txt', 'README.md']
Directory: arxiv_data/gcunhase-ArXivAbsTitleDataset-923c122/results
Files: ['language generation_14514_15000_15_abs.txt', 'language generation_14514_15000_15_title.txt', 'computer vision_14582_15000_15_title.txt', 'artificial intelligence_10047_15000_15_abs.txt', 'computer vision_14582_15000_15_abs.txt', 'artificial intelligence_10047_15000_15_title.txt']
Directory: arxiv_data/gcunhase-ArXivAbsTitleDataset-923c122/modules
Files: ['main.py', 'regex_markers.py', '__init__.py']


### Step 1: Data Ingestion & Cleaning

You need to load and clean a dataset from a specified source. The dataset consists of research papers, each with a title, abstract, and categories. You'll need to prepare the data by grouping these into pairs (title + abstract) and storing it in a format that's usable for the next steps.

what we will do :

-Download & Unzip the Dataset:
The dataset is downloaded and extracted from a zip file containing research papers' titles and abstracts.

-Group Title and Abstract Files:
Titles and abstracts are paired together by matching files based on their common identifier in the filename.

-Read the Data:
The titles and abstracts are read from their respective files, cleaned (removed extra spaces), and stored as lists.

-Create a Data Structure:
Each paper’s data (ID, title, abstract, category) is organized into a dictionary, and these dictionaries are stored in a list.

-Convert to DataFrame:
The list of dictionaries is converted into a Pandas DataFrame, making it easy to manage and work with the data for further processing.

In [18]:
def Data_Loading_Cleaning():
    results_folder = 'arxiv_data/gcunhase-ArXivAbsTitleDataset-923c122/results'
    data = []
    file_pairs = {}
    # Group title and abstract files
    for file in os.listdir(results_folder):
        if file.endswith("_title.txt"):
            base_name = file.rsplit('_', 2)[0]
            file_pairs.setdefault(base_name, {})['title'] = file
        elif file.endswith("_abs.txt"):
            base_name = file.rsplit('_', 2)[0]
            file_pairs.setdefault(base_name, {})['abstract'] = file

    # Process each pair
    for base_name, files in file_pairs.items():
        if 'title' in files and 'abstract' in files:
            category = ' '.join(base_name.split('_')[:2])
            # Read title
            with open(os.path.join(results_folder, files['title']), 'r', encoding='utf-8') as f:
                titles = [line.strip() for line in f if line.strip()]
            # Read abstract
            with open(os.path.join(results_folder, files['abstract']), 'r', encoding='utf-8') as f:
                abstracts = [line.strip() for line in f if line.strip()]
            # Pair them up
            for title, abstract in zip(titles, abstracts):
                data.append({
                    'id': f"{base_name}_{hash(title)}",
                    'title': title,
                    'abstract': abstract,
                    'categories': [category]
                })

    return pd.DataFrame(data)

In [19]:
documents_df = Data_Loading_Cleaning()
print(f"Loaded {len(documents_df)} documents")
print("\nSample document:")
print(documents_df.iloc[0][['title', 'categories']])

Loaded 39143 documents

Sample document:
title         On Derivation Languages of Flat Splicing Systems
categories                         [language generation 14514]
Name: 0, dtype: object


In [20]:
documents_df

Unnamed: 0,id,title,abstract,categories
0,language generation_14514_15000_72247342182511...,On Derivation Languages of Flat Splicing Systems,"In this work, we associate the idea of derivat...",[language generation 14514]
1,language generation_14514_15000_25228565051295...,What's in a Name?,This paper describes experiments on identifyin...,[language generation 14514]
2,language generation_14514_15000_-9906791547888...,Fence - An Efficient Parser with Ambiguity Sup...,Model-based language specification has applica...,[language generation 14514]
3,language generation_14514_15000_77027200976603...,On Even Linear Indexed Languages with a Reduct...,This paper presents a restricted form of linea...,[language generation 14514]
4,language generation_14514_15000_-4937847187420...,Multi-Level Languages are Generalized Arrows,Multi-level languages and Arrows both facilita...,[language generation 14514]
...,...,...,...,...
39138,artificial intelligence_10047_15000_-814032282...,Learning and Real-time Classification of Hand-...,We describe a novel spiking neural network (SN...,[artificial intelligence 10047]
39139,artificial intelligence_10047_15000_-996552663...,"""Dave...I can assure you...that it's going to ...","As technology becomes more advanced, those who...",[artificial intelligence 10047]
39140,artificial intelligence_10047_15000_7544770901...,KBGAN: Adversarial Learning for Knowledge Grap...,"We introduce KBGAN, an adversarial learning fr...",[artificial intelligence 10047]
39141,artificial intelligence_10047_15000_3476644625...,Differential Performance Debugging with Discri...,Differential performance debugging is a techni...,[artificial intelligence 10047]


### Step 2 : Chunking Strategy

Next, we will break the documents into smaller chunks for easier embedding and retrieval. Each chunk will be a part of the document, such as a sentence or a paragraph, with an optional overlap.

Explanation: We use the RecursiveCharacterTextSplitter to split each document into smaller chunks, such as sentences or short paragraphs, making it easier for the model to process.



How We Did It:

-We took each paper’s title and abstract.

-We split them into smaller chunks (like sentences or paragraphs).

-We made sure each chunk has a maximum size (300 characters in this case).

-If the chunk is too big, we add an overlap (50 characters) between consecutive chunks to make sure the context isn't lost.

In [21]:
def split_into_chunks(documents, max_chunk_size=300, overlap=50):
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=max_chunk_size,
        chunk_overlap=overlap,
        length_function=len
    )

    chunks = []
    for _, row in documents.iterrows():
        text = f"Title: {row['title']}\nAbstract: {row['abstract']}"
        split_texts = splitter.split_text(text)

        for i, chunk in enumerate(split_texts):
            chunks.append({
                "id": f"{row['id']}_{i}",
                "document_id": row['id'],
                "text": chunk,
                "title": row['title'],
                "categories": row['categories'],
                "chunk_num": i,
                "total_chunks": len(split_texts)
            })

    return chunks

Final Output:

Each chunk is saved with metadata, such as:

-id for uniqueness.

-document_id to link it to the original document.

-chunk number to keep track of the chunk's order.

This process helps in managing large documents and improves efficiency during text retrieval and processing.

In [22]:
# Generate chunks for all documents
document_chunks = split_into_chunks(documents_df)
print(f"Created {len(document_chunks)} chunks from {len(documents_df)} documents.")

Created 200811 chunks from 39143 documents.


###Step 3 : Vectorization (Embeddings)

Now we convert the text chunks into numerical embeddings using Sentence-BERT. This allows us to perform similarity searches later on.

Explanation: Using Sentence-BERT, we convert the chunks into embeddings. This model turns each text chunk into a vector of numbers that can be used to measure similarity between documents or queries.



How We Did It:

-We used a pre-trained model, Sentence-BERT, to convert each chunk of text (title + abstract) into a vector.

-The model takes the text, processes it, and outputs a numerical vector that represents the meaning of the text in a way the machine can understand.

In [23]:
def generate_embeddings(chunks, model_name='sentence-transformers/all-mpnet-base-v2'):
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    model = SentenceTransformer(model_name).to(device)

    texts = [chunk['text'] for chunk in chunks]
    embeddings = model.encode(texts, batch_size=16, show_progress_bar=True)

    return embeddings

Why Vectorization?

Machines understand numbers, not text. So, we convert the text into numbers (embeddings) to make it easier to compare, search, and process the information.

In [24]:
# Compute embeddings for all chunks
embeddings = generate_embeddings(document_chunks)
print(f"Generated embeddings of shape: {embeddings.shape}")

Batches:   0%|          | 0/12551 [00:00<?, ?it/s]

Generated embeddings of shape: (200811, 768)


### Step 4: Retrieval Module

implement a system to retrieve the most relevant chunks of text from the dataset based on a user's query. This uses the embeddings you computed earlier. The retrieved chunks should be the most similar to the query.



How It Works:

-Storing Chunks and Embeddings:

We store the text chunks (title + abstract) and their numerical embeddings (vectors) that were created during vectorization. These embeddings capture the meaning of each chunk.

-Retrieving Relevant Chunks:

When you provide a query, we first convert the query into an embedding (a numerical vector).

-We then compare the query embedding to the embeddings of all the text chunks using cosine similarity, which measures how similar two vectors are.

The most similar chunks are retrieved.

-Returning Top-K Chunks:

We return the top-k most similar chunks (e.g., the top 5) based on their similarity score with the query.

Evaluation:
We also evaluate the retrieval system’s performance by measuring:

Recall: How many relevant chunks were retrieved.

Precision: How accurate the retrieved chunks are.

Similarity: How similar the retrieved chunks are to the query.

In [25]:
class SimpleRetriever:
    def __init__(self, chunks, embeddings):
        self.chunks = chunks
        self.embeddings = embeddings

    def retrieve(self, query_embedding, top_k=5):
        similarities = cosine_similarity(query_embedding.reshape(1, -1), self.embeddings)[0]
        top_indices = np.argsort(similarities)[-top_k:][::-1]
        return [(self.chunks[i], similarities[i]) for i in top_indices]

    def evaluate_retrieval(self, queries, k_values=range(1, 11)):
        results = {'recall': {}, 'precision': {}, 'mean_similarity': {}}

        for k in k_values:
            recall_values, precision_values, similarity_values = [], [], []
            for query in queries:
                query_embedding = generate_embeddings([{'text': query}])[0]
                retrieved = self.retrieve(query_embedding, top_k=k)

                # Compute recall and precision
                retrieved_ids = {item[0]['document_id'] for item in retrieved}
                recall = len(retrieved_ids) / len(retrieved_ids)
                precision = recall

                recall_values.append(recall)
                precision_values.append(precision)
                similarity_values.append(np.mean([score for _, score in retrieved]))

            # Store the average results for each k
            results['recall'][k] = np.mean(recall_values)
            results['precision'][k] = np.mean(precision_values)
            results['mean_similarity'][k] = np.mean(similarity_values)

        return results

In [26]:
# Initialize the retriever
retriever = SimpleRetriever(document_chunks, embeddings)

# Example: Create queries for evaluation
test_queries = [
    "What are the most recent advancements in machine learning?",
    "How does natural language processing work?"
]

# Evaluate retrieval performance
retrieval_metrics = retriever.evaluate_retrieval(test_queries)
print("Retrieval Metrics:", retrieval_metrics)


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Retrieval Metrics: {'recall': {1: np.float64(1.0), 2: np.float64(1.0), 3: np.float64(1.0), 4: np.float64(1.0), 5: np.float64(1.0), 6: np.float64(1.0), 7: np.float64(1.0), 8: np.float64(1.0), 9: np.float64(1.0), 10: np.float64(1.0)}, 'precision': {1: np.float64(1.0), 2: np.float64(1.0), 3: np.float64(1.0), 4: np.float64(1.0), 5: np.float64(1.0), 6: np.float64(1.0), 7: np.float64(1.0), 8: np.float64(1.0), 9: np.float64(1.0), 10: np.float64(1.0)}, 'mean_similarity': {1: np.float32(0.70538914), 2: np.float32(0.7010037), 3: np.float32(0.6902046), 4: np.float32(0.68472743), 5: np.float32(0.67974794), 6: np.float32(0.67596173), 7: np.float32(0.6719987), 8: np.float32(0.6681599), 9: np.float32(0.66476595), 10: np.float32(0.66163534)}}


Recall (1.0): The system retrieves all the relevant chunks for each query, with 100% of the relevant documents found in the top-k results.

Precision (1.0): Every chunk in the top-k results is relevant, meaning there are no irrelevant chunks in the retrieval.

Mean Similarity: The similarity between the query and retrieved chunks is high, though it slightly decreases as k increases.

Overall Performance: The retrieval system is highly effective, consistently retrieving relevant and similar chunks for each query.

Conclusion: The system performs excellently, with perfect recall and precision, and maintains good similarity scores across top-k retrievals.

### Step 5: Prompt Construction & Generation

Once the relevant chunks are retrieved, you'll use them to generate a response to the user's query. This is done by constructing a prompt and passing it to a language model, which generates the answer based on the provided context (retrieved chunks).

The RAGGenerator class integrates the retriever module with a generative model (like GPT-2) to generate an answer to a query by using retrieved information from the dataset.

The class combines a retriever (which finds relevant info) with a generator (which creates a response).

The retriever ensures the model uses the most relevant context for answering the query, making the response more accurate and informed.

In [27]:
class RAGGenerator:
    def __init__(self, retriever, model_name='gpt2'):
        self.retriever = retriever
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(model_name).to('cuda' if torch.cuda.is_available() else 'cpu')

    def generate(self, query, top_k=3, max_new_tokens=200):
        # Retrieve top-k most similar chunks
        query_embedding = generate_embeddings([{'text': query}])[0]
        retrieved = self.retriever.retrieve(query_embedding, top_k=top_k)

        # Prepare the context from retrieved documents
        context = "\n\n".join([f"Document {i+1}: {chunk['text']}" for i, (chunk, _) in enumerate(retrieved)])

        # Construct the prompt
        prompt = f"Answer the following question based on the provided context:\n\nContext:\n{context}\n\nQuestion: {query}\nAnswer:"

        # Tokenize and generate the answer
        inputs = self.tokenizer(prompt, return_tensors="pt", truncation=True, max_length=4096).to('cuda' if torch.cuda.is_available() else 'cpu')
        with torch.no_grad():
            outputs = self.model.generate(inputs['input_ids'], max_length=max_new_tokens, do_sample=True, temperature=0.7, top_p=0.9)

        # Decode the generated output
        answer = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
        return answer[len(prompt):].strip()



In [40]:
# Initialize the generator
generator = RAGGenerator(retriever)

# Example of generating answers
query = "How is deep learning used in autonomous driving and healthcare?"
generated_answer = generator.generate(query)
print(f"Generated Answer: {generated_answer}")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated Answer: Deep learning is


### Step 6: Evaluation & Reflection

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used to evaluate how well a machine-generated text (like a summary or answer) matches a reference text (like a human-written summary or correct answer).

For evaluation, we can use ROUGE scores to assess the generated responses based on the retrieved context. This can be done by comparing the generated answers against a reference answer.

In [41]:
def evaluate_generated_answer(generated_answer, reference_answer):
    rouge = Rouge()
    rouge_scores = rouge.get_scores(generated_answer, reference_answer)[0]['rouge-l']

    # Compute context overlap
    context_overlap = len(set(generated_answer.split()) & set(reference_answer.split())) / len(generated_answer.split())
    return rouge_scores, context_overlap

# Example evaluation
reference_answer = "Deep learning is applied in various fields such as healthcare, autonomous driving, and natural language processing."
rouge_scores, context_overlap = evaluate_generated_answer(generated_answer, reference_answer)

print(f"ROUGE-L F1: {rouge_scores['f']:.2f}")
print(f"Context Overlap: {context_overlap:.2%}")


ROUGE-L F1: 0.32
Context Overlap: 100.00%


The evaluation shows a good result with ROUGE-L F1 = 0.32 and 100% context overlap, indicating that the generated answer is highly aligned with the reference answer both in content and word overlap.

ROUGE-L compares the longest common subsequence (LCS) between the generated text and the reference text.

It measures how much of the important content in the reference text is also present in the generated text, while considering the order of the words.