
**Overview**
For this assignment, we have worked on developing & evaluating a Retrieval-Augmented Generation (RAG) pipeline on the UN Sustainable Development Goals (SDGs) dataset to enable accurate retrieval & context-grounded answer generation from complex policy documents.

**Dataset used:** UNDP sdgi-corpus (United Nations Sustainable Development Goals (SDGs))

**Dataset link:** https://huggingface.co/datasets/UNDP/sdgi-corpus

**WHy this dataset?:** We selected United Nations (UN) SDG dataset because of its global importance & complexity as it conatins long policy documents making with suitable for evaulating RAG.

**Business Problem:** Large policy docs such as UN SDGs are difficult to search & interpret efficiently which leads to time consuming manual analysis & risk of incorrect insights.

**Objective:** Build and evaluate a RAG pipeline which will retrieve relevant SDG content & generates accurate, context-grounded responses to user queries.

### STep 1: Importing required libraries and installing essential packages

In [None]:
import sys

sys.path.append("..")



print(sys.executable)


In [None]:
# adding this to remove warning messages
from src.embeddings import load_embeddings
from src.rag_pipeline import rag_pipeline
import warnings
from dotenv import load_dotenv
import os
warnings.filterwarnings('ignore')
os.environ["TOKENIZERS_PARALLELISM"] = "false"

load_dotenv()
GENAI_API_KEY = os.getenv("GOOGLE_API_KEY")

In [None]:
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
from datasets import load_dataset

# -----------------------------*Data Indexing Pipeline*-----------------------------------------

### STep 2: Loading dataset and Exploratory Data Analysis
We have loaded UN SDG corpus dataset from Hugging faceand will be further used for EDA.
FOr EDA, we are checking basic dataset overview, missing values, text lenght , word count etc. We have also plotted some graphs for high level overview.

In [None]:
dataset = load_dataset("UNDP/sdgi-corpus", split="train")
print(len(dataset))


In [None]:
#Load data 
from PIL import Image
import numpy as np

dataset = load_dataset("UNDP/sdgi-corpus", split="train")
df = pd.DataFrame(dataset)
print(f"Dataset shape: {df.shape}")

#performing EDA here
print("\n-Dataset Overview--")
print(df.info())
df.head(3)

print("\n-Missing Values--")
print(df.isnull().sum())

print("\n-Text Statistics --")
df['text_length'] = df['text'].astype(str).apply(len)
df['word_count'] = df['text'].astype(str).apply(lambda x: len(x.split()))
print(f"Avg text length: {df['text_length'].mean():.2f}")
print(f"Avg word count: {df['word_count'].mean():.2f}")

print("\n-Label Distribution --")
df['label_primary'] = df['labels'].apply(lambda x: x[0] if len(x) > 0 else None)
label_counts = df['label_primary'].value_counts()
print(label_counts)

plt.figure(figsize=(8, 4))
plt.subplot(1, 2, 1)
label_counts.plot(kind='bar', color='skyblue')
plt.title('SDG Label Distribution')
plt.xlabel('SDG Label')
plt.ylabel('Count')
plt.subplot(1, 2, 2)
plt.pie(label_counts.head(5), labels=label_counts.head(5).index, autopct='%1.1f%%')
plt.title('Top 5 SDG Labels')
plt.tight_layout()
plt.show()

max_count = label_counts.max()
min_count = label_counts.min()
balance_ratio = min_count / max_count
print(f"Balance ratio: {balance_ratio:.2f}")

plt.figure(figsize=(8, 4))
plt.subplot(1, 2, 1)
plt.hist(df['word_count'], bins=50, color='lightcoral', edgecolor='black')
plt.title('Word Count Distribution')
plt.xlabel('Word Count')
plt.ylabel('Frequency')
plt.axvline(df['word_count'].mean(), color='red', linestyle='--', label=f'Mean: {df["word_count"].mean():.0f}')
plt.legend()
plt.subplot(1, 2, 2)
plt.boxplot(df['word_count'])
plt.title('Word Count Boxplot')
plt.ylabel('Word Count')
plt.tight_layout()
plt.show()

print("\n Sample Texts--")
print("Sample 1:", df['text'].iloc[0][:100])
print("Sample 2:", df['text'].iloc[100][:200])



wordcloud = WordCloud(
    width=600,
    height=300,
    background_color='white'
).generate(all_text)

plt.figure(figsize=(8, 5))

# ✅ Safe conversion (bypasses np.asarray(copy=...))
img = wordcloud.to_image()
plt.imshow(np.array(img))

plt.axis('off')
plt.title('Word Cloud')
plt.show()



### Step 3: Preprocessing the data
Here we are performing basic cleaning operations to refine the data such as removing extra spaces, strange characters etc.

In [None]:
#Preprocesing
from src.preprocessing import clean_text

df['clean_text'] = df['text'].apply(clean_text)
df = df[df['clean_text'] != ""]
df = df[df['clean_text'].str.len() > 20]
print(f"Dataset after cleaning: {df.shape}")


### STep 4: DOcument loading
Here, we are converting the DF into Langchain doc objects so that the text can be uniformly processed for chunking and embedding and retrieval.



In [None]:
#loading doc
from langchain_community.document_loaders import DataFrameLoader

loader = DataFrameLoader(df, page_content_column="clean_text")
documents = loader.load()
print(f"Loaded {len(documents)} documents")

### Step 5: CHunking
here, in this step, the documents are split into small chunks so that it can improve retrieval process.<br>
We tried chunking comparing with 3 different strategies : Fixed size, Recursive and Sentence-based chunking. This will help us to evalaute the results with different experimentation. <br>  Instead of using the entire corpus, we decided to use a subset of 500 docs for the experimentation purpose.

In [None]:
# Chunking strategies
from langchain_text_splitters import RecursiveCharacterTextSplitter, CharacterTextSplitter
from langchain_core.documents import Document

from src.chunking import create_sentence_chunks, get_chunkers

fixed_splitter, recursive_splitter = get_chunkers()

doc_subset = documents[:500]
print(f"Using {len(doc_subset)} documents")

# strategy 1: Fixed-Size
splitter_fixed = CharacterTextSplitter(separator=" ", chunk_size=300, chunk_overlap=50)
chunks_fixed = splitter_fixed.split_documents(doc_subset)
print(f"Fixed chunking: {len(chunks_fixed)} chunks")

# strategy 2: Recursive
splitter_recursive = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ". ", " ", ""],
    chunk_size=300,
    chunk_overlap=50
)
chunks_recursive = splitter_recursive.split_documents(doc_subset)
print(f"Recursive chunking: {len(chunks_recursive)} chunks")

# strategy 3: Sentence-based

chunks_sentence = create_sentence_chunks(doc_subset, sentences_per_chunk=3)
print(f"Sentence chunking: {len(chunks_sentence)} chunks")

chunk_comparison = pd.DataFrame({
    'Strategy': ['Fixed', 'Recursive', 'Sentence'],
    'Total Chunks': [len(chunks_fixed), len(chunks_recursive), len(chunks_sentence)],
    'Avg Length': [
        np.mean([len(c.page_content) for c in chunks_fixed]),
        np.mean([len(c.page_content) for c in chunks_recursive]),
        np.mean([len(c.page_content) for c in chunks_sentence])
    ]
})
print(chunk_comparison)

plt.figure(figsize=(14, 4))
plt.subplot(1, 3, 1)
plt.hist([len(c.page_content) for c in chunks_fixed], bins=30, color='skyblue')
plt.title('Fixed Chunks')
plt.xlabel('Length')
plt.subplot(1, 3, 2)
plt.hist([len(c.page_content) for c in chunks_recursive], bins=30, color='lightgreen')
plt.title('Recursive Chunks')
plt.xlabel('Length')
plt.subplot(1, 3, 3)
plt.hist([len(c.page_content) for c in chunks_sentence], bins=30, color='lightcoral')
plt.title('Sentence Chunks')
plt.xlabel('Length')
plt.tight_layout()
plt.show()


On comparing , we see that Recursive chunking offers the best balance between chunk size and semantic coherence compared to other two.

In [None]:
# Preparing chunks 
from langchain_community.vectorstores.utils import filter_complex_metadata

def prepare_chunks(chunks, strategy_name):
    prepared = []
    for i, chunk in enumerate(chunks):
        metadata = chunk.metadata.copy() if hasattr(chunk, 'metadata') else {}
        metadata['chunk_id'] = f"{strategy_name}_{i}"
        metadata['strategy'] = strategy_name
        prepared.append(Document(page_content=chunk.page_content, metadata=metadata))
    return filter_complex_metadata(prepared)

docs_fixed = prepare_chunks(chunks_fixed, "fixed")
docs_recursive = prepare_chunks(chunks_recursive, "recursive")
docs_sentence = prepare_chunks(chunks_sentence, "sentence")
print(f"Prepared: Fixed={len(docs_fixed)}, Recursive={len(docs_recursive)}, Sentence={len(docs_sentence)}")


### Step 6: Vector embeddings
here, to convert into vector represetations, we used direct embedding models like MiniLM, USE and MPNet for evaulating & analyzing thier impact on RAG performacne

In [None]:
#embedding mOdels


from src.embeddings import load_embeddings

embeddings = load_embeddings()

embed_minilm = embeddings["minilm"]
print("MiniLM loaded")

embed_mpnet = embeddings["mpnet"]
print("MPNet loaded")

# USE is optional
embed_use = embeddings.get("use")
if embed_use:
    print("USE loaded")
else:
    print("USE not available (skipping)")


### sTep 7: Vector store creation
Here, we created separate vector DBs for each chunking & embedding combination so that our evalaution & experimentation is fair enough. (The below code is very time consuming and requires high compuational components)

In [None]:
#Vector databases
from langchain_community.vectorstores import Chroma

db_fixed_minilm = Chroma.from_documents(docs_fixed, embed_minilm, collection_name="fixed_minilm")
db_fixed_use = Chroma.from_documents(docs_fixed, embed_use, collection_name="fixed_use")
db_fixed_mpnet = Chroma.from_documents(docs_fixed, embed_mpnet, collection_name="fixed_mpnet")
print("Fixed DBs created")

db_recursive_minilm = Chroma.from_documents(docs_recursive, embed_minilm, collection_name="recursive_minilm")
db_recursive_use = Chroma.from_documents(docs_recursive, embed_use, collection_name="recursive_use")
db_recursive_mpnet = Chroma.from_documents(docs_recursive, embed_mpnet, collection_name="recursive_mpnet")
print("Recursive DBs created")

db_sentence_minilm = Chroma.from_documents(docs_sentence, embed_minilm, collection_name="sentence_minilm")
db_sentence_use = Chroma.from_documents(docs_sentence, embed_use, collection_name="sentence_use")
db_sentence_mpnet = Chroma.from_documents(docs_sentence, embed_mpnet, collection_name="sentence_mpnet")
print("Sentence DBs created")

# -----------------------------*Data Retrieval & Generation pipeline*-----------------------------------------

### STep 8: Retriever configuration
Here, we created retrievers for each vector database that we discussed above to fetch top 5 most relevant chunks for any query we provide 

In [None]:
#Retriever
retrievers = {
    'fixed_minilm': db_fixed_minilm.as_retriever(search_kwargs={"k": 5}),
    'fixed_use': db_fixed_use.as_retriever(search_kwargs={"k": 5}),
    'fixed_mpnet': db_fixed_mpnet.as_retriever(search_kwargs={"k": 5}),
    'recursive_minilm': db_recursive_minilm.as_retriever(search_kwargs={"k": 5}),
    'recursive_use': db_recursive_use.as_retriever(search_kwargs={"k": 5}),
    'recursive_mpnet': db_recursive_mpnet.as_retriever(search_kwargs={"k": 5}),
    'sentence_minilm': db_sentence_minilm.as_retriever(search_kwargs={"k": 5}),
    'sentence_use': db_sentence_use.as_retriever(search_kwargs={"k": 5}),
    'sentence_mpnet': db_sentence_mpnet.as_retriever(search_kwargs={"k": 5}),
}
print(f"Created {len(retrievers)} retrievers")

### STep 9: Retriever evaluation
As discussed in the lecture, here we are using metrics like Precision@k, Recall@k, F1@k score, and MRR(Mean Reciprocal Rank) for evaluating retrieval quality.

In [None]:
#Evaluation metrics--

from src.retriever_eval import (
    precision_at_k,
    recall_at_k,
    f1_score_at_k,
    mrr
)


def evaluate_retriever(retriever, test_docs, k=5):
    precisions, recalls, f1_scores, mrrs = [], [], [], []
    for doc in test_docs:
        query = doc.page_content[:80]
        true_id = doc.metadata['chunk_id']
        try:
            retrieved = retriever.invoke(query)
            retrieved_ids = [d.metadata['chunk_id'] for d in retrieved]
            
            p = precision_at_k(retrieved_ids, true_id, k)
            r = recall_at_k(retrieved_ids, true_id, k)
            f1 = f1_score_at_k(p, r)
            
            precisions.append(p)
            recalls.append(r)
            f1_scores.append(f1)
            mrrs.append(mrr(retrieved_ids, true_id))
        except:
            continue
    return {
        'precision': np.mean(precisions),
        'recall': np.mean(recalls),
        'f1_score': np.mean(f1_scores),
        'mrr': np.mean(mrrs),
        'num_queries': len(precisions)
    }

we decided to evaluate fixed set of queries for measuring retrieval accuracy across differnt configurations.

In [None]:
# REtrieval evaluation
test_docs_fixed = docs_fixed[:50]
test_docs_recursive = docs_recursive[:50]
test_docs_sentence = docs_sentence[:50]

results = []
print("Evaluating retrievers..")
results.append({'Chunking': 'Fixed', 'Embedding': 'MiniLM', **evaluate_retriever(retrievers['fixed_minilm'], test_docs_fixed)})
results.append({'Chunking': 'Fixed', 'Embedding': 'USE', **evaluate_retriever(retrievers['fixed_use'], test_docs_fixed)})
results.append({'Chunking': 'Fixed', 'Embedding': 'MPNet', **evaluate_retriever(retrievers['fixed_mpnet'], test_docs_fixed)})
results.append({'Chunking': 'Recursive', 'Embedding': 'MiniLM', **evaluate_retriever(retrievers['recursive_minilm'], test_docs_recursive)})
results.append({'Chunking': 'Recursive', 'Embedding': 'USE', **evaluate_retriever(retrievers['recursive_use'], test_docs_recursive)})
results.append({'Chunking': 'Recursive', 'Embedding': 'MPNet', **evaluate_retriever(retrievers['recursive_mpnet'], test_docs_recursive)})
results.append({'Chunking': 'Sentence', 'Embedding': 'MiniLM', **evaluate_retriever(retrievers['sentence_minilm'], test_docs_sentence)})
results.append({'Chunking': 'Sentence', 'Embedding': 'USE', **evaluate_retriever(retrievers['sentence_use'], test_docs_sentence)})
results.append({'Chunking': 'Sentence', 'Embedding': 'MPNet', **evaluate_retriever(retrievers['sentence_mpnet'], test_docs_sentence)})

results_df = pd.DataFrame(results)
results_df.columns = ['Chunking', 'Embedding', 'Precision@5', 'Recall@5', 'F1@5','MRR', 'Queries']
print("\n-REtrieval results --")
print(results_df)

#visualization of F1
fig, axes = plt.subplots(2, 2, figsize=(12, 6))

results_pivot = results_df.pivot(index='Chunking', columns='Embedding', values='Precision@5')
results_pivot.plot(kind='bar', ax=axes[0,0], color=['skyblue', 'lightgreen', 'lightcoral'])
axes[0,0].set_title('Precision@5')
axes[0,0].set_ylabel('Precision@5')
axes[0,0].set_ylim([0, 1])

results_pivot = results_df.pivot(index='Chunking', columns='Embedding', values='Recall@5')
results_pivot.plot(kind='bar', ax=axes[0,1], color=['skyblue', 'lightgreen', 'lightcoral'])
axes[0,1].set_title('Recall@5')
axes[0,1].set_ylabel('Recall@5')
axes[0,1].set_ylim([0, 1])

results_pivot = results_df.pivot(index='Chunking', columns='Embedding', values='F1@5')
results_pivot.plot(kind='bar', ax=axes[1,0], color=['skyblue', 'lightgreen', 'lightcoral'])
axes[1,0].set_title('F1 Score@5')
axes[1,0].set_ylabel('F1@5')
axes[1,0].set_ylim([0, 1])

results_pivot = results_df.pivot(index='Chunking', columns='Embedding', values='MRR')
results_pivot.plot(kind='bar', ax=axes[1,1], color=['skyblue', 'lightgreen', 'lightcoral'])
axes[1,1].set_title('MRR')
axes[1,1].set_ylabel('MRR')
axes[1,1].set_ylim([0, 1])

plt.tight_layout()
plt.show()



We noticed that Recursive chunking consistently achives higher recall & MRR whcih indicates more reliable retrieval compared to fixed & sentence based chunking.

### Step 10: Setting LLM and Answer generation
For setting LLM, we are using Gemini-2.5-flash model using Google API key.<br>
To avoid exposing API key, we are implementing Kaggle Secrets feature to keep API key safe.<br>
We alos tried to evaluate responses through metrics like Faithfulness and Answer Relevance.


In [None]:
#setting LLM here--

import google.generativeai as genai
from dotenv import load_dotenv
import os

load_dotenv()

GENAI_API_KEY = os.getenv("GOOGLE_API_KEY")
if not GENAI_API_KEY:
    raise RuntimeError("GOOGLE_API_KEY not found in environment variables")

genai.configure(api_key=GENAI_API_KEY)
model = genai.GenerativeModel("models/gemini-2.5-flash")

print("Gemini configured via environment variables")


#generation functions__


def answer_relevance_llm(answer, query, model):

    prompt = f"""Rate how well the ANSWER addresses the QUERY on a scale of 0.0 to 1.0.

QUERY: {query}

ANSWER: {answer}

Does the answer directly address the query?
- 1.0 = Perfectly relevant, directly answers query
- 0.5 = Partially relevant, somewhat addresses query
- 0.0 = Not relevant, doesn't answer query

Respond with ONLY a number between 0.0 and 1.0:"""
    
    try:
        response = model.generate_content(prompt)
        score_text = response.text.strip()
        score = float(re.findall(r'0?\.\d+|1\.0|0|1', score_text)[0])
        return max(0.0, min(1.0, score))
    except:
        return 0.5



result = rag_pipeline(query, retriever, model)


### Step 11: TEsting RAG Pipeline through Question & Answers

In [None]:
#testing RAG pipeline
from src.rag_pipeline import rag_pipeline

test_queries = [
    "What are the key goals for ending poverty?",
    "How can sustainable cities be achieved?",
    "What actions are needed for climate change?"
]

print("\n-- RAG pipeline test--")
for query in test_queries:
    result = rag_pipeline(query, retrievers['recursive_mpnet'], model, k=3)
    print(f"\nQ: {result['query']}")
    print(f"A: {result['answer']}")
    print(f"Faithfulness: {result['faithfulness']:.2f}")
    print(f"Answer Relevance: {result['answer_relevance']:.2f}")

### STep 12: ANswer generation evaluation
Here, the generated answers are evaluated for faithfulness & relevance to access grounding quality.
We compared 3 configurations :
* Recursive+MPNet,
* Recursive+USE
* Sentence+MPNet.

We ensured all configurations used same dataset, retrieval parameters, & generation model to ensure fair comparison.

In [None]:
eval_queries = [
    "What are the main sustainable development goals?",
    "How is poverty reduction measured?",
    "What role do cities play in sustainability?",
    "What are climate action priorities?",
    "How can education improve development?"
]

configs = [
    ('recursive_mpnet', 'Recursive+MPNet'), 
    ('recursive_use', 'Recursive+USE'), 
    ('sentence_mpnet', 'Sentence+MPNet')
]

generation_results = []

print("\n--Evaluating Generation Quality--")
for config_key, config_name in configs:
    print(f"Evaluating {config_name}...")
    faithfulness_scores = []
    relevance_scores = []
    
    for query in eval_queries:
        result = rag_pipeline(query, retrievers[config_key], model, k=3)
        faithfulness_scores.append(result['faithfulness'])
        relevance_scores.append(result['answer_relevance'])
    
    generation_results.append({
        'Configuration': config_name,
        'Avg Faithfulness': np.mean(faithfulness_scores),
        'Std Faithfulness': np.std(faithfulness_scores),
        'Avg Relevance': np.mean(relevance_scores),
        'Std Relevance': np.std(relevance_scores)
    })

generation_df = pd.DataFrame(generation_results)
print("\n--GENERATION EVALUATION RESULTS--")
print(generation_df)

# Visualize generation metrics
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

x = np.arange(len(generation_df))
width = 0.35

axes[0].bar(x, generation_df['Avg Faithfulness'], width, label='Faithfulness', color='skyblue')
axes[0].set_xlabel('Configuration')
axes[0].set_ylabel('Score')
axes[0].set_title('Faithfulness Scores')
axes[0].set_xticks(x)
axes[0].set_xticklabels(generation_df['Configuration'], rotation=45, ha='right')
axes[0].set_ylim([0, 1])

axes[1].bar(x, generation_df['Avg Relevance'], width, label='Relevance', color='lightgreen')
axes[1].set_xlabel('Configuration')
axes[1].set_ylabel('Score')
axes[1].set_title('Answer Relevance Scores')
axes[1].set_xticks(x)
axes[1].set_xticklabels(generation_df['Configuration'], rotation=45, ha='right')
axes[1].set_ylim([0, 1])

plt.tight_layout()
plt.show()

In [None]:
print("\n--FINAL COMPARISON--")
print("\n--- Top 3 Retrieval Configurations (by F1 Score) ---")
top_retrieval = results_df.nlargest(3, 'F1@5')[['Chunking', 'Embedding', 'Precision@5', 'Recall@5', 'F1@5', 'MRR']]
print(top_retrieval)

print("\n-- Generation Quality Ranking --")
generation_df_sorted = generation_df.sort_values('Avg Faithfulness', ascending=False)
print(generation_df_sorted)

### Results & Conclusion  

**1. Evaluation observations**
* Recursive chunking combined with MPNet embeddings achieved substantially higher faithfulness (0.40) and relevance (0.38) compared to other configurations, indicating improved grounding of generated responses.
* This suggests that stronger semantic embeddings and structurally aware chunking significantly enhance generation quality in RAG systems.
* Lower scores observed for USE-based configurations highlight the impact of embedding model choice on retrieval and downstream generation performance

**2. Best RAG Configuration Summary**

| Component | Best Choice | Key Metrics |
|----------|------------|-------------|
| Chunking Strategy | Recursive | F1@5 = 0.304, MRR = 0.816 |
| Embedding Model (Retrieval) | MiniLM | Precision@5 = 0.188, Recall@5 = 0.940, F1@5 = 0.313 |
| Embedding Model (Generation) | MPNet | Faithfulness = 0.404 ± 0.320, Relevance = 0.377 ± 0.337 |
| Best Retrieval Configuration | Recursive + MiniLM | F1@5 = 0.313, MRR = 0.826 |
| Best Generation Configuration | Recursive + MPNet | Faithfulness = 0.404, Relevance = 0.377 |
| **Overall Best RAG System** | **Recursive + MPNet** | **Combined Score = 0.484** |

### Limitation
* The project was conducted with limited computational resources, including restricted GPU availability which required using a subset of the dataset for experimentation.

* The choice of LLMs was constrained due to limited access to free or low-cost LLM APIs, affecting the scale of generation experiments.

* The UN SDG dataset exhibited class imbalance across different SDG labels, which may have influenced retrieval performance.

### Business Implications

The results shows that RAG systems can significantly improve information access in large policy documents by providing accurate, context-grounded answers. This reduces manual effort, speeds up analysis & lowers the risk of misinterpretation in decision-making processes.

###  Recommendations
Use recursive chunking with semantically strong embeddings for policy document retrieval.

Invest in scalable compute and reliable LLM APIs for production use.