# RAG Pipeline Exercise

In this exercise you will build and **compare two simple Retrieval-Augmented Generation (RAG) pipelines**.

You will work with a small collection of PDF documents (e.g. medical guidelines) and:

1. Load and chunk the PDF documents.
2. Create a vector index using **embedding model A** (local `BAAI/bge-m3`).
3. Create a second index using **embedding model B** (e.g. OpenAI or Gemini embeddings).
4. Implement a simple **retriever** and an **answering function** that calls an LLM with retrieved context.
5. Automatically **generate questions** from the documents and use them to **compare two RAG configurations**.

Cells marked with `# TODO` are **for students to implement**.
Everything else is provided scaffolding.

## 0. Setup & Imports

In [2]:
# TODO (easy): skim the imports and make sure you understand what each library is used for.

from dotenv import load_dotenv
import os
import glob
from PyPDF2 import PdfReader
from langchain_text_splitters import RecursiveCharacterTextSplitter
import faiss
from sentence_transformers import SentenceTransformer
import pickle
import random
import numpy as np

# LLM / API clients (we will mainly use OpenAI here; Gemini can be added as a bonus)
from openai import OpenAI

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# Load API keys from .env (you need to create this file once and add your keys)
load_dotenv()

deepinfra_key = os.getenv("DEEPINFRA_API_KEY")
openai_api_key = os.getenv("OPENAI_API_KEY")
google_api_key = os.getenv("GOOGLE_API_KEY")
anthropic_api_key = os.getenv("ANTHROPIC_API_KEY")

## 1. Load PDF documents

We assume there is a `data/` folder containing one or more PDF files.

**Task:** implement `load_pdfs(glob_path)` so that it:
- Iterates over all PDF files matching `glob_path`
- Reads them with `PdfReader`
- Concatenates the text of all pages into **one long string**.

In [4]:
def load_pdfs(glob_path: str = "data/*.pdf") -> str:
    """Load all PDFs matching the pattern and return their combined text.

    TODO:
    - Use `glob.glob(glob_path)` to iterate over file paths
    - For each file, open it in binary mode and create a `PdfReader`
    - Loop over `reader.pages` and extract text with the extract_text() function
    - Concatenate everything into a single string `text`
    - Be robust: skip pages where `extract_text()` returns None
    """
    # YOUR CODE HERE
    text = ""
    for pdf_path in glob.glob(glob_path):
        with open(pdf_path, "rb") as f:
            reader = PdfReader(f)
            for page in reader.pages:
                page_text = page.extract_text()
                if page_text:
                    text += page_text + "\n"
    return text

In [5]:
# Run once and inspect
raw_text = load_pdfs("data/*.pdf")
print("Number of characters:", len(raw_text))
print("Preview:", raw_text[:500])

Number of characters: 230708
Preview: Asthma: diagnosis, 
moni toring and chr onic 
asthma manag emen t (BTS, 
NICE, SI GN) 
NICE guideline 
Published: 27 No vember 202 4 
www .nice.or g.uk/guidance/ng2 45 
© NICE 202 4. All right s reserved. Subject t o Notice of right s (https://www .nice.or g.uk/t erms-and-
conditions#notice-of -right s).
Your r esponsi bility 
The r ecommendations in t his guideline r epresent t he view of NICE, arriv ed at aft er car eful 
consideration of t he evidence a vailable. When e xercising t heir judge


## 2. Chunk the text

We will split the long text into overlapping chunks.

Later you can **experiment** with different `chunk_size` and `chunk_overlap` to see how it affects retrieval.

**Task:** start with the given parameters, run once, then try at least one alternative configuration and note the effects.

In [6]:
# Base configuration (RAG A)
chunk_size_a = 2000
chunk_overlap_a = 200

splitter_a = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size_a,
    chunk_overlap=chunk_overlap_a
)

chunks_a = splitter_a.split_text(raw_text)
print(f"RAG A: {len(chunks_a)} chunks produced, first chunk length = {len(chunks_a[0])}")

# TODO (mini-experiment): change chunk_size / chunk_overlap for RAG B and compare
chunk_size_b =  1000
chunk_overlap_b = 100

splitter_b = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size_b,
    chunk_overlap=chunk_overlap_b
)

chunks_b = splitter_b.split_text(raw_text)
print(f"RAG B: {len(chunks_b)} chunks produced, first chunk length = {len(chunks_b[0])}")

RAG A: 130 chunks produced, first chunk length = 1995
RAG B: 260 chunks produced, first chunk length = 979


## 3. Create embeddings and a FAISS index

We start with **Embedding model A: `BAAI/bge-small-en`** using `sentence-transformers`. You can find a list of more models here: https://huggingface.co/spaces/mteb/leaderboard 
make sure that the models are not bigger than the one used here. Otherwise the embeddings process will take too long.

Then, as an optional extension, you can build **Embedding model B** using OpenAI or Gemini and compare.

To keep the exercise manageable, the base version only **requires** BGE.

In [7]:
# Embedding model A (local)
model_name_a = "BAAI/bge-small-en"
embedder_a = SentenceTransformer(model_name_a)

# Compute embeddings for all chunks of configuration A
embeddings_a = embedder_a.encode(chunks_a, convert_to_numpy=True)

dimensions_a = embeddings_a.shape[1]
print("Embedding dimensionality (A):", dimensions_a)

index_a = faiss.IndexFlatL2(dimensions_a)
index_a.add(embeddings_a)
print("FAISS index (A) size:", index_a.ntotal)

# Persist index/chunks if you like (optional)
os.makedirs("faiss", exist_ok=True)
faiss.write_index(index_a, "faiss/faiss_index_a.index")
with open("faiss/chunks_a.pkl", "wb") as f:
    pickle.dump(chunks_a, f)

Embedding dimensionality (A): 384
FAISS index (A) size: 130


In [8]:
# Embedding model B using OpenAI embeddings.

# TODO :
# - Use `openai_client.embeddings.create(...)` to compute embeddings for `chunks_b`
# - Create a second FAISS index `index_b`
# - Make sure to check the dimensionality from the first embedding vector


openai_client = OpenAI(api_key=openai_api_key)
response = openai_client.embeddings.create(
     model="text-embedding-3-small",
    input=chunks_b
)
embeddings_b = np.array([item.embedding for item in response.data])
dim_b = embeddings_b.shape[1]
index_b = faiss.IndexFlatL2(dim_b)
index_b.add(embeddings_b)
print("FAISS index (B) size:", index_b.ntotal)

FAISS index (B) size: 260


## 4. Implement a simple retriever

We now implement a generic retrieval function that:
1. Embeds the query.
2. Searches the FAISS index.
3. Returns the corresponding text chunks.

We implement it for configuration A. If you built configuration B, you can reuse the same function.

In [9]:
def retrieve_texts(query: str, k: int, index, chunks, embedder) -> list:
    """Return the top-k most similar chunks for a query.
    - Encode the query with `embedder.encode(...)`
    - Call `index.search(query_embedding, k)`
    - Use the returned indices to select the chunks
    - Return a list of strings (chunks)
    """
    # YOUR CODE HERE
    query_emb = embedder.encode([query], convert_to_numpy=True)
    distances, indices = index.search(query_emb, k)
    retrieved = [chunks[i] for i in indices[0]]
    return retrieved

# Quick sanity check
test_query = "What is the most important factor in diagnosing asthma?"
retrieved_test = retrieve_texts(
    query = test_query,
    k=5,
    index = index_a,
    chunks = chunks_a,
    embedder = embedder_a
)
print("Number of retrieved chunks:",len(retrieved_test))
print("Preview of first chunk:", retrieved_test[0][:200])

Number of retrieved chunks: 5
Preview of first chunk: and signs of ot her causes of r espirat ory sympt oms but be awar e that e ven if 
examination r esult s are normal, t he person ma y still ha ve ast hma. [NICE 2017] 
Initial tr eatmen t and obje cti


## 5. Implement `answer_query` using an LLM

Now we build the actual RAG call:

1. Use `retrieve_texts` to get top-`k` chunks.
2. Concatenate them into a context string.
3. Build a prompt that:
   - shows the context
   - asks the model to answer the user question based **only** on this context.
4. Call the OpenAI chat completion API.

This is the **core RAG function**.

In [17]:
def answer_query(query: str, k: int, index, chunks, embedder, client: OpenAI) -> str:
    """RAG-style answer: retrieve context and ask an LLM.

    TODO (students):
    - Use `retrieve_texts` to get `k` relevant chunks.
    - Join them into a single context string.
    - Build a chat prompt that instructs the model to answer *only* using the context.
    - Call `client.chat.completions.create(...)` with model `"gpt-5o-mini"` (or similar).
    - Return the model's answer text.
    """
    retrieved_chunks = retrieve_texts(query, k, index, chunks, embedder)
    context = "\n\n".join(retrieved_chunks)

    # Remember: strings can be concatenated (like an addition)
    system_prompt = (
        "You are a helpful Assistant. Answer the users question."
    )

    messages = [
        {"role": "system", "content": system_prompt},
    ]

    completion = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages
    )

    return completion.choices[0].message.content.strip()

# Quick manual test
answer = answer_query(test_query, k=3, index=index_a, chunks=chunks_a, embedder=embedder_a, client=openai_client)
print("RAG answer:", answer)

RAG answer: Of course! What question do you have?


## 6. Generate questions from random chunks (automatic evaluation set)

To compare two RAG configurations, we need **questions**.

We will:
- randomly sample a few chunks from the corpus,
- ask an LLM to generate a **good question** whose answer is contained in the chunk.

Then we can use these question–chunk pairs as a small evaluation set.

We provide most of the implementation. Your job is mainly to:
- inspect the code,
- understand the prompt,
- maybe tweak the number of chunks or retries.

In [20]:
def generate_questions_for_random_chunks(chunks, num_chunks: int = 5, max_retries: int = 2):
    selected_chunks = random.sample(chunks, num_chunks)
    qa_pairs = []

    for chunk in selected_chunks:
        prompt = prompt = (
            "Based on the following text, generate an insightful question that covers its key content:\n\n"
            "Text:\n" + chunk + "\n\n"
            "Question:"
        )

        question = None
        for attempt in range(max_retries):
            try:
                completion = openai_client.chat.completions.create(
                    model="gpt-4o-mini",
                    messages=[{"role": "user", "content": prompt}]
                )
                question = completion.choices[0].message.content.strip()
                if question:
                    break
            except Exception as e:
                print("Error while generating question, retrying...", e)

        if question is None:
            question = "Error: could not generate question."

        qa_pairs.append((chunk, question))

    return qa_pairs

questions = generate_questions_for_random_chunks(chunks_a, num_chunks=8, max_retries=2)
for i, (chunk, q) in enumerate(questions, 1):
    print(f"Q{i}: {q}\n  From chunk preview: {chunk[:120]}...\n")

Q1: What implications do the recommendations for asthma management in children aged 5 to 11, particularly regarding the use of regular low-dose ICS, have for current treatment practices and future research directions?
  From chunk preview: applies her e. The r ecommendations ar e not based on a specific e vidence sear ch, but t he 
committ ee not ed that peo...

Q2: What evidence led the committee to recommend as-needed combination inhalers over standard SABA treatments for the management of newly diagnosed asthma in individuals aged 12 and over?
  From chunk preview: How the r ecommenda tion mig ht affect practice 
Digital inhalers ar e not r ecommended f or routine use in t he NHS, an...

Q3: What were the main considerations and decisions made by the committee regarding the use of beta-blockers and the recommendations for step 1 antihypertensive treatment in individuals with type 2 diabetes, particularly for those of Black African or African-Caribbean family origin?
  From chunk pre

## 7. Compare two RAG configurations

Now we can:
- Use the generated questions,
- Answer them with RAG configuration A (BGE + chunking A),
- (Optional) Answer them with RAG configuration B (e.g. different chunking and/or different embeddings),
- Compare the answers qualitatively.

To keep the exercise manageable, we start with config A only.
If you implemented config B, reuse `answer_query` with `index_b`, `chunks_b`, and your second embedder.

In [21]:
def answer_generated_questions(question_tuples, k, index, chunks, embedder, client):
    results = []
    for chunk, question in question_tuples:
        answer = answer_query(question, k, index, chunks, embedder, client)
        results.append({
            "chunk": chunk,
            "question": question,
            "answer": answer
        })
    return results

results_a = answer_generated_questions(
    questions,
    k=5,
    index=index_a,
    chunks=chunks_a,
    embedder=embedder_a,
    client=openai_client,
)

# ============================================================
# Embedder B: Wrapper for OpenAI embeddings
# Provides an .encode() method compatible with retrieve_texts()
# ============================================================

class OpenAIEmbedder:
    def __init__(self, client, model="text-embedding-3-small"):
        self.client = client
        self.model = model

    def encode(self, texts, convert_to_numpy=True):
        # Call OpenAI embeddings
        resp = self.client.embeddings.create(
            model=self.model,
            input=texts
        )
        # Convert embeddings to numpy array
        vectors = np.array([item.embedding for item in resp.data])
        return vectors


# instantiate embedder_b
embedder_b = OpenAIEmbedder(openai_client)

results_b = answer_generated_questions(
    questions,
    k=5,
    index=index_b,
    chunks=chunks_b,
    embedder=embedder_b,
    client=openai_client,
)


for item in results_a:
    print("Question:", item_a["question"])
    print("Answer A:", item_a["answer"])
    print("Answer B:", item_b["answer"])
    print("Source chunk preview:", item_a["chunk"][:150], "...")
    print("-" * 80)

NameError: name 'item_a' is not defined

Add RAG B and create a comparison table

If you implemented a second configuration (e.g. different chunking + OpenAI embeddings):

1. Build `index_b` and `embedder_b`.
2. Run `results_b = answer_generated_questions(..., index_b, chunks_b, embedder_b, client)`.
3. For each question, compare:
   - Which answer is more complete / specific?
   - Which one is better grounded in the source chunk?
4. Summarise your findings in a short **markdown cell** or a small table.

---

This concludes the core RAG exercise.