# CS 5542 — Lab 3: Multimodal RAG Systems & Retrieval Evaluation  
**Text + Images/PDFs (runs offline by default; optional LLM API hook)**

This notebook is a **student-ready, simplified, and fully runnable** lab workflow for **multimodal retrieval-augmented generation (RAG)**:
- ingest **PDF text** + **image captions/filenames**
- retrieve evidence with a lightweight baseline (TF‑IDF)
- build a **context block** for answering
- evaluate retrieval quality (Precision@5, Recall@10)
- run an **ablation study** (REQUIRED)

> ✅ **Important:** The code is optimized for **clarity + reproducibility for students** (minimal dependencies, no keys required).  
> It is not the “fastest possible” or “best-performing” RAG system — but it is a correct baseline that you can extend.

---

## Student Tasks (what you must do)
1. **Ingest** PDFs + images from `project_data_mm/` (or use the provided sample package).  
2. Implement / experiment with **chunking strategies** (page-based vs fixed-size).  
3. Compare retrieval methods (at least):  
   - **Sparse** (TF‑IDF / BM25-style)  
   - **Dense** (optional: embeddings)  
   - **Hybrid** (score fusion with `alpha`)  
   - **Hybrid + rerank** (optional: reranker / LLM rerank)  
4. Build a **multimodal context** that includes **evidence items** (text + images).  
5. Produce the required **results table**:

`Query × Method × Precision@5 × Recall@10 × Faithfulness`

---

## Expected Outputs (what graders look for)
- Printed ingestion counts (how many PDF pages/chunks, how many images)
- A retrieval demo showing **top‑k evidence** for a query
- Evaluation metrics per method (P@5, R@10)
- An ablation section with a small comparison table + short explanation


## Key Parameters You Can Tune (and what they do)

These parameters control retrieval + context building. **Students should change them and report what happens.**

- **`TOP_K_TEXT`**: how many text chunks to consider as candidates.  
  - Larger → more recall, but more noise (lower precision).
- **`TOP_K_IMAGES`**: how many image items to consider as candidates.  
  - Larger → more multimodal evidence, but can add irrelevant images.
- **`TOP_K_EVIDENCE`**: how many total evidence items (text+image) go into the final context.  
  - Larger → longer context; may dilute answer quality.
- **`ALPHA`** *(0 → 1)*: **fusion weight** when mixing text vs image evidence.  
  - `ALPHA = 1.0` → text dominates  
  - `ALPHA = 0.0` → images dominate  
  - typical starting point: `0.5`
- **`CHUNK_SIZE`** (fixed-size chunking): characters per chunk (baseline).  
  - Smaller → more granular retrieval (often higher precision)  
  - Larger → fewer chunks (often higher recall but less specific)
- **`CHUNK_OVERLAP`**: overlap between chunks to avoid cutting important info.  
  - Too high → redundant chunks; too low → missing context boundaries

### What to try (recommended student experiments)
- Keep everything fixed, vary **`ALPHA`**: 0.2, 0.5, 0.8  
- Vary **`TOP_K_TEXT`**: 2, 5, 10  
- Compare **page-based** vs **fixed-size** chunking (required ablation)


## 0) Student Info (Fill in)
- Name: Murali Ediga
- UMKC ID: 16334849



## 1) Setup (student-friendly baseline)

This lab starter is designed to be **easy to run** and **easy to modify**:
- **PyMuPDF (`fitz`)** for PDF text extraction
- **scikit-learn** for TF‑IDF retrieval (strong sparse baseline)
- **Pillow** for basic image IO
- Optional: connect an **LLM API** for answer generation (not required to run retrieval + eval)

### Student guideline
- First make sure **retrieval + metrics** run end-to-end.
- Then iterate: chunking → retrieval method → fusion (`ALPHA`) → rerank → faithfulness.

> If you have API keys (e.g., Gemini / OpenAI / etc.), you can plug them into the optional LLM hook later —  
> but your retrieval evaluation should work **without** any external keys.


In [11]:
# Imports
import os, re, glob, json, math
from dataclasses import dataclass
from typing import List, Dict, Any, Tuple, Optional

import numpy as np
import pandas as pd

import fitz  # PyMuPDF
from PIL import Image, ImageDraw, ImageFont

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import normalize

# New imports for Lab 3
try:
    from sentence_transformers import SentenceTransformer, util, CrossEncoder
    HAS_DENSE = True
    print("✅ SentenceTransformers loaded.")
except ImportError:
    HAS_DENSE = False
    print("⚠️ SentenceTransformers NOT found. Dense retrieval will be disabled.")

# Cell Description:
# Imports all required libraries for the multimodal RAG pipeline.
# Includes PyMuPDF for PDF processing, scikit-learn for TF-IDF, SentenceTransformers for dense retrieval,
# and CrossEncoder for reranking. This modular setup allows the notebook to gracefully degrade
# if optional dependencies (dense/rerank) are unavailable.


✅ SentenceTransformers loaded.


In [12]:
# =========================
# Lab Configuration (EDIT ME)
# =========================
DATA_DIR = "project_data_mm"   # folder containing pdfs/ and images/
PDF_DIR  = os.path.join(DATA_DIR, "pdfs")
IMG_DIR  = os.path.join(DATA_DIR, "figures")

# Retrieval knobs
TOP_K_TEXT     = 10   # High recall for first stage
TOP_K_IMAGES   = 5    
TOP_K_EVIDENCE = 8    # Final context size

# Fusion knob
ALPHA = 0.5  # 0.5 = equal weight. 1.0 = text only.

# Chunking knobs
CHUNK_SIZE    = 900
CHUNK_OVERLAP = 150
CHUNKING_MODE = "fixed" # 'page' or 'fixed'

# Retrieval Method
RETRIEVAL_METHOD = "rerank" # 'sparse', 'dense', 'hybrid', 'rerank'

# Reproducibility
RANDOM_SEED = 42

# Cell Description:
# Configuration parameters for the RAG pipeline. 
# We define paths, retrieval hyperparameters (k), fusion weights, and chunking settings.
# Providing a central config block allows for easy ablation studies (changing one knob at a time).


## 2) Data folder
Expected structure:
```
project_data_mm/
  doc1.pdf
  doc2.pdf
  figures/
    img1.png
    ... (>=5)
```

If the folder is missing, we will generate **sample PDFs and images** automatically so you can run and verify the pipeline end-to-end.


In [13]:
# Data Setup - Download Papers & Extract Images
import requests

os.makedirs(PDF_DIR, exist_ok=True)
os.makedirs(IMG_DIR, exist_ok=True)

# 1. Download Research Papers (Attention + BERT)
papers = {
    "attention.pdf": "https://arxiv.org/pdf/1706.03762.pdf",
    "bert.pdf": "https://arxiv.org/pdf/1810.04805.pdf"
}

for name, url in papers.items():
    path = os.path.join(PDF_DIR, name)
    if not os.path.exists(path):
        print(f"Downloading {name}...")
        headers = {"User-Agent": "Mozilla/5.0"}
        resp = requests.get(url, headers=headers)
        if resp.status_code == 200:
            with open(path, "wb") as f:
                f.write(resp.content)
            print(f"✅ {name} Download complete.")
        else:
            print(f"❌ Failed to download {name}: {resp.status_code}")
    else:
        print(f"✅ {name} already exists.")

# 2. Extract Images (Multimodal Requirement)
def extract_images_from_pdf(pdf_path: str, output_dir: str):
    doc = fitz.open(pdf_path)
    base_name = os.path.splitext(os.path.basename(pdf_path))[0]
    
    count = 0
    for i in range(len(doc)):
        page = doc.load_page(i)
        image_list = page.get_images(full=True)
        for img_index, img in enumerate(image_list):
            xref = img[0]
            try:
                base_image = doc.extract_image(xref)
                image_bytes = base_image["image"]
                image_ext = base_image["ext"]
                # Filter tiny images
                if len(image_bytes) < 3000: continue
                
                fname = f"{base_name}_p{i+1}_{img_index+1}.{image_ext}"
                out_path = os.path.join(output_dir, fname)
                with open(out_path, "wb") as f:
                    f.write(image_bytes)
                count += 1
            except Exception as e:
                print(f"Error extracting image {xref}: {e}")
    print(f"Extracted {count} images from {base_name}")

if len(glob.glob(os.path.join(IMG_DIR, "*.*"))) < 6:
    print("Extracting images from PDFs...")
    for p in glob.glob(os.path.join(PDF_DIR, "*.pdf")):
        extract_images_from_pdf(p, IMG_DIR)

pdfs = sorted(glob.glob(os.path.join(PDF_DIR, "*.pdf")))
imgs = sorted(glob.glob(os.path.join(IMG_DIR, "*.*")))
print(f"PDFs: {len(pdfs)}, Images: {len(imgs)}")

# Cell Description:
# Automated ingestion script to ensure dataset requirements are met (Min 2 PDFs, Min 6 Images).
# It downloads "Attention Is All You Need" and "BERT" papers and extracts figures using PyMuPDF.
# This ensures the pipeline works on real-world heterogeneous PDF data.


✅ attention.pdf already exists.
✅ bert.pdf already exists.
PDFs: 2, Images: 6


## 3) Define your 3 queries + rubrics
**Guideline:** write queries that can be answered using your PDFs/images.

Rubric format below is **simple and runnable**:
- `must_have_keywords`: words/phrases that should appear in relevant evidence
- `optional_keywords`: nice-to-have

Later, retrieval metrics will treat an evidence chunk as relevant if it contains at least one `must_have_keywords` item.


In [14]:
QUERIES = [
    {
        "id": "Q1",
        "question": "What is the Transformer architecture and how does it differ from RNNs/CNNs?",
        "rubric": {
            "must_have_keywords": ["transformer", "architecture", "recurrence", "convolution"],
            "optional_keywords": ["attention", "sequential", "parallel"]
        }
    },
    {
        "id": "Q2",
        "question": "How does BERT use the Transformer encoder?",
        "rubric": {
            "must_have_keywords": ["bert", "encoder", "bidirectional", "transformer"],
            "optional_keywords": ["masked", "lm", "pre-training"]
        }
    },
    {
        "id": "Q3",
        "question": "Describe the difference between the 'base' and 'large' models.",
        "rubric": {
            "must_have_keywords": ["base", "large", "parameter", "layer"],
            "optional_keywords": ["dimension", "head", "performance"]
        }
    },
]

# Cell Description:
# Defines the evaluation set (Queries + Gold Standard Rubrics).
# Q1 targets the Attention paper, Q2 targets BERT, Q3 target both/comparison.
# The rubrics provide keywords for automated "loose" relevance checking in our evaluation loop.


## 4) Ingestion
We extract:
- **PDF per-page text** as `TextChunk`
- **Image metadata** as `ImageItem` (caption = filename without extension)

> This is intentionally lightweight so it runs without downloading large embedding models.


In [15]:
@dataclass
class TextChunk:
    chunk_id: str
    doc_id: str
    page_num: int
    text: str

@dataclass
class ImageItem:
    item_id: str
    path: str
    caption: str

def clean_text(s: str) -> str:
    s = s or ""
    s = re.sub(r"\s+", " ", s).strip()
    return s

def extract_pdf_pages(pdf_path: str) -> List[TextChunk]:
    doc_id = os.path.basename(pdf_path)
    doc = fitz.open(pdf_path)
    out: List[TextChunk] = []
    for i in range(len(doc)):
        page = doc.load_page(i)
        text = clean_text(page.get_text("text"))
        if text:
            out.append(TextChunk(
                chunk_id=f"{doc_id}::p{i+1}",
                doc_id=doc_id,
                page_num=i+1,
                text=text
            ))
    return out

def chunk_text_fixed(text_chunks: List[TextChunk], chunk_size: int = 900, chunk_overlap: int = 150) -> List[TextChunk]:
    # Refine page chunks into smaller fixed windows
    # This allows for more precise retrieval than full-page chunks
    new_chunks = []
    for ch in text_chunks:
        text = ch.text
        if len(text) <= chunk_size:
            new_chunks.append(ch)
            continue
            
        start = 0
        sub_idx = 0
        while start < len(text):
            end = min(start + chunk_size, len(text))
            sub_text = text[start:end]
            # Inherit metadata
            new_chunks.append(TextChunk(
                chunk_id=f"{ch.chunk_id}::sub{sub_idx}",
                doc_id=ch.doc_id,
                page_num=ch.page_num,
                text=sub_text
            ))
            if end == len(text):
                break
            start += (chunk_size - chunk_overlap)
            sub_idx += 1
    return new_chunks

def load_images(fig_dir: str) -> List[ImageItem]:
    items: List[ImageItem] = []
    for p in sorted(glob.glob(os.path.join(fig_dir, "*.*"))):
        base = os.path.basename(p)
        caption = os.path.splitext(base)[0].replace("_", " ")
        items.append(ImageItem(item_id=base, path=p, caption=caption))
    return items

# Run ingestion
raw_page_chunks = []
for p in pdfs:
    raw_page_chunks.extend(extract_pdf_pages(p))

if CHUNKING_MODE == "fixed":
    page_chunks = chunk_text_fixed(raw_page_chunks, CHUNK_SIZE, CHUNK_OVERLAP)
else:
    page_chunks = raw_page_chunks

image_items = load_images(IMG_DIR)

print(f"Chunks ({CHUNKING_MODE}):", len(page_chunks))
print("Images:", len(image_items))

# Cell Description:
# Implements extraction and chunking logic.
# 'extract_pdf_pages' gets raw text per page.
# 'chunk_text_fixed' implements the required sliding window strategy to handle token limits 
# and focus retrieval.


Chunks (fixed): 146
Images: 6


## 5) Retrieval (TF‑IDF)
We build two TF‑IDF indexes:
- One over **PDF text chunks**
- One over **image captions**

Retrieval returns the top‑k results with similarity scores.


In [16]:
# TF-IDF Setup
def build_tfidf_index_text(chunks: List[TextChunk]):
    corpus = [c.text for c in chunks]
    vec = TfidfVectorizer(lowercase=True, stop_words="english")
    X = vec.fit_transform(corpus)
    X = normalize(X)
    return vec, X

def build_tfidf_index_images(items: List[ImageItem]):
    corpus = [it.caption for it in items]
    vec = TfidfVectorizer(lowercase=True, stop_words="english")
    X = vec.fit_transform(corpus)
    X = normalize(X)
    return vec, X

text_vec, text_X = build_tfidf_index_text(page_chunks)
img_vec, img_X = build_tfidf_index_images(image_items)

# Dense & Rerank Setup
dense_model = None
dense_embeddings = None
reranker_model = None

if HAS_DENSE:
    # Bi-Encoder for Dense Retrieval
    dense_model = SentenceTransformer('all-MiniLM-L6-v2')
    corpus_text = [c.text for c in page_chunks]
    dense_embeddings = dense_model.encode(corpus_text, convert_to_tensor=True)
    
    # Cross-Encoder for Reranking
    try:
        reranker_model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
        print("✅ CrossEncoder loaded for reranking.")
    except Exception as e:
        print(f"⚠️ CrossEncoder extract failed: {e}")
    
    print("✅ Dense index built.")

def tfidf_retrieve(query: str, vec: TfidfVectorizer, X, top_k: int = 5):
    q = vec.transform([query])
    q = normalize(q)
    scores = (X @ q.T).toarray().ravel()
    idx = np.argsort(-scores)[:top_k]
    return [(int(i), float(scores[i])) for i in idx]

def dense_retrieve(query: str, top_k: int = 5):
    if dense_model is None: return []
    query_emb = dense_model.encode(query, convert_to_tensor=True)
    hits = util.semantic_search(query_emb, dense_embeddings, top_k=top_k)[0]
    return [(h['corpus_id'], h['score']) for h in hits]

def rerank_hits(query: str, hit_list: List[Tuple[int, float]], top_k: int = 5):
    if reranker_model is None or not hit_list:
        return hit_list[:top_k]
    
    # Prepare pairs for cross-encoder
    pairs = [[query, page_chunks[idx].text] for idx, score in hit_list]
    scores = reranker_model.predict(pairs)
    
    # Re-sort based on new scores
    ranked = sorted(list(zip(hit_list, scores)), key=lambda x: x[1], reverse=True)
    # Output format is list of (idx, score)
    return [(item[0][0], float(item[1])) for item in ranked[:top_k]]

# Cell Description:
# Initializes retrieval models.
# - 'tfidf_retrieve': Sparse retrieval baseline using Scikit-Learn.
# - 'dense_retrieve': Semantic search using SentenceTransformers (Bi-Encoder).
# - 'rerank_hits': High-precision re-scoring using a Cross-Encoder (MS MARCO).


✅ CrossEncoder loaded for reranking.
✅ Dense index built.


## 6) Build evidence context
We assemble a compact context string + list of image paths.

**Guidelines for good context:**
- Keep snippets short (100–300 chars)
- Always include chunk IDs so you can cite evidence
- Attach images that are likely relevant


In [17]:
def _normalize_scores(pairs):
    if not pairs: return []
    scores = [s for _, s in pairs]
    lo, hi = min(scores), max(scores)
    if abs(hi - lo) < 1e-12: return [(i, 1.0) for i, _ in pairs]
    return [(i, (s - lo) / (hi - lo)) for i, s in pairs]

def retrieve_text_candidates(query: str, method: str, top_k: int):
    # Returns fused candidates
    sparse_hits = []
    dense_hits = []
    
    if method in ["sparse", "hybrid", "rerank"]:
        sparse_hits = tfidf_retrieve(query, text_vec, text_X, top_k=top_k*2)
        
    if method in ["dense", "hybrid", "rerank"]:
        dense_hits = dense_retrieve(query, top_k=top_k*2)
        
    if method == "sparse":
        return sparse_hits[:top_k]
    elif method == "dense":
        return dense_hits[:top_k]
    elif method == "hybrid":
        # Reciprocal Rank Fusion or Linear Blend
        combined = {}
        s_norm = _normalize_scores(sparse_hits)
        d_norm = _normalize_scores(dense_hits)
        for idx, sc in s_norm: combined[idx] = combined.get(idx, 0) + (0.5 * sc)
        for idx, sc in d_norm: combined[idx] = combined.get(idx, 0) + (0.5 * sc)
        return sorted(combined.items(), key=lambda x: x[1], reverse=True)[:top_k]
    elif method == "rerank":
        # Get pool of candidates from both
        pool_indices = set()
        for i, s in sparse_hits: pool_indices.add(i)
        for i, s in dense_hits: pool_indices.add(i)
        
        pool_list = list(pool_indices)
        # We need scores to pass to rerank, but rerank calculates its own.
        # Just pass them as dummy scores since CrossEncoder doesn't care about previous score.
        candidates = [(i, 0.0) for i in pool_list]
        return rerank_hits(query, candidates, top_k=top_k)
    return []

def build_context(
    question: str,
    top_k_text: int = TOP_K_TEXT,
    top_k_images: int = TOP_K_IMAGES,
    top_k_evidence: int = TOP_K_EVIDENCE,
    alpha: float = ALPHA,
    method: str = "rerank" 
) -> Dict[str, Any]:
    
    # 1. Text Retrieval
    text_hits = retrieve_text_candidates(question, method, top_k_text)
    
    # 2. Image Retrieval (TF-IDF on captions)
    img_hits = tfidf_retrieve(question, img_vec, img_X, top_k=top_k_images)

    # 3. Fusion (Text vs Image)
    # We re-normalize the ALREADY fused/retrieved text scores vs the image scores
    text_norm = _normalize_scores(text_hits)
    img_norm = _normalize_scores(img_hits)
    
    fused = []
    for idx, s in text_norm:
        ch = page_chunks[idx]
        fused.append({
            "modality": "text",
            "id": ch.chunk_id,
            "fused_score": alpha * s,
            "text": ch.text,
            "path": None
        })
    for idx, s in img_norm:
        it = image_items[idx]
        fused.append({
            "modality": "image",
            "id": it.item_id,
            "fused_score": (1.0 - alpha) * s,
            "text": it.caption,
            "path": it.path
        })
        
    fused = sorted(fused, key=lambda d: d["fused_score"], reverse=True)[:top_k_evidence]
    
    ctx_lines = []
    image_paths = []
    for ev in fused:
        if ev["modality"] == "text":
            snippet = (ev["text"] or "")[:260].replace("\n", " ")
            ctx_lines.append(f"[TEXT | {ev['id']}] {snippet}")
        else:
            ctx_lines.append(f"[IMAGE | {ev['id']}] {ev['text']}")
            image_paths.append(ev["path"])
            
    return {
        "question": question,
        "context": "\n".join(ctx_lines),
        "image_paths": image_paths,
        "evidence": fused
    }

# Demo
print(build_context(QUERIES[0]["question"])["context"])

# Cell Description:
# - 'retrieve_text_candidates': Central router for dispatching queries to the selected retrieval method.
# - 'build_context': Assembles the final context window. It handles multimodal fusion (mixing text and image results using alpha weighting) to create a single coherent list of evidence for the LLM.


[TEXT | attention.pdf::p3::sub0] Figure 1: The Transformer - model architecture. The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively
[IMAGE | bert_architecture.png] bert architecture
[TEXT | attention.pdf::p10::sub1] e maximum output length to input length + 300. We used a beam size of 21 and α = 0.3 for both WSJ only and the semi-supervised setting. Our results in Table 4 show that despite the lack of task-specific tuning our model performs sur- prisingly well, yielding b
[TEXT | attention.pdf::p2::sub4] ks including reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations [4, 27, 28, 22]. End-to-end memory networks are based on a recurrent attention mechanism instead of sequence- aligned recur
[TEXT | attention.pdf::p2::sub2] onjunction with a recurrent networ

## 7) “Generator” (simple, offline)
To keep this notebook runnable anywhere, we implement a **lightweight extractive generator**:
- It returns the top evidence lines
- In your real submission, you can replace this with an LLM call (HF local model or an API)

**Key rule:** the answer must stay consistent with evidence.


In [18]:
def simple_extractive_answer(question: str, context: str) -> str:
    lines = context.splitlines()
    if not lines: return "I don't know."
    return f"Based on evidence:\n" + "\n".join(lines[:3])

def run_query(qobj, top_k_text=TOP_K_TEXT, method="rerank"):
    q = qobj["question"]
    ctx = build_context(q, top_k_text=top_k_text, method=method)
    ans = simple_extractive_answer(q, ctx["context"])
    return {
        "id": qobj["id"],
        "question": q,
        "answer": ans,
        "context": ctx["context"],
        "image_paths": ctx["image_paths"]
    }

for q in QUERIES:
    res = run_query(q)
    print(f"--- {res['id']} ---")
    print(res['answer'])
    
# Cell Description:
# End-to-end execution loop. It takes a query, runs retrieval, builds the context, 
# and (in this simplified version) generates an extractive answer.
# In a full deployment, 'simple_extractive_answer' would be replaced by an LLM API call.


--- Q1 ---
Based on evidence:
[TEXT | attention.pdf::p3::sub0] Figure 1: The Transformer - model architecture. The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively
[IMAGE | bert_architecture.png] bert architecture
[TEXT | attention.pdf::p10::sub1] e maximum output length to input length + 300. We used a beam size of 21 and α = 0.3 for both WSJ only and the semi-supervised setting. Our results in Table 4 show that despite the lack of task-specific tuning our model performs sur- prisingly well, yielding b
--- Q2 ---
Based on evidence:
[TEXT | bert.pdf::p3::sub3] n the pre-trained architec- ture and the ﬁnal downstream architecture. Model Architecture BERT’s model architec- ture is a multi-layer bidirectional Transformer en- coder based on the original implementation de- scribed in Vaswani et al. (2017) and released in
[IMAGE | ber

## 8) Retrieval Evaluation (Precision@k / Recall@k)
We treat a text chunk as **relevant** for a query if it contains at least one `must_have_keywords` term.



In [19]:
def is_relevant_text(chunk_text: str, rubric: Dict[str, Any]) -> bool:
    text = chunk_text.lower()
    must = [k.lower() for k in rubric.get("must_have_keywords", [])]
    return any(k in text for k in must)

def precision_at_k(relevances: List[bool], k: int) -> float:
    k = min(k, len(relevances))
    if k == 0:
        return 0.0
    return sum(relevances[:k]) / k

def recall_at_k(relevances: List[bool], k: int, total_relevant: int) -> float:
    k = min(k, len(relevances))
    if total_relevant == 0:
        return 0.0
    return sum(relevances[:k]) / total_relevant

def eval_retrieval_for_query(qobj, top_k=10) -> Dict[str, Any]:
    question = qobj["question"]
    rubric = qobj["rubric"]

    hits = tfidf_retrieve(question, text_vec, text_X, top_k=top_k)
    rels = []
    for i, score in hits:
        rels.append(is_relevant_text(page_chunks[i].text, rubric))

    # Estimate total relevant in the corpus (for recall)
    total_rel = sum(is_relevant_text(ch.text, rubric) for ch in page_chunks)

    return {
        "id": qobj["id"],
        "P@5": precision_at_k(rels, 5),
        "R@10": recall_at_k(rels, 10, total_rel),
        "total_relevant_chunks": total_rel,
    }

eval_rows = [eval_retrieval_for_query(q) for q in QUERIES]
df_eval = pd.DataFrame(eval_rows)
df_eval

# Cell Description:
# Implements retrieval quality metrics (Precision@K and Recall@K).
# These metrics quantify how well our retrieval system surfaces relevant documents.
# We use keyword-based relevance (must_have_keywords) as a lightweight ground truth,
# which is a practical approximation for academic labs but would need human annotations in production.


Unnamed: 0,id,P@5,R@10,total_relevant_chunks
0,Q1,0.8,0.155172,58
1,Q2,1.0,0.104167,96
2,Q3,1.0,0.1,100


## 9) Ablation Study (REQUIRED)

You must compare **at least**:
- **Chunking A (page-based)** vs **Chunking B (fixed-size)**  
- **Sparse** vs **Dense** vs **Hybrid** vs **Hybrid + Rerank** *(dense/rerank can be optional extensions — but include at least sparse + one fusion variant)*  
- **Text-only RAG** vs **Multimodal RAG** (your context must include evidence items)

**Deliverable:** include a final results table in your README:

`Query × Method × Precision@5 × Recall@10 × Faithfulness`

### Quick ablation ideas
- Vary `TOP_K_TEXT`: 2, 5, 10  
- Vary `ALPHA`: 0.2, 0.5, 0.8  
- Compare page-chunking vs fixed-size (`CHUNK_SIZE` / `CHUNK_OVERLAP`)  


In [20]:
def ablation_study():
    methods = ["sparse", "hybrid"]
    if HAS_DENSE: methods = ["sparse", "dense", "hybrid", "rerank"]
    
    results = []
    
    print("Running Ablation (this may take a moment)...")
    for method in methods:
        for q in QUERIES:
            # We reuse retrieve_text_candidates logic manually for evaluation
            
            top_k_text = 10
            # Get hits using the method being tested
            hits = retrieve_text_candidates(q["question"], method, top_k_text)
            
            rels = []
            for i, _ in hits:
                rels.append(is_relevant_text(page_chunks[i].text, q["rubric"]))
            
            # Recalculate total relevant since it depends on chunking (page_chunks global)
            total_rel = sum(is_relevant_text(ch.text, q["rubric"]) for ch in page_chunks)
            if total_rel == 0: total_rel = 1 # Avoid div/0
            
            p5 = precision_at_k(rels, 5)
            r10 = recall_at_k(rels, 10, total_rel)
            
            results.append({
                "Query": q["id"],
                "Method": method,
                "P@5": p5,
                "R@10": r10
            })
            
    df = pd.DataFrame(results)
    return df

df_ablation = ablation_study()
df_ablation

# Cell Description:
# Systematic comparison of retrieval methods.
# We iterate through Sparse, Dense, Hybrid, and Rerank setups, calculating Precision@5 and Recall@10
# against our keyword-based ground truth rubrics.


Running Ablation (this may take a moment)...


Unnamed: 0,Query,Method,P@5,R@10
0,Q1,sparse,0.8,0.155172
1,Q2,sparse,1.0,0.104167
2,Q3,sparse,1.0,0.1
3,Q1,dense,1.0,0.172414
4,Q2,dense,1.0,0.104167
5,Q3,dense,0.8,0.08
6,Q1,hybrid,1.0,0.155172
7,Q2,hybrid,1.0,0.104167
8,Q3,hybrid,1.0,0.09
9,Q1,rerank,1.0,0.172414


## 10) What to submit
1) Your updated dataset (or keep your own)
2) This notebook (with your answers + screenshots/outputs)
3) A short write‑up: retrieval metrics + faithfulness discussion + ablation

**Tip:** If you switch to an LLM, keep the same `build_context()` so the evidence is always visible.


## 10) Failure Analysis (REQUIRED)

**Failure Case Observed:**
For Query Q1 ('What is the Transformer architecture?'), the sparse retrieval (TF-IDF) sometimes retrieves sections about 'Training' or 'Results' because the word 'transformer' appears frequently there, missing the core 'Model Architecture' section (Section 3) if the exact keyword density is higher elsewhere.

**Proposed Fix:**
Implementing a **Reranker** (Cross-Encoder) significantly mitigated this. By taking the top-20 sparse/dense candidates and scoring them with a model trained on true relevance (MS-MARCO), the 'Model Architecture' section was consistently promoted to the top-3 because it semantically answers 'what is...', rather than just containing the keyword.
