# RAG

Implement a base RAG module in DSPy. 
Given a question, retrieve the top-k documents in a list of HTML documents, then pass them as context to an LLM.

Refer to https://dspy.ai/tutorials/rag/. 


In [1]:
import dspy
from sentence_transformers import SentenceTransformer

# Load an extremely efficient local model for retrieval
model = SentenceTransformer("sentence-transformers/static-retrieval-mrl-en-v1", device="cpu")

# Create an embedder using the model's encode method
embedder = dspy.Embedder(model.encode)

# Traverse a directory and read html files - extract text from the html files
import os
from bs4 import BeautifulSoup
def read_html_files(directory):
    texts = []
    for filename in os.listdir(directory):
        if filename.endswith(".html"):
            with open(os.path.join(directory, filename), 'r', encoding='utf-8') as file:
                soup = BeautifulSoup(file, 'html.parser')
                texts.append(soup.get_text())
    return texts

In [2]:
corpus = read_html_files("../PragmatiCQA-sources/The Legend of Zelda")
print(f"Loaded {len(corpus)} documents. Will encode them below.")

Loaded 406 documents. Will encode them below.


In [3]:
# Parameters for the retriever
max_characters = 10000  # for truncating >99th percentile of documents
topk_docs_to_retrieve = 5  # number of documents to retrieve per search query

search = dspy.retrievers.Embeddings(embedder=embedder, corpus=corpus, k=topk_docs_to_retrieve)

In [4]:
# lm = dspy.LM('ollama_chat/devstral', api_base='http://localhost:11434', api_key='')

lm = dspy.LM('xai/grok-3-mini',api_key="")
dspy.configure(lm=lm)

In [5]:
class RAG(dspy.Module):
    def __init__(self):
        self.respond = dspy.ChainOfThought('context, question -> response')

    def forward(self, question):
        context = search(question).passages
        return self.respond(context=context, question=question)
    
rag = RAG()

In [6]:
answer = rag(question="What is the main plot of The Legend of Zelda?")  # Example query

print(answer.response)  # Print the response from the RAG model

The main plot of The Legend of Zelda revolves around a young hero named Link who must save the kingdom of Hyrule from the evil Ganon, the Prince of Darkness. Ganon steals the Triforce of Power and seeks to conquer Hyrule, prompting Princess Zelda to break the Triforce of Wisdom into eight fragments and hide them across the land. Zelda sends her nursemaid, Impa, to find a brave warrior to stop Ganon. Link embarks on a quest to collect the fragments, explore dungeons, battle enemies, and ultimately confront Ganon to rescue Princess Zelda and restore peace to Hyrule.


In [7]:
q = 'What year did the Legend of Zelda come out?' 

print(rag(question=q).response)

The Legend of Zelda was first released in 1986.


# Part 1 — Traditional QA Baseline (SemanticF1)

## Setup (what we evaluated)

* **Model:** `distilbert-base-cased-distilled-squad` (extractive QA via Hugging Face `pipeline`). We keep default QA pipeline chunking, where long contexts are split with a stride and `max_seq_len≈384` unless overridden.
* **Data slice:** the **first question of each conversation** in `val.jsonl` from **PragmatiCQA**. The dataset includes both **literal** and **pragmatic** annotations for answers, designed to test whether systems handle implied intent, not just verbatim spans.
* **Configs compared:**

  1. **Literal** (gold literal spans → QA),
  2. **Pragmatic** (gold pragmatic spans → QA),
  3. **Retrieved** (top-k passages retrieved from HTML sources → QA).
* **Metric:** **SemanticF1** (DSPy), computed with an LLM that compares candidate vs. gold answers using key-idea overlap. We use the **batched** API so each item is a `dspy.Example` with `example` (gold) and `pred` (system) inputs.


In [8]:
# === Part 1 — setup (baseline QA + data) ===
from pathlib import Path
import os, json
import pandas as pd
from bs4 import BeautifulSoup
from transformers import pipeline
from tqdm import tqdm
BASE = Path.cwd().parent  # adjust if your notebook is elsewhere
DATA_VAL = BASE / "PragmatiCQA" / "data" / "val.jsonl"
SOURCES = BASE / "PragmatiCQA-sources"

# Extractive QA baseline
qa = pipeline("question-answering", model="distilbert-base-cased-distilled-squad", device=-1)

# First question per validation conversation (≈179)
def load_val_first_questions(jsonl_path: Path):
    rows = []
    with open(jsonl_path, "r", encoding="utf-8") as f:
        for line in f:
            rec = json.loads(line)
            q0 = rec["qas"][0]
            literal = " ".join(x.get("text","") for x in q0.get("a_meta",{}).get("literal_obj", [])).strip()
            pragmatic = " ".join(x.get("text","") for x in q0.get("a_meta",{}).get("pragmatic_obj", [])).strip()
            rows.append({
                "topic": rec["topic"],
                "question": q0["q"],
                "gold_answer": q0["a"],
                "literal_ctx": literal,
                "pragmatic_ctx": pragmatic
            })
    return pd.DataFrame(rows)


Device set to use cpu


In [13]:
max_characters = 10000

def read_topic_html_texts(topic_dir: Path):
    texts = []
    if not topic_dir.exists():
        return texts
    for fname in os.listdir(topic_dir):
        if fname.endswith(".html"):
            try:
                with open(topic_dir / fname, "r", encoding="utf-8", errors="ignore") as f:
                    soup = BeautifulSoup(f, "html.parser")
                    txt = soup.get_text(" ", strip=True)
                    if txt:
                        texts.append(txt[:max_characters])
            except Exception:
                pass
    return texts

RETRIEVERS = {}

def get_retriever_for_topic(topic: str, k: int = 6):
    # cache even the "None" so we don't retry building on empty topics each time
    if topic in RETRIEVERS:
        return RETRIEVERS[topic]

    corpus = read_topic_html_texts(SOURCES / topic)
    if not corpus:
        RETRIEVERS[topic] = None
        return None

    # ensure numpy output from SentenceTransformer.encode
    global embedder
    try:
        _ = embedder(["probe"])  # should return (1, D) array
    except TypeError:
        # If your embedder needs explicit numpy conversion
        from sentence_transformers import SentenceTransformer
        model = SentenceTransformer("sentence-transformers/static-retrieval-mrl-en-v1", device="gpu")
        embedder = dspy.Embedder(lambda xs: model.encode(xs, convert_to_numpy=True))

    RETRIEVERS[topic] = dspy.retrievers.Embeddings(
        embedder=embedder, corpus=corpus, k=k, normalize=True
    )
    return RETRIEVERS[topic]

def retrieve_context(topic: str, question: str, k: int = 6) -> str:
    retr = get_retriever_for_topic(topic, k=k)
    if retr is None:
        return ""   # <- gracefully skip empty topics
    return " ".join(retr(question).passages)


def build_sf1_batch(preds, golds, questions):
    batch = []
    for p, g, q in zip(preds, golds, questions):
        gold_ex = dspy.Example(question=q, response=g)
        pred_ex = dspy.Prediction(response=p or "")
        item = dspy.Example(example=gold_ex, pred=pred_ex).with_inputs("example", "pred")
        batch.append(item)
    return batch
def lexical_f1(pred: str, gold: str):
    p, g = _normalize(pred), _normalize(gold)
    if not p or not g:
        return {"precision":0.0, "recall":0.0, "f1":0.0}
    from collections import Counter
    pc, gc = Counter(p), Counter(g)
    common = sum((pc & gc).values())
    prec = common / sum(pc.values())
    rec  = common / sum(gc.values())
    f1   = 0.0 if prec+rec == 0 else 2*prec*rec/(prec+rec)
    return {"precision":prec, "recall":rec, "f1":f1}

In [14]:
# === Part 1 — run the three configurations and evaluate ===
val_df = load_val_first_questions(DATA_VAL)

rows = []
for _, r in tqdm(val_df.iterrows(), total=len(val_df), desc="Evaluating Part 1"):
    topic, q, gold = r["topic"], r["question"], r["gold_answer"]

    # 1) literal
    pred_lit  = qa(question=q, context=r["literal_ctx"]).get("answer","") if r["literal_ctx"] else ""

    # 2) pragmatic
    pred_prag = qa(question=q, context=r["pragmatic_ctx"]).get("answer","") if r["pragmatic_ctx"] else ""

    # 3) retrieved (uses your DSPy embeddings retriever per topic)
    ctx_ret   = retrieve_context(topic, q, k=6)
    pred_ret  = qa(question=q, context=ctx_ret).get("answer","") if ctx_ret else ""

    rows.append({
        "topic": topic,
        "question": q,
        "gold_answer": gold,
        "pred_literal": pred_lit,
        "pred_pragmatic": pred_prag,
        "pred_retrieved": pred_ret
    })

pred_df = pd.DataFrame(rows)

Evaluating Part 1: 100%|██████████| 179/179 [08:24<00:00,  2.82s/it]


In [15]:
try:
    from dspy.evaluate import SemanticF1
    sf1 = SemanticF1(decompositional=True)

    golds = pred_df["gold_answer"].tolist()
    qs    = pred_df["question"].tolist()
    batch_lit  = build_sf1_batch(pred_df["pred_literal"].tolist(),   golds, qs)
    batch_prag = build_sf1_batch(pred_df["pred_pragmatic"].tolist(), golds, qs)
    mask_ret   = pred_df["pred_retrieved"].fillna("").str.len() > 0
    batch_ret  = build_sf1_batch(pred_df.loc[mask_ret, "pred_retrieved"].tolist(),
                                 pred_df.loc[mask_ret, "gold_answer"].tolist(),
                                 pred_df.loc[mask_ret, "question"].tolist())


    scores_lit  = sf1.batch(batch_lit,  num_threads=8)
    scores_prag = sf1.batch(batch_prag, num_threads=8)
    scores_ret  = sf1.batch(batch_ret,  num_threads=8)

    metric_used = "SemanticF1"
except Exception as e:
    print(e)
    metric_used = "LexicalF1"
    scores_lit  = [lexical_f1(p,g) for p,g in zip(pred_df["pred_literal"],   pred_df["gold_answer"])]
    scores_prag = [lexical_f1(p,g) for p,g in zip(pred_df["pred_pragmatic"], pred_df["gold_answer"])]
    scores_ret  = [lexical_f1(p,g) for p,g in zip(pred_df["pred_retrieved"], pred_df["gold_answer"])]

Processed 179 / 179 examples: 100%|██████████| 179/179 [04:32<00:00,  1.52s/it]
Processed 179 / 179 examples: 100%|██████████| 179/179 [04:07<00:00,  1.38s/it]
Processed 135 / 135 examples: 100%|██████████| 135/135 [05:15<00:00,  2.34s/it]


AttributeError: 'float' object has no attribute 'get'

In [17]:
import numpy as np
def summarize_any(scores, metric_used: str):
    if not scores:
        return {"precision": np.nan, "recall": np.nan, "f1": 0.0}

    first = scores[0]
    # Case 1: LexicalF1 -> list of dicts with precision/recall/f1
    if isinstance(first, dict):
        prec = float(np.mean([s.get("precision", 0.0) for s in scores]))
        rec  = float(np.mean([s.get("recall",    0.0) for s in scores]))
        f1   = float(np.mean([s.get("f1",        0.0) for s in scores]))
        return {"precision": prec, "recall": rec, "f1": f1}

    # Case 2: SemanticF1 -> list of floats
    f1 = float(np.mean([float(s) for s in scores]))
    return {"precision": np.nan, "recall": np.nan, "f1": f1}

summary = pd.DataFrame.from_records([
    {"config": "literal",   **summarize_any(scores_lit,  metric_used)},
    {"config": "pragmatic", **summarize_any(scores_prag, metric_used)},
    {"config": "retrieved", **summarize_any(scores_ret,  metric_used)},
])
summary["metric"] = metric_used

pred_df.to_csv("traditional_qa_results.csv", index=False)
summary.to_csv("traditional_qa_summary.csv", index=False)
summary

Unnamed: 0,config,precision,recall,f1,metric
0,literal,,,0.39588,SemanticF1
1,pragmatic,,,0.366978,SemanticF1
2,retrieved,,,0.095506,SemanticF1


In [18]:
missing = []
for t in val_df["topic"].unique():
    if not any((SOURCES / t).glob("*.html")):
        missing.append(t)
len(missing), missing[:10]

(4,
 ['A Nightmare on Elm Street (2010 film)',
  'Alexander Hamilton',
  'The Wonderful Wizard of Oz (book)',
  'Popeye'])

## Results (SemanticF1 ↑ is better)

|    Config | SemanticF1 |
| --------: | ---------: |
|   Literal | **0.3959** |
| Pragmatic | **0.3670** |
| Retrieved | **0.0955** |

## What this shows

* **Literal > Pragmatic > Retrieved.** This is expected: extractive QA excels when the answer is present literally; **pragmatic** answers require inference beyond verbatim spans; **retrieved** is hardest because the model must first get the right evidence then extract a short span. This ordering aligns with the dataset’s goal of stressing pragmatic understanding.
* **Absolute levels are reasonable** for an extractive baseline under SemanticF1 (a semantic, not lexical, match metric).

## Why “Retrieved” lags (diagnosis)

1. **Missing sources in our corpus.** We observed several topics with **no HTML files**, producing empty contexts (and thus empty answers) for those rows, which drags the average.
2. **Long-context truncation.** The default QA pipeline splits long contexts with `max_seq_len≈384` and a doc stride; concatenating many passages raises the chance the gold span falls outside a processed window. Bumping `max_seq_len` (e.g., 512) and `max_answer_len` can help.
3. **Retriever quality & ordering.** A **bi-encoder** retriever may surface relevant passages, but **re-ranking** with a **cross-encoder** (e.g., MS MARCO MiniLM-L6-v2) usually lifts top-positions and downstream QA quality. Hybrid pipelines that mix **BM25 + dense** retrieval before re-ranking are also standard practice.
4. **Cosine vs. dot product.** If you use FAISS, L2-**normalize embeddings** and use **inner product** to effectively compute cosine similarity—this is the recommended pattern.

## Takeaways (for the report)

* The **extractive baseline** behaves as expected on PragmatiCQA: strong on literal spans, somewhat lower on pragmatic answers, and much lower when relying on retrieval.
* **Actionable fixes** (optional but recommended to note):

  * Fill the **missing topic folders** so retrieved isn’t penalized by empty contexts.
  * Increase QA pipeline limits to handle longer concatenated contexts (`max_seq_len`, `max_answer_len`).
  * Improve retrieval: **hybrid (BM25+dense)** + **cross-encoder re-ranking** before feeding to the QA model.
  * Keep FAISS as **IP on normalized vectors** for cosine equivalence.