<a href="https://colab.research.google.com/github/scarlettyu2023/AI_agent_workshop/blob/main/Topic5RAG/manual_rag_pipeline_universal_scarlett.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Manual RAG Pipeline: Mechanisms First

This notebook builds a Retrieval-Augmented Generation (RAG) pipeline from scratch.
You'll see every step explicitly before we move to frameworks like LangChain.

**Works on:** Google Colab, Local Jupyter (Mac/Windows/Linux)

**Pipeline Overview:**
```
Documents â†’ Chunking â†’ Embedding â†’ Index (FAISS)
                                        â†“
User Query â†’ Embed Query â†’ Similarity Search â†’ Top-K Chunks
                                                    â†“
                                        Prompt Assembly â†’ LLM â†’ Answer
```

## TODO â€” Topic 5 RAG Course Project Checklist

- **Exercise 0:** Set-up â€” Get notebook running; unzip Corpora.zip. Use PDFs from `Corpora/<corpus>/pdf_embedded/`.
- **Exercise 1:** Open model RAG vs no RAG â€” Compare Qwen 2.5 1.5B with/without RAG on Model T manual and Congressional Record.
- **Exercise 2:** Open model + RAG vs large model â€” Run GPT-4o Mini with no tools on same queries.
- **Exercise 3:** Open model + RAG vs frontier chat â€” Compare local Qwen+RAG vs GPT-4/Claude (web).
- **Exercise 4:** Effect of top-K â€” Test k = 1, 3, 5, 10, 20.
- **Exercise 5:** Unanswerable questions â€” Off-topic, related-but-missing, false premise.
- **Exercise 6:** Query phrasing sensitivity â€” Same question in 5+ phrasings.
- **Exercise 7:** Chunk overlap â€” Re-chunk with overlap 0, 64, 128, 256.
- **Exercise 8:** Chunk size â€” Chunk at 128, 256, 512, 1024, 2048.
- **Exercise 9:** Retrieval score analysis â€” 10 queries, top-10 chunks, score distribution.
- **Exercise 10:** Prompt template variations â€” Minimal, strict grounding, citation, permissive, structured.
- **Exercise 11:** Failure mode catalog â€” Computation, temporal, comparison, ambiguous, multi-hop, etc.
- **Exercise 12:** Cross-document synthesis â€” Questions needing multiple chunks.

## Setup

First, let's install the required packages and detect our compute environment.

In [None]:
# Install dependencies
# On Colab, these install quickly. Locally, you may already have them.
# Use a kernel-aware install when available; fall back to subprocess otherwise.
try:
    ip = get_ipython()
    ip.run_line_magic('pip', 'install -q torch transformers sentence-transformers faiss-cpu pymupdf accelerate ipyfilechooser')
except NameError:
    import subprocess, sys
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', 'torch', 'transformers', 'sentence-transformers', 'faiss-cpu', 'pymupdf', 'accelerate', 'ipyfilechooser','openai'])
# For Exercise 2 (GPT-4o Mini): add 'openai' to the list above if needed


## Load Your Documents

**Cell 1:** Configure your document source and select/upload files
- **Local Jupyter**: Use the folder picker, then run Cell 2
- **Colab + Upload**: Files upload immediately (blocking), then run Cell 2
- **Colab + Drive**: Set `USE_GOOGLE_DRIVE = True`, mounts Drive and shows picker, then run Cell 2

**Cell 2:** Confirms selection and lists documents

##Exercise 3: Open Model + RAG vs. State-of-the-Art Chat Model
Compare your local RAG pipeline against a frontier model (GPT-5.2, Claude 4.6, etc.).

Setup:

Local: Qwen 2.5 1.5B with RAG using the Model T manual

Cloud: GPT-4 or Claude via their web interface (no file upload)

Queries to try:

All the ones from Exercise 1.

Document:

Where does the frontier model's general knowledge succeed?

When did the frontier model appear to be using live web search to help answer your questions?

Where does your RAG system provide more accurate, specific answers?

What does this tell you about when RAG adds value vs. when a powerful model suffices?

In [None]:
import pandas as pd
from openai import OpenAI
from google.colab import userdata

# ------------------------------------------
# 1. OpenAI Client (Cloud Model)
# ------------------------------------------

client = OpenAI(api_key=userdata.get('OPENAI_API_KEY'))

def ask_gpt4o(question: str):
    response = client.chat.completions.create(
        model="gpt-4o-mini",   # or "gpt-4o"
        messages=[
            {"role": "user", "content": question}
        ],
        temperature=0
    )
    return response.choices[0].message.content


# ------------------------------------------
# 2. Questions (same as Exercise 1)
# ------------------------------------------

questions = [
    "How do I adjust the carburetor on a Model T?",
    "What is the correct spark plug gap for a Model T Ford?",
    "How do I fix a slipping transmission band?",
    "What oil should I use in a Model T engine?",
    "What did Mr. Flood have to say about Mayor David Black in Congress on January 13, 2026?",
    "What mistake did Elise Stefanovic make in Congress on January 23, 2026?",
    "What is the purpose of the Main Street Parity Act?",
    "Who in Congress has spoken for and against funding of pregnancy centers?"
]


# ------------------------------------------
# 3. Run Comparison
# ------------------------------------------

results = []

for i, q in enumerate(questions, 1):

    print(f"\nQ{i}: {q}")
    print("-" * 60)

    # ----- Local RAG -----
    local_rag_answer = rag_query(q, top_k=5, show_context=False)
    print("\n[Local Qwen + RAG]")
    print(local_rag_answer)

    # ----- Frontier Model -----
    frontier_answer = ask_gpt4o(q)
    print("\n[GPT-4o Mini]")
    print(frontier_answer)

    results.append({
        "question": q,
        "local_rag_answer": local_rag_answer,
        "gpt4o_answer": frontier_answer
    })


# ------------------------------------------
# 4. Save Results
# ------------------------------------------

df = pd.DataFrame(results)
df.to_csv("exercise3_comparison.csv", index=False)

print("\nSaved: exercise3_comparison.csv")


# Exercise 3 — Local RAG (Qwen 2.5 1.5B) vs Frontier Chat Model (GPT-4o Mini)

## Setup

- **Local system:** Qwen 2.5 1.5B + RAG over the **Model T manual** index (`top_k=5`)
- **Frontier model:** GPT-4o Mini (no file upload, no tools; general knowledge only)
- **Queries:** All 8 questions from Exercise 1 (Model T + Congressional Record)

> Note: For Q5–Q8 (Congressional Record questions), the local RAG system was still indexed on the **Model T manual**, so it correctly lacked relevant documents. This comparison is still useful for testing “refusal vs hallucination,” but it is **not** a fair retrieval test for CR unless you rebuild the index on the CR corpus.

---

## Results Summary (By Question)

### Model T questions (Q1–Q4)

**Q1. Carburetor adjustment**
- **Local Qwen+RAG:** Mentions the *dash adjustment* idea (turn left for cold start, right in warm weather). This is aligned with retrieved manual snippets, but it stays high-level and avoids mechanical detail.
- **GPT-4o Mini:** Gives a modern “generic carburetor tuning procedure” (mixture screw turns, idle RPM, etc.) that is plausible but **not clearly grounded** in the Model T manual you retrieved.

**Q2. Spark plug gap**
- **Local Qwen+RAG:** Incorrectly interprets the manual’s “about the thickness of a smooth dime” line and outputs a nonsensical value (e.g., “7/4 inches / 1.75 inches”). This is a *reading/interpretation error* triggered by OCR noise (“74”).
- **GPT-4o Mini:** Answers “0.025 inches,” a common modern-ish value, but **not supported** by the manual excerpt you retrieved. Likely **general-knowledge guess**.

**Q3. Slipping transmission band**
- **Local Qwen+RAG:** Closely follows the manual: loosen lock nut, turn adjusting screw, avoid dragging, remove cover for brake/reverse bands; includes the “don’t drop tools” warning. **Most grounded + specific.**
- **GPT-4o Mini:** Gives a generic automatic-transmission style troubleshooting checklist (fluid level, torque converter, etc.), which is **not Model T–specific** and likely partially irrelevant.

**Q4. Engine oil**
- **Local Qwen+RAG:** The retrieved context talks about *oil level management via petcocks*, not oil type. The answer drifts into brand suggestions (Mobil 1, etc.) which are **not in context** → this is **hallucination / ungrounded addition**.
- **GPT-4o Mini:** Recommends “non-detergent SAE 30,” which is plausible vintage-car advice but **not grounded** in your retrieved manual snippet.

**Takeaway for Q1–Q4:**  
Local RAG helps most when the manual contains direct procedural steps (Q3). However, if retrieval returns OCR-corrupted text (Q2) or doesn’t contain the requested attribute (Q4 oil “type”), RAG can still produce wrong or made-up details unless the prompt strictly enforces “say you don’t know.”

---

### Congressional Record questions (Q5–Q8)

Since local RAG was indexed on the **Model T manual**, it predictably retrieves irrelevant text.

**Q5–Q6 (Jan 2026 specific events)**
- **Local Qwen+RAG:** Correctly says the corpus/context does not contain this information (good refusal behavior).
- **GPT-4o Mini:** Also refuses due to training cutoff (“I don’t have events after Oct 2023”). This is also correct.

**Q7–Q8 (Main Street Parity Act; pregnancy centers)**
- **Local Qwen+RAG:** Correctly refuses (no relevant info in Model T manual).
- **GPT-4o Mini:** Produces confident policy-style answers with named politicians.
  - These are **very likely hallucinations or at least unverified** for the specific Congressional Record issues/dates you intended, because the model has no access to your CR corpus and no tool/search.

**Takeaway for Q5–Q8:**  
Both systems refuse correctly for “post-cutoff event lookup” questions (Q5–Q6). For broad political questions (Q7–Q8), GPT-4o Mini is more likely to produce *plausible-sounding but ungrounded* content, while your RAG system (indexed on unrelated corpus) properly refuses.

---

## Where the Frontier Model’s General Knowledge Succeeds

- Gives fluent, well-structured “how-to” style answers (Q1, Q3, Q4) that read like standard automotive advice.
- Correctly refuses post-Oct-2023 event questions (Q5–Q6).

However, fluency ≠ correctness: for Model T specifics, it often gives *generic modern car* procedures.

---

## Did the Frontier Model Appear to Use Live Web Search?

No clear evidence.

- It explicitly claims a training cutoff and refuses Q5–Q6 rather than searching.
- For Q7–Q8 it answers confidently but without citing sources or showing signs of retrieval.
- In typical “live web search” behavior, you would expect citations/links or up-to-date references, or it would explicitly say it searched.

---

## Where Local RAG Provides More Accurate / Specific Answers

- **Q3 (transmission band)** is the best example: local RAG mirrors the manual’s exact steps and warnings.
- When the manual contains direct instructions, RAG can beat a general model by anchoring details.

---

## What This Suggests About When RAG Adds Value

RAG adds the most value when:
- The question requires **corpus-specific details** (exact procedure, exact wording, niche domain facts).
- The corpus is **properly indexed** for the question domain (e.g., CR questions require CR corpus index).
- The prompt enforces **strict grounding** (refuse when missing).

A powerful model can suffice when:
- The question is **generic** and not dependent on a particular document.
- You mainly want a plausible overview rather than citation-level correctness.

---

## Key Failure Modes Observed

- **OCR-induced misread** → wrong numeric interpretation (Local RAG Q2).
- **“Answer even when context doesn’t contain it”** → ungrounded additions (Local RAG Q4; GPT-4o Q7–Q8).
- **Wrong corpus indexed** → correct refusal but cannot answer (Local RAG Q5–Q8).

---

## Next Step to Make This a Fair Exercise-3 Comparison

To properly compare for Exercise 1 queries:
1. Rebuild your RAG index on the **Congressional Record corpus**.
2. Re-run Q5–Q8 with the CR index.
3. Compare:
   - Local Qwen+RAG (grounded in CR text)
   - GPT-4o Mini (general knowledge, no corpus)

That will directly test whether frontier general knowledge can compete with small+RAG on *document-grounded tasks*.


##Exercise 4: Effect of Top-K Retrieval Count
Vary the number of chunks retrieved and observe how it affects answer quality. You can use any of the copora and your own queries.

Test with: k = 1, 3, 5, 10, 20

For each k value:

Run the same 3-5 queries

Note answer quality, completeness, and accuracy

Note response latency

Questions to explore:

At what point does adding more context stop helping?

When does too much context hurt (irrelevant information, confusion)?

How does k interact with chunk size?

In [None]:
import time
import pandas as pd

queries = [
    "How do I adjust the carburetor on a Model T?",
    "What is the correct spark plug gap for a Model T Ford?",
    "How do I fix a slipping transmission band?",
    "What oil should I use in a Model T engine?",
]

ks = [1, 3, 5, 10, 20]
MAX_CHARS = 600
records = []

if index.ntotal > 0:
    for q in queries:
        print("\n" + "#"*80)
        print("QUESTION:", q)

        for k in ks:
            print("\n" + "-"*60)
            print(f"TOP_K = {k}")

            t0 = time.perf_counter()
            ans = rag_query(q, top_k=k, show_context=False)
            dt = time.perf_counter() - t0

            preview = ans[:MAX_CHARS] + ("..." if len(ans) > MAX_CHARS else "")
            print(preview)
            print(f"[latency] {dt:.3f} sec")

            records.append({
                "question": q,
                "top_k": k,
                "latency_sec": dt,
                "answer": ans
            })

    df = pd.DataFrame(records)
    display(df)
    df.to_csv("exercise4_topk_results.csv", index=False)
    print("\nSaved: exercise4_topk_results.csv")

else:
    print("Please complete the pipeline setup first (index is empty).")


# Exercise 4 — Effect of Top-K Retrieval Count

## Goal

This experiment evaluates how the number of retrieved chunks (**Top-K**) affects:

- Answer quality  
- Completeness and accuracy  
- Response latency  

The same queries were tested using:

**k = 1, 3, 5, 10, 20**

Queries:

1. How do I adjust the carburetor on a Model T?  
2. What is the correct spark plug gap for a Model T Ford?  
3. How do I fix a slipping transmission band?  
4. What oil should I use in a Model T engine?  

---

## Results

### Answer Quality

**k = 1**
- Often insufficient context
- Missing key facts (e.g., spark plug gap unknown)
- Some vague or incomplete answers
- Weak grounding in the manual

**k = 3**
- Large improvement in grounding
- Correct factual retrieval begins to appear
- Answers clearer and more complete

**k = 5**
- Best balance of accuracy and completeness
- Most answers grounded in retrieved manual text
- Minimal noise

**k = 10**
- Slight improvement in completeness
- Some answers become longer but not significantly more accurate

**k = 20**
- No consistent improvement
- Sometimes introduces irrelevant or noisy context
- Occasionally reduces answer clarity

---

### Latency

- Latency generally increases as **k increases**
- Larger context → longer generation time
- Some queries show significant slowdown at large k (e.g., k=10)

---

## Analysis

### At what point does adding more context stop helping?

From the experiment:

- Major improvement from **k=1 → k=3**
- Small improvement from **k=3 → k=5**
- Little to no improvement beyond **k≈5–10**

**Conclusion:**  
Adding more context stops helping around **k ≈ 5–10**.

---

### When does too much context hurt?

At large k values (especially **k=20**):

- Retrieval includes more irrelevant chunks
- Noise increases
- Answers may become less precise
- Model may synthesize mixed or unclear instructions

**Conclusion:**  
Too much context hurts when irrelevant information outweighs useful content, typically at **k ≥ 10–20**.

---

### How does k interact with chunk size?

Observed behavior suggests:

- **Small chunks + large k**
  - Many fragments retrieved
  - Higher noise
  - Lower precision

- **Larger chunks + small k**
  - More complete context per chunk
  - Better grounding

**Insight:**  
Top-K and chunk size trade off between **recall** and **precision**.  
Best performance usually occurs with **moderate chunk size and k ≈ 3–5**.

---

## Conclusion

This experiment shows:

- Best answer quality occurs around **k = 3–5**
- Increasing k beyond **≈10 gives little benefit**
- Large k increases latency and may reduce accuracy
- More retrieval is **not always better**

Careful tuning of Top-K is important for optimal RAG performance.


## Exercise 5: Handling Unanswerable Questions
Test how well your system handles questions that cannot be answered from the corpus. You can use any of the copora and your own queries.

Types of unanswerable questions:

Completely off-topic: "What is the capital of France?"

Related but not in corpus: "What's the horsepower of a 1925 Model T?" (if not in your manual)

False premises: "Why does the manual recommend synthetic oil?" (when it doesn't)

Document:

Does the model admit it doesn't know?

Does it hallucinate plausible-sounding but wrong answers?

Does retrieved context help or hurt? (Does irrelevant context encourage hallucination?)

Experiment: Modify your prompt template to add "If the context doesn't contain the answer, say 'I cannot answer this from the available documents.'" Does this help?

In [None]:
# ============================================================
# Exercise 5 — Handling Unanswerable Questions (CODE)
# ============================================================
# Assumes you already ran the earlier notebook cells so these exist:
# - index, embed_model, retrieve(), rag_query(), generate_response()
# - PROMPT_TEMPLATE (optional)
#
# This script:
# 1) Tests 3 unanswerable-question types
# 2) Runs BOTH:
#    - baseline RAG prompt (your current PROMPT_TEMPLATE)
#    - strict refusal prompt (adds required refusal phrase)
# 3) Optionally also runs NO-RAG (direct_query) for comparison
# 4) Prints retrieved context for transparency
# 5) Records latency + answers to CSV

import time
import pandas as pd

# ----------------------------
# 1) Unanswerable questions
# ----------------------------
UNANSWERABLE = [
    {
        "type": "off_topic",
        "question": "What is the capital of France?"
    },
    {
        "type": "related_missing",
        "question": "What’s the horsepower of a 1925 Model T?"
    },
    {
        "type": "false_premise",
        "question": "Why does the manual recommend synthetic oil?"
    },
]

# ----------------------------
# 2) Prompt templates
# ----------------------------
BASELINE_TEMPLATE = PROMPT_TEMPLATE  # use your existing one

STRICT_REFUSAL_TEMPLATE = """You are a helpful assistant that answers questions based on the provided context.

CONTEXT:
{context}

QUESTION: {question}

INSTRUCTIONS:
- Answer the question based ONLY on the information in the context above.
- If the context does not contain the answer, say EXACTLY:
  "I cannot answer this from the available documents."
- Do NOT use outside knowledge.
- Do NOT guess.
- Be concise.

ANSWER:"""

# ----------------------------
# 3) Helper wrappers
# ----------------------------
def direct_query(question: str, max_new_tokens: int = 512) -> str:
    """Ask the local LLM directly (no RAG)"""
    prompt = f"""Answer this question:

{question}

Answer:"""
    return generate_response(prompt, max_new_tokens=max_new_tokens)

def rag_query_with_template(question: str, top_k: int, prompt_template: str, show_context: bool = True) -> str:
    """RAG query but with a provided prompt template."""
    return rag_query(
        question,
        top_k=top_k,
        show_context=show_context,
        prompt_template=prompt_template
    )

# ----------------------------
# 4) Run experiment
# ----------------------------
TOP_K = 5
RUN_NO_RAG = True     # set False if you only want RAG
SHOW_CONTEXT = True   # print retrieved context (recommended)

records = []

assert index.ntotal > 0, "FAISS index is empty. Build your index first."

for item in UNANSWERABLE:
    qtype = item["type"]
    question = item["question"]

    print("\n" + "=" * 80)
    print(f"[{qtype.upper()}] Question: {question}")
    print("=" * 80)

    # ---- NO-RAG (optional) ----
    if RUN_NO_RAG:
        t0 = time.time()
        no_rag_ans = direct_query(question)
        dt = time.time() - t0
        print("\n[NO RAG] Answer:")
        print(no_rag_ans)
        print(f"[NO RAG latency] {dt:.3f} sec")

        records.append({
            "question_type": qtype,
            "question": question,
            "mode": "no_rag",
            "top_k": None,
            "latency_sec": dt,
            "answer": no_rag_ans
        })

    # ---- RAG baseline prompt ----
    t0 = time.time()
    baseline_ans = rag_query_with_template(
        question,
        top_k=TOP_K,
        prompt_template=BASELINE_TEMPLATE,
        show_context=SHOW_CONTEXT
    )
    dt = time.time() - t0
    print("\n[RAG baseline prompt] Answer:")
    print(baseline_ans)
    print(f"[RAG baseline latency] {dt:.3f} sec")

    records.append({
        "question_type": qtype,
        "question": question,
        "mode": "rag_baseline",
        "top_k": TOP_K,
        "latency_sec": dt,
        "answer": baseline_ans
    })

    # ---- RAG strict refusal prompt ----
    t0 = time.time()
    strict_ans = rag_query_with_template(
        question,
        top_k=TOP_K,
        prompt_template=STRICT_REFUSAL_TEMPLATE,
        show_context=SHOW_CONTEXT
    )
    dt = time.time() - t0
    print("\n[RAG strict refusal prompt] Answer:")
    print(strict_ans)
    print(f"[RAG strict latency] {dt:.3f} sec")

    records.append({
        "question_type": qtype,
        "question": question,
        "mode": "rag_strict_refusal",
        "top_k": TOP_K,
        "latency_sec": dt,
        "answer": strict_ans
    })

# ----------------------------
# 5) Save results
# ----------------------------
df = pd.DataFrame(records)
display(df)

out_csv = "exercise5_unanswerable_results.csv"
df.to_csv(out_csv, index=False)
print(f"\nSaved: {out_csv}")


# Exercise 5 — Handling Unanswerable Questions

## Goal
Evaluate how the system handles questions that **cannot be answered from the corpus**, across three categories:

1. Off-topic (completely unrelated)
2. Related but missing (domain-related but not present in corpus)
3. False premise (question assumes something the corpus does not say)

We compare:

- **No RAG** (direct model answer)
- **RAG (baseline prompt)**
- **RAG (strict refusal prompt)**: adds  
  > "If the context does not contain the answer, say exactly:  
  > `I cannot answer this from the available documents.`"

All RAG runs used **top_k = 5** and printed retrieved context.

---

## Test Cases and Results

### 1) Off-topic question  
**Question:** *What is the capital of France?*

- **No RAG:** answered **“Paris”** (correct but unrelated to corpus; uses general knowledge).
- **RAG baseline:** produced a **partly-hallucinated justification**. It still answered “Paris,” but not based on retrieved documents.
- **RAG strict refusal:**  
  `Paris I cannot answer this from the available documents.`  
  This is almost correct refusal but leaked an external answer before refusing.

**Observation:**  
Baseline RAG may still output correct facts for the wrong reason. Strict refusal reduces hallucination but must forbid any extra tokens before refusal.

---

### 2) Related but missing  
**Question:** *What’s the horsepower of a 1925 Model T?*

- **No RAG:** answered **~20 horsepower** (likely historically correct, but not grounded in corpus).
- **RAG baseline:** correctly stated that the corpus does not contain horsepower → grounded behavior.
- **RAG strict refusal:**  
  `I cannot answer this from the available documents.`

**Observation:**  
RAG successfully prevents guessing when the corpus lacks information.

---

### 3) False premise  
**Question:** *Why does the manual recommend synthetic oil?*

- **No RAG:** hallucinated a confident explanation about synthetic oil benefits (not in corpus).
- **RAG baseline:** also hallucinated, misinterpreting unrelated lubrication text as evidence.
- **RAG strict refusal:**  
  `I cannot answer this from the available documents.`

**Observation:**  
False-premise questions are dangerous. Weakly related retrieved context can *encourage hallucination*. Strict refusal significantly improves reliability.

---

## Does the model admit it doesn’t know?

- **No RAG:** rarely admits uncertainty; tends to answer confidently.
- **RAG baseline:** mixed — sometimes admits missing knowledge, sometimes hallucinates.
- **RAG strict refusal:** usually admits inability to answer, though minor leakage occurred once.

---

## Does it hallucinate plausible-sounding but wrong answers?

Yes:

- **No RAG:** hallucinated for the false-premise question.
- **RAG baseline:** also hallucinated when irrelevant context was present.

---

## Does retrieved context help or hurt?

- **Helps** when:
  - The question is answerable.
  - The corpus clearly lacks the information (encourages refusal).

- **Hurts** when:
  - Context is loosely related.
  - The question contains a false premise.
  - The model uses irrelevant context to justify invented claims.

---

## Effect of Strict Refusal Prompt

Adding:

> "If the context does not contain the answer, say exactly:  
> `I cannot answer this from the available documents.`"

**Improved behavior significantly:**

- Prevented hallucination for the false-premise case.
- Produced correct refusal for the related-missing case.
- Minor leakage occurred in off-topic case (“Paris” before refusal).

**Recommendation:**  
Strengthen the prompt further:

- Require output **only** the refusal sentence.
- Disallow any answer tokens before refusal.

---

## Latency Observations

- Baseline RAG sometimes had high latency (e.g., ~106s).
- Strict refusal RAG was consistently faster (~10–11s) due to short output.

---

## Conclusion

- **No RAG:** prone to confident hallucination.
- **Baseline RAG:** improves grounding but can still hallucinate when context is weak or misleading.
- **Strict refusal prompt:** greatly improves safety and grounding, though prompt constraints should be tightened to prevent answer leakage.


##Exercise 6: Query Phrasing Sensitivity
Test how different phrasings of the same question affect retrieval. You can use any of the copora and your own queries or the ones below for the Model T or Learjet corpus.

Choose one underlying question and phrase it 5+ different ways:

Formal: "What is the recommended maintenance schedule for the engine?"

Casual: "How often should I service the engine?"

Keywords only: "engine maintenance intervals"

Question form: "When do I need to check the engine?"

Indirect: "Preventive maintenance requirements"

For each phrasing:

Record the top 5 retrieved chunks

Note similarity scores

Compare overlap between result sets

Questions to explore:

Which phrasings retrieve the best chunks?

Do keyword-style queries work better or worse than natural questions?

What does this tell you about potential query rewriting strategies?



In [None]:
# ============================================================
# Exercise 6 — Query Phrasing Sensitivity
# ============================================================
# Assumes you already ran the notebook up through:
# - documents loaded
# - all_chunks built
# - embed_model loaded
# - FAISS index built
# - retrieve(query, top_k) is defined
#
# This cell:
# 1) defines 5+ phrasings of ONE underlying question
# 2) retrieves top-5 chunks for each phrasing
# 3) prints scores + chunk previews
# 4) computes overlap across result sets (by source_file + chunk_index)
# 5) saves results to CSV for your write-up

import pandas as pd
from collections import defaultdict
from itertools import combinations

TOP_K = 5
PREVIEW_CHARS = 260

# ----------------------------
# 1) Pick ONE underlying question + 5+ phrasings
# ----------------------------
underlying_question = "Engine maintenance schedule / intervals"

phrasings = [
    ("formal",        "What is the recommended maintenance schedule for the engine?"),
    ("casual",        "How often should I service the engine?"),
    ("keywords",      "engine maintenance intervals"),
    ("question_form", "When do I need to check the engine?"),
    ("indirect",      "Preventive maintenance requirements for the engine"),
    ("extra",         "How frequently do I need to do engine upkeep?"),
]

# ----------------------------
# 2) Run retrieval for each phrasing and record top-5
# ----------------------------
rows = []
results_by_phrase = {}  # phrase_label -> list of identifiers

for label, q in phrasings:
    retrieved = retrieve(q, top_k=TOP_K)  # returns [(Chunk, score), ...]
    print("\n" + "="*90)
    print(f"[{label.upper()}] {q}")
    print("="*90)

    ids = []
    for rank, (chunk, score) in enumerate(retrieved, start=1):
        chunk_id = f"{chunk.source_file}::chunk{chunk.chunk_index}"
        ids.append(chunk_id)

        preview = chunk.text[:PREVIEW_CHARS].replace("\n", " ")
        if len(chunk.text) > PREVIEW_CHARS:
            preview += "..."

        print(f"\n  ({rank}) score={score:.4f}  id={chunk_id}")
        print(f"      {preview}")

        rows.append({
            "underlying_question": underlying_question,
            "phrasing_label": label,
            "phrasing_text": q,
            "rank": rank,
            "score": score,
            "source_file": chunk.source_file,
            "chunk_index": chunk.chunk_index,
            "chunk_id": chunk_id,
            "chunk_preview": preview,
        })

    results_by_phrase[label] = ids

df = pd.DataFrame(rows)

# ----------------------------
# 3) Overlap analysis between phrasing result sets
# ----------------------------
def jaccard(a, b):
    a, b = set(a), set(b)
    return len(a & b) / max(1, len(a | b))

overlap_rows = []
labels = [lbl for (lbl, _) in phrasings]

print("\n" + "#"*90)
print("OVERLAP SUMMARY (Top-5 sets per phrasing)")
print("#"*90)

for a, b in combinations(labels, 2):
    A, B = results_by_phrase[a], results_by_phrase[b]
    inter = sorted(set(A) & set(B))
    overlap_rows.append({
        "phrasing_a": a,
        "phrasing_b": b,
        "overlap_count": len(inter),
        "jaccard": jaccard(A, B),
        "overlap_ids": "; ".join(inter),
    })

overlap_df = pd.DataFrame(overlap_rows).sort_values(
    by=["overlap_count", "jaccard"], ascending=False
)

print(overlap_df[["phrasing_a", "phrasing_b", "overlap_count", "jaccard"]].to_string(index=False))

# ----------------------------
# 4) Save CSVs for your write-up
# ----------------------------
df.to_csv("exercise6_phrasing_retrieval_top5.csv", index=False)
overlap_df.to_csv("exercise6_phrasing_overlap.csv", index=False)

print("\nSaved:")
print(" - exercise6_phrasing_retrieval_top5.csv")
print(" - exercise6_phrasing_overlap.csv")

# Optional: display tables in notebook
try:
    display(df.head(20))
    display(overlap_df.head(20))
except Exception:
    pass

# Exercise 6 — Query Phrasing Sensitivity

## Goal

This experiment evaluates how **different phrasings of the same question** affect retrieval behavior in a RAG system.

We test whether:

- Different wording retrieves different chunks  
- Some phrasings produce better semantic matches  
- Keyword-style queries behave differently from natural language  
- Query phrasing significantly impacts retrieval quality  

---

## Underlying Question

Engine maintenance schedule / intervals

---

## Query Variants Tested

1. **Formal** — "What is the recommended maintenance schedule for the engine?"
2. **Casual** — "How often should I service the engine?"
3. **Keywords** — "engine maintenance intervals"
4. **Question form** — "When do I need to check the engine?"
5. **Indirect** — "Preventive maintenance requirements for the engine"
6. **Extra** — "How frequently do I need to do engine upkeep?"

Top-5 chunks were retrieved for each phrasing.

---

## Retrieval Results Summary

### Similarity Score Observations

| Phrasing        | Typical Score Range |
|-----------------|--------------------|
| Question form   | **Highest (~0.49)** |
| Indirect        | High (~0.44–0.45)   |
| Formal          | Medium (~0.38–0.40) |
| Casual          | Medium (~0.37–0.39) |
| Keywords        | Slightly lower (~0.35–0.37) |
| Extra           | Similar to formal/casual |

Interestingly, the **question-form query had the highest similarity scores**, but retrieved **completely different chunks** from the other phrasings.

---

## Overlap Between Retrieval Sets

Key findings from Jaccard overlap:

| Pair                     | Overlap |
|--------------------------|---------|
| Formal vs Casual         | **4 / 5 chunks** (high) |
| Formal vs Extra          | **4 / 5 chunks** |
| Casual vs Extra          | **4 / 5 chunks** |
| Formal vs Keywords       | Moderate |
| Keywords vs Indirect     | Moderate |
| **Question form vs Others** | **0 overlap** |

The **question-form phrasing retrieved a completely different semantic region** of the corpus.

---

## Analysis

### Which phrasings retrieved the best chunks?

- **Formal, Casual, and Extra phrasing** produced the most stable and consistent retrieval.
- These queries retrieved **nearly identical top-5 chunks**, suggesting semantic robustness.
- The retrieved chunks included relevant maintenance-related text.

---

### Did keyword-style queries work better or worse?

**Keyword queries performed slightly worse.**

Observations:

- Lower similarity scores  
- Slightly different chunk mix  
- Still semantically reasonable  

Keyword queries are less expressive, so embeddings may capture less context.

---

### Why did the Question-Form query behave differently?

The query:

> "When do I need to check the engine?"

Shifted semantic meaning toward:

- Engine operation  
- Ignition  
- Running behavior  

instead of **maintenance schedule**.

This shows that **small wording changes can redirect retrieval** to different semantic regions.

---

### What does this tell us about query rewriting?

Key insight:

**Query phrasing strongly affects retrieval.**

Implications:

- Natural-language questions are generally more stable than keywords  
- Slight semantic shifts can drastically change retrieved context  
- Query rewriting / normalization could improve RAG robustness  
- Systems may benefit from:
  - Query expansion
  - Synonym rewriting
  - Hybrid keyword + semantic retrieval

---

## Conclusion

This experiment demonstrates that:

- Retrieval is **sensitive to phrasing**
- Semantically similar wording often retrieves similar chunks
- Keyword queries are slightly weaker than natural-language queries
- Certain phrasings can redirect retrieval to unrelated content
- Query rewriting could significantly improve RAG reliability

Overall, phrasing matters — even when the underlying question is identical.

##Exercise 7: Chunk Overlap Experiment
Test how overlap between chunks affects retrieval of information that spans chunk boundaries. You can use any of the copora and your own queries. Note: this exercise takes a long time to run.  Only try it on CoLab or a similar platform with T4 or better GPUs.

Setup: Re-chunk your corpus with different overlap values while keeping chunk size constant (e.g., 512 characters):

Overlap = 0 (no overlap)

Overlap = 64

Overlap = 128

Overlap = 256

For each configuration:

Rebuild the index

Find a question whose answer spans what would be a chunk boundary

Test retrieval quality

Document:

Does higher overlap improve retrieval of complete information?

What's the cost? (Index size, redundant information in context)

Is there a point of diminishing returns?

In [None]:
# ============================================================
# EXERCISE 7 — Chunk overlap experiment (RETRIEVAL ONLY)
# ============================================================

OVERLAPS = [0, 64, 128, 256]
CHUNK_SIZE_FIXED = 512

EX7_QUERIES = [
    # Model T
    "How do I adjust the carburetor on a Model T?",
    "What is the correct spark plug gap for a Model T Ford?",
    # Congressional Record
    "What mistake did Elise Stefanik make in Congress on January 23, 2026?",
]

TOP_K = 5

ex7_retrieval_results = {}

for ov in OVERLAPS:
    print("\n" + "="*90)
    print(f"Rebuilding pipeline: chunk_size={CHUNK_SIZE_FIXED}, chunk_overlap={ov}")
    rebuild_pipeline(chunk_size=CHUNK_SIZE_FIXED, chunk_overlap=ov)

    per_ov = []
    for q in EX7_QUERIES:
        hits = retrieve(q, top_k=TOP_K)  # returns [(Chunk, score), ...] :contentReference[oaicite:3]{index=3}
        per_ov.append(hits)

        print("\n" + "-"*90)
        print("Q:", q)
        for i, (chunk, score) in enumerate(hits, 1):
            snippet = chunk.text[:220].replace("\n", " ")
            print(f"[{i}] score={score:.4f} | {chunk.source_file} | chunk#{chunk.chunk_index}")
            print("    ", snippet, "...")
    ex7_retrieval_results[ov] = per_ov

print("\nDone. Results saved in ex7_retrieval_results (key = overlap).")

## Exercise 7 — Effect of Chunk Overlap on RAG Retrieval

### Setup
We fixed `chunk_size = 512` and varied `chunk_overlap ∈ {0, 64, 128, 256}`.  
For each overlap, we **re-chunked, re-embedded, and rebuilt the FAISS index** using the provided `rebuild_pipeline()` helper.  
We evaluated retrieval using the same set of queries and compared both **retrieval quality** and **index cost**.

---

### Index Size / Cost

| Overlap | #Chunks | Change |
|--------:|--------:|--------|
| 0       | 888     | baseline |
| 64      | 1051    | +18% |
| 128     | 1286    | +45% |
| 256     | 2016    | +127% |

As overlap increases, chunk duplication grows significantly, leading to:
- larger index size  
- more embeddings computed  
- longer rebuild time  

Thus, overlap introduces a clear **computational and storage cost**.

---

### Retrieval Quality

**1. Carburetor adjustment (procedural answer)**  
- All overlaps retrieve relevant chunks.
- Moderate overlap (64) slightly improves snippet completeness.
- Larger overlap (128/256) sometimes retrieves semantically related but less direct chunks (e.g., principle instead of procedure).

**Observation:** Increasing overlap does not always improve top-1 relevance. It can introduce more semantically similar but less useful candidates.

---

**2. Spark plug gap (numeric value near chunk boundary)**  
- With small overlap, the retrieved chunk may truncate the sentence containing the numeric value.
- With larger overlap (especially 256), the full sentence including the spark gap value is retrieved.

**Observation:** Overlap helps when critical information lies near chunk boundaries.

---

**3. Out-of-corpus query (Stefanik question)**  
- Retrieval failed across all overlaps.
- Low similarity scores indicate the answer does not exist in the indexed corpus.

**Observation:** Overlap cannot compensate for missing relevant content.

---

### Diminishing Returns

- Increasing overlap from **0 → 64** provides modest improvement with limited cost.
- Beyond **128**, index size grows rapidly while retrieval gains are inconsistent.
- Some queries improve, while others become noisier due to more overlapping candidates.

**Conclusion:** Chunk overlap improves boundary robustness but introduces higher cost and diminishing returns. A moderate overlap (around 64) provides the best tradeoff in this experiment.

In [None]:
# ============================================================
# EXERCISE 7 — Chunk overlap experiment (FULL RAG)
# ============================================================

import time

OVERLAPS = [0, 64, 128, 256]
CHUNK_SIZE_FIXED = 512
TOP_K = 5

EX7_QUERIES = [
    "How do I fix a slipping transmission band?",
    "What is the purpose of the Main Street Parity Act?",
]

ex7_rag_results = {}

for ov in OVERLAPS:
    print("\n" + "="*90)
    print(f"Rebuilding pipeline: chunk_size={CHUNK_SIZE_FIXED}, chunk_overlap={ov}")
    t0 = time.time()
    rebuild_pipeline(chunk_size=CHUNK_SIZE_FIXED, chunk_overlap=ov)  # :contentReference[oaicite:5]{index=5}
    rebuild_s = time.time() - t0
    print(f"Rebuild took {rebuild_s:.2f}s")

    per_ov = []
    for q in EX7_QUERIES:
        print("\n" + "-"*90)
        print("Q:", q)

        hits = retrieve(q, top_k=TOP_K)  # :contentReference[oaicite:6]{index=6}
        for i, (chunk, score) in enumerate(hits, 1):
            snippet = chunk.text[:200].replace("\n", " ")
            print(f"[retr {i}] score={score:.4f} | {chunk.source_file} | chunk#{chunk.chunk_index}")
            print("         ", snippet, "...")

        t1 = time.time()
        ans = rag_query(q, top_k=TOP_K, show_context=False)  #  contentReference[oaicite:7]{index=7}
        gen_s = time.time() - t1

        print("\nAnswer:")
        print(ans)
        print(f"(generation took {gen_s:.2f}s)")

        per_ov.append({
            "query": q,
            "overlap": ov,
            "chunk_size": CHUNK_SIZE_FIXED,
            "top_k": TOP_K,
            "rebuild_seconds": rebuild_s,
            "generation_seconds": gen_s,
            "retrieved": hits,
            "answer": ans,
        })

    ex7_rag_results[ov] = per_ov

print("\nDone. Results saved in ex7_rag_results (key = overlap).")

## Exercise 7 — Chunk Overlap Experiment (RAG)

### Setup
- Fixed `chunk_size = 512`
- Varied `chunk_overlap ∈ {0, 64, 128, 256}`
- For each overlap: **re-chunk → re-embed → rebuild FAISS index**, then ran the same queries with `top_k=5`.

---

### Cost (Index Size + Rebuild Time)

| Overlap | #Chunks | Rebuild Time |
|--------:|--------:|-------------:|
| 0       | 888     | 14.89s |
| 64      | 1051    | 17.25s |
| 128     | 1286    | 20.52s |
| 256     | 2016    | 37.78s |

**Observation:** Larger overlap substantially increases index size and rebuild time (0 → 256 more than doubles chunks and ~2.5× rebuild time).

---

### Retrieval + Answer Quality

#### Query 1: “How do I fix a slipping transmission band?”
- **Overlap 0:** Top retrieval directly contains the procedural steps (“loosen lock nut… turn adjusting screw…”). Answer is concise and grounded.
- **Overlap 64/256:** Top retrieval often shifts toward nearby *caution* text (dropping tools into transmission case), which is related but less direct. The generated answer becomes longer/noisier and may include irrelevant repetition.
- **Overlap 128:** Top retrieval includes both caution + the actual “Bands adjusted? Answer…” section; answer is mostly grounded.

**Takeaway:** Overlap can change which nearby segment becomes top-1. More overlap does not guarantee a better top-1 chunk, and can increase “nearby but less useful” context.

#### Query 2: “What is the purpose of the Main Street Parity Act?”
- Across overlaps, retrieved chunks are unrelated (Model T manual content), indicating the act is **not in the indexed corpus**.
- The model sometimes correctly says “not in context” (overlap 0/128/256), but at overlap 64 it hallucinates a purpose.

**Takeaway:** Overlap cannot fix missing content. When the corpus lacks the answer, the model may still hallucinate depending on retrieved noise.

---

### Diminishing Returns
- **0 → 64 → 128:** gradual cost increase; retrieval quality mixed (sometimes better context, sometimes more distraction).
- **128 → 256:** large cost jump (chunks +57% from 128; rebuild time +84%), with no consistent improvement in answer quality.

**Conclusion:** Chunk overlap improves boundary robustness in principle, but in this experiment it introduces substantial cost and can increase retrieval noise. A moderate overlap (≈64–128) is a more reasonable tradeoff than 256.