
# Step 3 — Evaluation Phase I (Top‑1 Evidence)  
**Reuses Step 2 functions via `%run`**

This notebook is designed to be saved anywhere and run **after** your Step 2 notebook.  
It reuses `retrieve(query, top_k=1)` and your vector store setup from Step 2, then evaluates multiple prompting strategies with **SQuAD F1/EM**.

**What this notebook does**
1. Configure the path to your **Step 2** notebook.
2. `%run` Step 2 to load the vector store and the `retrieve()` function.
3. Load the `question-answer` split from *rag-mini-wikipedia*.
4. Define three prompt strategies (Instruction / CoT / Persona).
5. Generate answers using a local Transformers model (or OpenAI if an API key is available).
6. Compute **exact_match** and **f1** using the Hugging Face `squad` metric.
7. Print the best strategy and its scores.

> If stronger decoupling needed, move Step 2 core into a Python module (e.g., `rag_core.py`) and import it here.



## 0) Configure the Step 2 notebook path

- Set `STEP2_NOTEBOOK` to the **absolute path** of your Step 2 notebook.  
- Example: `/path/to/Naive_RAG_Milvus_Step2.ipynb` (Linux/Mac) or `C:\\path\\to\\Naive_RAG_Milvus_Step2.ipynb` (Windows).


In [None]:

# >>>> EDIT THIS TO MATCH YOUR FILE SYSTEM <<<<
STEP2_NOTEBOOK = "My Drive/Colab Notebooks/Naive_RAG_Milvus_Step2.ipynb"  # change to your actual path
print("Using Step 2 notebook at:", STEP2_NOTEBOOK)


Using Step 2 notebook at: My Drive/Colab Notebooks/Naive_RAG_Milvus_Step2.ipynb



## 1) Bring in Step 2 (vector store & retrieve) via `%run`

- This cell executes your Step 2 notebook so the current kernel has `retrieve()` and the vector store loaded.
- If Step 2 rebuilds the index on run, that's fine; otherwise it will reuse whatever is already persisted (e.g., Milvus Lite `milvus.db`).


In [None]:
import os

# Execute Step 2 so we can reuse retrieve() and the embedding/vector store setup
notebook_path = os.path.join('/content', STEP2_NOTEBOOK)

if not os.path.exists(notebook_path):
    raise FileNotFoundError(f"Step 2 notebook not found at: {notebook_path}")

%run "{notebook_path}"

# Sanity: check retrieve is available
assert 'retrieve' in globals(), "retrieve() not found; please confirm the Step 2 notebook path."
print("Step 2 executed. Functions available:", [n for n in ('retrieve','answer_with_context','model','col') if n in globals()])

FileNotFoundError: Step 2 notebook not found at: /content/My Drive/Colab Notebooks/Naive_RAG_Milvus_Step2.ipynb

In [None]:
%pip install pymilvus[milvus_lite]

Collecting pymilvus[milvus_lite]
  Downloading pymilvus-2.6.2-py3-none-any.whl.metadata (6.5 kB)
Collecting ujson>=2.0.0 (from pymilvus[milvus_lite])
  Downloading ujson-5.11.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (9.4 kB)
Collecting milvus-lite>=2.4.0 (from pymilvus[milvus_lite])
  Downloading milvus_lite-2.5.1-py3-none-manylinux2014_x86_64.whl.metadata (10.0 kB)
Downloading milvus_lite-2.5.1-py3-none-manylinux2014_x86_64.whl (55.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.3/55.3 MB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading ujson-5.11.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (57 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.4/57.4 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pymilvus-2.6.2-py3-none-any.whl (258 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m258.8/258.8 kB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
[?25hInsta

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import os

# >>>> EDIT THIS TO MATCH YOUR FILE SYSTEM <<<<
# Assuming your notebook is in the root of "My Drive"
STEP2_NOTEBOOK = "/content/drive/My Drive/Colab Notebooks/Naive_RAG_Milvus_Step2.ipynb"  # change to your actual path
print("Using Step 2 notebook at:", STEP2_NOTEBOOK)

Using Step 2 notebook at: /content/drive/My Drive/Colab Notebooks/Naive_RAG_Milvus_Step2.ipynb


In [None]:
import os

# Execute Step 2 so we can reuse retrieve() and the embedding/vector store setup
notebook_path = STEP2_NOTEBOOK # Use the updated path directly

if not os.path.exists(notebook_path):
    raise FileNotFoundError(f"Step 2 notebook not found at: {notebook_path}")

%run "{notebook_path}"

# Sanity: check retrieve is available
assert 'retrieve' in globals(), "retrieve() not found; please confirm the Step 2 notebook path."
print("Step 2 executed. Functions available:", [n for n in ('retrieve','answer_with_context','model','col') if n in globals()])

  if pkgutil.find_loader(mod) is None:


OK: datasets
OK: sentence_transformers
OK: pymilvus
OK: numpy
OK: pandas
OK: openai


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/719 [00:00<?, ?B/s]

data/passages.parquet/part.0.parquet:   0%|          | 0.00/797k [00:00<?, ?B/s]

Generating passages split:   0%|          | 0/3200 [00:00<?, ? examples/s]

data/test.parquet/part.0.parquet:   0%|          | 0.00/54.4k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/918 [00:00<?, ? examples/s]

Dataset({
    features: ['passage', 'id'],
    num_rows: 3200
})
Dataset({
    features: ['question', 'answer', 'id'],
    num_rows: 918
})
Sample passage: Uruguay (official full name in  ; pron.  , Eastern Republic of  Uruguay) is a country located in the southeastern part of South America.  It is home to 3.3 million people, of which 1.7 million live in the capital Montevideo and its metropolitan area. ...
Total passages: 3200
Total chunks: 3299
Sample chunk: Uruguay (official full name in  ; pron.  , Eastern Republic of  Uruguay) is a country located in the southeastern part of South America.  It is home to 3.3 million people, of which 1.7 million live in ...


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/52 [00:00<?, ?it/s]

Embeddings shape: (3299, 384)
Connected to Milvus Lite: True
Created collection: rag_mini_wiki_chunks
Index created.
Installing pymilvus[milvus_lite]...
Installation complete.
Inserted rows: 3299
Collection num_entities: 3299
[('36-0', "Montevideo, Uruguay's capital.", 0.409418523311615), ('2279-0', 'French explorer Samuel de Champlain arrived in 1603 and established the first permanent European settlements at Port Royal in 1605 and Quebec City in 1608. These would become respectively the capitals of Acadia and Canada. Among French colonists of New France, Canadiens extensively settled the St. Lawrence River valley, Acadians settled the present-day Maritimes, while French fur traders and Catholic missionaries explored the Great Lakes, Hudson Bay and the Mississippi watershed to Louisiana. The French and Iroquois Wars broke out over control of the fur trade.', 0.4016871452331543), ('893-0', 'Map of Egypt, showing the 26 capitals of governorates, in addition to the self-governing city of


## 2) Install and import evaluation dependencies

We use:  
- `datasets` to load *rag-mini-wikipedia*  
- `evaluate` for the **SQuAD** metric  
- `transformers` for a local generation baseline (fallback to OpenAI if configured)


In [None]:

import sys, subprocess, pkgutil

def ensure(pkg, pip_name=None):
    pip_name = pip_name or pkg
    if pkgutil.find_loader(pkg) is None:
        print(f"Installing: {pip_name}")
        subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", pip_name])
    else:
        print(f"OK: {pkg}")

ensure("datasets", "datasets")
ensure("evaluate", "evaluate")
ensure("transformers", "transformers")

from datasets import load_dataset
import evaluate
import numpy as np


  if pkgutil.find_loader(pkg) is None:


OK: datasets
Installing: evaluate
OK: transformers


## 3) Load the QA split and the SQuAD metric


In [None]:
ds = load_dataset("rag-datasets/rag-mini-wikipedia", "question-answer")
qa = ds["test"] # Use the correct split name 'test'
squad = evaluate.load("squad")

print("QA size:", len(qa))
print("Sample QA:", {k: qa[0][k] for k in qa[0].keys() if k in ("question","answer","answers")})

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading extra modules: 0.00B [00:00, ?B/s]

QA size: 918
Sample QA: {'question': 'Was Abraham Lincoln the sixteenth President of the United States?', 'answer': 'yes'}



## 4) Define prompting strategies

We compare three simple, reproducible templates:
- **Instruction**
- **Chain-of-Thought (CoT)** (concise: asks to reason step by step)
- **Persona** (concise editor style)


In [None]:

def build_prompt_instruction(context, question):
    return (
        "Answer STRICTLY using the context. If insufficient, reply 'I don't know.'\n\n"
        f"Context:\n{context}\n\nQuestion: {question}\nAnswer:"
    )

def build_prompt_cot(context, question):
    return (
        "You are a careful analyst. Use ONLY the context. "
        "If insufficient, say 'I don't know.' Carefully plan your reasoning step by step.\n\n"
        f"Context:\n{context}\n\nQuestion: {question}\nAnswer:"
    )

def build_prompt_persona(context, question):
    return (
        "You are a concise encyclopedia editor. Use ONLY the context. "
        "If insufficient, say 'I don't know.' Keep answers factual and brief.\n\n"
        f"Context:\n{context}\n\nQuestion: {question}\nAnswer:"
    )

PROMPTS = {
    "instruction": build_prompt_instruction,
    "cot": build_prompt_cot,
    "persona": build_prompt_persona,
}



## 5) Generation backend (local Transformers by default; OpenAI optional)

- Default uses a small local model (`google/flan-t5-base`) to keep the notebook self-contained.
- If you have `OPENAI_API_KEY` in env, you can switch to `gpt-4o-mini` by un-commenting the OpenAI block.


In [None]:

# Option A: local Transformers (default)

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

try:
    _tok = AutoTokenizer.from_pretrained("google/flan-t5-base")
    _mdl = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base")
    pipe = pipeline("text2text-generation", model=_mdl, tokenizer=_tok)
    def generate_answer(prompt, max_new_tokens=128, temperature=0.0):
        out = pipe(prompt, max_new_tokens=max_new_tokens)[0]["generated_text"]
        return out.strip()
    print("Using local Transformers: google/flan-t5-base")
except Exception as e:
    print("Local Transformers unavailable, you may switch to OpenAI below if you have an API key.", e)
'''
    # --- Option B: OpenAI (uncomment to use) ---
from openai import OpenAI
import os
client = OpenAI(api_key='') # Paste your OpenAI API key here
def generate_answer(prompt, max_new_tokens=128, temperature=0.2):
    resp = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role":"user","content":prompt}],
        temperature=temperature, max_tokens=max_new_tokens
    )
    return resp.choices[0].message.content.strip()
'''


tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Device set to use cuda:0


Using local Transformers: google/flan-t5-base


'\n    # --- Option B: OpenAI (uncomment to use) ---\nfrom openai import OpenAI\nimport os\nclient = OpenAI(api_key=\'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\')\ndef generate_answer(prompt, max_new_tokens=128, temperature=0.2):\n    resp = client.chat.completions.create(\n        model="gpt-3.5-turbo",\n        messages=[{"role":"user","content":prompt}],\n        temperature=temperature, max_tokens=max_new_tokens\n    )\n    return resp.choices[0].message.content.strip()\n'


## 6) Enforce the Step 3 constraint: Top‑1 evidence only

We call `retrieve(query, top_k=1)` from Step 2 and pass the *single* retrieved chunk into the prompt.


In [None]:

def get_top1_context(query):
    hits = retrieve(query, top_k=1)  # <-- strict top-1 evidence
    return hits[0][1] if hits else ""



## 7) Evaluation loop (compute SQuAD F1 / EM)


In [None]:

def evaluate_strategy(prompt_name, n_samples=None):
    build = PROMPTS[prompt_name]
    preds, refs = [], []
    total = len(qa) if n_samples is None else min(n_samples, len(qa))

    for i in range(total):
        q = qa[i]["question"]
        gold = qa[i]["answer"] if "answer" in qa[i] else qa[i]["answers"]
        gold_text = gold if isinstance(gold, str) else gold[0]

        ctx = get_top1_context(q)
        prompt = build(ctx, q)
        pred = generate_answer(prompt)

        preds.append({"id": str(i), "prediction_text": pred})
        refs.append({"id": str(i), "answers": {"text": [gold_text], "answer_start": [0]}})

    return evaluate.load("squad").compute(predictions=preds, references=refs), preds

# quick sanity on a subset first
for name in PROMPTS:
    m, _ = evaluate_strategy(name, n_samples=20)
    print(name, m)


You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


instruction {'exact_match': 35.0, 'f1': 44.64102564102564}
cot {'exact_match': 0.0, 'f1': 8.6432740512443}
persona {'exact_match': 35.0, 'f1': 44.64102564102564}



## 8) Full run & select best strategy


In [None]:

results = {}
for name in PROMPTS:
    metrics, _ = evaluate_strategy(name, n_samples=None)  # full set
    results[name] = metrics

print("All results:", results)
best_by_f1 = max(results.items(), key=lambda kv: kv[1]["f1"])
best_by_em = max(results.items(), key=lambda kv: kv[1]["exact_match"])
print("Best F1:", best_by_f1)
print("Best EM:", best_by_em)


All results: {'instruction': {'exact_match': 22.004357298474947, 'f1': 28.500566154630903}, 'cot': {'exact_match': 2.396514161220044, 'f1': 9.77070173875652}, 'persona': {'exact_match': 23.856209150326798, 'f1': 31.24413005587416}}
Best F1: ('persona', {'exact_match': 23.856209150326798, 'f1': 31.24413005587416})
Best EM: ('persona', {'exact_match': 23.856209150326798, 'f1': 31.24413005587416})



## 9) Notes & Tips for Reproduction
- Ensure the **Step 2** notebook builds or loads the vector index before running this notebook.
- Keep generation temperature low (0–0.2) for reproducibility.
- Save results (e.g., to `results/step3_baseline.json`) for later comparison in Step 4–6.
