# 🧠 CS 546 Advanced NLP — Homework 2  
## Retrieval-Augmented Generation (RAG) under Knowledge Edits

### 📘 Overview
In this assignment, you will build and evaluate a **Retrieval-Augmented Generation (RAG)** system that must reason correctly under **knowledge edits** — when a single factual statement in a knowledge base (say [WIKIDATA](https://www.wikidata.org/wiki/Wikidata:Main_Page)) is modified.

For example, if the original fact was:  
> “Leonardo DiCaprio was born in the United States.”  
and it is **edited** to:  
> “Leonardo DiCaprio was born in Syria.”  

then a query like  
> “What is the currency of the country where Leonardo DiCaprio was born?”  
must now produce **“Syrian Pound”**, not **“US Dollar.”**

Your system should learn to retrieve the **modified fact** and use it to answer downstream reasoning questions that depend on it.

---

### 🎯 Learning Goals
By completing this homework, you will:
1. Implement **two retrievers** — BM25 and dense embedding–based — and evaluate **Hit@K** performance.  
2. Use the retrieved facts as **context** for an LLM to perform reasoning for multiple-choice questions.  
3. Experiment with **in-context prompting** and compare performance with varying numbers of retrieved documents (`k ∈ {0,1,2,4,8}`).  
4. Quantify and understand how retrieval quality influences answer accuracy.

---

### 📦 Deliverables
1. **Code Notebook (`ANLP_HW2_RAG.ipynb`)** – with completed TODOs and results cells executed.  

---

### 💡 Evaluation
Your submission will be evaluated on:
- ✅ Correct implementation of retrievers and metrics  
- ✅ Quality of reasoning accuracy results (with vs. without retrieval)   
- ✅ Reproducibility and code organization  

---


## 🧱 Section 1 — Environment Setup, Hugging Face Authentication, and Testing Enviroment

In this section, you’ll install all the required Python libraries and configure your Hugging Face access so that you can load pretrained models for retrieval and generation.

### 🧩 Step 1 – Install Required Packages

We’ll install a stable environment for the RAG homework:

- **`vllm == 0.11.0`** – fast local inference for prompt-based evaluation  
- **`transformers`, `tokenizers`, `accelerate`, `huggingface_hub`** – for working with pretrained LLMs  
- **`rank_bm25`** – lexical retrieval baseline  
- **`gdown`, `json_repair`** – for dataset download + JSON cleanup  

> 💡 The first two commands uninstall and purge cached packages to prevent version conflicts.  
> If the kernel suggests a restart after installation, **accept it** before continuing.

Run the next code cell and wait for it to complete successfully.

In [3]:
# !pip uninstall -y transformers tokenizers accelerate huggingface_hub
# !pip cache purge
# !pip install "vllm==0.11.0" "transformers>=4.51.0" "tokenizers>=0.21.0" "accelerate>=1.0.0" "huggingface_hub>=0.26.0" "rank_bm25" "gdown" "json_repair"


### 🧩 Step 2 – Authenticate with Hugging Face

Some models (e.g., `sentence-transformers` or `Qwen` series) require a valid Hugging Face access token.

1. Go to **[https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)**  
2. Create or copy a token with at least **“read”** permissions  
3. Run the login cell below and paste your token when prompted  

> ⚠️ The command `huggingface-cli login` is deprecated.  
> You can still use it, or switch to the modern equivalents:
> ```bash
> !huggingface-cli login --token <YOUR_TOKEN>
> ```
> or
> ```bash
> !hf auth login
> ```

Once your token is accepted, you should see a message like:

**Login successful.**  
**The current active token is `research`.**

In [1]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `hf auth whoami` to get more information or `hf auth logout` if you want to log out.
    Setting a new token will erase the existing one.
    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: write).
The token `cs

### 🧩 Step 3 – Import Packages and Verify Installation

Now that all dependencies are installed, let’s import the required packages and confirm that the environment is correctly configured.

This step ensures:
- All essential libraries (`torch`, `transformers`, `vllm`, etc.) were installed successfully  
- Their versions match the expected setup for this assignment  

> ✅ If all imports run without errors and version checks pass, you’re ready to proceed to the next section.

In [2]:
import sys
import os
import json
from json_repair import repair_json
import torch
import accelerate
import huggingface_hub
import tokenizers
import numpy as np
import gdown
from rank_bm25 import BM25Okapi
import sentence_transformers
from sentence_transformers import SentenceTransformer, CrossEncoder
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import re
import pathlib
import gdown
import random
from tqdm import tqdm
import gc
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"  # required before importing vLLM
import vllm
from vllm import LLM, SamplingParams
random.seed(42)

INFO 10-31 22:23:55 [__init__.py:216] Automatically detected platform cuda.


In [7]:
# !pip install -q --force-reinstall "transformers==4.57.0"
# !pip install -q --force-reinstall --no-deps "accelerate==1.10.1"
# !pip install -q --force-reinstall --no-deps "huggingface_hub==0.35.3"
# !pip install -q --force-reinstall --no-deps "sentence_transformers==5.1.1"
# !pip install -q --force-reinstall --no-deps "numpy==2.0.2"
# !pip install -q --force-reinstall --no-deps "vllm==0.11.0"
# !pip install -q --force-reinstall --no-deps "gdown==5.2.0"

In [3]:
# --- Expected versions ---
expected = {
    "python": "3.12",
    "torch": "2.8.0+cu126",
    "transformers": "4.57.0",
    "accelerate": "1.10.1",
    "huggingface_hub": "0.35.3",
    "tokenizers": "0.22.1",
    "sentence_transformers": "5.1.1",
    "numpy": "2.0.2",
    "vllm": "0.11.0",
    "gdown": "5.2.0",
}

assert sys.version.startswith(expected["python"]), f"Python version mismatch: {sys.version}"
assert torch.__version__ == expected["torch"], f"Torch version mismatch: {torch.__version__}"
assert transformers.__version__ == expected["transformers"], f"Transformers version mismatch: {transformers.__version__}"
assert accelerate.__version__ == expected["accelerate"], f"Accelerate version mismatch: {accelerate.__version__}"
assert huggingface_hub.__version__ == expected["huggingface_hub"], f"HuggingFace Hub version mismatch: {huggingface_hub.__version__}"
assert tokenizers.__version__ == expected["tokenizers"], f"Tokenizers version mismatch: {tokenizers.__version__}"
assert sentence_transformers.__version__ == expected["sentence_transformers"], f"SentenceTransformers version mismatch: {sentence_transformers.__version__}"
assert np.__version__ == expected["numpy"], f"Numpy version mismatch: {np.__version__}"
assert vllm.__version__ == expected["vllm"], f"vLLM version mismatch: {vllm.__version__}"
assert gdown.__version__ == expected["gdown"], f"gdown version mismatch: {gdown.__version__}"

print("✅ All package versions match expected values.")

✅ All package versions match expected values.


## 📂 Section 2 — Dataset Overview and Loading

In this section, you’ll load and inspect the dataset used for this homework.  
Each example represents a **modified fact** paired with **queries** that require reasoning based on this change.

### 🧠 Dataset Description

The dataset has two main components:

1. **📘 Edited Facts**  
   - Each document encodes one modified statement (e.g., *“Leonardo DiCaprio was born in Syria.”*).  
   - These serve as the **knowledge base** for retrieval.

2. **❓ Queries**  
   - Each query asks a reasoning question affected by the edited fact.  
   - Each includes:
     - `question`: the query text  
     - `choices`: six options (A–F)  
     - `correct_choice`: correct option key  
     - `correct_document_ids`: IDs of the relevant edited fact(s)

> 💡 **Goal:** Retrieve the correct edited fact and answer according to the **modified world**, not the original one.

### 🧩 Step 1 – Download, Save, and Inspect the Dataset

1. **Download** the dataset files using the `gdrive_get()` function implemented below.  
2. **Save** them locally in your working directory at `/content/datasets/`.  
3. **Load** the dataset using `json.load()` or an equivalent utility.  
4. **Inspect a few examples**, each containing:
   - `documents` — verify each includes a single edited fact  
   - `queries` — ensure they reference valid `correct_document_ids`  
   - The `documents` remain the same across all `queries` within a given data split (`val`, `test`).  
5. The dataset has two splits: **validation (`val`)** and **test (`test`)**.  
   - For the **test** split, `correct_choice` is always set to `"A"` and `correct_document_ids` to `0`.  
   - Ground-truth labels for the test split are **not released**, as your submissions will be evaluated later.

**Example checks:**
- Verify dataset sizes (`len(documents)`, `len(queries)`)  
- Print a few representative examples  

In [4]:
def gdrive_get(url, out):
    fid = re.search(r'(?:/d/|id=)([-\w]{10,})', url).group(1)
    p = pathlib.Path(out); p.parent.mkdir(parents=True, exist_ok=True)
    gdown.download(id=fid, output=str(p), quiet=False)
    if p.read_bytes()[:32].lstrip().startswith((b'<!DOCTYPE html', b'<html')):
        gdown.download(url=f"https://drive.google.com/uc?id={fid}&export=download",
                       output=str(p), quiet=False)

In [5]:
files = {
    "datasets/val_dataset.json":  "https://drive.google.com/file/d/1fbRNGPpNebv8lDJGtjg0kvEXoYs5R4XC/view?usp=sharing",
    "datasets/test_dataset.json": "https://drive.google.com/file/d/1BnB3cAakw5oB1z7yzt8HKXF9c5V3rTWw/view?usp=sharing",
}

datasets = {}

print("\n📦 Downloading & Loading Datasets...\n" + "="*60)
for out_path, url in files.items():
    out = pathlib.Path(out_path)
    split_name = "val" if "val" in out.stem else "test"

    gdrive_get(url, out)

    with open(out, "r") as f:
        dataset = json.load(f)
    datasets[split_name] = dataset

    num_queries = len(dataset)
    num_docs = len(dataset[0]["all_documents"]) if num_queries > 0 else 0
    print(f"📂 Split: {split_name.upper():<5}")
    print(f"   • Path: {out.resolve()}")
    print(f"   • Queries:  {num_queries:,}")
    print(f"   • Documents: {num_docs:,}")
    print("-" * 60)

print("✅ All datasets ready.\n")


📦 Downloading & Loading Datasets...


Downloading...
From: https://drive.google.com/uc?id=1fbRNGPpNebv8lDJGtjg0kvEXoYs5R4XC
To: /content/datasets/val_dataset.json
100%|██████████| 51.3M/51.3M [00:00<00:00, 81.8MB/s]


📂 Split: VAL  
   • Path: /content/datasets/val_dataset.json
   • Queries:  1,492
   • Documents: 371
------------------------------------------------------------


Downloading...
From (original): https://drive.google.com/uc?id=1BnB3cAakw5oB1z7yzt8HKXF9c5V3rTWw
From (redirected): https://drive.google.com/uc?id=1BnB3cAakw5oB1z7yzt8HKXF9c5V3rTWw&confirm=t&uuid=64b3513b-912b-48b6-95df-d583fb3fd244
To: /content/datasets/test_dataset.json
100%|██████████| 108M/108M [00:01<00:00, 71.6MB/s]


📂 Split: TEST 
   • Path: /content/datasets/test_dataset.json
   • Queries:  1,955
   • Documents: 535
------------------------------------------------------------
✅ All datasets ready.



In [6]:
def print_sample(split, item_id):
    query_id = datasets[split][item_id]['query_id']
    query = datasets[split][item_id]['query']
    all_document_ids = datasets[split][item_id]['all_document_ids']
    all_documents = datasets[split][item_id]['all_documents']
    gold_document_ids = datasets[split][item_id]['correct_document_ids']
    gold_document = all_documents[gold_document_ids[0]]

    choices = datasets[split][item_id]['choices']
    correct_choice = datasets[split][item_id]['correct_choice']
    correct_choice_value = choices[correct_choice]

    print("=" * 80)
    print(f"📘 Split: {split} | Item ID: {item_id} | Query ID: {query_id}")
    print("-" * 80)
    print(f"🧠 Query:\n{query}\n")
    print("📚 All Documents:")
    for doc_id, doc_text in zip(all_document_ids, all_documents):
        print(f"  [{doc_id}] {doc_text}")
    print("\n🏆 Gold Document:")
    print(f"  ID: {gold_document_ids[0]}")
    print(f"  Text: {gold_document}\n")
    print("🗳️ Choices:")
    for key, val in choices.items():
        print(f"  {key}: {val}")
    print(f"\n✅ Correct Choice: {correct_choice} → {correct_choice_value}")
    print("=" * 80)

In [7]:
print_sample('val', 1400)

📘 Split: val | Item ID: 1400 | Query ID: 1402
--------------------------------------------------------------------------------
🧠 Query:
The name of the anthem of the country Michigan–Ohio State football rivalry is associated with is

📚 All Documents:
  [0] The name of the country of citizenship of Leonardo DiCaprio is Syria.
  [1] The name of the country which Academy Award for Best Picture is associated with is Wassoulou Empire.
  [2] The name of the spouse of Ron DeSantis is Carol Chu.
  [3] The names of the siblings of Janice Dickinson are Antoine-Jean-Matthieu Séguier.
  [4] Big Mouth is followed by 1977–78 French Division 2.
  [5] The name of the anthem of Philippines is Hatikvah.
  [6] The name of the country of citizenship of Jerrod Carmichael is Terengganu.
  [7] The name of the composer of Vikram is Johnny Reine.
  [8] The place of burial of Princess Alice of Battenberg is Panteón de Marinos Ilustres.
  [9] Soviet Union follows 2011 Greece Junior Badminton Championships – men'

In [8]:
print_sample('test', 1400)

📘 Split: test | Item ID: 1400 | Query ID: 1400
--------------------------------------------------------------------------------
🧠 Query:
The name of the head of state of the country 5 Forge Row is associated with is

📚 All Documents:
  [0] The name of the country which Goursez Vreizh is associated with is Franche-Comté.
  [1] The name of the country which Pralayakkad South is associated with is Sui dynasty.
  [2] The gender of Jose L Castillo is cisgender female.
  [3] The occupation of Emily I Jones is philatelist.
  [4] The name of the country which Suttor is associated with is Dutch Republic.
  [5] The name of the country which canton of Orcières is associated with is Chuvash Republic.
  [6] The occupation of G.L. Defer is Greek prefect.
  [7] The name of the country which Shockwave is associated with is Republic of Abkhazia.
  [8] The occupation of Nicholas D Rintala is police dog.
  [9] The name of the mother of Stephana Warnock is Sheila Mary Nolan.
  [10] The gender of Henry Joh

## 🔎 Section 3 — Implement the Retriever

In this section, you have to implement a **retriever** that retrieves the most relevant documents for every query.  
You can choose to use either:

- **BM25** — a lexical retriever based on token overlap, or  
- **Bi-encoder** — a neural retriever that uses **sentence embeddings**.

> 💡 Read more about different types of model-based retrievers here: [blog](https://blog.dailydoseofds.com/p/visual-guide-to-bi-encoders-cross)

Each retriever computes **similarity scores** between every pair of query and document.  
For each query, your goal is to identify the **top-k documents** with the highest similarity scores.


### 🧩 Step 1 – Fill the missing code.

Fill in the `<fill block> ... </fill block>` section in the code cells below to:

1. Compute a similarity score matrix of shape **[num_queries, num_documents]**.  
2. Retrieve **top-k** documents per query (and optionally their similarity scores).

In [9]:
def get_similarity_scores(retrieval_dataset, retriever_type="model", model_name="hkunlp/instructor-large"):
    queries = [item["query"] for item in retrieval_dataset] # queries.
    documents = retrieval_dataset[0]["all_documents"] # documents.
    for idx in range(len(retrieval_dataset)): # documents for different queries in a split are exactly same.
      assert retrieval_dataset[idx]["all_documents"] == documents

    if retriever_type == 'model':
      print(f"Using retriever type: {retriever_type}, model name: {model_name}")
    else:
      print(f"Using retriever type: {retriever_type}")

    if retriever_type == "model":
        # Use the SentenceTransformer specified by model_name to compute similarity scores (tensor of shape [len(queries), len(documents)]).
        # <fill block>
        st_model = SentenceTransformer(model_name)
        doc_emb = st_model.encode(documents, convert_to_numpy=True, normalize_embeddings=True, show_progress_bar=False)
        query_emb = st_model.encode(queries, convert_to_numpy=True, normalize_embeddings=True, show_progress_bar=False)

        doc_t   = torch.from_numpy(doc_emb).float()    # [D, dim]
        query_t = torch.from_numpy(query_emb).float()  # [Q, dim]
        similarity = query_t @ doc_t.T                  # [Q, D]
        # </fill block>

    elif retriever_type == "bm25":
        # Use the BM25Okapi to compute similarity scores (tensor of shape [len(queries), len(documents)]).
        # <fill block>
        def tok(x):
            return re.findall(r"\w+", x.lower())

        tokenized_docs = [tok(d) for d in documents]
        bm25 = BM25Okapi(tokenized_docs)

        rows = []
        for q in queries:
            scores = bm25.get_scores(tok(q))
            rows.append(torch.tensor(scores, dtype=torch.float32))
        similarity = torch.stack(rows, dim=0)           # [Q, D]
        # </fill block>

    else:
        raise ValueError(f"Unknown retriever type: {retriever_type}")

    # --- Output consistency checks ---
    assert isinstance(similarity, torch.Tensor)
    assert similarity.shape[0] == len(queries) and similarity.shape[1] == len(documents)
    return similarity

In [10]:
def get_topk(similarity, documents, k):
    """
    Return top-k indices and documents per query for a single k.
    """
    # Use tensor manipulation to fetch indices of top-k most similar documents corresponding to each query.

    # <fill block>
    topk_indices = torch.topk(similarity, k=k, dim=1).indices.cpu().tolist()  # [[i1..ik], ...]
    topk_docs = [[documents[j] for j in row] for row in topk_indices]
    # </fill block>


    # --- Output consistency checks ---
    assert len(topk_indices) == similarity.size(0), (
        f"Mismatch: got {len(topk_indices)} query results, expected {similarity.size(0)}"
    )
    assert all(len(row) == k for row in topk_indices), "Each query must return exactly k indices"
    assert all(len(row) == k for row in topk_docs), "Each query must return exactly k docs"
    assert len(topk_docs) == len(topk_indices), "Mismatch between indices and docs output lengths"
    assert all(isinstance(idx, int) for row in topk_indices for idx in row), \
        "All elements in topk_indices must be integers"
    assert all(isinstance(doc, str) for row in topk_docs for doc in row), \
        "All elements in topk_docs must be strings"
    return topk_indices, topk_docs

In [11]:
def hitrate_at_k_from_indices(topk_indices, gold_ids):
    """
    Compute hit-rate@k given top-k indices (per query) and gold doc IDs (per query).
    """
    n = len(gold_ids)
    hits = sum(bool(set(gold_ids[i]).intersection(topk_indices[i])) for i in range(n))
    return hits / n if n else 0.0

### 🧩 Step 2 – Set Dataset Split.

Before running the cell below, make sure to **set the dataset split** correctly.

1. **Validation Phase**  
   - Fill in the required code sections.  
   - Run **all cells** in the notebook using the **validation (`val`)** split.  
   - Verify that your implementation and performance metrics are satisfactory.

2. **Testing Phase**  
   - Once you are satisfied with your results on the validation split, **restart the runtime/session** to ensure a clean environment.  
   - Then, run **all cells again** using the **test** split.

⚠️ **Note:** The ground truth labels for the test split are **not visible**.  
As a result, the notebook may show **lower scores** on the test split, even if your implementation performs well on validation.  
Your final test predictions will be evaluated separately.

In [12]:
CURRENT_SPLIT = "val" # set current split

In [13]:
# setting the dataset variables
retrieval_dataset = datasets[CURRENT_SPLIT]
gold_ids  = [item["correct_document_ids"] for item in datasets[CURRENT_SPLIT]]
queries   = [item["query"] for item in datasets[CURRENT_SPLIT]]
documents = datasets[CURRENT_SPLIT][0]["all_documents"]
answer_choices = []
for item in datasets[CURRENT_SPLIT]:
  item_choices = item['choices']
  item_choices['correct_choice'] = item['correct_choice']
  answer_choices.append(item_choices)

### 🧩 Step 3 – Run Retriever, Visualize Examples, and Save Results

In this step, you will:

1. **Run the Retriever**  
   - Execute the retrieval pipeline using your selected retriever type (e.g., `bm25`, `instructor`, or `qwen3` model).  
   - Compute similarity scores between all **queries** and **documents**.  
   - Evaluate **Hit@K** metrics across different values of K (e.g., 1, 2, 4, 8, 16, 32, 64).

2. **Visualize Retrieval Examples**  
   - Randomly inspect a few **Top-1 retrieved documents** for sample queries.  
   - Check whether the retrieved results match the **ground-truth document IDs**.

3. **Save Retrieval Results**  
   - Store the computed top-K indices, retrieved documents, and hit rates in a dictionary (e.g., `TOP_K_RETRIEVAL_RESULTS`).  
   - Later, write these results to disk for submission or further evaluation.

⚙️ **Tip:**  
Experiment with multiple retrievers, but **use the one that achieves the highest HitRate on your validation split** when running the final cells, since this retriever will be used for the generation stage in the next section.


📊 **Outcome:**  
After completing this step, you should see printed **Hit@K** scores (e.g., HitRate@1, HitRate@4, HitRate@16, etc.) and several sample retrievals that help verify the correctness of your retriever implementation.

In [14]:
# Try retriever "bm25" and "model" with embedding models such as "hkunlp/instructor-large" or "Qwen/Qwen3-Embedding-0.6B"
# <fill block>
RETRIEVER_TYPE, RETRIEVER_MODEL_NAME = "model", "Qwen/Qwen3-Embedding-0.6B"
# </fill block>

TOP_K_RETRIEVAL_RESULTS = dict() # saving the retrieval results here.
similarity_scores = get_similarity_scores(retrieval_dataset=retrieval_dataset, retriever_type=RETRIEVER_TYPE, model_name=RETRIEVER_MODEL_NAME).cpu()

ks = (1, 2, 4, 8, 16, 32, 64)
for k in ks:
    topk_idx, topk_docs = get_topk(similarity_scores, documents, k=k)
    if k not in TOP_K_RETRIEVAL_RESULTS:
      TOP_K_RETRIEVAL_RESULTS[k] = dict()
    hitrate = hitrate_at_k_from_indices(topk_idx, gold_ids)
    TOP_K_RETRIEVAL_RESULTS[k] = {'indices': topk_idx, 'documents': topk_docs, 'hitrate': hitrate}
    if k == 1 and CURRENT_SPLIT == "val":
      assert hitrate > 0.80, "For validation split the hitrate@1 should be greater than 0.8"
    print(f"HitRate@{k}: {hitrate:.3f}")

Using retriever type: model, model name: Qwen/Qwen3-Embedding-0.6B


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/215 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.19G [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/313 [00:00<?, ?B/s]

HitRate@1: 0.870
HitRate@2: 0.897
HitRate@4: 0.920
HitRate@8: 0.940
HitRate@16: 0.963
HitRate@32: 0.979
HitRate@64: 0.991


In [15]:
random.seed(42)
print("\n=== Sample Top-1 Retrievals (Random Order) ===")
for i in random.sample(range(len(queries)), k=min(20, len(queries))):
    top1_idx = TOP_K_RETRIEVAL_RESULTS[1]['indices'][i][0]
    gold = set(gold_ids[i])
    is_hit = top1_idx in gold
    symbol = "✅" if is_hit else "❌"

    print(f"\n{symbol} Query {i}: {queries[i]}")
    print(f"  ➤ Top-1 Retrieved Doc[{top1_idx}]: {topk_docs[i][0][:150]}{'...' if len(topk_docs[i][0])>150 else ''}")
    print(f"  🎯 Ground Truth IDs: {sorted(gold)}")


=== Sample Top-1 Retrievals (Random Order) ===

✅ Query 1309: 1986 German Badminton Championships – women's doubles is followed by
  ➤ Top-1 Retrieved Doc[328]: Drishyam 2 follows 1986 German Badminton Championships – women's doubles.
  🎯 Ground Truth IDs: [328]

✅ Query 228: The name of the continent which the country of citizenship of Sunil Lahri is part of is
  ➤ Top-1 Retrieved Doc[56]: The name of the country of citizenship of Sunil Lahri is Viceroyalty of New Granada.
  🎯 Ground Truth IDs: [56]

✅ Query 51: The name of the spouse of the mother of Richard Nixon is
  ➤ Top-1 Retrieved Doc[11]: The name of the mother of Richard Nixon is Caretene.
  🎯 Ground Truth IDs: [11]

✅ Query 563: The name of the child of Dorothy Bridges is
  ➤ Top-1 Retrieved Doc[132]: The names of the siblings of Jeff Bridges are Carol Nugent.
  🎯 Ground Truth IDs: [132]

❌ Query 501: The name of the spouse of the founder of Most Eminent Order of the Indian Empire is
  ➤ Top-1 Retrieved Doc[367]: The name o

In [16]:
# Save the retrieval results.

output_file = f"anlp_hw2_outputs/retrieval_results_{RETRIEVER_TYPE}_{RETRIEVER_MODEL_NAME.replace('/', '.')}_{CURRENT_SPLIT}.json"
os.makedirs(os.path.dirname(output_file), exist_ok=True)

with open(output_file, "w") as f:
    json.dump(TOP_K_RETRIEVAL_RESULTS, f, indent=2)

print(f"✅ Retrieval results saved successfully to: {output_file}")

✅ Retrieval results saved successfully to: anlp_hw2_outputs/retrieval_results_model_Qwen.Qwen3-Embedding-0.6B_val.json


### 🧠 Section 4 — Generate Answers with an LLM

In this section, you’ll use a **generator LLM** to answer queries **conditioned on the retrieved documents**. Your objective is to improve the model’s reasoning and accuracy by systematically experimenting with different configurations.

**Your tasks:**
1. **Try different generator models**  
   - Experiment with different small models.  
   - Compare their performance on the validation split to identify which model generalizes best.

2. **Tune the input prompts**  
   - Refine the system and user prompts to improve reasoning structure and output consistency.  
   - Emphasize step-by-step reasoning and enforce strict JSON-only answers.

3. **Vary the number of top-K retrieved documents**  
   - Experiment with different context sizes (e.g., K = 0, 1, 2, 4, 8, 16, 64).  
   - Observe how increasing or decreasing K affects both accuracy and generation cost.

4. **Save your results to disk**  
   - Store outputs for atleast two configuration (model, K, prompt variant) (one with K>=1 and one with K=0) on both **validation** and **test** splits.  

> **Note:** Test split labels are hidden, so accuracy will appear low or undefined. This is expected — focus on achieving strong validation performance first, then run inference on the test split for evaluation.


### 🧩 Step 1 – Load the retrieval results from the best found retriever

In this step, you will **reload the retrieval outputs** obtained from the best-performing retriever model (identified in the previous section).  
These results contain the **top-K retrieved documents** for each query and will serve as input for the **generator LLM** in the next stages.

You’ll load the saved `.json` file from the `outputs/` directory, which stores the retrieval scores and document IDs for both validation and test splits.

In [17]:
RETRIEVER_TYPE, RETRIEVER_MODEL_NAME = "model", "Qwen/Qwen3-Embedding-0.6B"
retrieval_results_file = f"anlp_hw2_outputs/retrieval_results_{RETRIEVER_TYPE}_{RETRIEVER_MODEL_NAME.replace('/', '.')}_{CURRENT_SPLIT}.json"
TOP_K_RETRIEVAL_RESULTS = json.load(open(retrieval_results_file))

### 🧩 Step 2 – Setup the generator LLM using vLLM

In this step, you will **initialize the generator model** that will produce answers based on the retrieved documents.  
The generator LLM takes as input the query, context (retrieved top-K documents), and the prompt instructions — and outputs the most likely correct answer.

You will:
1. **Select the generator model**  
   Choose from models like `Qwen2.5-3B-Instruct`, `Llama-3.2-1B-Instruct`, or `SmolLM3-3B`.  
   Larger models generally exhibit stronger reasoning abilities but require more GPU memory and may take longer to load.

2. **Initialize the tokenizer**  
   The tokenizer formats the input messages using the appropriate chat template (system + user roles) before sending them to the model.

3. **Load the model with vLLM**  
   Use the `LLM()` class from [**vLLM**](https://vllm.ai) — a high-performance inference engine optimized for **fast, memory-efficient, and scalable generation**.  
   vLLM uses techniques such as **PagedAttention**, dynamic batching, and GPU memory sharing, allowing you to run large models efficiently with minimal latency.

   Key parameters to tune:
   - `dtype`: `"float16"` or `"bfloat16"` (depending on GPU support)  
   - `tensor_parallel_size`: Increase if running on multiple GPUs  
   - `gpu_memory_utilization`: Adjust between `0.85–0.95` depending on available memory  
   - `max_model_len`: Set according to the model’s context window  

> 💡 **Tip:** If any model fails to load, it’s likely due to **insufficient free GPU memory**.  
> Before retrying, free up GPU memory by restarting the kernel.

In [20]:
# Try different small generator models such as "Qwen/Qwen2.5-3B-Instruct" or "meta-llama/Llama-3.2-1B-Instruct" or "HuggingFaceTB/SmolLM3-3B"
# <fill block>
GENERATOR_NAME = "HuggingFaceTB/SmolLM3-3B"
# </fill block>

# ---- Initialize tokenizer (for chat templating) ----
generator_tokenizer = AutoTokenizer.from_pretrained(GENERATOR_NAME)

# ---- Initialize vLLM engine ----
generator_model = LLM(
    model=GENERATOR_NAME,
    dtype="float16",                # or "bfloat16" if supported
    tensor_parallel_size=1,         # increase if multi-GPU
    gpu_memory_utilization=0.85,     # try 0.85–0.95 depending on GPU headroom
    max_model_len=2048,
    enable_prefix_caching=True      # saves the compute if prompts share a common prefix (e.g. a system prompt)
)
print(f"✅ Model {GENERATOR_NAME} loaded successfully.")

INFO 10-31 22:29:07 [utils.py:233] non-default args: {'dtype': 'float16', 'max_model_len': 2048, 'enable_prefix_caching': True, 'gpu_memory_utilization': 0.85, 'disable_log_stats': True, 'model': 'HuggingFaceTB/SmolLM3-3B'}
INFO 10-31 22:29:09 [model.py:547] Resolved architecture: SmolLM3ForCausalLM
INFO 10-31 22:29:09 [model.py:1510] Using max model len 2048
INFO 10-31 22:29:09 [scheduler.py:205] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 10-31 22:29:57 [llm.py:306] Supported_tasks: ['generate']
✅ Model HuggingFaceTB/SmolLM3-3B loaded successfully.


### 🧩 Step 3 – Tune the Prompts to Generator

In this step, you will design **prompt templates** that will guide your **generator model** (e.g., `Llama-3`, `Qwen2`, `Gemma`, etc.) to produce accurate and well-structured answers based on the retrieved documents.

Your task is to:

1. **Understand the Context and Task**
   - You are given a list of modified WIKIDATA facts (`contextual_facts`), a question (`question`), and multiple answer choices (`answer_options`).
   - The model must reason over the modified facts — treating them as the *true, authoritative version* — and pick the correct choice.

2. **Write the System and User Prompts**
   - The **system prompt** should explain the reasoning process and enforce the correct output format (a single JSON line like `{"correct_choice": "A"}`).
   - The **user prompt** should include the given context, the question, and all answer choices.

3. **Goal**
   - Guide the model to reason step-by-step using the provided facts.
   - Ensure the final output is machine-parseable (strict JSON schema) so it can be automatically evaluated.

4. **Instructions**
   - Fill the `# <fill block> ... </fill block>` section by writing both:
     - A clear system instruction on how to reason and format the answer.
     - A user instruction that includes the contextual facts, question, and answer choices.
   - Verify that the return value is a list of two messages:
     ```python
     [
       {"role": "system", "content": sys},
       {"role": "user", "content": user}
     ]
     ```

🧠 *Tip:* Try different phrasing styles for the system and user prompts — for instance, make the system more detailed and the user shorter (or vice versa) — to see which yields the best validation accuracy.

In [22]:
def prepare_generator_input(contextual_facts, question, answer_options):
    ctx = " ".join(contextual_facts)
    answer_choices_string = "\n".join([f"{choice}: {answer_options['choice_' + choice]}" for choice in ['A', 'B', 'C', 'D', 'E', 'F']])
    # NOTE: Write the system and user prompt.
    # System prompt should guide the LLM about how to use context, how to reason and how to geneate output in the parseable format {"correct_choice": "label"}, where "label" can only be "A", "B", "C", "D", "E", "F".
    # User prompt will be dependent on contextual facts, question and answer_option.
    # More tips on tuning prompts:
    # =========================
    # SYSTEM PROMPT CHECKLIST
    # =========================
    # Purpose: Teach the model HOW to think and respond.

    # 1) Define model role clearly:
    #       - State the model is an expert in Wikidata facts and logical reasoning.

    # 2) Explain function of the provided context:
    #       - Context temporarily modifies/overrides relationships in the Wikidata graph.

    # 3) Set precedence of new facts:
    #       - Modified facts in context take precedence over any prior/latent knowledge.

    # 4) Specify reasoning style:
    #       - Require 3–6 concise, non-redundant bullet points.
    #       - Bullets must reference the modified relationships and their ripple effects.

    # 5) Enforce output structure:
    #       - Exactly two sections, in order:
    #         1) "Reasoning:" (bullets)
    #         2) "Final:" (single JSON line)

    # 6) Enforce strict JSON format in Final:
    #       - Exactly: {"correct_choice": "label"}
    #       - "label" ∈ {"A", "B", "C", "D", "E", "F"}

    # 7) Restrict JSON location:
    #       - JSON must appear ONLY in the Final section (never inside Reasoning).

    # 8) Encourage step-by-step logic:
    #       - Instruct to explain how modified links cause one choice to be correct.


    # =======================
    # USER PROMPT CHECKLIST
    # =======================
    # Purpose: Provide the specific instance (facts, question, choices).

    # 1) Include contextual facts:
    #       - Start with "Context:" containing the edited/modified Wikidata relationships.

    # 2) Present the question clearly:
    #       - Follow with "Question:" that refers to entities impacted by the context.

    # 3) List answer choices:
    #       - End with "Answer Choices:" listing options A–F, clearly labeled.

    # 4) Keep structure consistent:
    #       - Order must be: Context → Question → Answer Choices.
    #       - Use clean line breaks; avoid extra filler text.

    # 5) Avoid meta-instructions:
    #       - Do NOT include reasoning guidance or examples—only task content.

    # 6) Keep inputs parseable:
    #       - Consistent labels and formatting to allow automated extraction.


    # <fill block>
    sys = (
    "You will answer multiple-choice questions using only the facts in the Context. "
    "If anything conflicts with your own knowledge, follow the Context instead.\n\n"
    "Write a short reasoning section as bullet points under the heading 'Reasoning:'. "
    "Keep it brief and refer to relationships mentioned in the Context.\n\n"
    "After that, add a 'Final:' section with exactly one JSON line:\n"
    '  {\"correct_choice\": \"X\"}\n'
    "Use only one of A, B, C, D, E, F. "
    "Do not include JSON anywhere else."
    )

    user = (
    "Context:\n"
    f"{ctx}\n\n"
    "Question:\n"
    f"{question}\n\n"
    "Answer Choices:\n"
    f"{answer_choices_string}"
    )
    # </fill block>

    return [
        {"role": "system", "content": sys},
        {"role": "user", "content": user}
    ]

In [23]:
def chat_response(list_messages, generator_model, generator_tokenizer, max_new_tokens, temperature=0.0):
    # Prepare the chat input text
    prompts = []
    for messages in list_messages:
      prompt = generator_tokenizer.apply_chat_template(
          messages,
          tokenize=False,
          add_generation_prompt=True
      )
      prompts.append(prompt)

    # --- Define sampling parameters ---
    sp = SamplingParams(
        temperature=temperature,
        max_tokens=max_new_tokens,
    )

    # --- Run inference with vLLM ---
    responses = generator_model.generate(prompts, sp)
    responses = [response.outputs[0].text for response in responses]

    return responses

### 🧩 Step 4 – Prepare Generator Inputs with Top-K Retrieved Documents

In this step, you’ll **combine the retriever results** (Top-K documents) with the query and answer options to form the **input messages** for your generator LLM.

Your tasks:

1. **Select K (number of retrieved documents)**
   - Set the variable `K` to control how many top-retrieved documents are passed to the generator.
   - Try values like `K = 1`, `K = 2`, `K = 4`, or `K = 8` to observe how retrieval depth affects generation accuracy.
   - When `K = 0`, the model receives **no context** (baseline, zero-context reasoning).

2. **Sample Evaluation Queries**
   - A random subset of queries will be selected to visualize input examples and sanity-check that the right documents are being used.

3. **Generate Formatted Messages**
   - For each query, you will create a message pair:
     - **System + User prompts** (from `prepare_generator_input`)
     - **Ground-truth answer label**
   - These structured inputs will later be passed to your generator in the next step for inference.

4. **Inspect a Few Samples**
   - After preparing the list `messages_and_labels`, print a few samples to confirm:
     - Retrieved context snippets
     - Corresponding question and answer choices
     - JSON-formatted LLM messages and correct answer labels

💡 *Tip:* Start with `K = 0` and `K = 1` for quick debugging, then progressively increase K. Larger K may improve recall but can hurt precision if irrelevant documents are added.

In [24]:
# NOTE: Choose how many top-k documents (experiment with atleast K=0 and K=1) you want to put the in the context of generator LLM.
# <fill block>
K = 1
# </fill block>

if K == 0:
  topk_idx, topk_docs = [[] for _ in range(len(TOP_K_RETRIEVAL_RESULTS[str(1)]['indices']))], [[] for _ in range(len(TOP_K_RETRIEVAL_RESULTS[str(1)]['documents']))]
else:
  topk_idx, topk_docs = TOP_K_RETRIEVAL_RESULTS[str(K)]['indices'], TOP_K_RETRIEVAL_RESULTS[str(K)]['documents']

messages_and_labels = [
    {
        'messages': prepare_generator_input(topk_docs[i], queries[i], answer_choices[i]),
        'correct_choice': answer_choices[i]['correct_choice'].split('_')[-1],
    }
    for i in range(len(queries))
]

In [25]:
# visualise some geneator inptus
random.seed(42)
indices = random.sample(range(len(queries)), k=5)

for i in indices:
    print(f"\n=== Query {i} ===")
    print(f"❓ {queries[i]}")
    print(f"📚 Context ({len(topk_docs[i])} docs):")
    for j, doc in enumerate(topk_docs[i]):
        print(f"  [{j+1}] {doc[:150]}{'...' if len(doc)>150 else ''}")
    print(f"\n🧠 LLM Messages:\n{json.dumps(messages_and_labels[indices.index(i)]['messages'], indent=4)}\n")
    print(f"Correct Choice:\n{messages_and_labels[indices.index(i)]['correct_choice']}")


=== Query 1309 ===
❓ 1986 German Badminton Championships – women's doubles is followed by
📚 Context (1 docs):
  [1] Drishyam 2 follows 1986 German Badminton Championships – women's doubles.

🧠 LLM Messages:
[
    {
        "role": "system",
        "content": "You will answer multiple-choice questions using only the facts in the Context. If anything conflicts with your own knowledge, follow the Context instead.\n\nWrite a short reasoning section as bullet points under the heading 'Reasoning:'. Keep it brief and refer to relationships mentioned in the Context.\n\nAfter that, add a 'Final:' section with exactly one JSON line:\n  {\"correct_choice\": \"X\"}\nUse only one of A, B, C, D, E, F. Do not include JSON anywhere else."
    },
    {
        "role": "user",
        "content": "Context:\nThe name of the country of citizenship of Leonardo DiCaprio is Syria.\n\nQuestion:\nThe name of the currency in the country of citizenship of Leonardo DiCaprio is\n\nAnswer Choices:\nA: Japanese yen

### 🧩 Step 5 – Compute Accuracy and Save the Generated Responses

In this final step, you’ll **evaluate your generator model** by comparing its predicted answers against the ground truth labels and compute overall accuracy.

Your tasks:

1. **Parse and Validate Model Outputs**
   - The generator produces answers in JSON format (e.g., `{"correct_choice": "A"}`).
   - The function `repair_json()` ensures that slightly malformed outputs (like missing braces or quotes) are corrected before evaluation.

2. **Compute Accuracy and Parsing Error**
   - For each response, extract the predicted label and compare it with the true label.
   - If the model output fails to parse correctly, it’s counted toward the **parsing error fraction**.
   - The `accuracy()` function returns both:
     - `accuracy`: fraction of correctly predicted answers  
     - `parse_error_fraction`: percentage of invalid or unparsable outputs

3. **Run Generator and Evaluate**
   - Use the `chat_response()` function to generate predictions for all prepared prompts.
   - Pass these responses into `accuracy()` to compute the final metrics.

4. **Inspect Results**
   - Print the final accuracy and parsing error (both shown as percentages).
   - A progress bar displays how many queries were processed successfully.
   - Example output:
     ```
     Final Accuracy: 50.60% || Parsing Error: 1.07%
     ```

5. **(Optional Extension)**
   - Save responses and scores for further analysis:
     - Try saving them as a JSONL file for leaderboard or post-hoc evaluation.
     - Example: `json.dump({"responses": responses, "labels": labels}, open('results/generator_outputs.json', 'w'))`

💡 *Tip:* A low parsing error (<2%) indicates your prompt formatting is robust.  
If parsing errors spike, revisit your system prompt or tighten JSON output instructions.

---

📘 **Runtime Note:**  
On a **T4 GPU**, the validation set should take **about one hour** to run.  
If you have access to the **free *Colab Pro ([Student Edition](https://colab.research.google.com/signup))***, you can switch your runtime to an **A100 GPU**, which will complete the validation run in **about one minute**.

In [26]:
def accuracy(responses, labels):
    pred_labels = []
    parsing_error = 0.0
    for response, label in zip(responses, labels):
        pred_obj = repair_json(response, return_objects=True)
        if isinstance(pred_obj, list):
            pred_obj = pred_obj[-1]
        if isinstance(pred_obj, dict) and "correct_choice" in pred_obj:
            pred_label = pred_obj["correct_choice"]
        else:
            pred_label = "A"  # default fallback
            parsing_error += 1
        pred_labels.append(pred_label)

    correct = sum(p == l for p, l in zip(pred_labels, labels))
    return {"accuracy": correct / len(labels) if labels else 0.0, "parse_error_fraction": parsing_error / len(labels) if labels else 0.0}

In [27]:
generator_messages = [item['messages'] for item in messages_and_labels]
labels = [item['correct_choice'] for item in messages_and_labels]

responses = chat_response(
    generator_messages, generator_model, generator_tokenizer, max_new_tokens=1502
)

# compute final accuracy
results = accuracy(responses, labels)
print(f"Final Accuracy: {results['accuracy']:.3%} || Parsing Error: {results['parse_error_fraction']:.4%}")
assert results['accuracy'] > 0.5, "Final accuracy should be greater than 0.5 after tuning value of K and prompt."
assert results['parse_error_fraction'] < 0.05, "Parsing error fraction should be less than 5%."

Adding requests:   0%|          | 0/1492 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1492 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s…

Final Accuracy: 50.469% || Parsing Error: 1.4745%


In [28]:
# Save the generator results.
output_file = f"anlp_hw2_outputs/generator_results_{GENERATOR_NAME.replace('/', '.')}_K-{K}_{CURRENT_SPLIT}.json"
os.makedirs(os.path.dirname(output_file), exist_ok=True)

with open(output_file, "w") as f:
    json.dump({"responses": responses, "generator_messages": generator_messages}, f, indent=2)

print(f"✅ Generator results saved successfully to: {output_file}")

✅ Generator results saved successfully to: anlp_hw2_outputs/generator_results_HuggingFaceTB.SmolLM3-3B_K-1_val.json


## 🔎 Section 5 — Upload Your Outputs to Hugging Face Hub

In this step, you will upload your generated **JSON output files** to your **own Hugging Face repository**.  
This ensures your results are versioned, shareable, and verifiable by the instructors.

#### Instructions

1. **Login to Hugging Face**  (you already did this in the beginning)
   Run the following in a code cell:
   ```bash
   huggingface-cli login
   ```
   Get your access token from:  
   <https://huggingface.co/settings/tokens>

2. **Run the upload cell below**  
   It will:
   - Create a repository named `anlp-hw2-outputs` under your username (if it doesn’t already exist).
   - Upload all `.json` files inside `/content/anlp_hw2_outputs/`.
   - Print a **commit URL** and a **repository URL** upon success.

3. **Verify your upload**
   - Open the printed commit URL in your browser.
   - Confirm that all expected `.json` files are present in the repository.
   - Ensure each file has valid JSON content and is non-empty.

#### ✅ Verification Checklist

- `/content/anlp_hw2_outputs/` contains multiple `.json` files.
- All files are visible on Hugging Face and have non-zero size.
- The commit URL opens correctly and shows your uploaded files.


In [30]:
from huggingface_hub import HfApi, create_repo, upload_folder

# NOTE: specify your username to upload the upload the test and val results.
# <fill block>
user_name = "jameshan5040"
# </fill block>
repo_id = f"{user_name}/anlp-hw2-outputs"      # 🔹 change this
folder_path = "/content/anlp_hw2_outputs"       # 🔹 local folder path


api = HfApi()
create_repo(repo_id, exist_ok=True)

upload_folder(
    folder_path=folder_path,
    repo_id=repo_id,
    commit_message="Upload JSON outputs",
    ignore_patterns=["*.ipynb", "*.py", "*.DS_Store"],  # optional cleanup
)


Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...3-Embedding-0.6B_val.json:  88%|########8 | 16.4MB / 18.6MB            

CommitInfo(commit_url='https://huggingface.co/jameshan5040/anlp-hw2-outputs/commit/07824a4fdbd7615348c1def349708a0a50837f16', commit_message='Upload JSON outputs', commit_description='', oid='07824a4fdbd7615348c1def349708a0a50837f16', pr_url=None, repo_url=RepoUrl('https://huggingface.co/jameshan5040/anlp-hw2-outputs', endpoint='https://huggingface.co', repo_type='model', repo_id='jameshan5040/anlp-hw2-outputs'), pr_revision=None, pr_num=None)