
# Step 2 — Naive RAG (Milvus Lite Edition)

This notebook is **self-contained** and can be run independently of any Step 1 notebook.  
It will:

1. Load the **RAG Mini Wikipedia** dataset from Hugging Face
2. (Optionally) chunk passages to manageable sizes
3. Generate embeddings with **sentence-transformers/all-MiniLM-L6-v2** (384-dim)
4. Create a **Milvus Lite** collection, index vectors, and insert data
5. Implement a simple **retrieve** function (ANN search with HNSW + IP on normalized vectors)
6. Implement a minimal **answer_with_context** function that retrieves top-k passages and calls an LLM
7. Run an end-to-end demo using a question from the QA split

> **Note**: If you prefer FAISS, you can swap the vector store step for a FAISS index.  
> **Milvus Lite** runs locally via a single file (e.g., `milvus.db`) using `pymilvus`.



## 0. (Optional) Install Dependencies

Run this cell if your environment doesn't already have these packages.  
Restart kernel if upgrades are performed.


In [None]:

import sys, subprocess, pkgutil

def pip_install(pkg):
    print(f"Installing: {pkg}")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", pkg])

need = {
    "datasets": "datasets",
    "sentence_transformers": "sentence-transformers",
    "pymilvus": "pymilvus>=2.4.0",
    "numpy": "numpy",
    "pandas": "pandas",
    "openai": "openai",  # only if you plan to use OpenAI
}

for mod, pip_name in need.items():
    if pkgutil.find_loader(mod) is None:
        pip_install(pip_name)
    else:
        print(f"OK: {mod}")


  if pkgutil.find_loader(mod) is None:


OK: datasets
OK: sentence_transformers
Installing: pymilvus>=2.4.0
OK: numpy
OK: pandas
OK: openai



## 1. Imports & Basic Setup


In [None]:

from datasets import load_dataset
from sentence_transformers import SentenceTransformer
import numpy as np
import pandas as pd
import os

from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection, utility

# Reproducibility
import random
random.seed(42)
np.random.seed(42)



## 2. Load the RAG Mini Wikipedia Dataset

We load two splits:
- `text-corpus`: documents/passages to be indexed
- `question-answer`: evaluation questions (we'll sample one to demo)


In [None]:
ds_corpus = load_dataset("rag-datasets/rag-mini-wikipedia", "text-corpus")
# print(ds_corpus.keys()) # Already diagnosed, no need to print again
corpus = ds_corpus["passages"] # Corrected key
N_DOCS = 1000
corpus = corpus.select(range(min(N_DOCS, len(corpus))))
ds_qa = load_dataset("rag-datasets/rag-mini-wikipedia", "question-answer")
# print(ds_qa.keys()) # Check keys for qa dataset as well
qa = ds_qa["test"] # Corrected key


print(corpus)
print(qa)

# Extract raw texts
raw_passages = [row.get("text") or row.get("passage") or "" for row in corpus]
print("Sample passage:", raw_passages[0][:300], "...")
print("Total passages:", len(raw_passages))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/719 [00:00<?, ?B/s]

data/passages.parquet/part.0.parquet:   0%|          | 0.00/797k [00:00<?, ?B/s]

Generating passages split:   0%|          | 0/3200 [00:00<?, ? examples/s]

data/test.parquet/part.0.parquet:   0%|          | 0.00/54.4k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/918 [00:00<?, ? examples/s]

Dataset({
    features: ['passage', 'id'],
    num_rows: 3200
})
Dataset({
    features: ['question', 'answer', 'id'],
    num_rows: 918
})
Sample passage: Uruguay (official full name in  ; pron.  , Eastern Republic of  Uruguay) is a country located in the southeastern part of South America.  It is home to 3.3 million people, of which 1.7 million live in the capital Montevideo and its metropolitan area. ...
Total passages: 3200



## 3. Character-Based Chunking




In [None]:

def simple_chunks(text, max_chars=600):
    # Avoid empty chunks
    if not text:
        return []
    return [text[i:i+max_chars] for i in range(0, len(text), max_chars)]

docs = []
for i, t in enumerate(raw_passages):
    chunks = simple_chunks(t)
    for j, ch in enumerate(chunks):
        docs.append({"id": f"{i}-{j}", "text": ch})

print("Total chunks:", len(docs))
print("Sample chunk:", docs[0]["text"][:200], "...")


Total chunks: 4046
Sample chunk: Uruguay (official full name in  ; pron.  , Eastern Republic of  Uruguay) is a country located in the southeastern part of South America.  It is home to 3.3 million people, of which 1.7 million live in ...



## 4. Embeddings with all-MiniLM-L6-v2

Per assignment requirements, we use **sentence-transformers/all-MiniLM-L6-v2** (384-dim).  
We also **L2-normalize** embeddings so that **Inner Product (IP)** becomes cosine similarity.


In [None]:

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
text_list = [d["text"] for d in docs]

# Encode in batches to reduce memory usage
emb = model.encode(text_list, batch_size=64, show_progress_bar=True, normalize_embeddings=True)
emb = emb.astype("float32")  # Milvus expects float vectors

dim = emb.shape[1]
assert dim == 384, f"Unexpected dim {dim}" #
print("Embeddings shape:", emb.shape)


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/64 [00:00<?, ?it/s]

Embeddings shape: (4046, 384)



## 5. Milvus Lite: Connect, Define Schema, Create Collection & Index

- Connect to Milvus Lite (embedded mode) via `uri="milvus.db"`
- Define a schema with:
  - `id` (primary key, VARCHAR)
  - `embedding` (FLOAT_VECTOR, dim=384)
  - `text` (VARCHAR, carry the chunk text)
- Create an HNSW index with IP metric (on normalized vectors)


In [None]:
# Connect (Milvus Lite will create a local file if it doesn't exist)
connections.connect(alias="default", uri="milvus.db")
print("Connected to Milvus Lite:", connections.has_connection("default"))

COLL_NAME = "rag_mini_wiki_chunks"

# Drop if exists (to make the notebook idempotent)
if utility.has_collection(COLL_NAME):
    utility.drop_collection(COLL_NAME)

fields = [
    FieldSchema(name="id", dtype=DataType.VARCHAR, is_primary=True, auto_id=False, max_length=64),
    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=384),
    FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=8192),  # fits our 1200-char chunks
]
schema = CollectionSchema(fields, description="Naive RAG chunks")
col = Collection(COLL_NAME, schema=schema, consistency_level="Strong")
print("Created collection:", col.name)

# Create index for ANN search
index_params = {
    "index_type": "FLAT", # Changed from HNSW to FLAT
    "metric_type": "IP",
    # Removed HNSW specific params
}
col.create_index(field_name="embedding", index_params=index_params)
print("Index created.")

# Load the collection to memory to serve queries
col.load()

2025-09-28 13:22:54,937 [ERROR][handler]: RPC error: [create_index], <MilvusException: (code=65535, message=invalid index type: HNSW, local mode only support FLAT IVF_FLAT AUTOINDEX: )>, <Time:{'RPC start': '2025-09-28 13:22:54.936065', 'RPC error': '2025-09-28 13:22:54.937634'}> (decorators.py:140)


Connected to Milvus Lite: True
Created collection: rag_mini_wiki_chunks


MilvusException: <MilvusException: (code=65535, message=invalid index type: HNSW, local mode only support FLAT IVF_FLAT AUTOINDEX: )>

In [None]:
import sys, subprocess
import pymilvus # Add this import
print("Installing pymilvus[milvus_lite]...")
subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "pymilvus[milvus_lite]"])
print("Installation complete.")

Installing pymilvus[milvus_lite]...
Installation complete.



## 6. Insert Data into Milvus

We insert `id`, `embedding`, and `text` columns in aligned order.


In [None]:

# Prepare entities
ids = [d["id"] for d in docs]
texts = [d["text"] for d in docs]
entities = [ids, emb.tolist(), texts]

# Insert and flush
mr = col.insert(entities)
col.flush()

print("Inserted rows:", mr.insert_count)
print("Collection num_entities:", col.num_entities)



## 7. Define a Retrieve Function

- Encode the query with the same embedding model
- Search top-k neighbors using `col.search` on the vector field
- Use IP metric (cosine similarity on normalized vectors)


In [None]:

def retrieve(query, top_k=5, ef=64, output_fields=("id","text")):
    qv = model.encode([query], normalize_embeddings=True).astype("float32")[0].tolist()
    search_params = {"metric_type": "IP", "params": {"ef": ef}}
    res = col.search(
        data=[qv],
        anns_field="embedding",
        param=search_params,
        limit=top_k,
        output_fields=list(output_fields),
    )
    hits = []
    for h in res[0]:
        eid = h.entity.get("id")
        etext = h.entity.get("text")
        hits.append((eid, etext, float(h.distance)))
    return hits

# Quick smoke test
print(retrieve("What is the capital of France?", top_k=3))



## 8. Minimal Generation with Context

This cell provides two options:
- **OpenAI**: If `OPENAI_API_KEY` is set, call a chat completion model (e.g., `gpt-4o-mini`).  
- **Local fallback**: If no key is present, return a template answer with the top context (for testing the pipeline).


In [None]:

def answer_with_context(query, top_k=5, max_ctx_chars=2000):
    hits = retrieve(query, top_k=top_k)
    context = "\n\n".join([h[1] for h in hits])[:max_ctx_chars]

    try:
        from openai import OpenAI
        #api_key = os.getenv("OPENAI_API_KEY")
        api_key = '' # Paste your own API key before reproduction
        if not api_key:
            raise RuntimeError("No OPENAI_API_KEY in env")
        client = OpenAI(api_key=api_key)

        prompt = (
            "You are a helpful assistant. Answer strictly using the provided context. "
            "If the context is insufficient, answer 'I don't know.'\n\n"
            f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"
        )
        resp = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role":"user","content": prompt}],
            temperature=0.2,
        )
        answer = resp.choices[0].message.content
    except Exception as e:
        # Local fallback when no API key is present
        answer = (
            "[Local fallback] No OpenAI key detected. Sample answer based on top context snippet:\n\n"
            + context[:500] + ("..." if len(context) > 500 else "")
        )
    return answer, hits



## 9. End-to-End Test

In order to test the effectiveness and robustness of the pipeline, we pick a sample question from the QA split and run the full pipeline.


In [None]:

sample_q = qa[0]["question"]
print("Question:", sample_q)

answer, hits = answer_with_context(sample_q, top_k=5)
print("\n=== Answer ===\n", answer)
print("\n=== Top Hits (id, score) ===")
for eid, etxt, score in hits:
    print(eid, "| score:", round(score, 4), "| text:", (etxt[:120] + "..."))
