<a href="https://colab.research.google.com/github/indranildchandra/rag-done-right-in-production/blob/main/rag_done_right_production.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# RAG Done Right in Production
### Demo Notebook — Zerodha Support FAQ Corpus

**Talk:** RAG Done Right in Production · Indranil Chandra, ML & Data Architect, Upstox

---

This notebook is a **self-contained production RAG demo** that runs entirely inside Google Colab — no external API keys required. It builds a real retrieval pipeline on top of Zerodha's public support FAQ corpus.

**What this covers, in order:**

| # | Section | What you will observe |
|---|---------|----------------------|
| 1 | **Corpus Ingestion** | Scrape + parse Zerodha support FAQs |
| 2 | **Chunking** | Semantic chunking vs naive fixed-size — side-by-side |
| 3 | **Qdrant (in-memory)** | Embed + index with `all-MiniLM-L6-v2` |
| 4a | **Hybrid Search** | BM25 + dense retrieval merged with RRF; exact-term recall demo |
| 4b | **MMR Reranking** | Maximal Marginal Relevance — diversity over redundancy |
| 5 | **Cross-Encoder Reranking** | Two-stage pipeline: recall then precision; per-stage latency |
| 6 | **Adaptive-k** | Score-gap boundary vs fixed top-k |
| 7 | **Generation + Citations** | LLM answer grounded with source attribution |
| 8 | **Evaluation** | Hit rate, Faithfulness, Hallucination detection |
| 9 | **Interactive** | 10 pre-loaded example queries — run any question end-to-end |

> **Runtime:** GPU not required. CPU is sufficient. First run (fresh crawl + model downloads): ~12–18 min. Subsequent runs (corpus cache hit): ~8–12 min.


---
## 0 · Install Dependencies

In [None]:
%%capture
import importlib.util, subprocess, sys

# Pinned versions — tested on Google Colab and Python 3.10+.
# To regenerate pins after a working install:
#   !pip freeze | grep -E "requests|qdrant|sentence|bm25|beautifulsoup|transformers|openai|tiktoken|tqdm|colorama"
_PACKAGES = {
    "qdrant_client":         "qdrant-client==1.17.0",
    "sentence_transformers": "sentence-transformers==5.2.3",
    "rank_bm25":             "rank-bm25==0.2.2",
    "bs4":                   "beautifulsoup4>=4.13.5",
    "requests":              "requests==2.32.4",
    "transformers":          "transformers==5.0.0",
    "openai":                "openai==2.23.0",
    "tiktoken":              "tiktoken==0.12.0",
    "tqdm":                  "tqdm==4.67.3",
    "colorama":              "colorama==0.4.6",
}

_missing = [spec for mod, spec in _PACKAGES.items()
            if importlib.util.find_spec(mod) is None]
if _missing:
    subprocess.check_call([sys.executable, "-m", "pip", "install", "--quiet"] + _missing)


In [None]:
!pip freeze | grep -E "requests|qdrant|sentence|bm25|beautifulsoup|transformers|openai|tiktoken|tqdm|colorama"

In [None]:
import os, re, json, time, random, hashlib, warnings
from dataclasses import dataclass, field
from typing import List, Dict, Tuple, Optional
from tqdm.auto import tqdm
import requests
from bs4 import BeautifulSoup
import numpy as np

from sentence_transformers import SentenceTransformer, CrossEncoder
from sklearn.metrics.pairwise import cosine_similarity
from rank_bm25 import BM25Okapi

from qdrant_client import QdrantClient
from qdrant_client.models import (
    Distance, VectorParams, PointStruct,
    Filter, FieldCondition, MatchValue
)

warnings.filterwarnings('ignore')

# ── Colour helpers for readable notebook output ────────────────────
from colorama import Fore, Style, init as colorama_init
colorama_init(autoreset=True)

def hdr(msg):  print(f"\n{Fore.CYAN}{Style.BRIGHT}{'─'*60}\n  {msg}\n{'─'*60}{Style.RESET_ALL}")
def ok(msg):   print(f"{Fore.GREEN}✓  {msg}{Style.RESET_ALL}")
def info(msg): print(f"{Fore.YELLOW}ℹ  {msg}{Style.RESET_ALL}")
def err(msg):  print(f"{Fore.RED}✗  {msg}{Style.RESET_ALL}")

ok("Imports complete.")

---
## 1 · Corpus Ingestion — Zerodha Support FAQ Scraper

The scraper works in **three passes**:

1. **Category discovery** — read the homepage sidebar to find all section URLs  
2. **Article link extraction** — for each category page, collect article hrefs  
3. **Article fetch + parse** — extract `<h1>` title and body text, strip nav/footer boilerplate  

We scrape politely: 0.4s delay per thread (5 threads), 3 retries with exponential backoff. All articles across all 32 sub-categories are scraped — no cap.

In [None]:
REDOWNLOAD_DATA = False                 # True → re-crawl regardless of cache, default = False, skips downloading data once downloaded
CORPUS_CACHE    = "zerodha_faqs.json"

In [None]:
REDOWNLOAD_DATA = False           # True → re-crawl regardless of cache
CORPUS_CACHE    = "zerodha_faqs.json"
FETCH_WORKERS   = 6               # threads for parallel article fetching (I/O-bound)

BASE_URL  = "https://support.zerodha.com"
DELAY_SEC = 0.4
RETRIES   = 3

HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
        "(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
    )
}

# ── Discovery config ───────────────────────────────────────────────────────
# The 6 top-level section URLs are stable (Zerodha's nav structure hasn't changed
# in years). run-scraper fetches each section page, parses sub-category hrefs
# automatically, then scrapes all articles in every sub-category.
TOP_LEVEL_SECTIONS = [
    "/category/account-opening",
    "/category/your-zerodha-account",
    "/category/trading-and-markets",
    "/category/funds",
    "/category/console",
    "/category/mutual-funds",
]

# Sub-categories to skip — no Q&A article content
EXCLUDED_SLUGS = {"glossary"}

# ── Fallback sub-categories ────────────────────────────────────────────────
# Used if live discovery returns 0 results for a section (e.g. JS rendering).
# Derived from a live fetch of https://support.zerodha.com/ on 2026-02-28.
# Update this list if Zerodha adds new sections in the future.
FALLBACK_SUBCATEGORIES: List[Tuple[str, str]] = [
    # account-opening
    ("/category/account-opening/resident-individual",                             "Resident Individual"),
    ("/category/account-opening/minor",                                           "Minor"),
    ("/category/account-opening/nri-account-opening",                            "NRI"),
    ("/category/account-opening/company-partnership-and-huf-account-opening",    "Corporate HUF LLP"),
    # your-zerodha-account
    ("/category/your-zerodha-account/your-profile",                              "Your Profile"),
    ("/category/your-zerodha-account/account-modification-and-segment-addition", "Account Modification"),
    ("/category/your-zerodha-account/dp-id-and-bank-details",                   "DP and Bank Details"),
    ("/category/your-zerodha-account/nomination-process",                        "Nomination"),
    ("/category/your-zerodha-account/transfer-of-shares-and-conversion-of-shares", "Transfer of Shares"),
    # trading-and-markets
    ("/category/trading-and-markets/trading-faqs",                               "Trading FAQs"),
    ("/category/trading-and-markets/margins",                                    "Margins"),
    ("/category/trading-and-markets/charts-and-orders",                          "Charts and Orders"),
    ("/category/trading-and-markets/general-kite",                               "General Kite"),
    ("/category/trading-and-markets/charges",                                    "Charges"),
    ("/category/trading-and-markets/ipo",                                        "IPO"),
    ("/category/trading-and-markets/alerts-and-nudges",                          "Alerts and Nudges"),
    # funds
    ("/category/funds/adding-funds",                                             "Adding Funds"),
    ("/category/funds/fund-withdrawal",                                          "Fund Withdrawal"),
    ("/category/funds/adding-bank-accounts",                                     "Adding Bank Accounts"),
    ("/category/funds/mandate",                                                  "Mandate"),
    # console
    ("/category/console/portfolio",                                              "Console Portfolio"),
    ("/category/console/corporate-actions",                                      "Corporate Actions"),
    ("/category/console/ledger",                                                 "Ledger"),
    ("/category/console/reports",                                                "Reports"),
    ("/category/console/profile",                                                "Console Profile"),
    ("/category/console/segments",                                               "Segments"),
    # mutual-funds
    ("/category/mutual-funds/understanding-mutual-funds",                        "Understanding Mutual Funds"),
    ("/category/mutual-funds/payments-and-orders",                               "MF Payments and Orders"),
    ("/category/mutual-funds/nps",                                               "NPS"),
    ("/category/mutual-funds/fixed-deposits",                                    "Fixed Deposits"),
    ("/category/mutual-funds/features-on-coin",                                  "Features on Coin"),
    ("/category/mutual-funds/coin-general",                                      "Coin General"),
]

hdr("Scraper configuration")
info(f"Base URL:          {BASE_URL}")
info(f"Top-level sections: {len(TOP_LEVEL_SECTIONS)} (sub-categories auto-discovered)")
info(f"Fallback entries:   {len(FALLBACK_SUBCATEGORIES)} known sub-categories")
info(f"Delay:              {DELAY_SEC}s between requests")
info(f"Workers:            {FETCH_WORKERS} threads (parallel article fetch)")
info(f"Cache file:         {CORPUS_CACHE}  (REDOWNLOAD_DATA={REDOWNLOAD_DATA})")


In [None]:
@dataclass
class ZerodhaArticle:
    url:      str
    title:    str
    body:     str
    category: str
    doc_id:   str = field(init=False)

    def __post_init__(self):
        self.doc_id = hashlib.md5(self.url.encode()).hexdigest()[:12]


def _get(url: str, retries: int = RETRIES) -> Optional[BeautifulSoup]:
    """Fetch a URL and return a BeautifulSoup object. Returns None on failure."""
    full = url if url.startswith("http") else BASE_URL + url
    _fail = "unknown"
    for attempt in range(retries):
        try:
            r = requests.get(full, headers=HEADERS, timeout=15)
            if r.status_code == 200:
                time.sleep(DELAY_SEC)
                return BeautifulSoup(r.text, "html.parser")
            elif r.status_code == 429:
                _fail = "rate-limited (429)"
                wait = 2 ** (attempt + 1)
                if attempt < retries - 1:   # don't sleep after the last attempt
                    info(f"Rate limited — waiting {wait}s")
                    time.sleep(wait)
            else:
                _fail = f"HTTP {r.status_code}"
        except requests.RequestException as e:
            _fail = type(e).__name__
            if attempt < retries - 1:       # don't sleep after the last attempt
                time.sleep(2 ** attempt)
    err(f"Skipped [{_fail}] after {retries} attempts: {full}")
    return None


def _discover_subcategories(section_url: str) -> List[Tuple[str, str]]:
    """
    Fetch a top-level section page and extract all sub-category hrefs.

    Zerodha's section pages (e.g. /category/account-opening) are server-side
    rendered and list sub-category links in the nav sidebar and main content.
    We parse any <a href="/category/{section}/{sub-cat}"> links — exactly 3
    path segments, no /articles/ — to discover the full sub-category set.

    Returns List of (href, label) tuples, or [] if the page is unreachable.
    The caller falls back to FALLBACK_SUBCATEGORIES on an empty result.
    """
    section_slug = section_url.strip("/").split("/")[-1]
    soup = _get(section_url)
    if not soup:
        return []
    seen, results = set(), []
    for a in soup.find_all("a", href=True):
        href = a["href"]
        parts = [p for p in href.split("/") if p]
        if (len(parts) == 3
                and parts[0] == "category"
                and parts[1] == section_slug
                and "articles" not in href
                and parts[2] not in EXCLUDED_SLUGS
                and href not in seen):
            seen.add(href)
            label = a.get_text(strip=True) or parts[2].replace("-", " ").title()
            results.append((href, label))
    return results


def _extract_article_links(category_url: str) -> List[str]:
    """
    From a sub-category page, collect all /articles/ hrefs.
    Zerodha renders article links as <a href="/category/.../articles/slug">
    Works for any depth: /category/{section}/{sub}/{topic}/articles/{slug}
    """
    soup = _get(category_url)
    if not soup:
        return []
    links = set()
    for a in soup.find_all("a", href=True):
        href = a["href"]
        if "/articles/" in href and href.startswith("/category/"):
            links.add(href)
    return list(links)


# Boilerplate patterns to strip from article body
_BOILERPLATE = re.compile(
    r"(Updates|Education|Utilities|Support Portal|Related articles|Quick links"
    r"|Signup|About|Products|Pricing|Zerodha Broking|© 20\d\d)",
    re.IGNORECASE
)


def _parse_article(href: str, category: str) -> Optional[ZerodhaArticle]:
    """
    Fetch and parse a single article page.
    Strategy:
      - Title  → <h1> tag
      - Body   → all <p>/<li> tags after the <h1>, before 'Related articles'
    """
    soup = _get(href)
    if not soup:
        return None

    h1 = soup.find("h1")
    if not h1:
        return None
    title = h1.get_text(strip=True)

    # Collect paragraphs that appear after the <h1>
    paragraphs = []
    in_body = False
    for tag in soup.find_all(["h1", "h2", "h3", "p", "li"]):
        if tag == h1:
            in_body = True
            continue
        if not in_body:
            continue
        text = tag.get_text(" ", strip=True)
        if not text or len(text) < 15:
            continue
        if re.search(r"related articles", text, re.IGNORECASE):
            break
        if _BOILERPLATE.search(text):
            continue
        paragraphs.append(text)

    body = " ".join(paragraphs).strip()
    if len(body) < 80:   # skip stubs
        return None

    return ZerodhaArticle(
        url=BASE_URL + href,
        title=title,
        body=body,
        category=category,
    )


ok("Scraper functions defined.")


In [None]:
%%time
_from_cache = (not REDOWNLOAD_DATA) and os.path.exists(CORPUS_CACHE)

if _from_cache:
    hdr("Corpus cache found — skipping discovery")
    ok(f"Reusing: {CORPUS_CACHE}  (set REDOWNLOAD_DATA=True in scraper-config to re-crawl)")
else:
    hdr("Discovering and scraping all Zerodha support categories")
    
    # ── Phase 1: Auto-discover sub-categories ─────────────────────────────────
    # For each section, merge live-discovered sub-categories with the known
    # FALLBACK_SUBCATEGORIES. This handles two cases:
    #   (a) Live discovery returns 0 (JS-rendered section) → use all fallback entries
    #   (b) Live discovery finds SOME but misses others (e.g. charges not in nav)
    #       → fallback fills the gaps without adding duplicates
    all_subcategories: List[Tuple[str, str]] = []
    
    for section_url in TOP_LEVEL_SECTIONS:
        section_name = section_url.split("/")[-1].replace("-", " ").title()
        section_slug = section_url.strip("/").split("/")[-1]
        # Use startswith to avoid substring matches (e.g. "funds" ⊂ "mutual-funds")
        section_prefix = f"/category/{section_slug}/"
    
        live       = _discover_subcategories(section_url)
        known      = [(h, l) for h, l in FALLBACK_SUBCATEGORIES if h.startswith(section_prefix)]
        live_hrefs = {h for h, _ in live}
    
        # Start with live results; append any known entries missing from live
        merged = list(live)
        gap    = [(h, l) for h, l in known if h not in live_hrefs]
        merged += gap
    
        if not live:
            info(f"{section_name}: live discovery returned 0 — using {len(known)} fallback entries")
        elif gap:
            info(f"{section_name}: {len(live)} discovered + {len(gap)} from fallback = {len(merged)}")
        else:
            info(f"{section_name}: {len(live)} sub-categories discovered (complete)")
    
        all_subcategories.extend(merged)
    
    ok(f"Total sub-categories: {len(all_subcategories)}")
    
    # ── Phase 2: Extract article links from every sub-category page ────────────
    all_links: Dict[str, str] = {}      # href → category label
    cat_link_map: Dict[str, List[str]] = {}
    
    for sub_cat_href, label in tqdm(all_subcategories, desc="Discovering articles"):
        links = _extract_article_links(sub_cat_href)
        for lnk in links:
            if lnk not in all_links:    # first category label wins on cross-category duplicates
                all_links[lnk] = label
        cat_link_map[label] = links
        info(f"  {label}: {len(links)} articles")
    
    # Order doesn't affect the corpus (all articles are fetched); seed kept so
    # partial runs are consistent if fetch-articles is interrupted mid-way.
    random.seed(42)
    all_link_items = list(all_links.items())
    random.shuffle(all_link_items)
    
    ok(f"Total unique articles to fetch: {len(all_link_items)}")
    info(f"Sub-categories scraped: {len(all_subcategories)}")

In [None]:
%%time
from concurrent.futures import ThreadPoolExecutor, as_completed

if _from_cache:
    with open(CORPUS_CACHE) as _f:
        _raw = json.load(_f)
    articles = [
        ZerodhaArticle(url=d["url"], title=d["title"], body=d["body"], category=d["category"])
        for d in _raw
    ]
    ok(f"Loaded {len(articles)} articles from cache ({CORPUS_CACHE})")
else:
    # Network I/O releases the GIL → threads give near-linear speedup here.
    # FETCH_WORKERS=6 means ~6 concurrent requests; each thread still
    # respects DELAY_SEC inside _get().
    articles: List[ZerodhaArticle] = []
    with ThreadPoolExecutor(max_workers=FETCH_WORKERS) as pool:
        futures = {pool.submit(_parse_article, href, cat): href
                   for href, cat in all_link_items}
        for fut in tqdm(as_completed(futures), total=len(futures), desc="Fetching articles"):
            art = fut.result()
            if art:
                articles.append(art)
    ok(f"Successfully parsed {len(articles)} articles")

# Quick sanity check — show 3 samples
print()
for art in articles[:3]:
    print(f"  [{art.category}]  {art.title}")
    print(f"  Body ({len(art.body)} chars): {art.body[:120]}...")
    print(f"  URL: {art.url}")
    print()

ℹ  Rate limited — waiting 2s
ℹ  Rate limited — waiting 2s
ℹ  Rate limited — waiting 2s
ℹ  Rate limited — waiting 2s
ℹ  Rate limited — waiting 2s
ℹ  Rate limited — waiting 2s
ℹ  Rate limited — waiting 2s
ℹ  Rate limited — waiting 2s
ℹ  Rate limited — waiting 2s
ℹ  Rate limited — waiting 2s
ℹ  Rate limited — waiting 2s
ℹ  Rate limited — waiting 2s
ℹ  Rate limited — waiting 2s
ℹ  Rate limited — waiting 2s
ℹ  Rate limited — waiting 2s
ℹ  Rate limited — waiting 2s
ℹ  Rate limited — waiting 2s
ℹ  Rate limited — waiting 2s
ℹ  Rate limited — waiting 2s
ℹ  Rate limited — waiting 2s
ℹ  Rate limited — waiting 2sℹ  Rate limited — waiting 2s

ℹ  Rate limited — waiting 2s
ℹ  Rate limited — waiting 2s
ℹ  Rate limited — waiting 2sℹ  Rate limited — waiting 2s

ℹ  Rate limited — waiting 2s
ℹ  Rate limited — waiting 2s
ℹ  Rate limited — waiting 2s
ℹ  Rate limited — waiting 2s
ℹ  Rate limited — waiting 2s
ℹ  Rate limited — waiting 2s
ℹ  Rate limited — waiting 2s
ℹ  Rate limited — waiting 2s
ℹ  Rate limite

In [None]:
# Persist to disk — so you can reload without re-scraping
if _from_cache:
    ok(f"Corpus loaded from cache — skipped re-save ({CORPUS_CACHE})")
else:
    with open(CORPUS_CACHE, "w") as f:
        json.dump(
            [{"doc_id": a.doc_id, "url": a.url, "title": a.title,
              "body": a.body, "category": a.category} for a in articles],
            f, indent=2
        )
    ok(f"Corpus saved → {CORPUS_CACHE}  ({os.path.getsize(CORPUS_CACHE)//1024} KB)")

# Category distribution
from collections import Counter
dist = Counter(a.category for a in articles)
print("\nCategory distribution:")
for cat, cnt in dist.most_common():
    bar = "█" * cnt
    print(f"  {cat:<50} {cnt:>3}  {bar}")


---
## 2 · Chunking Strategy

**The problem with fixed-size chunking:** A 512-token window that splits mid-sentence loses the semantic unit. Two adjacent chunks end up containing half a thought each — both retrieve poorly.

**What we use here:** Sentence-boundary aware chunking with a configurable overlap. Each chunk respects sentence boundaries, with a sliding window overlap to preserve cross-sentence context.

For a support FAQ corpus specifically, most articles are short enough (200–600 tokens) that we treat each article as 1–3 chunks max. The overlap handles the edge case where a key fact straddles a sentence boundary.

**Production note:** For longer documents (regulatory PDFs, annual reports), switch to hierarchical chunking: paragraph-level for indexing, sentence-level for reranking.

In [None]:
%%time
@dataclass
class Chunk:
    chunk_id:   str          # doc_id_chunk_idx
    doc_id:     str
    title:      str
    text:       str          # chunk text that gets embedded
    full_text:  str          # full article body — for citation display
    url:        str
    category:   str
    chunk_idx:  int


def sentence_aware_chunk(
    article: ZerodhaArticle,
    max_chars: int = 800,
    overlap_chars: int = 120,
) -> List[Chunk]:
    """
    Split article body into overlapping sentence-boundary-aware chunks.
    Each chunk prepends the article title for embedding context.
    """
    # Split on sentence boundaries
    sentences = re.split(r'(?<=[.!?])\s+', article.body.strip())
    sentences = [s.strip() for s in sentences if len(s.strip()) > 10]

    chunks = []
    buf, buf_len = [], 0
    idx = 0

    for sent in sentences:
        if buf_len + len(sent) > max_chars and buf:
            text = " ".join(buf)
            chunks.append(Chunk(
                chunk_id  = f"{article.doc_id}_{idx}",
                doc_id    = article.doc_id,
                title     = article.title,
                text      = f"{article.title}. {text}",
                full_text = article.body,
                url       = article.url,
                category  = article.category,
                chunk_idx = idx,
            ))
            idx += 1
            # Carry over tail sentences for overlap
            overlap_buf, overlap_len = [], 0
            for s in reversed(buf):
                if overlap_len + len(s) < overlap_chars:
                    overlap_buf.insert(0, s)
                    overlap_len += len(s)
                else:
                    break
            buf, buf_len = overlap_buf, overlap_len

        buf.append(sent)
        buf_len += len(sent)

    # Flush remainder
    if buf:
        text = " ".join(buf)
        chunks.append(Chunk(
            chunk_id  = f"{article.doc_id}_{idx}",
            doc_id    = article.doc_id,
            title     = article.title,
            text      = f"{article.title}. {text}",
            full_text = article.body,
            url       = article.url,
            category  = article.category,
            chunk_idx = idx,
        ))

    return chunks


# Build chunk corpus
all_chunks: List[Chunk] = []
for art in articles:
    all_chunks.extend(sentence_aware_chunk(art))

ok(f"{len(articles)} articles → {len(all_chunks)} chunks")
avg_len = sum(len(c.text) for c in all_chunks) / len(all_chunks)
info(f"Average chunk length: {avg_len:.0f} chars")

# Show chunking on a sample article
sample_art = max(articles, key=lambda a: len(a.body))
sample_chunks = sentence_aware_chunk(sample_art)
print(f"\nSample article: '{sample_art.title}'")
print(f"Body length: {len(sample_art.body)} chars → {len(sample_chunks)} chunks")
for i, c in enumerate(sample_chunks):
    print(f"  Chunk {i}: {len(c.text)} chars | '{c.text[:80]}...'")

### Chunking decisions and their knock-on effects

Every number in `sentence_aware_chunk()` was chosen to respect a constraint further down the pipeline. Ideally, if we use a better embedding model (like text-embedding-3-small), we should consider using semantic chunking with one chunk per doc id (i.e. one record per faq article) to reduce a lot of these overheads

Here is the full reasoning:

**`max_chars=800` — why not larger?**
`all-MiniLM-L6-v2` (used in Section 3) has a **hard 256-token max sequence length**. A token is roughly 4 characters of English text, so 256 tokens ≈ 1024 characters. At 800 chars we sit at ~200 tokens — a comfortable margin that survives slightly longer words or tokeniser overhead without truncation. Any chunk over ~1000 chars will be silently truncated by the model, making the tail of the chunk invisible to retrieval. This is the single most important constraint in the chunking design.

**`overlap_chars=120` — why overlap at all?**
Answers often straddle a natural sentence boundary. Without overlap, a question about "when does a withdrawal reflect in the bank account?" might hit a chunk that ends with "...withdrawal is processed within 24 hours" and miss the next chunk that starts with "Funds appear in your account between 9 AM–5 PM on business days." The 120-char overlap ensures both chunks contain the bridging sentence, so at least one of them retrieves correctly.

**Sentence-boundary splitting — why not fixed character windows?**
A fixed 800-char split lands mid-sentence roughly 40% of the time (see the naive_chunk comparison below). A partial sentence embedded as a vector produces a noisy representation — the model encodes an incomplete thought. Sentence-aware splitting guarantees every chunk is a self-contained semantic unit. The embedding quality difference is measurable: mid-sentence chunks rank 15–30% lower in retrieval experiments on short-form Q&A corpora.

**`chunk.text` is always `"{title}. {body_text}"` — why prefix the title?**
A chunk in isolation loses its document identity. Consider the text *"This can be done from the Console under Funds."* — without the title prefix, the embedding has no signal that this is about withdrawals. Prepending `"How to withdraw money from the Zerodha account. "` anchors the chunk's semantic meaning to its article topic. This is especially important for BM25: the title words ("withdraw", "money", "account") now appear in every chunk and participate in term-frequency scoring.


In [None]:

# ── Naive fixed-size chunker (baseline for comparison) ─────────────
def naive_chunk(article: ZerodhaArticle, chunk_size: int = 200) -> List[Chunk]:
    """
    Fixed-size chunking with no regard for sentence boundaries.
    This is what most tutorials start with — and what costs you retrieval quality.
    """
    body = article.body
    chunks = []
    idx = 0
    for start in range(0, len(body), chunk_size):
        text = body[start:start + chunk_size]
        if len(text.strip()) < 20:
            continue
        chunks.append(Chunk(
            chunk_id  = f"{article.doc_id}_naive_{idx}",
            doc_id    = article.doc_id,
            title     = article.title,
            text      = f"{article.title}. {text}",
            full_text = article.body,
            url       = article.url,
            category  = article.category,
            chunk_idx = idx,
        ))
        idx += 1
    return chunks


# ── Side-by-side comparison on the longest article ──────────────────
hdr("Chunking Strategy: Naive Fixed-size vs Sentence-Aware")
naive_chunks  = naive_chunk(sample_art, chunk_size=200)
smart_chunks  = sentence_aware_chunk(sample_art)

print(f"  Article: '{sample_art.title}'  ({len(sample_art.body)} chars)\n")

print(f"{Fore.RED}Naive fixed-size (200 chars): {len(naive_chunks)} chunks{Style.RESET_ALL}")
for i, c in enumerate(naive_chunks[:2]):
    snippet = c.text[len(sample_art.title) + 2:]
    print(f"  Chunk {i}: |{repr(snippet[:120])}|")
    if i == 0:
        # Highlight mid-sentence split
        if not snippet.rstrip().endswith(('.', '!', '?')):
            print(f"            {Fore.RED}⚠ ends mid-sentence{Style.RESET_ALL}")

print(f"\n{Fore.GREEN}Sentence-aware (800 chars, 120 overlap): {len(smart_chunks)} chunks{Style.RESET_ALL}")
for i, c in enumerate(smart_chunks[:2]):
    snippet = c.text[len(sample_art.title) + 2:]
    print(f"  Chunk {i}: |{repr(snippet[:120])}|")

print(f"\n{Fore.YELLOW}Key insight:{Style.RESET_ALL} Naive chunks split mid-sentence. "
      "Sentence-aware chunks always start and end at natural boundaries — "
      "semantic units stay intact.")


---
## 3 · Qdrant In-Memory Vector Store

We use **Qdrant's in-memory mode** — no Docker, no external service, no API key. Same Python client API as production Qdrant Cloud. This means your demo code is identical to what you would run in prod — just swap `QdrantClient(':memory:')` for `QdrantClient(url='https://your-cluster.qdrant.io', api_key='...')`.

**Embedding model:** `all-MiniLM-L6-v2`  
- 384 dimensions, 22M parameters, ~80ms per 100 chunks on CPU  
- Production note: for a finance domain, fine-tuned models on SEBI/NSE corpus improve retrieval recall by 10–15%  
- 768-dim models (MPNet) give diminishing returns past this corpus size

In [None]:
%%time
hdr("Loading embedding model")
EMBED_MODEL_NAME = "all-MiniLM-L6-v2"
embed_model = SentenceTransformer(EMBED_MODEL_NAME)
EMBED_DIM = embed_model.get_sentence_embedding_dimension()
ok(f"Model: {EMBED_MODEL_NAME} | Dim: {EMBED_DIM}")

### Why all-MiniLM-L6-v2 and what its constraints mean for the rest of the pipeline

**Why this model?**
`all-MiniLM-L6-v2` is a 22M-parameter distilled sentence encoder. It runs comfortably on a free Colab CPU (~22ms per batch of 64 chunks), requires no API key, and produces embeddings that rank in the top tier of the MTEB retrieval benchmarks for its size class. For a production demo with a finite corpus, the recall quality is more than sufficient.

**Hard constraint: 256-token maximum sequence length.**
This is the single number that shapes the entire chunking strategy in Section 2. The tokeniser will silently truncate any input beyond 256 tokens — no warning, no error, just a shorter embedding. At ~4 chars/token, the safe upper bound is roughly 1000 characters of body text. The 800-char chunk target in `sentence_aware_chunk()` respects this with a buffer.

**`normalize_embeddings=True` — why?**
Normalising to unit length (L2 norm = 1) converts cosine similarity into a plain dot product:  
`cosine(a, b) = a·b` when `|a| = |b| = 1`.  
Qdrant's query uses cosine distance by default. Normalised embeddings make the math consistent and avoid numerical instability when scores are compared across queries of different lengths.

**`EMBED_DIM=384` — what does this mean for storage?**
Each chunk is stored as a 384-float32 vector = 384 × 4 bytes = **1,536 bytes ≈ 1.5 KB**.

With the auto-discovery crawler fetching all 32 sub-categories (~400–700 articles, ~5 chunks per article on average):

| Corpus size | Chunks | Vector store size |
|---|---|---|
| This notebook (~500 articles) | ~2,500 | ~3.8 MB |
| 10× scale (~5,000 articles) | ~25,000 | ~38 MB |
| 100× scale (~50,000 articles) | ~250,000 | ~380 MB |

All three fit comfortably in free Colab RAM (12 GB). In-memory Qdrant only becomes impractical above ~1M chunks (~1.5 GB), at which point you would switch to persistent Qdrant with **HNSW (Hierarchical Navigable Small World)** graph indexing — a nearest-neighbour data structure that delivers sub-millisecond queries at billion-vector scale. The Python client API is identical for both in-memory and cloud deployments.

**Why not a larger model like `all-mpnet-base-v2` or OpenAI `text-embedding-3-small`?**
Larger models improve recall but break the "no API key, runs on CPU" constraint of this demo. The architecture is model-agnostic — swap the `SentenceTransformer` call in `embedding-model` and the `EMBED_DIM` constant in `qdrant-setup` and nothing else changes.


In [None]:
%%time
hdr("Setting up Qdrant in-memory + indexing chunks")

COLLECTION = "zerodha_faqs"

# In-memory Qdrant — identical API to Qdrant Cloud
qdrant = QdrantClient(":memory:")
qdrant.create_collection(
    collection_name=COLLECTION,
    vectors_config=VectorParams(size=EMBED_DIM, distance=Distance.COSINE),
)

# Embed all chunks in batches
BATCH_SIZE = 64
texts = [c.text for c in all_chunks]
all_embeddings = []

for i in tqdm(range(0, len(texts), BATCH_SIZE), desc="Embedding"):
    batch = texts[i : i + BATCH_SIZE]
    embs  = embed_model.encode(batch, normalize_embeddings=True)
    all_embeddings.extend(embs)

all_embeddings = np.array(all_embeddings)
ok(f"Embeddings shape: {all_embeddings.shape}")

# Upsert into Qdrant
points = [
    PointStruct(
        id=i,
        vector=all_embeddings[i].tolist(),
        payload={
            "chunk_id":  c.chunk_id,
            "doc_id":    c.doc_id,
            "title":     c.title,
            "text":      c.text,
            "full_text": c.full_text,
            "url":       c.url,
            "category":  c.category,
            "chunk_idx": c.chunk_idx,
        }
    )
    for i, c in enumerate(all_chunks)
]

# Upload in batches of 256
for i in tqdm(range(0, len(points), 256), desc="Uploading to Qdrant"):
    qdrant.upsert(collection_name=COLLECTION, points=points[i:i+256])

collection_info = qdrant.get_collection(COLLECTION)
ok(f"Qdrant collection '{COLLECTION}' ready — {collection_info.points_count} vectors indexed")

---
## 4 · Hybrid Search — BM25 + Dense + RRF Fusion

**Why not dense-only?**  
Dense retrieval is strong on semantic similarity. It misses exact-term matches — financial terms, product names, regulatory codes. BM25 nails exact matches but fails on paraphrase. Neither alone is sufficient for a support FAQ system where users ask both `"what is MTF"` (semantic) and `"MTF margin requirement"` (keyword).

**Reciprocal Rank Fusion (RRF):**  
Score = Σ 1/(k + rank_i) where k=60 is a smoothing constant. RRF is rank-based — it does not require score normalization between BM25 and cosine similarity, which makes it robust without hyperparameter tuning.

**Production trade-off:**  
BM25 runs on CPU in memory. Dense retrieval hits the vector index. For 100k+ chunks, the bottleneck shifts to BM25 RAM — at that scale, replace BM25 with Elasticsearch sparse vectors or Qdrant's built-in sparse support (FastEmbed).

In [None]:
%%time
# Build BM25 index over tokenized chunk texts
hdr("Building BM25 sparse index")

def tokenize(text: str) -> List[str]:
    """Simple whitespace + lowercase tokenizer. Good enough for BM25."""
    return re.findall(r"[a-z0-9]+", text.lower())

tokenized_corpus = [tokenize(c.text) for c in all_chunks]
bm25_index = BM25Okapi(tokenized_corpus)

ok(f"BM25 index built over {len(tokenized_corpus)} chunks")
info("Average doc length (BM25): " + str(round(bm25_index.avgdl, 1)) + " tokens")

In [None]:
def dense_search(query: str, top_k: int = 30) -> List[Tuple[int, float]]:
    """
    Query Qdrant for top_k nearest neighbours.
    Returns list of (point_index, cosine_score).
    """
    q_emb = embed_model.encode([query], normalize_embeddings=True)[0]
    results = qdrant.query_points(
        collection_name=COLLECTION,
        query=q_emb.tolist(),
        limit=top_k,
    )
    points = results[0] if isinstance(results, tuple) else results.points
    return [(r.id, r.score) for r in points]


def bm25_search(query: str, top_k: int = 30) -> List[Tuple[int, float]]:
    """
    BM25 retrieval. Returns list of (chunk_index, bm25_score).
    """
    q_tokens = tokenize(query)
    scores = bm25_index.get_scores(q_tokens)
    top_indices = np.argsort(scores)[::-1][:top_k]
    return [(int(idx), float(scores[idx])) for idx in top_indices if scores[idx] > 0]


def reciprocal_rank_fusion(
    dense_hits:  List[Tuple[int, float]],
    sparse_hits: List[Tuple[int, float]],
    k: int = 60,
    top_k: int = 20,
) -> List[Tuple[int, float]]:
    """
    Fuse two ranked lists using RRF.
    k=60 is the standard smoothing constant from the original RRF paper.
    """
    rrf_scores: Dict[int, float] = {}

    for rank, (idx, _) in enumerate(dense_hits):
        rrf_scores[idx] = rrf_scores.get(idx, 0) + 1.0 / (k + rank + 1)

    for rank, (idx, _) in enumerate(sparse_hits):
        rrf_scores[idx] = rrf_scores.get(idx, 0) + 1.0 / (k + rank + 1)

    fused = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)
    return fused[:top_k]


def hybrid_search(query: str, stage1_k: int = 30) -> List[Tuple[Chunk, float]]:
    """
    Full hybrid search pipeline.
    Returns list of (Chunk, rrf_score) sorted by descending RRF score.
    """
    dense_hits  = dense_search(query, top_k=stage1_k)
    sparse_hits = bm25_search(query, top_k=stage1_k)
    fused       = reciprocal_rank_fusion(dense_hits, sparse_hits, top_k=stage1_k)

    results = []
    for idx, score in fused:
        chunk = all_chunks[idx]
        results.append((chunk, score))
    return results


ok("Hybrid search functions ready.")

### Why the same article can appear multiple times in retrieval results

This is **expected and correct behaviour** at the search layer. Here is the full chain of reasoning:

**1. The pipeline is chunk-aware, not article-aware.**
Every article is split into N chunks (typically 2–6 for Zerodha support articles). Each chunk is stored as a separate point in Qdrant with its own vector. `dense_search()` and `bm25_search()` both return *chunk-level* ranked lists — they have no concept of "article".

**2. Chunks from the same article have nearly-identical embeddings.**
`all-MiniLM-L6-v2` encodes semantic meaning. Two chunks from the same article cover overlapping content and therefore land very close together in the 384-dimensional embedding space. When you query for "how do I withdraw funds", all 4 chunks of the withdrawal article sit at roughly the same cosine distance from the query vector. All 4 get ranked near the top.

**3. This is the right design for RRF fusion.**
`reciprocal_rank_fusion()` needs the full chunk-level ranked lists from both dense and BM25 to compute accurate rank positions. If we deduplicated inside `dense_search()` before fusion, we would collapse the dense signal to a single rank per article — losing rank-spread information that RRF depends on to correctly weight evidence. Chunk-level retrieval → better fusion → better final ranking.

**4. Deduplication happens downstream, not at search time.**
The notebook applies a four-layer dedup strategy:

| Stage | Where | What it does |
|---|---|---|
| Demo display | `_dedup_chunks()` in this cell | Shows one result per article for readability |
| MMR | Section 4b | Penalises semantically redundant candidates before reranking |
| Generation | `rag_query()` Stage 4 | Keeps only the best chunk per `doc_id` before building the prompt |
| Prompt budget | `build_prompt()` | 150-char preview per source keeps 5 diverse articles inside Flan-T5's 512-token limit |

The demo below calls `_dedup_chunks()` only for display. The underlying `hybrid_search()` intentionally returns chunk-level results so the cross-encoder and adaptive-k stages operate on the richest possible candidate pool.


In [None]:

def _dedup_chunks(hits, n: int = 5):
    """Deduplicate a ranked list by doc_id, keeping the top-scoring chunk per article.

    Works for both dense hits (List[Tuple[int, float]]) and hybrid hits
    (List[Tuple[Chunk, float]]) — distinguishes them by checking the first element type.
    """
    seen, out = set(), []
    for item in hits:
        idx_or_chunk, score = item
        doc_id = all_chunks[idx_or_chunk].doc_id if isinstance(idx_or_chunk, int) else idx_or_chunk.doc_id
        if doc_id not in seen:
            seen.add(doc_id)
            out.append(item)
        if len(out) >= n:
            break
    return out


# ── Demo 1: semantic query — both Dense and Hybrid should do well ────
hdr("Demo: Dense-only vs Hybrid Search")
DEMO_QUERY = "how do I withdraw funds from my Zerodha account"

print(f"Query 1 (semantic): '{DEMO_QUERY}'\n")

# Fetch extra candidates upfront so dedup still returns 5 distinct articles
dense_only  = dense_search(DEMO_QUERY, top_k=50)
hybrid_hits = hybrid_search(DEMO_QUERY, stage1_k=50)

print(f"{Fore.CYAN}{'─'*28} Dense-only top-5 (deduped by article) {'─'*2}{Style.RESET_ALL}")
for rank, (idx, score) in enumerate(_dedup_chunks(dense_only), 1):
    c = all_chunks[idx]
    print(f"  {rank}. [{score:.3f}] {c.title}")

print(f"\n{Fore.CYAN}{'─'*28} Hybrid top-5 (RRF, deduped) {'─'*4}{Style.RESET_ALL}")
for rank, (chunk, score) in enumerate(_dedup_chunks(hybrid_hits), 1):
    print(f"  {rank}. [{score:.4f}] {chunk.title}")

# ── Demo 2: exact financial abbreviations — where BM25 wins ─────────
TERM_QUERY = "what is BTST and CNC order type"
print(f"\n\nQuery 2 (exact financial terms): '{TERM_QUERY}'\n")
print(f"{Fore.YELLOW}These abbreviations are hard for dense embeddings — BM25 recovers "
      f"them via exact token match.{Style.RESET_ALL}\n")

dense_term  = dense_search(TERM_QUERY, top_k=50)
hybrid_term = hybrid_search(TERM_QUERY, stage1_k=50)

print(f"{Fore.CYAN}{'─'*28} Dense-only top-5 (deduped by article) {'─'*2}{Style.RESET_ALL}")
for rank, (idx, score) in enumerate(_dedup_chunks(dense_term), 1):
    c = all_chunks[idx]
    print(f"  {rank}. [{score:.3f}] {c.title}")

print(f"\n{Fore.CYAN}{'─'*28} Hybrid top-5 (RRF, deduped) {'─'*4}{Style.RESET_ALL}")
for rank, (chunk, score) in enumerate(_dedup_chunks(hybrid_term), 1):
    print(f"  {rank}. [{score:.4f}] {chunk.title}")

print(f"\n{Fore.YELLOW}Key insight:{Style.RESET_ALL} BM25 recovers exact financial terms "
      "(BTST, CNC, NRML, MTF) that dense retrieval misses due to "
      "abbreviation tokenization. Hybrid is not an optimisation — it is a requirement.")



---
## 4b · MMR — Maximal Marginal Relevance

**The redundancy problem:** Dense retrieval on a broad query often returns 5 chunks that all say the same thing. You burn your entire LLM context window on one idea — the answer is impoverished even when recall was perfect.

**MMR formula:**

```
MMR = argmax_{d ∈ R \ S} [ λ · sim(d, q)  −  (1 − λ) · max_{d' ∈ S} sim(d, d') ]
```

- **Left term:** relevance to the query (cosine similarity to query embedding)  
- **Right term:** maximum similarity to any already-selected chunk (diversity penalty)  
- **λ ≈ 0.6** balances relevance vs diversity — production default  
- **Algorithm:** greedy iterative selection — at each step pick the candidate with the best relevance-minus-redundancy score

**Where it fits:** Run MMR *after* hybrid retrieval and *before* the cross-encoder. It trims the shortlist to a diverse set, so the cross-encoder scores a tighter, non-redundant candidate pool.

> **Production note:** MMR replaces the diversity problem that fixed-k ignores. Combine with Adaptive-k for the full effect: MMR for diversity, Adaptive-k for relevance boundary.


In [None]:

def mmr(
    query:      str,
    candidates: List[Tuple[Chunk, float]],
    final_k:    int   = 8,
    lambda_:    float = 0.6,
) -> List[Tuple[Chunk, float]]:
    """
    Maximal Marginal Relevance — greedy iterative selection for diversity.

    At each step selects the candidate that maximises:
        λ · sim(chunk, query)  −  (1−λ) · max_{selected} sim(chunk, selected)

    λ=0.6 favours relevance slightly over diversity (production default).
    """
    if not candidates:
        return []

    texts  = [c.text for c, _ in candidates]
    q_emb  = embed_model.encode([query], normalize_embeddings=True)          # (1, D)
    c_embs = embed_model.encode(texts,   normalize_embeddings=True)          # (N, D)

    # Relevance scores: query ↔ each candidate
    rel_scores = cosine_similarity(q_emb, c_embs)[0]                        # (N,)

    selected_idxs:  List[int] = []
    remaining_idxs: List[int] = list(range(len(candidates)))

    for _ in range(min(final_k, len(candidates))):
        if not remaining_idxs:
            break

        if not selected_idxs:
            # First pick: highest relevance, no diversity penalty yet
            best = max(remaining_idxs, key=lambda i: rel_scores[i])
        else:
            sel_embs = c_embs[selected_idxs]                                 # (|S|, D)
            best, best_score = None, -float("inf")
            for i in remaining_idxs:
                relevance  = lambda_ * rel_scores[i]
                redundancy = (1 - lambda_) * cosine_similarity(
                    c_embs[[i]], sel_embs
                ).max()
                score = relevance - redundancy
                if score > best_score:
                    best_score, best = score, i

        selected_idxs.append(best)
        remaining_idxs.remove(best)

    return [(candidates[i][0], float(rel_scores[i])) for i in selected_idxs]


# ── Demo: before/after MMR on a redundancy-heavy query ──────────────
MMR_QUERY = "what is margin in Zerodha"
hdr(f"Demo: MMR Diversity — '{MMR_QUERY}'")

raw_candidates = hybrid_search(MMR_QUERY, stage1_k=30)
mmr_results    = mmr(MMR_QUERY, raw_candidates, final_k=5, lambda_=0.6)

print(f"{Fore.RED}WITHOUT MMR — top 5 by RRF score (may be redundant):{Style.RESET_ALL}")
for i, (chunk, score) in enumerate(raw_candidates[:5], 1):
    snippet = chunk.text[len(chunk.title) + 2:len(chunk.title) + 90]
    print(f"  {i}. [{score:.4f}] {chunk.title}")
    print(f"       …{snippet}…")

print(f"\n{Fore.GREEN}WITH MMR (λ=0.6) — diverse top 5:{Style.RESET_ALL}")
for i, (chunk, score) in enumerate(mmr_results, 1):
    snippet = chunk.text[len(chunk.title) + 2:len(chunk.title) + 90]
    print(f"  {i}. [{score:.4f}] {chunk.title}")
    print(f"       …{snippet}…")

print(f"\n{Fore.YELLOW}Key insight:{Style.RESET_ALL} Each MMR-selected chunk covers a "
      "distinct aspect of 'margin'. No two chunks say the same thing.")

ok("MMR function ready. Use mmr(query, hybrid_search(...)) before cross-encoder for diversity.")


---
## 5 · Cross-Encoder Reranking

**The two-tower gap:** Bi-encoders (like MiniLM) encode query and document independently — they cannot model fine-grained token-level interactions between the two. This is fast but imprecise.

**Cross-encoders** process the (query, document) pair jointly through the full attention stack. They see every token of the query attending to every token of the document. Dramatically more accurate — but O(n) in inference cost.

**The production pattern:** Use bi-encoder to recall 20–30 candidates cheaply, then cross-encoder to rerank to final top-5. You pay cross-encoder cost only on the shortlist, not the full corpus.

**Latency budget:** With `ms-marco-MiniLM-L-6-v2` on CPU:  
- Bi-encoder recall on 2,500+ chunks (HNSW index): ~80ms  
- Cross-encoder on 20 candidates: ~120ms  
- Total retrieval budget: ~200ms — well inside a 1.5s response SLA

In [None]:
%%time
hdr("Loading Cross-Encoder reranker")
CE_MODEL_NAME = "cross-encoder/ms-marco-MiniLM-L-6-v2"
cross_encoder = CrossEncoder(CE_MODEL_NAME)
ok(f"Cross-encoder: {CE_MODEL_NAME}")


def rerank(
    query: str,
    candidates: List[Tuple[Chunk, float]],
    final_k: int = 5,
) -> List[Tuple[Chunk, float]]:
    """
    Rerank hybrid search candidates with the cross-encoder.
    Input:  candidates from hybrid_search (shortlist of ~20–30)
    Output: top final_k chunks with cross-encoder relevance scores
    """
    if not candidates:
        return []

    pairs  = [(query, c.text) for c, _ in candidates]
    scores = cross_encoder.predict(pairs)

    ranked = sorted(
        zip([c for c, _ in candidates], scores),
        key=lambda x: x[1],
        reverse=True
    )
    return ranked[:final_k]


# ── Demo: per-stage latency (bi-encoder vs cross-encoder) ────────────
print(f"\nQuery: '{DEMO_QUERY}'")

t0 = time.time()
candidates = hybrid_search(DEMO_QUERY, stage1_k=25)
t_hybrid_ms = (time.time() - t0) * 1000

t0 = time.time()
reranked = rerank(DEMO_QUERY, candidates, final_k=5)
t_ce_ms = (time.time() - t0) * 1000

print(f"\n  {Fore.CYAN}Stage latencies:{Style.RESET_ALL}")
print(f"  Bi-encoder recall  (25 candidates):  {t_hybrid_ms:6.0f}ms")
print(f"  Cross-encoder rerank (→ top 5):       {t_ce_ms:6.0f}ms")
print(f"  {'─'*42}")
print(f"  Total retrieval:                      {t_hybrid_ms + t_ce_ms:6.0f}ms")
print(f"\n  {Fore.YELLOW}Note:{Style.RESET_ALL} CE cost scales with candidate count, "
      f"not corpus size. Running on 25 docs, not {len(all_chunks)}.")

print(f"\n{Fore.CYAN}{'─'*30} After cross-encoder reranking {'─'*1}{Style.RESET_ALL}")
for rank, (chunk, score) in enumerate(reranked, 1):
    print(f"  {rank}. [CE={score:.3f}] {chunk.title}")
    print(f"       {chunk.text[:120]}...")
    print()

---
## 6 · Adaptive-k Retrieval

**The fixed-k problem:** Setting `top_k=5` works well for narrow queries. For ambiguous queries — `"margin"` could mean MTF margin, intraday margin, or Options margin — you need more context. But padding every query with top-10 inflates prompt cost and dilutes the signal fed to the LLM.

**Adaptive-k** uses the distribution of cross-encoder scores to find the natural relevance cliff:

```
k = argmax( score_i − score_{i+1} )
```

The largest consecutive score drop is the boundary between relevant and irrelevant. Everything above the cliff goes into context. Everything below gets dropped — no LLM sees it, no token budget is consumed.

**When it fails:** Flat score curves with no clear cliff — usually when the query is highly ambiguous or the embedding space is too compressed. Mitigate with a minimum k of 2.

In [None]:
def adaptive_k(
    ranked_chunks: List[Tuple[Chunk, float]],
    min_k: int = 2,
    max_k: int = 8,
) -> List[Tuple[Chunk, float]]:
    """
    Apply adaptive-k boundary detection on cross-encoder scored candidates.
    Returns only the chunks above the steepest score drop.

    k = argmax(score_i - score_{i+1})
    """
    if len(ranked_chunks) <= min_k:
        return ranked_chunks

    scores = [s for _, s in ranked_chunks[:max_k]]

    # Compute consecutive differences
    gaps = [scores[i] - scores[i+1] for i in range(len(scores)-1)]

    if not gaps:
        return ranked_chunks[:min_k]

    cliff_idx = int(np.argmax(gaps))   # index of steepest drop
    k = max(min_k, cliff_idx + 1)      # include everything above the cliff

    return ranked_chunks[:k]


# ── Demo: adaptive-k vs fixed-k on several query types ───────────
hdr("Demo: Adaptive-k vs Fixed-k")

test_queries = [
    "how to place a stop loss order",       # narrow — expect k=2 or 3
    "margin",                                # ambiguous — expect higher k
    "can I withdraw money on the same day",  # specific — expect k=2
]

for q in test_queries:
    candidates = hybrid_search(q, stage1_k=25)
    reranked   = rerank(q, candidates, final_k=8)
    adaptive   = adaptive_k(reranked, min_k=2, max_k=8)

    scores = [f"{s:.2f}" for _, s in reranked[:8]]
    gaps   = [
        f"{reranked[i][1]-reranked[i+1][1]:.2f}"
        for i in range(len(reranked)-1) if i < 7
    ]

    print(f"\n  Query: '{q}'")
    print(f"  CE scores: [{', '.join(scores)}]")
    print(f"  Gaps:      [{', '.join(gaps)}]")
    print(f"  {Fore.GREEN}Adaptive-k = {len(adaptive)}{Style.RESET_ALL}  "
          f"(Fixed-k=5 would use {min(5, len(reranked))} chunks)")
    for i, (chunk, score) in enumerate(adaptive, 1):
        print(f"    {i}. [CE={score:.3f}] {chunk.title}")

---
## 7 · Generation with Grounded Citations

We use `google/flan-t5-base` here — a small (250M param) instruction-tuned model that runs on CPU in Colab without any API key. It is not production-grade for generative quality, but it demonstrates the full pipeline architecture correctly.

**To swap in a better model:** Replace the generation cell with an OpenAI/Anthropic API call — the retrieval pipeline above is completely model-agnostic. The prompt template stays identical.

**Citation grounding:** Every answer cites the chunk IDs it used. If the model produces a claim not traceable to a retrieved chunk, that is your hallucination signal — catch it at evaluation time (Section 8).

In [None]:
%%time
hdr("Loading generation model (Flan-T5-base — CPU, no API key)")
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

GEN_MODEL     = "google/flan-t5-base"
gen_tokenizer = AutoTokenizer.from_pretrained(GEN_MODEL)
gen_model     = AutoModelForSeq2SeqLM.from_pretrained(GEN_MODEL)

def generator(prompt: str, max_new_tokens: int = 256, **kwargs) -> list:
    # Flan-T5-base encoder hard limit is 512 tokens.
    # Truncate at 512 (not 1024) so the model actually sees the full prompt.
    inputs  = gen_tokenizer(prompt, return_tensors="pt",
                            truncation=True, max_length=512)
    outputs = gen_model.generate(**inputs, max_new_tokens=max_new_tokens)
    text    = gen_tokenizer.decode(outputs[0], skip_special_tokens=True)
    return [{"generated_text": text}]

ok(f"Generator ready: {GEN_MODEL}")

In [None]:
def build_prompt(query: str, chunks: List[Tuple[Chunk, float]]) -> str:
    """
    Construct a grounded RAG prompt optimised for Flan-T5-base (512-token limit).

    Design constraints:
    • Question is placed FIRST — protected from truncation even if context fills the window.
    • Each source uses max 150 chars (~40 tokens) so 5 sources fit inside the 512-token limit.
    • Sources are labelled "Source 1:", "Source 2:" — natural text patterns Flan-T5 handles well.
    • Citation instruction says "mention 'Source N'" rather than showing a bracket placeholder
      like [SOURCE-N], which Flan-T5 echoes literally instead of filling in.
    • Instruction is one plain sentence — seq2seq models respond best to simple directives.
    """
    context_blocks = []
    for i, (chunk, score) in enumerate(chunks, 1):
        context_blocks.append(
            f"Source {i} ({chunk.title}): {chunk.text[:150]}"
        )
    context = "\n".join(context_blocks)

    prompt = (
        f"Question: {query}\n\n"
        f"Answer using only the sources below. "
        f"When you use information from a source, mention it as 'Source 1', 'Source 2', etc.\n\n"
        f"{context}\n\n"
        f"Answer:"
    )
    return prompt


@dataclass
class RAGResponse:
    query:    str
    answer:   str
    sources:  List[Chunk]
    latency_ms: float


def rag_query(
    query:      str,
    stage1_k:   int = 25,
    final_k:    int = 8,
    use_adaptive_k: bool = True,
) -> RAGResponse:
    """
    Full production RAG pipeline:
    Hybrid Search → Cross-Encoder Rerank → Adaptive-k → Doc-dedup → Generate
    """
    t0 = time.time()

    # Stage 1: Hybrid retrieval
    candidates = hybrid_search(query, stage1_k=stage1_k)

    # Stage 2: Cross-encoder reranking
    reranked = rerank(query, candidates, final_k=final_k)

    # Stage 3: Adaptive-k boundary detection
    if use_adaptive_k:
        final_chunks = adaptive_k(reranked, min_k=2, max_k=final_k)
    else:
        final_chunks = reranked[:5]

    # Stage 4: Deduplicate by doc_id — keep only the highest-scoring chunk per
    # article so the same article doesn't crowd out all source slots.
    seen_docs: set = set()
    deduped: List[Tuple[Chunk, float]] = []
    for chunk, score in final_chunks:
        if chunk.doc_id not in seen_docs:
            seen_docs.add(chunk.doc_id)
            deduped.append((chunk, score))
    final_chunks = deduped

    # Stage 5: Generate grounded answer
    prompt = build_prompt(query, final_chunks)
    output = generator(prompt, do_sample=False)[0]["generated_text"]

    latency_ms = (time.time() - t0) * 1000

    return RAGResponse(
        query=query,
        answer=output,
        sources=[c for c, _ in final_chunks],
        latency_ms=latency_ms,
    )


ok("RAG pipeline assembled.")


### Prompt design for Flan-T5-base: three non-obvious constraints

Flan-T5-base is a seq2seq model instruction-tuned on hundreds of NLP tasks. It is not a GPT-style autoregressive decoder — it has a separate encoder and decoder, and the encoder has a **hard 512-token limit**. This shapes every design decision in `build_prompt()` and `rag_query()`.

**Constraint 1: Question must come first.**
The encoder tokenises the full prompt left-to-right and truncates at 512 tokens. If the prompt is structured as `SOURCES … QUESTION`, a long context will fill the 512-token window before the QUESTION line is reached — the model never sees what it is being asked, and answers with something plausible from the sources regardless of the query. Putting `Question: {query}` at the top ensures the question survives truncation even in the worst case.

**Constraint 2: Token budget arithmetic.**
With the question-first layout the token budget is approximately:

| Component | Tokens |
|---|---|
| Instruction line | ~20 |
| Question (avg Zerodha query) | ~15 |
| 5 sources × (title + 150 chars) | ~250 |
| Formatting / labels | ~15 |
| **Total** | **~300** |

This leaves a 200-token safety margin under the 512-token limit. The 150-char preview in `build_prompt()` is calibrated to this budget. Increasing it to 300 chars with 5 sources would hit ~430 tokens — still safe, but with less headroom.

**Constraint 3: Doc-id deduplication before generation.**
`adaptive_k()` returns chunk-level candidates. Without deduplication, the same article can fill 3 of the 5 source slots — the model then sees the same information three times and the remaining slots are wasted on redundant context. The dedup step in `rag_query()` Stage 4 ensures each source slot represents a distinct article, maximising the information density of the prompt.

**Swapping Flan-T5 for a real LLM.**
`build_prompt()` and `RAGResponse` are completely LLM-agnostic. To use Claude or GPT-4.1-mini:
1. Replace the `generator()` function in `load-generator` with an API call.
2. Increase the 150-char preview limit — modern LLMs handle 8k–128k context windows.
3. Everything else (hybrid search, reranking, adaptive-k, dedup, evaluation) stays identical.


In [None]:
%%time
hdr("Running RAG Pipeline — Live Queries")

demo_queries = [
    "How do I withdraw money from Zerodha to my bank account?",
    "What is the difference between CNC and MIS orders?",
    "Why was my F&O trade rejected due to insufficient margin?",
    "How do I add a nominee to my Zerodha account?",
]

responses: List[RAGResponse] = []

for q in demo_queries:
    print(f"\n{Fore.YELLOW}Q: {q}{Style.RESET_ALL}")
    resp = rag_query(q)
    responses.append(resp)

    print(f"{Fore.GREEN}A: {resp.answer}{Style.RESET_ALL}")
    print(f"   Latency: {resp.latency_ms:.0f}ms | Sources used: {len(resp.sources)}")
    for i, src in enumerate(resp.sources, 1):
        print(f"   [{i}] {src.title} → {src.url}")

---
## 8 · Evaluation — Hit Rate, Faithfulness, Hallucination Detection

**The gap most teams skip:** A RAG system that looks good in demos can still fail silently in production. Two failure modes from the deck:

1. **Bad retrieval** — the right chunk was not retrieved at all. No amount of generation quality fixes this. Measure: Hit Rate @ k.
2. **Bad generation** — the chunk was retrieved but the model hallucinated or ignored it. Measure: Faithfulness score.

We use a **synthetic golden QA set** generated from the corpus itself — no human annotation required for a demo. In production, seed this with real user queries from your support ticket logs.

**Faithfulness check:** A simple heuristic — does the answer contain n-grams that appear in the retrieved source? Not a replacement for LLM-as-judge, but deterministic and fast.

In [None]:
hdr("Building Synthetic Golden QA Set")

# Hand-crafted query → expected article title pairs covering all 32 scraped sub-categories.
# expected_titles: lowercase substrings matched against retrieved article titles.
# In production: derive from support ticket logs or LLM-generated QA pairs.
#
# Section / sub-category coverage:
#   funds/fund-withdrawal            → Q01, Q02
#   funds/adding-funds               → Q03
#   funds/adding-bank-accounts       → Q04
#   funds/mandate                    → Q05
#   trading/trading-faqs             → Q06, Q07, Q08, Q09
#   trading/margins                  → Q10, Q11, Q12
#   trading/charts-and-orders        → Q13, Q14, Q15, Q16, Q17
#   trading/general-kite             → Q18, Q19
#   trading/charges                  → Q20, Q21
#   trading/ipo                      → Q22, Q23
#   trading/alerts-and-nudges        → Q24
#   account-opening/resident         → Q25, Q26, Q27
#   account-opening/minor            → Q28
#   account-opening/nri              → Q29
#   account-opening/corporate-huf    → Q30
#   your-account/your-profile        → Q31, Q32
#   your-account/account-modification→ Q33
#   your-account/dp-bank-details     → Q34, Q35
#   your-account/nomination          → Q36
#   your-account/transfer-of-shares  → Q37
#   console/portfolio                → Q38
#   console/corporate-actions        → Q39
#   console/ledger                   → Q40
#   console/reports                  → Q41, Q42
#   console/profile                  → Q43
#   console/segments                 → Q44
#   mutual-funds/understanding-mf    → Q45, Q46, Q47
#   mutual-funds/payments-and-orders → Q48, Q49
#   mutual-funds/nps                 → Q50
#   mutual-funds/fixed-deposits      → Q51
#   mutual-funds/features-on-coin    → Q52
#   mutual-funds/coin-general        → Q53
GOLDEN_QA = [
    # ── funds/fund-withdrawal ────────────────────────────────────────────────
    {"query": "How do I withdraw money from Zerodha?",
     "expected_titles": ["fund withdrawal", "withdraw", "payout"]},

    {"query": "What is the cut-off time for fund withdrawal in Zerodha?",
     "expected_titles": ["withdrawal", "payout", "cut-off", "timing"]},

    # ── funds/adding-funds ───────────────────────────────────────────────────
    {"query": "How do I add funds to my Zerodha trading account?",
     "expected_titles": ["add funds", "deposit", "transfer funds", "upi"]},

    # ── funds/adding-bank-accounts ───────────────────────────────────────────
    {"query": "How do I link a bank account to my Zerodha account?",
     "expected_titles": ["bank account", "add bank", "primary bank"]},

    # ── funds/mandate ────────────────────────────────────────────────────────
    {"query": "What is a NACH mandate and why is it required in Zerodha?",
     "expected_titles": ["mandate", "nach", "auto-debit", "auto debit"]},

    # ── trading-and-markets/trading-faqs ─────────────────────────────────────
    {"query": "What is the difference between holdings and positions?",
     "expected_titles": ["holdings and positions", "difference between holdings"]},

    {"query": "What is F&O trading?",
     "expected_titles": ["futures and options", "f&o", "derivatives"]},

    {"query": "What is the settlement process for equity trades?",
     "expected_titles": ["settlement", "t+1", "delivery"]},

    {"query": "What is BTST trading and is it allowed in Zerodha?",
     "expected_titles": ["btst", "buy today sell tomorrow"]},

    # ── trading-and-markets/margins ───────────────────────────────────────────
    {"query": "What is MTF and how does it work?",
     "expected_titles": ["mtf", "margin trading facility"]},

    {"query": "What are intraday margins?",
     "expected_titles": ["intraday", "margin", "mis"]},

    {"query": "What is span and exposure margin in F&O?",
     "expected_titles": ["span", "exposure", "margin", "f&o"]},

    # ── trading-and-markets/charts-and-orders ─────────────────────────────────
    {"query": "What is a stop loss order?",
     "expected_titles": ["stop loss", "sl order"]},

    {"query": "What is a GTT order and how do I place one?",
     "expected_titles": ["gtt", "good till triggered", "good-till-triggered"]},

    {"query": "What is CNC order type in Zerodha?",
     "expected_titles": ["cnc", "cash and carry", "delivery"]},

    {"query": "How do I place an after-market order on Kite?",
     "expected_titles": ["amo", "after market", "after-market", "pre-market"]},

    {"query": "How do I convert an MIS position to CNC in Zerodha?",
     "expected_titles": ["mis", "cnc", "convert", "position conversion"]},

    # ── trading-and-markets/general-kite ──────────────────────────────────────
    {"query": "How to create a support ticket on Zerodha?",
     "expected_titles": ["ticket", "create a ticket", "raise a ticket"]},

    {"query": "What is Kite and how do I use it for trading?",
     "expected_titles": ["kite", "trading platform", "web platform"]},

    # ── trading-and-markets/charges ───────────────────────────────────────────
    {"query": "What are the brokerage charges for equity and F&O trades?",
     "expected_titles": ["brokerage", "charges", "stt", "fee"]},

    {"query": "What is STT (Securities Transaction Tax) in Zerodha?",
     "expected_titles": ["stt", "securities transaction tax", "tax"]},

    # ── trading-and-markets/ipo ────────────────────────────────────────────────
    {"query": "How do I apply for an IPO through Zerodha?",
     "expected_titles": ["ipo", "apply for ipo", "asba", "upi"]},

    {"query": "How do I check my IPO allotment status?",
     "expected_titles": ["ipo", "allotment", "allotted"]},

    # ── trading-and-markets/alerts-and-nudges ─────────────────────────────────
    {"query": "How do I set price alerts on Kite?",
     "expected_titles": ["alert", "price alert", "nudge"]},

    # ── account-opening/resident-individual ───────────────────────────────────
    {"query": "What documents are required to open a Zerodha account?",
     "expected_titles": ["documents", "open an account", "account opening", "kyc"]},

    {"query": "How long does it take to open a Zerodha account online?",
     "expected_titles": ["open account", "account opening", "days", "time"]},

    {"query": "Is Aadhaar mandatory to open a Zerodha account?",
     "expected_titles": ["aadhaar", "account opening", "mandatory", "digilocker"]},

    # ── account-opening/minor ─────────────────────────────────────────────────
    {"query": "Can a minor open a trading account in Zerodha?",
     "expected_titles": ["minor", "guardian", "minor account"]},

    # ── account-opening/nri-account-opening ───────────────────────────────────
    {"query": "Can NRIs trade on the Indian stock market through Zerodha?",
     "expected_titles": ["nri", "non-resident", "nre", "nro"]},

    # ── account-opening/company-partnership-and-huf ───────────────────────────
    {"query": "How do I open a trading account for a company or HUF?",
     "expected_titles": ["company", "huf", "partnership", "corporate account"]},

    # ── your-zerodha-account/your-profile ─────────────────────────────────────
    {"query": "How do I change my mobile number linked to my Zerodha account?",
     "expected_titles": ["mobile", "phone number", "change mobile", "update"]},

    {"query": "How do I update my email address on Zerodha?",
     "expected_titles": ["email", "update email", "change email"]},

    # ── your-zerodha-account/account-modification-and-segment-addition ────────
    {"query": "How do I add the F&O segment to my existing Zerodha account?",
     "expected_titles": ["segment", "f&o", "add segment", "activation", "futures"]},

    # ── your-zerodha-account/dp-id-and-bank-details ───────────────────────────
    {"query": "What is DP ID and client ID in Zerodha?",
     "expected_titles": ["dp id", "client id", "bo id", "depository"]},

    {"query": "How to add a bank account in Zerodha?",
     "expected_titles": ["add bank", "bank account", "primary bank"]},

    # ── your-zerodha-account/nomination-process ───────────────────────────────
    {"query": "How do I add a nominee to my Zerodha account?",
     "expected_titles": ["nominee", "nomination", "add nominee"]},

    # ── your-zerodha-account/transfer-of-shares ───────────────────────────────
    {"query": "How do I transfer shares from my Zerodha demat account to another broker?",
     "expected_titles": ["transfer", "shares", "demat", "off-market"]},

    # ── console/portfolio ─────────────────────────────────────────────────────
    {"query": "How do I view my stock portfolio on Zerodha Console?",
     "expected_titles": ["portfolio", "console", "holdings"]},

    # ── console/corporate-actions ─────────────────────────────────────────────
    {"query": "Where can I see dividends and bonuses credited to my Zerodha account?",
     "expected_titles": ["dividend", "corporate action", "bonus", "split"]},

    # ── console/ledger ────────────────────────────────────────────────────────
    {"query": "How do I download my account ledger from Zerodha?",
     "expected_titles": ["ledger", "download", "statement"]},

    # ── console/reports ───────────────────────────────────────────────────────
    {"query": "How do I download my profit and loss report for tax filing?",
     "expected_titles": ["profit and loss", "p&l", "tax", "report"]},

    {"query": "Where can I find my contract notes in Zerodha?",
     "expected_titles": ["contract note", "trade confirmation", "report"]},

    # ── console/profile ───────────────────────────────────────────────────────
    {"query": "How do I update my profile details on Zerodha Console?",
     "expected_titles": ["profile", "console", "update", "details"]},

    # ── console/segments ──────────────────────────────────────────────────────
    {"query": "How do I activate the commodity or currency segment in Zerodha?",
     "expected_titles": ["segment", "activate", "commodity", "currency"]},

    # ── mutual-funds/understanding-mutual-funds ───────────────────────────────
    {"query": "How do I invest in mutual funds on Coin?",
     "expected_titles": ["mutual fund", "coin", "invest"]},

    {"query": "What is NAV in a mutual fund?",
     "expected_titles": ["nav", "net asset value", "mutual fund"]},

    {"query": "What is the difference between direct and regular mutual fund plans?",
     "expected_titles": ["direct plan", "regular plan", "expense ratio", "direct vs regular"]},

    # ── mutual-funds/payments-and-orders ──────────────────────────────────────
    {"query": "How do I set up a SIP on Zerodha Coin?",
     "expected_titles": ["sip", "systematic investment", "sip on coin"]},

    {"query": "How do I redeem or exit a mutual fund on Zerodha Coin?",
     "expected_titles": ["redeem", "withdraw", "exit", "mutual fund"]},

    # ── mutual-funds/nps ──────────────────────────────────────────────────────
    {"query": "How do I invest in NPS (National Pension System) through Zerodha?",
     "expected_titles": ["nps", "national pension", "pension"]},

    # ── mutual-funds/fixed-deposits ───────────────────────────────────────────
    {"query": "Can I invest in fixed deposits through Zerodha?",
     "expected_titles": ["fixed deposit", "fd", "bajaj", "deposit"]},

    # ── mutual-funds/features-on-coin ─────────────────────────────────────────
    {"query": "What is the basket feature on Zerodha Coin?",
     "expected_titles": ["basket", "coin", "feature"]},

    # ── mutual-funds/coin-general ─────────────────────────────────────────────
    {"query": "Is investing in mutual funds on Zerodha Coin free of charge?",
     "expected_titles": ["commission", "direct", "free", "coin", "charge"]},
]

ok(f"Golden QA set: {len(GOLDEN_QA)} queries across all 32 scraped sub-categories")


In [None]:
def hit_at_k(
    retrieved_chunks: List[Chunk],
    expected_title_fragments: List[str],
    k: int = 5,
) -> bool:
    """
    Returns True if any of the top-k retrieved chunks match
    at least one expected title fragment (case-insensitive substring).
    """
    top_titles = [c.title.lower() for c in retrieved_chunks[:k]]
    for frag in expected_title_fragments:
        for title in top_titles:
            if frag.lower() in title:
                return True
    return False


def faithfulness_score(answer: str, source_chunks: List[Chunk]) -> float:
    """
    Heuristic faithfulness check.
    Counts what fraction of answer trigrams appear in the source corpus.

    Not a replacement for LLM-as-judge — but deterministic, fast, and
    surprisingly effective at catching severe hallucinations.
    """
    def ngrams(text: str, n: int = 3) -> set:
        tokens = re.findall(r"[a-z0-9]+", text.lower())
        return {tuple(tokens[i:i+n]) for i in range(len(tokens)-n+1)}

    answer_ngrams = ngrams(answer)
    if not answer_ngrams:
        return 0.0

    source_text = " ".join(c.text for c in source_chunks)
    source_ngrams = ngrams(source_text)

    overlap = answer_ngrams & source_ngrams
    return len(overlap) / len(answer_ngrams)


ok("Evaluation functions defined.")

### What hit@k and faithfulness actually measure — and what they miss

**Hit Rate @ k (retrieval metric)**
A query "hits" if at least one of the top-k retrieved chunks comes from an article whose title substring-matches an `expected_titles` entry in `GOLDEN_QA`. This tells you whether the retrieval stage surfaces the right *document* — it says nothing about whether the answer extracted from that document is correct.

*Blind spot:* A pipeline that always retrieves the right article but generates a hallucinated answer will score 100% hit rate with 0% factual accuracy. Hit rate is necessary but not sufficient.

**Faithfulness score (grounding metric)**
Computed as the fraction of answer trigrams (3-word sequences) that appear verbatim in the retrieved source text. A score of 1.0 means every phrase in the answer came directly from the sources. A score of 0.3 means the model is generating content not grounded in context — likely hallucinating.

*Why trigrams, not exact match?* Single words are too coarse (common words match trivially). Full sentences are too strict (the model may paraphrase correctly). Trigrams strike a balance that correlates well with human judgements of faithfulness at low computational cost.

*Blind spot:* A model that copies source text verbatim scores 1.0 even if the copied text does not actually answer the question. Faithfulness measures grounding, not answer correctness. You need both.

**Why synthetic golden QA?**
Collecting real user queries with verified correct answers from a live support system takes months. Synthetic QA lets us evaluate the pipeline immediately with reasonable coverage. The 10 questions in `GOLDEN_QA` were written to cover every major Zerodha support category and to include edge cases (abbreviations, multi-step answers, regulatory questions).

*The key limitation:* the synthetic questions are written by someone who already knows what's in the corpus. Real user queries are messier, more ambiguous, and often combine multiple intents. Synthetic evaluation scores are optimistic estimates of real-world performance.

**For production evaluation, consider:**
- **RAGAS** — LLM-as-judge framework that measures answer correctness, context precision, and context recall
- **TruLens** — RAG triad (groundedness, answer relevance, context relevance) with per-query breakdown
- **Human eval spot-check** — 50 random queries judged by a domain expert remains the gold standard


In [None]:
%%time

hdr("Running Evaluation — Production RAG vs Naive Dense-Only")

# ── Production pipeline (Hybrid + Rerank + Adaptive-k) ────────────
prod_results = []
for qa in tqdm(GOLDEN_QA, desc="Production RAG"):
    candidates = hybrid_search(qa["query"], stage1_k=25)
    reranked   = rerank(qa["query"], candidates, final_k=8)
    final      = adaptive_k(reranked, min_k=2, max_k=8)
    hit        = hit_at_k([c for c, _ in final], qa["expected_titles"], k=5)
    prompt     = build_prompt(qa["query"], final)
    answer     = generator(prompt, do_sample=False)[0]["generated_text"]
    faith      = faithfulness_score(answer, [c for c, _ in final])
    prod_results.append({"hit": hit, "faithfulness": faith,
                         "k_used": len(final), "answer": answer})

# ── Naive baseline (dense-only, fixed top-5) ──────────────────────
naive_results = []
for qa in tqdm(GOLDEN_QA, desc="Naive Dense-only"):
    dense_hits = dense_search(qa["query"], top_k=5)
    chunks     = [all_chunks[idx] for idx, _ in dense_hits]
    hit        = hit_at_k(chunks, qa["expected_titles"], k=5)
    prompt     = build_prompt(qa["query"], [(c, 0.0) for c in chunks])
    answer     = generator(prompt, do_sample=False)[0]["generated_text"]
    faith      = faithfulness_score(answer, chunks)
    naive_results.append({"hit": hit, "faithfulness": faith,
                          "k_used": 5, "answer": answer})

In [None]:
%%time

hdr("Evaluation Results")

prod_hit_rate = sum(r["hit"] for r in prod_results) / len(prod_results)
prod_faith    = sum(r["faithfulness"] for r in prod_results) / len(prod_results)
prod_avg_k    = sum(r["k_used"] for r in prod_results) / len(prod_results)

naive_hit_rate = sum(r["hit"] for r in naive_results) / len(naive_results)
naive_faith    = sum(r["faithfulness"] for r in naive_results) / len(naive_results)

print(f"{'Metric':<35} {'Naive Dense-only':>18} {'Production RAG':>18}")
print("─" * 73)
print(f"{'Hit Rate @ 5':<35} {naive_hit_rate:>17.1%} {prod_hit_rate:>17.1%}")
print(f"{'Faithfulness (trigram overlap)':<35} {naive_faith:>17.2f} {prod_faith:>17.2f}")
print(f"{'Avg chunks sent to LLM':<35} {'5 (fixed)':>18} {prod_avg_k:>17.1f}")
print("─" * 73)
print()

# Per-query breakdown
print(f"{'Query':<52} {'Naive Hit':>11} {'Prod Hit':>10} {'Prod k':>8}")
print("─" * 84)
for i, qa in enumerate(GOLDEN_QA):
    q = qa["query"][:50]
    nh = Fore.GREEN + "✓" + Style.RESET_ALL if naive_results[i]["hit"] else Fore.RED + "✗" + Style.RESET_ALL
    ph = Fore.GREEN + "✓" + Style.RESET_ALL if prod_results[i]["hit"]  else Fore.RED + "✗" + Style.RESET_ALL
    pk = prod_results[i]["k_used"]
    print(f"  {q:<60} {nh:>9} {ph:>19} {pk:>8}")

---
## 9 · Interactive Query Interface

Type any question about Zerodha below and the full production pipeline runs — Hybrid Search → Cross-Encoder Rerank → Adaptive-k → Grounded Generation.

In [None]:
def ask(query: str, verbose: bool = True) -> RAGResponse:
    """
    Ask any question. Full pipeline runs automatically.
    Set verbose=False to suppress source details.
    """
    resp = rag_query(query)

    print(f"\n{Fore.YELLOW}{'─'*60}")
    print(f"Q: {query}")
    print(f"{'─'*60}{Style.RESET_ALL}")
    print(f"\n{Fore.GREEN}{resp.answer}{Style.RESET_ALL}")
    print(f"\n  Pipeline: Hybrid Search → CE Rerank → Adaptive-k={len(resp.sources)}")
    print(f"  Latency:  {resp.latency_ms:.0f}ms")

    if verbose:
        print(f"\n  {Fore.CYAN}Sources:{Style.RESET_ALL}")
        for i, src in enumerate(resp.sources, 1):
            print(f"    [{i}] {src.title}")
            print(f"         {src.url}")

    return resp


# ── Try it ──────────────────────────────────────────────────────
# Modify this query to test any Zerodha support question.
# This default covers account-opening:
_ = ask("What documents are required to open a Zerodha account?")


In [None]:
%%time

# ── 10 pre-loaded example queries ────────────────────────────────────
# Covering: F&O, funds, orders, account management, compliance, charges
# Run any single query or loop through all of them.

example_queries = [
    # Funds & withdrawals
    "How long does it take for a fund withdrawal to reach my bank?",
    "How do I add money to my Zerodha trading account?",
    # Orders & position management
    "What happens to my F&O position on expiry day?",
    "What is a GTT order and how do I place one?",
    "How do I convert a CNC position to MIS intraday?",
    "What is BTST trading and how does it work?",
    # Charges & brokerage
    "What is the brokerage charged on intraday equity trades?",
    "What are the charges for options trading on Zerodha?",
    # Compliance & account management
    "Can I trade on a Zerodha NRI account?",
    "What documents do I need to open a Zerodha demat account?",
]

# ── Run a single query (change index 0–9) ───────────────────────────
_ = ask(example_queries[0])

# ── Uncomment to run all 10 queries ─────────────────────────────────
# for q in example_queries:
#     _ = ask(q, verbose=False)
#     print()



---
## Key Takeaways

```
Production RAG is not:
  Chunk → Embed → top-k → Generate

Production RAG is:
  Semantic Chunk (sentence-aware, overlapping)
  → Hybrid Retrieval (BM25 + Dense, fused with RRF)
  → MMR (Maximal Marginal Relevance — diversity over redundancy)
  → Cross-Encoder Rerank (precision pass on shortlist)
  → Adaptive-k (score-gap boundary — no fixed k)
  → Grounded Generation (citations, no hallucination budget)
  → Continuous Eval (Hit Rate + Faithfulness per release)
```

**Swap out for production:**

| This notebook | Production equivalent |
|---------------|----------------------|
| `QdrantClient(':memory:')` | Qdrant Cloud / self-hosted |
| `flan-t5-base` | GPT-4.1-mini / Claude 3.5 / Gemini |
| `all-MiniLM-L6-v2` | Domain-fine-tuned embedding model |
| `BM25Okapi` in RAM | Elasticsearch sparse / Qdrant FastEmbed |
| `mmr()` in-process | Qdrant native MMR / Cohere Rerank diversity mode |
| Synthetic QA eval | Real user query logs + human annotation |

**Further reading:** HyDE · HyPE · Contextual Chunk Headers · Fusion Retrieval · PageRank in RAG  

**Github→** `github.com/indranildchandra/rag-done-right-in-production`

---
*Indranil Chandra · ML & Data Architect, Upstox · Co-organiser, GDG MAD Mumbai*
