# ðŸ““ The GenAI Revolution Cookbook

**Title:** How to Add RAG Reranking with bge-reranker for Trusted Citations

**Description:** Upgrade your RAG with cross-encoder reranking, schema-validated JSON, provenance-rich citations, and calibrated confidence scores for consistently precise, trustworthy, production-ready answers.

---

*This jupyter notebook contains executable code examples. Run the cells below to try out the code yourself!*



## Why This Approach Works

Basic RAG systems fail silentlyâ€”they hallucinate, cite nonexistent sources, and erode user trust. In production, you need precision, provenance, and calibrated confidence. This tutorial builds a production-grade RAG answerer that fixes these issues with two-stage retrieval (embedding + reranking), OpenAI Structured Outputs for strict JSON, and server-side confidence scoring. You'll learn to build a system that answers only with verifiable facts, cites exact sources, and quantifies its own reliability.

**What you'll build:** A complete RAG pipeline that retrieves candidates with FAISS, reranks for precision with a cross-encoder, generates grounded answers via OpenAI's Structured Outputs, and validates citations with Pydanticâ€”all with latency logging and confidence calibration.

**Prerequisites:** Python 3.10+, OpenAI API key (with access to `gpt-4o-mini` or newer models supporting Structured Outputs), CPU baseline (optional GPU for faster reranking).

---

## How It Works (High-Level Overview)

1. **Ingest & Chunk:** Split documents into overlapping chunks with metadata (doc_id, title, URL, character spans).
2. **Embed & Index:** Encode chunks with BGE-small and store in a FAISS index for fast retrieval.
3. **Retrieve Candidates:** Use semantic search to fetch top-k candidates (high recall, lower precision).
4. **Rerank for Precision:** Apply a cross-encoder (BGE-reranker) to score query-passage pairs and select top-n.
5. **Generate Grounded Answer:** Pass top passages to OpenAI with a strict JSON schema enforcing citations and answer structure.
6. **Validate & Calibrate:** Backfill reranker scores into citations, compute confidence from passage quality and coverage, validate with Pydantic.

**Why two stages?** Embedding models are fast but imprecise; rerankers are slow but accurate. Combining both gives you speed and quality.

---

## Setup & Installation

Run this cell first to install dependencies (Colab-ready):

In [None]:
!pip install --quiet sentence-transformers==3.0.1 FlagEmbedding==1.2.11 faiss-cpu==1.8.0 openai==1.51.0 pydantic==2.9.2 torch>=2.2,<3

**Set your OpenAI API key** (Colab-safe):

In [None]:
import os
from getpass import getpass

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")

# Validate key presence
if not os.environ.get("OPENAI_API_KEY"):
    raise ValueError("OPENAI_API_KEY is required. Set it via environment variable or the cell above.")

**Check your environment:**

In [None]:
import torch
print(f"Device: {'CUDA' if torch.cuda.is_available() else 'CPU'}")
print(f"PyTorch version: {torch.__version__}")

---

## Step-by-Step Implementation

### Step 1: Ingest and Chunk Documents

Split documents into overlapping chunks to preserve context across boundaries. Each chunk stores metadata for citation.

In [None]:
from dataclasses import dataclass
from typing import List, Dict, Any
import uuid

@dataclass
class Document:
    doc_id: str
    title: str
    url: str
    text: str

def chunk_text(text: str, chunk_size=800, overlap=200) -> List[tuple]:
    """Splits text into overlapping chunks."""
    chunks = []
    start = 0
    n = len(text)
    while start < n:
        end = min(start + chunk_size, n)
        chunks.append((start, end, text[start:end]))
        if end == n:
            break
        start = max(0, end - overlap)
    return chunks

def ingest_documents(raw_docs: List[Dict[str, str]], chunk_size=800, overlap=200) -> tuple:
    """Ingests raw documents and chunks them with metadata."""
    docs = [
        Document(doc_id=str(uuid.uuid4()), title=rd["title"], url=rd["url"], text=rd["text"])
        for rd in raw_docs
    ]
    chunks = []
    for d in docs:
        for i, (s, e, chunk_text_) in enumerate(chunk_text(d.text, chunk_size, overlap)):
            chunk = {
                "doc_id": d.doc_id,
                "title": d.title,
                "url": d.url,
                "chunk_id": f"{d.doc_id}::chunk_{i}",
                "char_start": s,
                "char_end": e,
                "text": chunk_text_
            }
            chunks.append(chunk)
    return docs, chunks

**Example corpus:**

In [None]:
raw_corpus = [
    {
        "title": "Product Onboarding Guide",
        "url": "https://example.com/onboarding",
        "text": "Welcome to Acme. This guide covers account creation, SSO configuration, and role management. Start by creating an admin account, then configure SAML or OAuth for single sign-on. Assign roles to users based on their responsibilities."
    },
    {
        "title": "Security & Compliance",
        "url": "https://example.com/security",
        "text": "We support SOC 2 Type II, data encryption at rest (AES-256), and in transit (TLS 1.2+). Data residency options include US, EU, and APAC regions. All data is encrypted using industry-standard protocols."
    },
    {
        "title": "API Rate Limits",
        "url": "https://example.com/rate-limits",
        "text": "The default rate limit is 100 requests/min per API key. Burst behavior allows up to 150 requests in a 10-second window. Implement exponential backoff for retries. Contact support to increase limits for enterprise plans."
    },
]

docs, chunks = ingest_documents(raw_corpus, chunk_size=400, overlap=80)
print(f"Ingested {len(docs)} documents into {len(chunks)} chunks.")

---

### Step 2: Embed and Index with FAISS

Use BGE-small to encode chunks. BGE models expect instruction prefixes for queries to improve retrieval quality.

In [None]:
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

# Initialize embedding model
EMB_MODEL_NAME = "BAAI/bge-small-en-v1.5"
embedder = SentenceTransformer(EMB_MODEL_NAME)
embedder.max_seq_length = 512

# Instruction prefix for passages (optional for BGE, improves quality)
PASSAGE_PREFIX = "Represent this document for retrieval: "

def encode(texts: List[str], is_query=False) -> np.ndarray:
    """Encodes texts into embeddings with optional query instruction."""
    if is_query:
        # BGE query instruction improves retrieval
        texts = [f"Represent this question for retrieving relevant documents: {t}" for t in texts]
    else:
        # Optional passage prefix (less critical but can help)
        texts = [PASSAGE_PREFIX + t for t in texts]
    
    embs = embedder.encode(
        texts,
        batch_size=64,
        show_progress_bar=False,
        convert_to_numpy=True,
        normalize_embeddings=True
    )
    return embs.astype(np.float32)

# Encode chunks and build FAISS index
chunk_texts = [c["text"] for c in chunks]
chunk_embs = encode(chunk_texts, is_query=False)
dim = chunk_embs.shape[1]

index = faiss.IndexFlatIP(dim)  # Inner Product for cosine similarity
index.add(chunk_embs)
print(f"Indexed {index.ntotal} chunks with dimension {dim}.")

---

### Step 3: Retrieve Candidates

Fetch top-k candidates using semantic search. High recall, moderate precision.

In [None]:
def retrieve(query: str, top_k=40) -> List[Dict[str, Any]]:
    """Retrieves top-k candidates based on query embedding."""
    q_emb = encode([query], is_query=True)[0].reshape(1, -1)
    scores, ids = index.search(q_emb, top_k)
    results = []
    for score, idx in zip(scores[0], ids[0]):
        c = chunks[int(idx)]
        results.append({**c, "recall_score": float(score)})
    return results

---

### Step 4: Rerank for Precision

Use a cross-encoder to score query-passage pairs. Reranking is slower but dramatically improves precision.

In [None]:
from FlagEmbedding import FlagReranker

# Auto-select reranker model and fp16 based on device
device = "cuda" if torch.cuda.is_available() else "cpu"
use_fp16 = device == "cuda"
RERANK_MODEL_NAME = "BAAI/bge-reranker-base" if device == "cpu" else "BAAI/bge-reranker-large"

reranker = FlagReranker(RERANK_MODEL_NAME, use_fp16=use_fp16)
print(f"Reranker: {RERANK_MODEL_NAME} on {device} (fp16={use_fp16})")

def rerank(query: str, candidates: List[Dict[str, Any]], top_n=5, normalize=True) -> tuple:
    """Reranks candidates using a cross-encoder."""
    pairs = [[query, c["text"]] for c in candidates]
    scores = reranker.compute_score(pairs, normalize=normalize)
    for c, s in zip(candidates, scores):
        c["rerank_score"] = float(s)
    ranked = sorted(candidates, key=lambda x: x["rerank_score"], reverse=True)
    return ranked[:top_n], ranked

---

### Step 5: Generate Grounded Answer with Structured Outputs

Define a strict JSON schema for OpenAI. The LLM returns only the answer and citations; we compute confidence server-side.

In [None]:
from openai import OpenAI
import json

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

# JSON schema for Structured Outputs (no LLM-generated scores or confidence)
answer_schema = {
    "name": "rag_answer_schema",
    "schema": {
        "type": "object",
        "additionalProperties": False,
        "properties": {
            "answer": {
                "type": "string",
                "description": "A concise answer grounded strictly in the provided context."
            },
            "citations": {
                "type": "array",
                "items": {
                    "type": "object",
                    "additionalProperties": False,
                    "properties": {
                        "doc_id": {"type": "string"},
                        "title": {"type": "string"},
                        "url": {"type": "string"},
                        "chunk_id": {"type": "string"},
                        "quote": {"type": "string"}
                    },
                    "required": ["doc_id", "title", "url", "chunk_id"]
                }
            }
        },
        "required": ["answer", "citations"]
    },
    "strict": True
}

SYSTEM_PROMPT = """You are a precise assistant that answers only with facts from the supplied CONTEXT.
Rules:
- If the context is insufficient, say you don't know.
- Cite sources using the exact doc_id, title, url, chunk_id provided.
- Include brief supporting quotes when possible.
- Do not invent URLs or IDs.
"""

def make_user_prompt(query: str, context: str) -> str:
    """Constructs the user prompt for the LLM."""
    return f"""QUESTION:
{query}

CONTEXT:
{context}

INSTRUCTIONS:
- Answer concisely in 2-5 sentences.
- Use only information from CONTEXT.
- Provide citations referencing chunk_id(s) that support each claim.
- Include short quotes in citations when useful.
"""

def build_context(passages: List[Dict[str, Any]]) -> str:
    """Builds context string from top passages."""
    blocks = []
    for i, p in enumerate(passages, 1):
        block = (
            f"[{i}] title={p['title']}\n"
            f"url={p['url']}\n"
            f"doc_id={p['doc_id']} chunk_id={p['chunk_id']} span=({p['char_start']},{p['char_end']})\n"
            f"rerank_score={p['rerank_score']:.4f}\n"
            f"{p['text']}"
        )
        blocks.append(block)
    return "\n\n".join(blocks)

def generate_answer(query: str, top_passages: List[Dict[str, Any]]) -> Dict[str, Any]:
    """Generates answer using OpenAI with Structured Outputs."""
    context = build_context(top_passages)
    response = client.chat.completions.create(
        model="gpt-4o-mini",  # Fallback model with Structured Outputs support
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": make_user_prompt(query, context)}
        ],
        response_format={"type": "json_schema", "json_schema": answer_schema}
    )
    
    # Stable parsing with fallback
    try:
        out = response.choices[0].message.content
        data = json.loads(out)
    except (AttributeError, json.JSONDecodeError) as e:
        raise ValueError(f"Failed to parse LLM response: {e}")
    
    return data

---

### Step 6: Validate and Calibrate Confidence

Backfill reranker scores into citations and compute confidence from passage quality and coverage. Validate with Pydantic.

In [None]:
from pydantic import BaseModel, Field, HttpUrl
from typing import Optional

class Citation(BaseModel):
    doc_id: str
    title: str
    url: HttpUrl
    chunk_id: str
    score: float = Field(ge=0.0, le=1.0)
    quote: Optional[str] = None

class RAGAnswer(BaseModel):
    answer: str
    confidence: float = Field(ge=0.0, le=1.0)
    citations: List[Citation]

def compute_confidence(citations: List[Dict[str, Any]], top_passages: List[Dict[str, Any]]) -> float:
    """Computes confidence from reranker scores and citation coverage."""
    score_by_chunk = {p["chunk_id"]: p["rerank_score"] for p in top_passages}
    cited_chunks = set([c["chunk_id"] for c in citations if c.get("chunk_id") in score_by_chunk])
    
    if cited_chunks:
        mean_score = sum(score_by_chunk[c] for c in cited_chunks) / len(cited_chunks)
    else:
        mean_score = 0.0
    
    coverage = len(cited_chunks) / max(1, len(top_passages))
    conf = 0.7 * float(mean_score) + 0.3 * float(coverage)
    return round(max(0.0, min(1.0, conf)), 3)

def assemble_final_answer(raw_llm: Dict[str, Any], top_passages: List[Dict[str, Any]]) -> RAGAnswer:
    """Backfills scores, computes confidence, validates with Pydantic."""
    score_map = {p["chunk_id"]: p["rerank_score"] for p in top_passages}
    
    # Backfill reranker scores into citations
    for c in raw_llm.get("citations", []):
        if c["chunk_id"] in score_map:
            c["score"] = float(score_map[c["chunk_id"]])
        else:
            c["score"] = 0.0  # Fallback if chunk_id not found
    
    # Compute confidence server-side
    conf = compute_confidence(raw_llm.get("citations", []), top_passages)
    raw_llm["confidence"] = conf
    
    return RAGAnswer(**raw_llm)

---

### Step 7: End-to-End Pipeline with Latency Logging

Combine all steps into a single function with timing for each stage.

In [None]:
import time

def answer_query(query: str, top_k=40, top_n=5) -> tuple:
    """Answers a query with retrieval, reranking, generation, and validation."""
    t0 = time.perf_counter()
    cand = retrieve(query, top_k=top_k)
    t1 = time.perf_counter()
    
    top, _ = rerank(query, cand, top_n=top_n, normalize=True)
    t2 = time.perf_counter()
    
    raw = generate_answer(query, top)
    t3 = time.perf_counter()
    
    final = assemble_final_answer(raw, top)
    t4 = time.perf_counter()
    
    timings = {
        "retrieve_ms": round((t1 - t0) * 1000, 1),
        "rerank_ms": round((t2 - t1) * 1000, 1),
        "generate_ms": round((t3 - t2) * 1000, 1),
        "validate_ms": round((t4 - t3) * 1000, 1),
        "total_ms": round((t4 - t0) * 1000, 1),
    }
    return final, timings

---

## Run and Validate

Test the pipeline with two queries and inspect the JSON output.

In [None]:
def render(final: RAGAnswer):
    """Renders the final answer and citations."""
    print("Answer:")
    print(final.answer)
    print(f"\nConfidence: {final.confidence:.2f}")
    print("\nCitations:")
    for c in final.citations:
        print(f"- {c.title} [{c.chunk_id}] ({c.url}) score={c.score:.3f}")
        if c.quote:
            print(f'  "{c.quote[:160]}..."')

# Query 1: Security and compliance
q1 = "How is data encrypted and do you have SOC 2?"
final1, timings1 = answer_query(q1, top_k=40, top_n=5)
render(final1)
print("\nTimings (ms):", timings1)

print("\n" + "="*60 + "\n")

# Query 2: Rate limits
q2 = "What are the API rate limits?"
final2, timings2 = answer_query(q2, top_k=40, top_n=5)
render(final2)
print("\nTimings (ms):", timings2)

**Interpreting the output:**

- **Answer:** Grounded in the top passages, no hallucinations.
- **Confidence:** Calibrated from reranker scores and citation coverage. Values above 0.7 indicate high reliability.
- **Citations:** Each includes the exact chunk_id, reranker score, and optional quote for verification.
- **Timings:** Retrieval is fast (<50ms), reranking slower (100-500ms depending on device), generation depends on LLM latency.

**Validation checks:**

- All URLs in citations are valid (Pydantic enforces HttpUrl).
- Confidence is between 0.0 and 1.0.
- Citations reference only chunks that exist in the top passages.

---

## Conclusion

You've built a production-grade RAG answerer that retrieves with FAISS, reranks for precision, generates grounded answers via OpenAI Structured Outputs, and validates citations with Pydantic. Key design choices:

- **Two-stage retrieval:** Fast embedding search + slow reranking = speed + accuracy.
- **Structured Outputs:** Enforces strict JSON schema, eliminating parsing errors.
- **Server-side confidence:** Computed from reranker scores and coverage, not guessed by the LLM.
- **Provenance:** Every claim is tied to a verifiable chunk_id, title, and URL.

**Next steps:**

- **Evaluation:** Benchmark retrieval quality with BEIR datasets and answer quality with RAGAS (faithfulness, relevance).
- **Hybrid retrieval:** Combine dense embeddings with BM25 for keyword-heavy queries.
- **FastAPI deployment:** Wrap `answer_query()` in a REST endpoint with async support and rate limiting.
- **Observability:** Log queries, latencies, and confidence scores to track performance in production.
- **Cost management:** Cache embeddings, batch reranking, and monitor OpenAI token usage.

This system is ready to handle real queries with verifiable answers and calibrated confidenceâ€”no silent failures, no hallucinations.