# Lab 3: Building the RAG Service
## Learning Objectives
By the end of this lab, you will:
- Configure a ChromaDB vector store with HNSW indexing
- Implement a VectorStoreManager for batch ingestion and search
- Build an end-to-end RAGService that orchestrates Retrieve → Augment → Generate
- Test your RAG system with real queries
## Setup

In [None]:
!uv pip install chromadb numpy -q


## Part 1: Vector Store Setup

**ChromaDB** is an open-source embedding database that makes it easy to store, search, and retrieve vector embeddings. Under the hood, it uses **HNSW (Hierarchical Navigable Small World)** as its approximate nearest neighbor (ANN) index.

Key HNSW parameters:
- `hnsw:space` — Distance metric (`cosine`, `l2`, `ip`). We use `cosine` for text embeddings.
- `hnsw:construction_ef` — Controls index build quality. Higher = better recall, slower build.
- `hnsw:search_ef` — Controls search quality. Higher = better recall, slower search.
- `hnsw:M` — Number of bi-directional links per node. Higher = better recall, more memory.

We will create a `VectorStoreManager` class that wraps ChromaDB to provide a clean interface for document ingestion and search.

In [None]:
import chromadb
import numpy as np
import hashlib

class VectorStoreManager:
    """Manages ChromaDB with HNSW indexing."""
    
    def __init__(self, collection_name="research_papers"):
        # Use in-memory client for this lab
        self.client = chromadb.Client()
        
        self.collection = self.client.get_or_create_collection(
            name=collection_name,
            metadata={
                "hnsw:space": "cosine",
                "hnsw:construction_ef": 200,
                "hnsw:search_ef": 100,
                "hnsw:M": 16,
            }
        )
        print(f"Collection '{collection_name}' ready (HNSW cosine)")
    
    def add_documents(self, documents, batch_size=100):
        """Add documents with embeddings in batches."""
        total = 0
        for i in range(0, len(documents), batch_size):
            batch = documents[i:i+batch_size]
            ids, embeddings, metadatas, texts = [], [], [], []
            
            for doc in batch:
                content_hash = hashlib.md5(doc['text'].encode()).hexdigest()
                ids.append(f"doc_{content_hash}")
                embeddings.append(doc['embedding'])
                metadatas.append(doc.get('metadata', {}))
                texts.append(doc['text'])
            
            self.collection.add(
                embeddings=embeddings,
                documents=texts,
                metadatas=metadatas,
                ids=ids
            )
            total += len(batch)
        
        print(f"Added {total} documents. Collection size: {self.collection.count()}")
        return total
    
    def search(self, query_embedding, n_results=5, filter_conditions=None):
        """Search for similar documents."""
        results = self.collection.query(
            query_embeddings=[query_embedding],
            n_results=n_results,
            where=filter_conditions,
            include=["documents", "metadatas", "distances"]
        )
        
        formatted = []
        for i in range(len(results["documents"][0])):
            formatted.append({
                "text": results["documents"][0][i],
                "metadata": results["metadatas"][0][i],
                "score": 1 - results["distances"][0][i]  # distance -> similarity
            })
        return formatted
    
    def get_stats(self):
        return {"count": self.collection.count()}

# Initialize
store = VectorStoreManager()
print(f"Stats: {store.get_stats()}")


## Part 2: Populate with Sample Data

We will create simulated research paper chunks with synthetic embeddings. In a production system, these embeddings would come from your ingestion and embedding pipeline (e.g., the `EmbeddingGenerator` from Lab 2).

To make the simulation meaningful, we inject a topic-specific signal into each embedding so that semantically related chunks are closer together in vector space.

In [None]:
# Simulated research paper chunks with pre-computed embeddings
# In production, these would come from your ingestion + embedding pipeline
np.random.seed(42)
DIM = 384  # Using smaller dimension for demo

sample_chunks = [
    {"text": "RAG combines retrieval with generation to ground LLM responses in external knowledge. This reduces hallucinations and enables citation of sources.", 
     "metadata": {"title": "RAG Survey", "section": "introduction", "chunk_id": 0}},
    {"text": "The transformer architecture uses self-attention mechanisms to process all positions in a sequence simultaneously, enabling massive parallelization during training.",
     "metadata": {"title": "Attention Is All You Need", "section": "architecture", "chunk_id": 0}},
    {"text": "HNSW (Hierarchical Navigable Small World) is an approximate nearest neighbor algorithm that builds a multi-layered graph for fast vector search with high recall.",
     "metadata": {"title": "HNSW Paper", "section": "algorithm", "chunk_id": 0}},
    {"text": "Cosine similarity measures the angle between two vectors, making it robust to document length variation. It is the standard metric for text embeddings.",
     "metadata": {"title": "Vector Search Guide", "section": "metrics", "chunk_id": 0}},
    {"text": "Fine-tuning changes model weights through additional training, while RAG keeps the model frozen and provides knowledge through retrieval. RAG is more cost-effective for dynamic knowledge.",
     "metadata": {"title": "RAG vs Fine-tuning", "section": "comparison", "chunk_id": 0}},
    {"text": "Chunking is the process of splitting documents into smaller pieces for embedding. The optimal chunk size balances context preservation with retrieval precision.",
     "metadata": {"title": "Chunking Strategies", "section": "fundamentals", "chunk_id": 0}},
    {"text": "Cross-encoder re-ranking takes query-document pairs as input and produces a relevance score. It is more accurate than bi-encoders but slower.",
     "metadata": {"title": "Re-ranking Survey", "section": "methods", "chunk_id": 0}},
    {"text": "BM25 is a probabilistic ranking function based on term frequency. It excels at exact keyword matching, complementing semantic search in hybrid systems.",
     "metadata": {"title": "Hybrid Search", "section": "bm25", "chunk_id": 0}},
    {"text": "Vector quantization reduces memory by storing vectors with lower precision. Scalar quantization (SQ8) achieves 4x compression with minimal recall loss.",
     "metadata": {"title": "Scaling Vectors", "section": "quantization", "chunk_id": 0}},
    {"text": "The evaluation of RAG systems requires both retrieval metrics (Hit Rate, MRR) and generation metrics (faithfulness, relevance). DeepEval provides automated LLM-as-Judge evaluation.",
     "metadata": {"title": "RAG Evaluation", "section": "metrics", "chunk_id": 0}},
]

# Generate embeddings that cluster by topic (simulated)
# In production, use your EmbeddingGenerator from Lab 2
for i, chunk in enumerate(sample_chunks):
    # Create a base vector + topic-specific signal
    base = np.random.randn(DIM) * 0.1
    # Add topic signal so similar chunks are closer
    if "RAG" in chunk["text"] or "retrieval" in chunk["text"].lower():
        base[:50] += 0.5  # RAG topic signal
    if "vector" in chunk["text"].lower() or "embedding" in chunk["text"].lower():
        base[50:100] += 0.5  # Vector topic signal
    chunk["embedding"] = (base / np.linalg.norm(base)).tolist()

# Add to vector store
store.add_documents(sample_chunks)


## Part 3: Search Testing

Let's test that our vector store returns relevant results. We create a query embedding with a RAG topic signal and verify that the top results are indeed about RAG.

In [None]:
# Create a query embedding (simulated)
query = "What is RAG and how does it work?"
# Query embedding with RAG topic signal
query_emb = np.random.randn(DIM) * 0.1
query_emb[:50] += 0.5  # RAG topic signal
query_emb = (query_emb / np.linalg.norm(query_emb)).tolist()

results = store.search(query_emb, n_results=3)
print(f"Query: {query}\n")
print(f"Top {len(results)} results:")
for i, r in enumerate(results):
    print(f"\n  [{i+1}] Score: {r['score']:.4f}")
    print(f"      Title: {r['metadata'].get('title', 'N/A')}")
    print(f"      Text: {r['text'][:120]}...")


### Exercise 3.1: Metadata Filtering

ChromaDB supports metadata filtering via the `where` parameter. This allows you to combine vector similarity search with structured filters -- for example, searching only within a specific section or document title.

In [None]:
# TODO: Search for chunks specifically about "architecture" section
# Use the filter_conditions parameter with: {"section": "architecture"}

filtered_results = None  # Replace with: store.search(query_emb, n_results=5, filter_conditions={"section": "architecture"})

assert filtered_results is not None, "Call store.search with filter_conditions"
assert len(filtered_results) > 0, "Should find at least one result"
assert all(r['metadata']['section'] == 'architecture' for r in filtered_results), \
    "All results should be from 'architecture' section"
print(f"Found {len(filtered_results)} results in 'architecture' section")
for r in filtered_results:
    print(f"  {r['metadata']['title']}: {r['text'][:80]}...")
print("Metadata filtering working!")


## Part 4: The RAG Service

Now we build the complete RAG pipeline that orchestrates three stages:

1. **Retrieve** -- Embed the user query and search the vector store for relevant chunks.
2. **Augment** -- Format the retrieved chunks into a context string and build a grounded prompt.
3. **Generate** -- Send the augmented prompt to the LLM to produce a cited answer.

In this lab we simulate embedding and generation. In production, you would replace `_embed_query` and `_generate` with calls to OpenAI, Anthropic, or another provider.

In [None]:
class RAGService:
    """Orchestrates Retrieve -> Augment -> Generate."""
    
    def __init__(self, vector_store):
        self.vector_store = vector_store
        self.embed_dim = DIM
    
    def _embed_query(self, query: str) -> list:
        """Generate query embedding. In production: call OpenAI API."""
        # Simulated embedding based on keywords
        emb = np.random.randn(self.embed_dim) * 0.1
        query_lower = query.lower()
        if any(w in query_lower for w in ["rag", "retrieval", "generation"]):
            emb[:50] += 0.5
        if any(w in query_lower for w in ["vector", "embedding", "cosine"]):
            emb[50:100] += 0.5
        if any(w in query_lower for w in ["chunk", "split"]):
            emb[100:150] += 0.5
        return (emb / np.linalg.norm(emb)).tolist()
    
    def _build_context(self, results: list) -> str:
        """Format retrieved chunks into context string."""
        parts = []
        for i, doc in enumerate(results, 1):
            title = doc['metadata'].get('title', 'Unknown')
            parts.append(f"[{i}] (Source: {title})\n{doc['text']}")
        return "\n\n---\n\n".join(parts)
    
    def _build_prompt(self, query: str, context: str) -> str:
        """Build the full prompt for the LLM."""
        system = """You are a Research Assistant. Answer using ONLY the provided context.
Cite sources using [N] format. If the answer is not in the context, say so."""
        
        return f"""SYSTEM: {system}

CONTEXT:
{context}

QUESTION: {query}

ANSWER:"""
    
    def _generate(self, prompt: str) -> str:
        """Generate answer. In production: call OpenAI/Claude API."""
        # Simulated generation - extracts key sentences from context
        lines = prompt.split("\n")
        context_lines = [l for l in lines if l.startswith("[")]
        if context_lines:
            return f"Based on the retrieved sources: {context_lines[0][:200]}... [1]"
        return "I don't have enough information to answer this question."
    
    def answer(self, query: str, top_k: int = 3) -> dict:
        """Full RAG pipeline: Retrieve -> Augment -> Generate."""
        # 1. RETRIEVE
        query_embedding = self._embed_query(query)
        results = self.vector_store.search(query_embedding, n_results=top_k)
        
        if not results:
            return {"answer": "No relevant information found.", "sources": []}
        
        # 2. AUGMENT
        context = self._build_context(results)
        prompt = self._build_prompt(query, context)
        
        # 3. GENERATE
        answer = self._generate(prompt)
        
        return {
            "answer": answer,
            "sources": [r['metadata'] for r in results],
            "scores": [r['score'] for r in results],
            "prompt_preview": prompt[:500] + "..."
        }

# Test the complete pipeline
rag = RAGService(store)
response = rag.answer("What is RAG and why is it useful?")

print(f"Answer: {response['answer']}")
print(f"\nSources:")
for s in response['sources']:
    print(f"  - {s.get('title', 'N/A')} ({s.get('section', 'N/A')})")
print(f"\nRetrieval scores: {[f'{s:.3f}' for s in response['scores']]}")


### Exercise 4.1: Add Score Threshold

In production, not all retrieved results are useful. Low-scoring results can introduce noise into the context and degrade generation quality. Implement a `min_score` threshold to filter out weak matches.

In [None]:
class ImprovedRAGService(RAGService):
    """RAG Service with minimum score threshold."""
    
    def answer(self, query: str, top_k: int = 3, min_score: float = 0.0) -> dict:
        # TODO: Override the answer method to filter out results below min_score
        # Steps:
        # 1. Call self._embed_query(query)
        # 2. Call self.vector_store.search(embedding, n_results=top_k)
        # 3. Filter results where score >= min_score
        # 4. If no results pass the threshold, return a "no information" message
        # 5. Build context and generate as before
        pass

# Test
improved_rag = ImprovedRAGService(store)
result = improved_rag.answer("What is RAG?", min_score=0.5)
from tests import checks
checks.check_lab_3_4(result)


### Exercise 4.2: Deduplication Check

When ingesting documents at scale, you may encounter duplicates. Let's verify how ChromaDB handles re-adding documents that already exist in the collection.

In [None]:
# TODO: Verify that adding the same documents again doesn't create duplicates
initial_count = store.get_stats()['count']

# Re-add the same documents
store.add_documents(sample_chunks[:3])
final_count = store.get_stats()['count']
from tests import checks
checks.check_lab_3_4_dedup(initial_count, final_count)


## Reflection Questions
1. **HNSW Parameters**: What would happen if you set `hnsw:M` to 2 instead of 16? What about 128?
2. **Score Interpretation**: A result comes back with score 0.72. Is this "good enough"? What factors determine the threshold?
3. **Production Gap**: What are 3 things missing from our RAGService that a production system would need?

*Your answers here:*
1. ...
2. ...
3. ...