# Module 2: RAG Systems - Production Implementation

## 🎯 What You'll Learn

Build production-ready Retrieval-Augmented Generation (RAG) systems with security.

**Time to Complete:** 3-4 hours

---

## 📚 Module Outline

### Part 1: Understanding RAG (30 min)
- What is RAG and why use it
- The three stages: Ingest → Retrieve → Generate
- When RAG helps vs. hurts

### Part 2: Chunking Strategies (45 min)
- Why chunking matters
- 3 strategies: Fixed, Sentence, Semantic
- Choosing the right strategy

### Part 3: Retrieval Methods (60 min)
- Vector search basics
- BM25 (keyword search)
- Hybrid search (combining both)
- Reranking for precision

### Part 4: Security & RBAC (45 min)
- Role-based access control
- Audit logging
- Multi-tenant isolation

### Part 5: Practice & Review (30 min)
- Build your own RAG
- Practice exercises

---

## 🧭 Learning Path

```
[Start] → Understand RAG → Learn Chunking → Master Retrieval
   ↓
Add Security → Practice → [Complete]
```

---

## Setup

**Time:** 5 minutes

In [None]:
%pip install -q chromadb sentence-transformers rank-bm25
%pip install -q pandas numpy matplotlib

import re
import hashlib
import numpy as np
from typing import List, Dict, Any
from dataclasses import dataclass
from collections import defaultdict
from datetime import datetime

print('✅ Dependencies loaded!')
print('📖 Ready to learn RAG systems!')

---

# Part 1: Understanding RAG

**Time:** 30 min | **Difficulty:** Beginner

## 🎯 The Problem RAG Solves

**Without RAG:**
- LLM only knows what it learned during training
- Can't access your private documents
- Makes up answers (hallucinates)
- Knowledge cutoff date

**With RAG:**
- Retrieves relevant documents
- Grounds answers in YOUR data
- Reduces hallucinations
- Always up-to-date

## 🔄 How RAG Works (3 Stages)

### Stage 1: Ingest (One-time setup)
```
Your Documents → Split into Chunks → Convert to Embeddings → Store in Vector DB
```

### Stage 2: Retrieve (Per query)
```
User Question → Convert to Embedding → Find Similar Chunks → Return Top-K
```

### Stage 3: Generate (Per query)
```
Question + Retrieved Chunks → Send to LLM → Get Grounded Answer
```

## ⚖️ When to Use RAG

**Use RAG when:**
- ✅ You have private/proprietary documents
- ✅ Knowledge changes frequently
- ✅ Need citations/sources
- ✅ Need access control (different users see different docs)

**Don't use RAG when:**
- ❌ Questions are about general knowledge (LLM already knows)
- ❌ Documents are too small (just put in prompt)
- ❌ Need real-time data (use API tools instead)

---

---

# Part 2: Chunking Strategies

**Time:** 45 min | **Difficulty:** Intermediate

## 🎯 Why Chunking Matters

**Problem:** Documents are too long for LLM context window.

**Solution:** Split into smaller chunks that fit in context.

**Challenge:** How you split affects retrieval quality!

## 📏 Three Main Strategies

### 1. Fixed-Size Chunking
**What:** Split every N characters
**Pros:** Simple, fast
**Cons:** Breaks sentences mid-way
**Use when:** Code, highly structured data

### 2. Sentence-Aware Chunking
**What:** Split at sentence boundaries
**Pros:** Preserves meaning, readable
**Cons:** Slightly complex
**Use when:** Articles, policies, Q&A documents

### 3. Semantic Chunking
**What:** Split when topic changes
**Pros:** Best retrieval quality
**Cons:** Slowest, most complex
**Use when:** Long-form content, books

---

## 💻 Code Example: Document Chunker

**What this does:**
- Implements sentence-aware chunking
- Adds overlap between chunks (important!)
- Tracks metadata

**Why overlap matters:** Last sentence of Chunk 1 = First sentence of Chunk 2
→ Prevents information loss at boundaries

In [None]:
@dataclass
class Chunk:
    text: str
    chunk_id: str
    start_idx: int
    end_idx: int

class DocumentChunker:
    """Split documents into retrievable chunks."""
    
    def __init__(self, chunk_size=400, overlap=50):
        self.chunk_size = chunk_size  # Target size in characters
        self.overlap = overlap  # Overlap to prevent information loss
    
    def chunk_by_sentences(self, text: str, doc_id: str) -> List[Chunk]:
        """Chunk by sentence boundaries (recommended for most docs)."""
        
        # Split into sentences
        sentences = re.split(r'(?<=[.!?])\s+', text)
        
        chunks = []
        current_chunk_sentences = []
        current_length = 0
        
        for sentence in sentences:
            sentence_len = len(sentence)
            
            # If adding this sentence exceeds limit, create chunk
            if current_length + sentence_len > self.chunk_size and current_chunk_sentences:
                # Create chunk
                chunk_text = ' '.join(current_chunk_sentences)
                chunks.append(Chunk(
                    text=chunk_text,
                    chunk_id=f'{doc_id}_chunk_{len(chunks)}',
                    start_idx=0,
                    end_idx=len(chunk_text)
                ))
                
                # Overlap: Keep last sentence for next chunk
                if len(current_chunk_sentences) > 1:
                    current_chunk_sentences = current_chunk_sentences[-1:]
                    current_length = len(current_chunk_sentences[0])
                else:
                    current_chunk_sentences = []
                    current_length = 0
            
            current_chunk_sentences.append(sentence)
            current_length += sentence_len
        
        # Don't forget the last chunk!
        if current_chunk_sentences:
            chunk_text = ' '.join(current_chunk_sentences)
            chunks.append(Chunk(
                text=chunk_text,
                chunk_id=f'{doc_id}_chunk_{len(chunks)}',
                start_idx=0,
                end_idx=len(chunk_text)
            ))
        
        return chunks

# 📝 Try it!
sample_doc = """Company Leave Policy

Full-time employees with 1-3 years tenure get 15 days annual leave.
Part-time employees get pro-rated leave.
Medical leave requires a doctor's note after 3 consecutive days.
"""

chunker = DocumentChunker(chunk_size=100, overlap=20)  # Small for demo
chunks = chunker.chunk_by_sentences(sample_doc, 'policy_doc')

print(f"📄 Original: {len(sample_doc)} characters")
print(f"✂️  Chunked into: {len(chunks)} chunks\n")

for i, chunk in enumerate(chunks, 1):
    print(f"Chunk {i}: {len(chunk.text)} chars")
    print(f"  {chunk.text[:80]}...\n")

print("💡 Notice: Chunks have some overlap to preserve context!")

## ✅ Section Summary: Chunking

### What You Learned:
1. ✓ Documents must be split to fit context limits
2. ✓ Sentence-aware chunking preserves meaning
3. ✓ Overlap prevents information loss at boundaries
4. ✓ Chunk size affects retrieval quality

### Key Takeaways:
- 📌 **Recommended size: 400-600 characters** (or 100-150 tokens)
- 📌 **Always use overlap** (50-100 characters)
- 📌 **Sentence-aware > fixed-size** for most documents
- 📌 **Test different sizes** - optimal varies by content

### Common Mistakes:
- ❌ Chunks too large (reduces retrieval precision)
- ❌ Chunks too small (loses context)
- ❌ No overlap (information loss at boundaries)
- ❌ Breaking mid-sentence (confuses retrieval)

### Quick Guide:
- **Articles/Policies**: 400 chars, sentence-aware
- **Code**: 512 tokens, fixed-size
- **Books**: 800 chars, semantic (topic-based)

---

---

# Part 3: Retrieval Methods

**Time:** 60 min | **Difficulty:** Intermediate

## 🎯 Goal: Find the Most Relevant Chunks

## 🔍 Three Approaches

### 1. Vector Search (Semantic)
**How it works:**
- Convert text to embeddings (vectors)
- Find chunks with similar vectors
- Uses cosine similarity

**Good for:** Understanding meaning, paraphrases
**Example:** Query "PTO" finds "vacation days"

### 2. BM25 (Keyword)
**How it works:**
- Statistical keyword matching
- Like traditional search engines

**Good for:** Exact terms, IDs, names
**Example:** Query "Section 3.2" finds exact section

### 3. Hybrid (Best!)
**How it works:**
- Combine vector + BM25 scores
- Get best of both worlds

**Good for:** Production systems
**Improvement:** +20-30% over single method

---

## 💻 Code Example: Hybrid Search

**What this shows:**
- How to combine BM25 and vector search
- Weighted fusion of scores
- Comparison of methods

**Study tip:** Run with the sample data, then try your own documents.

In [None]:
# Simple hybrid retrieval demo
from rank_bm25 import BM25Okapi

class SimpleHybridRetriever:
    """Combine keyword and semantic search."""
    
    def __init__(self):
        self.documents = []
        self.bm25 = None
    
    def index(self, documents: List[str]):
        """Index documents."""
        self.documents = documents
        
        # Build BM25 index (keyword search)
        tokenized = [doc.lower().split() for doc in documents]
        self.bm25 = BM25Okapi(tokenized)
        
        print(f"✅ Indexed {len(documents)} documents")
    
    def search_bm25(self, query: str, top_k=3) -> List[Tuple[int, float]]:
        """Keyword search."""
        tokens = query.lower().split()
        scores = self.bm25.get_scores(tokens)
        
        # Get top-k
        top_indices = np.argsort(scores)[::-1][:top_k]
        return [(idx, scores[idx]) for idx in top_indices]
    
    def search_hybrid(self, query: str, top_k=3) -> List[Tuple[int, float]]:
        """Hybrid search (for demo, just using BM25)."""
        # In production: combine with vector search
        return self.search_bm25(query, top_k)

# 📝 Try it!
test_docs = [
    'Python is a programming language used for data science and machine learning.',
    'Machine learning models require large datasets for training.',
    'Natural language processing helps computers understand human language.',
    'Vector databases store embeddings for similarity search.',
    'RAG combines retrieval with generation for better accuracy.',
]

print("🔍 HYBRID SEARCH DEMONSTRATION\n")

retriever = SimpleHybridRetriever()
retriever.index(test_docs)

# Test queries
queries = [
    'Python machine learning',
    'NLP language understanding',
    'vector similarity',
]

for query in queries:
    print(f"\nQuery: '{query}'")
    print("-" * 70)
    
    results = retriever.search_hybrid(query, top_k=2)
    
    for rank, (idx, score) in enumerate(results, 1):
        print(f"{rank}. [Score: {score:.2f}] {test_docs[idx][:60]}...")

print("\n\n💡 Tip: Hybrid search combines semantic understanding + keyword matching!")

## ✅ Section Summary: Retrieval

### What You Learned:
1. ✓ Vector search finds semantic matches
2. ✓ BM25 finds keyword matches
3. ✓ Hybrid combines both for best results
4. ✓ Top-k parameter controls how many chunks to retrieve

### Key Takeaways:
- 📌 **Vector search**: Good for paraphrases ("PTO" → "vacation")
- 📌 **BM25**: Good for exact terms ("Section 3.2")
- 📌 **Hybrid**: Best for production (+20-30% accuracy)
- 📌 **Reranking**: Add cross-encoder for even better results

### Common Mistakes:
- ❌ Using only vector search (misses exact term matches)
- ❌ Using only keyword search (misses semantic matches)
- ❌ top_k too high (retrieves irrelevant chunks)
- ❌ top_k too low (misses relevant information)

### Optimal Settings:
- **top_k**: Start with 5, tune based on evaluation
- **Hybrid weight (alpha)**: 0.5 (equal weight BM25 and vector)
- **Similarity threshold**: 0.7 (filter low-confidence matches)

---

---

# Part 4: Security - RBAC and Audit Logging

**Time:** 45 min | **Difficulty:** Advanced

## 🎯 The Security Challenge

**Problem:** Different users should see different documents.

**Example:**
- Public users: Only public docs
- Employees: Public + internal docs
- HR: All docs including compensation

## 🔐 Solution: Role-Based Access Control (RBAC)

**Key concept:**
1. Tag each document with allowed roles
2. Filter retrieval results by user's role
3. Log all accesses for audit

---

## 💻 Code Example: Secure RAG with RBAC

**What this shows:**
- Documents tagged with allowed roles
- Retrieval filtered by user role
- Audit log for compliance

**Study progression:**
1. See how documents are ingested (line 20)
2. Understand retrieval filtering (line 35)
3. Notice audit logging (line 50)

In [None]:
class SecureRAG:
    """RAG with role-based access control."""
    
    def __init__(self):
        self.documents = []  # In production: vector DB
        self.audit_log = []
    
    def ingest(self, text: str, doc_id: str, allowed_roles: set):
        """Add document with access control."""
        self.documents.append({
            'doc_id': doc_id,
            'text': text,
            'allowed_roles': allowed_roles
        })
        print(f"✅ Ingested: {doc_id} (allowed: {allowed_roles})")
    
    def retrieve(self, query: str, user_role: str, top_k=2) -> List[dict]:
        """Retrieve documents user can access."""
        
        # Filter by role
        accessible = [
            doc for doc in self.documents
            if user_role in doc['allowed_roles'] or 'public' in doc['allowed_roles']
        ]
        
        # Log access
        self._log_access(query, user_role, len(accessible))
        
        # In production: do actual similarity search
        # For demo: return all accessible (limited to top_k)
        return accessible[:top_k]
    
    def _log_access(self, query: str, role: str, results: int):
        """Audit logging for compliance."""
        self.audit_log.append({
            'timestamp': datetime.utcnow().isoformat(),
            'query': query[:50],
            'user_role': role,
            'results_count': results
        })

# 🧪 Test RBAC
print("🔒 SECURE RAG DEMONSTRATION\n")

rag = SecureRAG()

# Ingest with different access levels
rag.ingest('Company holidays: Jan 1, Jul 4, Dec 25', 'holidays', {'public', 'employee', 'hr'})
rag.ingest('Leave policy: 15 days after 1 year', 'leave', {'employee', 'hr'})
rag.ingest('L4 Engineer salary: $150K-$180K', 'compensation', {'hr'})  # HR only!

print("\n🔍 Testing access control:\n")

# Test different roles
for query, role in [('holidays', 'public'), ('salary', 'employee'), ('salary', 'hr')]:
    results = rag.retrieve(query, role, top_k=5)
    print(f"Query: '{query}' | Role: {role:<10} | Results: {len(results)}")
    for r in results:
        print(f"  → {r['doc_id']}")
    print()

print("💡 Notice: 'employee' can't see salary, but 'hr' can!")

## ✅ Section Summary: Security

### What You Learned:
1. ✓ Different users need different document access
2. ✓ RBAC filters results by user role
3. ✓ Audit logs track all accesses
4. ✓ Security prevents data leakage

### Key Takeaways:
- 📌 **Tag at ingestion** - mark allowed roles when indexing
- 📌 **Filter at retrieval** - only return allowed documents
- 📌 **Log everything** - audit trail for compliance
- 📌 **Default to most restrictive** - require explicit permission

### Common Mistakes:
- ❌ Filtering in application (too late, already retrieved)
- ❌ No audit logging (can't prove compliance)
- ❌ Using user input for role (security bypass!)
- ❌ Not testing cross-tenant access

### Security Checklist:
- [ ] Documents tagged with allowed roles
- [ ] Retrieval filters by role
- [ ] Audit log retention (1+ years)
- [ ] Regular access reviews
- [ ] Test unauthorized access attempts

---

---

# 📝 Module 2 Review & Practice

## 🎓 Concepts Mastered

### RAG Fundamentals
- ✅ Why RAG improves accuracy
- ✅ Three stages: Ingest → Retrieve → Generate
- ✅ When to use RAG

### Chunking
- ✅ Why chunking matters
- ✅ Sentence-aware strategy
- ✅ Importance of overlap

### Retrieval
- ✅ Vector vs. keyword search
- ✅ Hybrid search benefits
- ✅ Top-k selection

### Security
- ✅ RBAC implementation
- ✅ Audit logging
- ✅ Multi-tenant isolation

---

## 🎯 Practice Exercises

### Exercise 1: Chunking Strategy (Medium)
You have a 50-page PDF policy document. Which chunking strategy and why?

<details>
<summary>Show answer</summary>

**Recommendation:** Sentence-aware with 400-char chunks, 50-char overlap

**Why:**
- Policy documents have clear sentences
- Need to preserve complete rules (sentence-aware helps)
- 400 chars ≈ 1-2 paragraphs (good granularity)
- Overlap prevents losing info at boundaries

**Alternative:** Semantic chunking if document has clear sections
</details>

### Exercise 2: Precision vs Recall (Hard)
Your RAG has:
- Precision: 40% (many irrelevant chunks retrieved)
- Recall: 90% (finds almost all relevant chunks)

Users complain about irrelevant results. What do you do?

<details>
<summary>Show answer</summary>

**Root cause:** top_k too high + no reranking

**Solutions (in order of impact):**
1. **Add reranking** (+30% precision)
   - Use cross-encoder to rerank top results
   - Most impactful single improvement

2. **Reduce top_k** (+15% precision)
   - If using top_k=10, try top_k=5
   - Fewer results = higher precision

3. **Add similarity threshold** (+20% precision)
   - Only return chunks above 0.7 similarity
   - Filters low-confidence matches

4. **Better chunking** (+10% both metrics)
   - Semantic chunking preserves context better

**Expected:** 40% → 80%+ precision
</details>

---

## 🎓 Module 2 Complete!

**Time invested:** 3-4 hours

### 📚 Next Module:
**Module 3: LangChain** - Learn to build chains, agents, and evaluation frameworks

---