# Demo #7: Corrective RAG (Self-Reflective Retrieval)

## 🎯 Learning Objectives

In this demonstration, you will learn:

1. **Retrieval Quality Evaluation**: How to assess the relevance of retrieved documents using LLM-based grading
2. **Self-Corrective Mechanisms**: How to implement conditional workflows based on retrieval assessment
3. **Fallback Strategies**: How to trigger alternative paths (re-retrieval, web search, direct generation)
4. **Adaptive Behavior**: How to build RAG systems that "reflect" on their own performance

## 📚 Theoretical Background

### The Problem with Static RAG Pipelines

Traditional RAG systems follow a fixed pipeline: **Retrieve → Generate**. They assume that:
- The retrieved documents are always relevant
- The knowledge base contains the answer
- The retrieval method successfully found the right content

However, these assumptions often fail in practice, leading to common failure modes:

#### Common RAG Failure Points (from the curriculum)

- **FP1: Missing Content (Corpus Failure)**: The knowledge base doesn't contain the answer → System may hallucinate
- **FP2: Missed Top Ranked (Retrieval Failure)**: The correct document exists but isn't retrieved → Poor answers despite good data
- **FP3: Not in Context (Post-Retrieval Failure)**: Document was retrieved but filtered out → Information loss
- **FP4: Not Extracted (Generation Failure)**: Answer is in context but LLM fails to extract it → Wasted retrieval effort

### The Corrective RAG Solution

Corrective RAG (CRAG) introduces a **self-reflective layer** between retrieval and generation:

```
Traditional RAG:  Query → Retrieve → Generate

Corrective RAG:   Query → Retrieve → EVALUATE → [Route Decision]
                                        ↓
                        ┌───────────────┼───────────────┐
                        ↓               ↓               ↓
                   RELEVANT      PARTIALLY         NOT RELEVANT
                        ↓           RELEVANT            ↓
                   Generate    Re-retrieve +       Web Search
                               Transform            Fallback
```

### Key Components

1. **Relevance Grader**: An LLM-based critic that scores retrieval quality (1-5 scale)
2. **Decision Logic**: Conditional routing based on relevance scores
3. **Corrective Actions**:
   - **High Relevance (4-5)**: Proceed with standard generation
   - **Medium Relevance (2-3)**: Trigger query transformation and re-retrieval
   - **Low Relevance (<2)**: Fall back to web search or direct LLM knowledge

### Why This Matters

Corrective RAG transforms the system from a **blind executor** into a **reflective agent** that:
- **Detects its own failures** before producing bad answers
- **Takes corrective action** to improve retrieval quality
- **Falls back gracefully** when knowledge is missing
- **Reduces hallucinations** by recognizing insufficient context

---

## 🔧 Implementation

We'll implement a complete Corrective RAG system with:
- A deliberately limited corpus (to trigger failure scenarios)
- An LLM-based relevance grader
- Three retrieval paths (standard, corrective, fallback)
- Test queries designed to exercise each path

In [5]:
# Install required packages (run once)
# !pip install llama-index-core llama-index-llms-azure-openai llama-index-embeddings-azure-openai python-dotenv

Collecting llama-index-core
  Downloading llama_index_core-0.14.5-py3-none-any.whl.metadata (2.5 kB)
  Downloading llama_index_core-0.14.5-py3-none-any.whl.metadata (2.5 kB)
Collecting llama-index-llms-azure-openai
Collecting llama-index-llms-azure-openai
  Downloading llama_index_llms_azure_openai-0.4.2-py3-none-any.whl.metadata (3.7 kB)
  Downloading llama_index_llms_azure_openai-0.4.2-py3-none-any.whl.metadata (3.7 kB)
Collecting llama-index-embeddings-azure-openai
Collecting llama-index-embeddings-azure-openai
  Downloading llama_index_embeddings_azure_openai-0.4.1-py3-none-any.whl.metadata (503 bytes)
  Downloading llama_index_embeddings_azure_openai-0.4.1-py3-none-any.whl.metadata (503 bytes)
Collecting aiosqlite (from llama-index-core)
  Using cached aiosqlite-0.21.0-py3-none-any.whl.metadata (4.3 kB)
Collecting aiosqlite (from llama-index-core)
  Using cached aiosqlite-0.21.0-py3-none-any.whl.metadata (4.3 kB)
Collecting banks<3,>=2.2.0 (from llama-index-core)
  Downloading ban

In [6]:
# Import required libraries
import os
from dotenv import load_dotenv
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.azure_openai import AzureOpenAI
from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding
from llama_index.core.indices.query.query_transform import HyDEQueryTransform
from llama_index.core.schema import QueryBundle
from enum import Enum
from typing import List, Dict, Any, Tuple
import warnings

warnings.filterwarnings('ignore')

# Load environment variables
load_dotenv()

print("✅ Libraries imported successfully")

✅ Libraries imported successfully


## Step 1: Environment Setup

Configure Azure OpenAI for both LLM (GPT-4) and embeddings.

In [9]:
# Configure Azure OpenAI LLM
llm = AzureOpenAI(
    model="gpt-4",
    deployment_name=os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME"),
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_version="2024-02-15-preview",
    temperature=0.1  # Low temperature for consistent evaluation
)

# Configure Azure OpenAI Embeddings
embed_model = AzureOpenAIEmbedding(
    model="text-embedding-ada-002",
    deployment_name=os.getenv("AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME"),
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_version="2024-02-15-preview",
)

# Set global defaults
Settings.llm = llm
Settings.embed_model = embed_model
Settings.chunk_size = 512
Settings.chunk_overlap = 50

print("✅ Azure OpenAI configured successfully")
print(f"   LLM: {llm.model}")
print(f"   Embeddings: {embed_model.model_name}")

✅ Azure OpenAI configured successfully
   LLM: gpt-4
   Embeddings: text-embedding-ada-002


## Step 2: Deliberately Limited Data Preparation

We'll load only 3 documents from the `tech_docs` directory to create a limited knowledge base. This will enable us to test scenarios where:
- **Query is well-covered**: Documents contain comprehensive information
- **Query is partially covered**: Documents have some relevant info but gaps
- **Query is not covered**: Information is completely missing (FP1: Missing Content)

In [10]:
# Load a limited set of documents (3 docs only)
documents = SimpleDirectoryReader(
    input_files=[
        "data/tech_docs/transformer_architecture.md",
        "data/tech_docs/bert_model.md",
        "data/tech_docs/gpt4_model.md"
    ]
).load_data()

print(f"✅ Loaded {len(documents)} documents")
print("\n📄 Document topics:")
for i, doc in enumerate(documents, 1):
    print(f"   {i}. {doc.metadata.get('file_name', 'Unknown')} ({len(doc.text)} chars)")

print("\n⚠️  Note: Limited corpus intentionally excludes topics like:")
print("   - Docker, REST APIs, embeddings (in tech_docs but not loaded)")
print("   - Finance, ML algorithms (in other directories)")
print("   - Recent events (not in any document)")

✅ Loaded 3 documents

📄 Document topics:
   1. transformer_architecture.md (4247 chars)
   2. bert_model.md (2358 chars)
   3. gpt4_model.md (2814 chars)

⚠️  Note: Limited corpus intentionally excludes topics like:
   - Docker, REST APIs, embeddings (in tech_docs but not loaded)
   - Finance, ML algorithms (in other directories)
   - Recent events (not in any document)


## Step 3: Build Baseline Vector Index

Create a standard vector index for retrieval.

In [11]:
# Build vector index
index = VectorStoreIndex.from_documents(documents)
print("✅ Vector index built successfully")

# Create retriever with moderate top_k
retriever = index.as_retriever(similarity_top_k=3)
print(f"   Configured retriever with top_k=3")

✅ Vector index built successfully
   Configured retriever with top_k=3


## Step 4: Implement Relevance Grader

This is the **core component** of Corrective RAG. The grader:
1. Takes a query and retrieved documents
2. Uses an LLM to assess relevance on a 1-5 scale
3. Returns both a score and a decision (RELEVANT/PARTIAL/NOT_RELEVANT)

In [12]:
class RelevanceDecision(Enum):
    """Enum for relevance decision outcomes"""
    RELEVANT = "relevant"  # Score 4-5: Proceed with generation
    PARTIAL = "partial"    # Score 2-3: Re-retrieve with query transformation
    NOT_RELEVANT = "not_relevant"  # Score <2: Fallback to web search or direct LLM


class RelevanceGrader:
    """LLM-based relevance grader for assessing retrieval quality"""
    
    def __init__(self, llm):
        self.llm = llm
        
    def assess_relevance(
        self, 
        query: str, 
        retrieved_texts: List[str]
    ) -> Tuple[float, RelevanceDecision, str]:
        """
        Assess the relevance of retrieved documents to the query.
        
        Args:
            query: The user's query
            retrieved_texts: List of retrieved document texts
            
        Returns:
            Tuple of (relevance_score, decision, explanation)
        """
        # Combine retrieved texts
        combined_context = "\n\n---\n\n".join(
            [f"Document {i+1}:\n{text[:500]}..." 
             for i, text in enumerate(retrieved_texts)]
        )
        
        # Construct grading prompt
        grading_prompt = f"""You are a relevance grading expert. Your task is to assess how relevant a set of retrieved documents are to answering a user's query.

USER QUERY: {query}

RETRIEVED DOCUMENTS:
{combined_context}

GRADING INSTRUCTIONS:
Assess the relevance of these documents on a 1-5 scale:
- 5: Highly relevant - Documents directly and comprehensively address the query
- 4: Relevant - Documents contain substantial information to answer the query
- 3: Partially relevant - Documents have some related information but with gaps
- 2: Minimally relevant - Documents tangentially related but insufficient
- 1: Not relevant - Documents do not contain information to answer the query

Respond ONLY in this exact format:
SCORE: [number 1-5]
REASONING: [One sentence explaining your score]
"""
        
        # Get LLM assessment
        response = self.llm.complete(grading_prompt)
        response_text = response.text.strip()
        
        # Parse response
        try:
            score_line = [l for l in response_text.split('\n') if l.startswith('SCORE:')][0]
            score = float(score_line.split('SCORE:')[1].strip())
            
            reasoning_line = [l for l in response_text.split('\n') if l.startswith('REASONING:')][0]
            reasoning = reasoning_line.split('REASONING:')[1].strip()
        except Exception as e:
            print(f"⚠️  Failed to parse LLM response: {e}")
            print(f"   Response: {response_text}")
            score = 3.0  # Default to partial
            reasoning = "Failed to parse grading response"
        
        # Make decision based on score
        if score >= 4.0:
            decision = RelevanceDecision.RELEVANT
        elif score >= 2.0:
            decision = RelevanceDecision.PARTIAL
        else:
            decision = RelevanceDecision.NOT_RELEVANT
        
        return score, decision, reasoning


# Initialize grader
grader = RelevanceGrader(llm)
print("✅ Relevance grader initialized")

✅ Relevance grader initialized


## Step 5: Implement Corrective RAG System

Now we'll build the complete Corrective RAG pipeline with three execution paths:

1. **Path A (Relevant)**: Standard generation with retrieved context
2. **Path B (Partial)**: Re-retrieval with HyDE query transformation
3. **Path C (Not Relevant)**: Fallback to direct LLM generation (simulating web search)

In [13]:
class CorrectiveRAG:
    """Corrective RAG system with self-reflective retrieval"""
    
    def __init__(self, index, llm, grader):
        self.index = index
        self.llm = llm
        self.grader = grader
        self.retriever = index.as_retriever(similarity_top_k=3)
        self.hyde_transform = HyDEQueryTransform(llm=llm, include_original=True)
        
    def query(
        self, 
        query_text: str, 
        verbose: bool = True
    ) -> Dict[str, Any]:
        """
        Execute corrective RAG query with adaptive routing.
        
        Returns dict with:
            - answer: Final generated answer
            - path: Execution path taken
            - relevance_score: Initial relevance assessment
            - decision: Routing decision
            - metadata: Additional execution metadata
        """
        if verbose:
            print(f"\n{'='*80}")
            print(f"🔍 QUERY: {query_text}")
            print(f"{'='*80}\n")
        
        # Step 1: Initial Retrieval
        if verbose:
            print("📥 STEP 1: Initial Retrieval")
        
        retrieved_nodes = self.retriever.retrieve(query_text)
        retrieved_texts = [node.node.text for node in retrieved_nodes]
        
        if verbose:
            print(f"   Retrieved {len(retrieved_texts)} documents")
            for i, text in enumerate(retrieved_texts, 1):
                print(f"   Doc {i}: {text[:100]}...")
        
        # Step 2: Relevance Assessment
        if verbose:
            print("\n🧠 STEP 2: Relevance Assessment")
        
        score, decision, reasoning = self.grader.assess_relevance(
            query_text, 
            retrieved_texts
        )
        
        if verbose:
            print(f"   Relevance Score: {score}/5")
            print(f"   Decision: {decision.value.upper()}")
            print(f"   Reasoning: {reasoning}")
        
        # Step 3: Conditional Routing
        if decision == RelevanceDecision.RELEVANT:
            return self._path_relevant(
                query_text, 
                retrieved_nodes, 
                score, 
                reasoning, 
                verbose
            )
        elif decision == RelevanceDecision.PARTIAL:
            return self._path_partial(
                query_text, 
                score, 
                reasoning, 
                verbose
            )
        else:  # NOT_RELEVANT
            return self._path_not_relevant(
                query_text, 
                score, 
                reasoning, 
                verbose
            )
    
    def _path_relevant(
        self, 
        query_text: str, 
        retrieved_nodes, 
        score: float, 
        reasoning: str, 
        verbose: bool
    ) -> Dict[str, Any]:
        """Path A: Standard generation with relevant context"""
        if verbose:
            print("\n✅ STEP 3: Path A - Standard Generation")
            print("   Confidence: HIGH - Proceeding with retrieved context")
        
        # Use query engine for generation
        query_engine = self.index.as_query_engine(similarity_top_k=3)
        response = query_engine.query(query_text)
        
        return {
            "answer": response.response,
            "path": "A: RELEVANT",
            "relevance_score": score,
            "decision": "relevant",
            "reasoning": reasoning,
            "source_nodes": response.source_nodes,
            "corrective_action": None
        }
    
    def _path_partial(
        self, 
        query_text: str, 
        score: float, 
        reasoning: str, 
        verbose: bool
    ) -> Dict[str, Any]:
        """Path B: Re-retrieval with HyDE query transformation"""
        if verbose:
            print("\n⚠️  STEP 3: Path B - Corrective Re-Retrieval")
            print("   Confidence: MEDIUM - Attempting query transformation + re-retrieval")
        
        # Transform query using HyDE
        query_bundle = QueryBundle(query_str=query_text)
        transformed_query = self.hyde_transform(query_bundle)
        
        if verbose:
            print(f"\n   🔄 HyDE Transformation:")
            print(f"   Original: {query_text}")
            if hasattr(transformed_query, 'embedding_strs'):
                print(f"   Hypothetical Doc: {transformed_query.embedding_strs[0][:200]}...")
        
        # Create query engine with HyDE
        hyde_query_engine = self.index.as_query_engine(
            similarity_top_k=3,
            query_transform=self.hyde_transform
        )
        response = hyde_query_engine.query(query_text)
        
        if verbose:
            print(f"\n   ✓ Re-retrieved with transformed query")
        
        return {
            "answer": response.response,
            "path": "B: PARTIAL (Re-retrieval)",
            "relevance_score": score,
            "decision": "partial",
            "reasoning": reasoning,
            "source_nodes": response.source_nodes,
            "corrective_action": "HyDE query transformation"
        }
    
    def _path_not_relevant(
        self, 
        query_text: str, 
        score: float, 
        reasoning: str, 
        verbose: bool
    ) -> Dict[str, Any]:
        """Path C: Fallback to direct LLM generation (simulating web search)"""
        if verbose:
            print("\n❌ STEP 3: Path C - Fallback to Direct LLM")
            print("   Confidence: LOW - Knowledge base lacks relevant information")
            print("   Action: Using LLM's parametric knowledge (simulating web search)")
        
        # Generate directly from LLM without RAG context
        fallback_prompt = f"""The knowledge base does not contain relevant information for this query.
Using your parametric knowledge, provide a helpful answer. If you don't know, say so.

Query: {query_text}

Answer:"""
        
        response = self.llm.complete(fallback_prompt)
        
        if verbose:
            print("   ⚠️  Note: This is a graceful fallback. In production, this would:")
            print("      - Trigger web search (Bing, Google)")
            print("      - Query external APIs")
            print("      - Or return 'Information not available' with suggestions")
        
        return {
            "answer": response.text,
            "path": "C: NOT RELEVANT (Fallback)",
            "relevance_score": score,
            "decision": "not_relevant",
            "reasoning": reasoning,
            "source_nodes": [],
            "corrective_action": "Direct LLM generation (web search simulation)"
        }


# Initialize Corrective RAG system
crag_system = CorrectiveRAG(index, llm, grader)
print("✅ Corrective RAG system initialized")

✅ Corrective RAG system initialized


## Step 6: Test Suite - Exercising All Three Paths

We'll design three test queries to exercise each execution path:

### Test Query 1: Well-Covered Topic (Expected: Path A - RELEVANT)
Query about transformers - topic directly covered in loaded documents

In [14]:
# Test Query 1: Expected Path A (RELEVANT)
query1 = "What is the self-attention mechanism in transformer architecture and why is it important?"

result1 = crag_system.query(query1, verbose=True)

print("\n" + "="*80)
print("📊 RESULT SUMMARY")
print("="*80)
print(f"Path Taken: {result1['path']}")
print(f"Relevance Score: {result1['relevance_score']}/5")
print(f"Corrective Action: {result1['corrective_action']}")
print(f"\n💡 Answer:\n{result1['answer']}")
print("\n📚 Sources:")
for i, node in enumerate(result1['source_nodes'][:2], 1):
    print(f"   {i}. {node.node.metadata.get('file_name', 'Unknown')}: {node.node.text[:150]}...")


🔍 QUERY: What is the self-attention mechanism in transformer architecture and why is it important?

📥 STEP 1: Initial Retrieval
   Retrieved 3 documents
   Doc 1: # Transformer Architecture in Deep Learning

The **Transformer architecture** revolutionized natural...
   Doc 2: Multi-head self-attention
  2. Layer normalization
  3. Feed-forward network
  4. Residual connectio...
   Doc 3: # BERT: Bidirectional Encoder Representations from Transformers

**BERT** is a transformer-based lan...

🧠 STEP 2: Relevance Assessment
   Relevance Score: 4.0/5
   Decision: RELEVANT
   Reasoning: Document 1 provides substantial information on the self-attention mechanism within transformer architecture, explaining its importance in enabling parallel processing and capturing long-range dependencies, which directly addresses the user's query.

✅ STEP 3: Path A - Standard Generation
   Confidence: HIGH - Proceeding with retrieved context

📊 RESULT SUMMARY
Path Taken: A: RELEVANT
Relevance Score: 4.0/5


### Test Query 2: Partially Covered Topic (Expected: Path B - PARTIAL)
Query that's tangentially related but needs better retrieval

In [15]:
# Test Query 2: Expected Path B (PARTIAL - Re-retrieval)
query2 = "How do language models handle bidirectional context compared to unidirectional approaches?"

result2 = crag_system.query(query2, verbose=True)

print("\n" + "="*80)
print("📊 RESULT SUMMARY")
print("="*80)
print(f"Path Taken: {result2['path']}")
print(f"Relevance Score: {result2['relevance_score']}/5")
print(f"Corrective Action: {result2['corrective_action']}")
print(f"\n💡 Answer:\n{result2['answer']}")
print("\n📚 Sources:")
for i, node in enumerate(result2['source_nodes'][:2], 1):
    print(f"   {i}. {node.node.metadata.get('file_name', 'Unknown')}: {node.node.text[:150]}...")


🔍 QUERY: How do language models handle bidirectional context compared to unidirectional approaches?

📥 STEP 1: Initial Retrieval
   Retrieved 3 documents
   Doc 1: # BERT: Bidirectional Encoder Representations from Transformers

**BERT** is a transformer-based lan...
   Doc 2: Multi-head self-attention
  2. Layer normalization
  3. Feed-forward network
  4. Residual connectio...
   Doc 3: **Pre-training**: Learning from a massive text corpus to predict the next token
2. **Instruction Tun...

🧠 STEP 2: Relevance Assessment
   Relevance Score: 5.0/5
   Decision: RELEVANT
   Reasoning: Document 1 directly addresses the query by explaining BERT's bidirectional approach, contrasting it with unidirectional models, which is central to understanding how language models handle bidirectional context.

✅ STEP 3: Path A - Standard Generation
   Confidence: HIGH - Proceeding with retrieved context

📊 RESULT SUMMARY
Path Taken: A: RELEVANT
Relevance Score: 5.0/5
Corrective Action: None

💡 Answer:
L

### Test Query 3: Not Covered Topic (Expected: Path C - NOT RELEVANT)
Query about a topic not in our limited corpus (Docker, recent events, etc.)

In [16]:
# Test Query 3: Expected Path C (NOT RELEVANT - Fallback)
query3 = "What are the best practices for containerizing machine learning models with Docker and Kubernetes?"

result3 = crag_system.query(query3, verbose=True)

print("\n" + "="*80)
print("📊 RESULT SUMMARY")
print("="*80)
print(f"Path Taken: {result3['path']}")
print(f"Relevance Score: {result3['relevance_score']}/5")
print(f"Corrective Action: {result3['corrective_action']}")
print(f"\n💡 Answer:\n{result3['answer']}")
print(f"\n📚 Sources: {len(result3['source_nodes'])} (fallback used LLM parametric knowledge)")


🔍 QUERY: What are the best practices for containerizing machine learning models with Docker and Kubernetes?

📥 STEP 1: Initial Retrieval
   Retrieved 3 documents
   Doc 1: **Pre-training**: Learning from a massive text corpus to predict the next token
2. **Instruction Tun...
   Doc 2: # GPT-4: Generative Pre-trained Transformer 4

**GPT-4** is OpenAI's fourth-generation large languag...
   Doc 3: # Transformer Architecture in Deep Learning

The **Transformer architecture** revolutionized natural...

🧠 STEP 2: Relevance Assessment
   Relevance Score: 1.0/5
   Decision: NOT_RELEVANT
   Reasoning: None of the documents contain information related to containerizing machine learning models with Docker and Kubernetes.

❌ STEP 3: Path C - Fallback to Direct LLM
   Confidence: LOW - Knowledge base lacks relevant information
   Action: Using LLM's parametric knowledge (simulating web search)
   ⚠️  Note: This is a graceful fallback. In production, this would:
      - Trigger web search (Bing, 

## Step 7: Comparative Analysis - Corrective vs. Baseline RAG

Let's compare Corrective RAG against a baseline system that always proceeds with retrieval, regardless of quality.

In [17]:
# Create baseline (non-corrective) query engine
baseline_engine = index.as_query_engine(similarity_top_k=3)

def run_comparison(query: str):
    """Run same query through both systems and compare"""
    print(f"\n{'='*100}")
    print(f"🔬 COMPARISON TEST")
    print(f"{'='*100}")
    print(f"Query: {query}\n")
    
    # Baseline RAG (no correction)
    print("[1] BASELINE RAG (No Correction)")
    print("-" * 50)
    baseline_response = baseline_engine.query(query)
    print(f"Answer: {baseline_response.response}\n")
    
    # Corrective RAG
    print("[2] CORRECTIVE RAG (With Self-Reflection)")
    print("-" * 50)
    crag_result = crag_system.query(query, verbose=False)
    print(f"Path: {crag_result['path']}")
    print(f"Relevance: {crag_result['relevance_score']}/5")
    print(f"Corrective Action: {crag_result['corrective_action']}")
    print(f"Answer: {crag_result['answer']}\n")
    
    print("="*100)
    print("\n📊 ANALYSIS:")
    if crag_result['decision'] == 'relevant':
        print("   ✅ Both systems performed similarly (high-quality retrieval)")
    elif crag_result['decision'] == 'partial':
        print("   ⚠️  Corrective RAG improved retrieval via query transformation")
    else:
        print("   ❌ Corrective RAG avoided using irrelevant context (prevented potential hallucination)")
        print("      Baseline system may have generated answer from poor context")

# Test on the Docker query (known to be out of corpus)
run_comparison("What are the best practices for containerizing machine learning models with Docker?")


🔬 COMPARISON TEST
Query: What are the best practices for containerizing machine learning models with Docker?

[1] BASELINE RAG (No Correction)
--------------------------------------------------
Answer: Containerizing machine learning models with Docker involves several best practices to ensure efficiency, scalability, and maintainability. Here are some key practices:

1. **Use Lightweight Base Images**: Start with a minimal base image to reduce the size of the container. Alpine Linux is often recommended for its small footprint.

2. **Separate Build and Runtime Environments**: Use multi-stage builds to separate the environment needed for building the model from the one needed for running it. This helps in keeping the runtime environment clean and lightweight.

3. **Optimize Dependencies**: Only include necessary dependencies in the Docker image. This minimizes the image size and reduces potential security vulnerabilities.

4. **Environment Variables for Configuration**: Use environmen

## Step 8: Visualizing the Decision Tree

Let's create a summary visualization of how queries were routed.

In [18]:
import pandas as pd

# Compile results
results_data = [
    {
        "Query": "Self-attention in transformers",
        "Relevance Score": result1['relevance_score'],
        "Decision": result1['decision'],
        "Path": result1['path'],
        "Corrective Action": result1['corrective_action'] or "None (sufficient)"
    },
    {
        "Query": "Bidirectional vs unidirectional context",
        "Relevance Score": result2['relevance_score'],
        "Decision": result2['decision'],
        "Path": result2['path'],
        "Corrective Action": result2['corrective_action'] or "None"
    },
    {
        "Query": "Docker containerization",
        "Relevance Score": result3['relevance_score'],
        "Decision": result3['decision'],
        "Path": result3['path'],
        "Corrective Action": result3['corrective_action'] or "None"
    }
]

df_results = pd.DataFrame(results_data)

print("\n" + "="*100)
print("📊 CORRECTIVE RAG EXECUTION SUMMARY")
print("="*100)
print(df_results.to_string(index=False))
print("\n" + "="*100)

# Path distribution
path_counts = df_results['Decision'].value_counts()
print("\n🎯 Execution Path Distribution:")
for decision, count in path_counts.items():
    print(f"   {decision.upper()}: {count} queries")


📊 CORRECTIVE RAG EXECUTION SUMMARY
                                  Query  Relevance Score     Decision                       Path                             Corrective Action
         Self-attention in transformers              4.0     relevant                A: RELEVANT                             None (sufficient)
Bidirectional vs unidirectional context              5.0     relevant                A: RELEVANT                                          None
                Docker containerization              1.0 not_relevant C: NOT RELEVANT (Fallback) Direct LLM generation (web search simulation)


🎯 Execution Path Distribution:
   RELEVANT: 2 queries
   NOT_RELEVANT: 1 queries


## 🎓 Key Takeaways

### What We Learned

1. **Self-Reflection Prevents Failures**
   - Traditional RAG blindly proceeds with retrieved context, even if irrelevant
   - Corrective RAG evaluates quality BEFORE generation, preventing hallucinations
   - LLM-as-judge provides flexible, semantic quality assessment

2. **Adaptive Routing Improves Robustness**
   - High relevance → Standard path (efficient)
   - Medium relevance → Corrective action (query transformation, re-retrieval)
   - Low relevance → Fallback (web search, direct LLM, or "I don't know")

3. **Failure Modes Mapping**
   - **FP1 (Missing Content)**: Detected by Path C (not relevant) → Fallback triggered
   - **FP2 (Missed Ranking)**: Detected by Path B (partial) → Re-retrieval with HyDE
   - **FP3-FP5**: Can be caught post-generation with additional grading

4. **Production Considerations**
   - Relevance grading adds latency (extra LLM call)
   - Can cache grading decisions for similar queries
   - Threshold tuning (4-5, 2-4, <2) depends on domain
   - Web search integration provides genuine fallback capability

### Architectural Insights

Corrective RAG represents a fundamental shift:
- From **passive retrieval** to **active quality control**
- From **fixed pipelines** to **adaptive workflows**
- From **blind generation** to **self-aware systems**

This is a bridge between traditional RAG and fully Agentic RAG, where the system begins to reason about its own capabilities and limitations.

### Next Steps

To extend this implementation:
1. **Add Web Search**: Integrate Bing/Google API for Path C
2. **Multi-Pass Correction**: Allow multiple re-retrieval attempts
3. **Post-Generation Validation**: Add faithfulness grading after answer generation
4. **Confidence Calibration**: Tune thresholds based on domain-specific evaluation
5. **Logging & Monitoring**: Track path distribution to identify corpus gaps

---

## 📚 References

From the curriculum (AdvancedRAGWorkshop.md):
- **Reference 10**: Seven Failure Points When Engineering a RAG System (arXiv)
- **Reference 11**: The Common Failure Points of LLM RAG Systems and How to Overcome Them

Key concepts demonstrated:
- Self-reflective retrieval and adaptive behavior
- Conditional workflow based on retrieval quality assessment
- Fallback mechanisms for graceful degradation
- Query transformation as corrective action (HyDE)

---

**Status**: Demo #7 Complete ✅