# Demo #7: Corrective RAG (CRAG) - Self-Correcting Retrieval

## Objective
Implement a self-reflective system that evaluates retrieval quality and triggers corrective actions (web search fallback) when internal knowledge is insufficient.

## Core Concepts
- **Self-correction and retrieval evaluation**: System assesses its own retrieval quality
- **Dynamic routing based on confidence scores**: High/Low/Ambiguous confidence triggers different paths
- **Fallback to external knowledge sources**: Web search when internal knowledge fails
- **Knowledge refinement**: Document grading and filtering

## What is Corrective RAG (CRAG)?

Traditional RAG assumes that retrieved documents are always relevant and useful. **CRAG challenges this assumption** by:

1. **Evaluating** retrieval quality before generation
2. **Correcting** by routing to alternative sources when needed
3. **Refining** retrieved content through knowledge strip filtering

### The Problem with Naive RAG
```
Query → Retrieve → Generate
         ↓
   (Assumes retrieval is good)
```

**Issues:**
- ❌ No evaluation of retrieval quality
- ❌ Generates answers even with poor/irrelevant context
- ❌ No fallback when knowledge base lacks information
- ❌ Leads to hallucinations or low-quality answers

### CRAG Solution
```
Query → Retrieve → Evaluate Confidence
                        ↓
         ┌──────────────┼──────────────┐
         ↓              ↓              ↓
      HIGH           LOW          AMBIGUOUS
         ↓              ↓              ↓
   Use Internal   Web Search    Merge Both
       Docs          Only         Sources
         ↓              ↓              ↓
         └──────────────┴──────────────┘
                        ↓
                  LLM Generation
```

## Data Flow
```
Query
  ↓
Retrieve from Internal Knowledge Base
  ↓
Evaluator scores relevance (0-1)
  ↓
Decision:
  • High (>0.7): Use internal docs
  • Low (<0.4): Use web search
  • Ambiguous (0.4-0.7): Merge both
  ↓
Optional: Knowledge strip filtering
  ↓
LLM Generation with refined context
```

## Setup: Install Dependencies and Load Environment

In [None]:
# Install required packages
# Run this cell if packages are not already installed
# !pip install llama-index llama-index-llms-azure-openai llama-index-embeddings-azure-openai
# !pip install python-dotenv

In [None]:
import os
import json
import re
from typing import List, Dict, Tuple
from dotenv import load_dotenv
import warnings
warnings.filterwarnings('ignore')

# Load environment variables
load_dotenv()

# Verify Azure OpenAI credentials
required_vars = [
    'AZURE_OPENAI_API_KEY',
    'AZURE_OPENAI_ENDPOINT',
    'AZURE_OPENAI_API_VERSION',
    'AZURE_OPENAI_DEPLOYMENT_NAME',
    'AZURE_OPENAI_EMBEDDING_DEPLOYMENT'
]

missing_vars = [var for var in required_vars if not os.getenv(var)]
if missing_vars:
    print(f"❌ Missing environment variables: {', '.join(missing_vars)}")
else:
    print("✅ All required environment variables are set.")

## Step 1: Setup Limited Internal Knowledge Base

We'll create a focused knowledge base on **machine learning concepts**. This will allow us to test:
- **In-domain queries**: Questions about ML algorithms (should score HIGH)
- **Out-of-domain queries**: Questions about unrelated topics (should score LOW)
- **Ambiguous queries**: Questions partially covered (should score AMBIGUOUS)

In [None]:
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter
from llama_index.llms.azure_openai import AzureOpenAI
from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding
from llama_index.core import Settings

# Initialize Azure OpenAI
azure_llm = AzureOpenAI(
    model="gpt-4",
    deployment_name=os.getenv('AZURE_OPENAI_DEPLOYMENT_NAME'),
    api_key=os.getenv('AZURE_OPENAI_API_KEY'),
    azure_endpoint=os.getenv('AZURE_OPENAI_ENDPOINT'),
    api_version=os.getenv('AZURE_OPENAI_API_VERSION'),
    temperature=0.1
)

azure_embed = AzureOpenAIEmbedding(
    model="text-embedding-ada-002",
    deployment_name=os.getenv('AZURE_OPENAI_EMBEDDING_DEPLOYMENT'),
    api_key=os.getenv('AZURE_OPENAI_API_KEY'),
    azure_endpoint=os.getenv('AZURE_OPENAI_ENDPOINT'),
    api_version=os.getenv('AZURE_OPENAI_API_VERSION'),
)

Settings.llm = azure_llm
Settings.embed_model = azure_embed

print("✅ Azure OpenAI initialized")

In [None]:
# Load only ML concepts (limited knowledge base)
documents = SimpleDirectoryReader('data/ml_concepts').load_data()

print(f"📚 Internal Knowledge Base: {len(documents)} documents")
print("\nDocuments:")
for i, doc in enumerate(documents, 1):
    filename = os.path.basename(doc.metadata.get('file_name', 'Unknown'))
    print(f"   {i}. {filename}")

print("\n⚠️ This is a LIMITED knowledge base focused on ML concepts.")
print("   Queries outside this domain should trigger web search fallback.")

In [None]:
# Create vector index
splitter = SentenceSplitter(chunk_size=512, chunk_overlap=50)
nodes = splitter.get_nodes_from_documents(documents)
index = VectorStoreIndex(nodes, embed_model=azure_embed)

print(f"✅ Vector index created with {len(nodes)} chunks")

## Step 2: Implement Retrieval Evaluator

The evaluator is the core of CRAG. It assesses whether retrieved documents are sufficient to answer the query.

In [None]:
def evaluate_retrieval_relevance(query: str, retrieved_chunks: List[str], llm) -> Tuple[float, str]:
    """
    Evaluate the relevance of retrieved chunks for a given query.
    
    Returns:
        Tuple of (confidence_score, reasoning)
        - confidence_score: 0-1 scale
        - reasoning: Explanation of the score
    """
    # Combine chunks for evaluation
    context = "\n\n".join([f"Document {i+1}: {chunk[:300]}..." for i, chunk in enumerate(retrieved_chunks[:5])])
    
    eval_prompt = f"""
You are a retrieval quality evaluator. Assess whether the retrieved documents contain sufficient information to answer the query.

Query: {query}

Retrieved Documents:
{context}

Evaluate the retrieval quality and provide:
1. A confidence score between 0 and 1:
   - 0.0-0.4: LOW - Documents are irrelevant or insufficient
   - 0.4-0.7: AMBIGUOUS - Partial information, may need supplementation
   - 0.7-1.0: HIGH - Documents are highly relevant and sufficient

2. Brief reasoning (1-2 sentences)

Respond in JSON format:
{{
    "confidence_score": 0.85,
    "reasoning": "Your explanation here"
}}
"""
    
    response = llm.complete(eval_prompt)
    
    try:
        # Parse JSON response
        # Extract JSON from response (handle cases where LLM adds extra text)
        response_text = response.text.strip()
        # Find JSON in response
        json_match = re.search(r'\{[^}]+\}', response_text, re.DOTALL)
        if json_match:
            result = json.loads(json_match.group())
            return result['confidence_score'], result['reasoning']
        else:
            # Fallback: try to extract score from text
            score_match = re.search(r'(0\.[0-9]+|1\.0)', response_text)
            if score_match:
                return float(score_match.group()), response_text
            return 0.5, "Could not parse evaluation"
    except Exception as e:
        print(f"⚠️ Error parsing evaluation: {e}")
        return 0.5, "Evaluation error"

print("✅ Retrieval evaluator function created")
print("   Thresholds: HIGH (>0.7), AMBIGUOUS (0.4-0.7), LOW (<0.4)")

## Step 3: Implement Web Search Fallback

For this demo, we'll use a **mock web search** that simulates external knowledge retrieval.

In [None]:
def web_search_fallback(query: str) -> List[str]:
    """
    Simulate web search results.
    
    In a production system, this would integrate with:
    - Bing Search API (Azure)
    - Google Search API
    - Tavily Search API
    - SerpAPI
    
    For demo purposes, we return mock results based on query keywords.
    """
    print(f"\n🌐 Simulating web search for: {query}")
    
    # Mock results based on query content
    if "climate" in query.lower() or "global warming" in query.lower():
        return [
            "[Web Source 1] Climate change refers to long-term shifts in temperatures and weather patterns. Since the 1800s, human activities have been the main driver of climate change, primarily due to burning fossil fuels.",
            "[Web Source 2] The effects of climate change include rising sea levels, more frequent extreme weather events, and significant impacts on ecosystems and biodiversity.",
            "[Web Source 3] The Paris Agreement is an international treaty on climate change, adopted in 2015, with the goal of limiting global warming to well below 2°C above pre-industrial levels."
        ]
    elif "quantum computing" in query.lower() or "quantum" in query.lower():
        return [
            "[Web Source 1] Quantum computing leverages quantum mechanics principles like superposition and entanglement to process information in fundamentally different ways than classical computers.",
            "[Web Source 2] Quantum computers use quantum bits (qubits) which can exist in multiple states simultaneously, enabling parallel processing of vast amounts of data.",
            "[Web Source 3] Current applications of quantum computing include cryptography, drug discovery, optimization problems, and simulation of quantum systems."
        ]
    elif "blockchain" in query.lower() or "cryptocurrency" in query.lower():
        return [
            "[Web Source 1] Blockchain is a distributed ledger technology that records transactions across multiple computers in a way that makes it difficult to alter retroactively.",
            "[Web Source 2] Cryptocurrencies like Bitcoin use blockchain technology to enable peer-to-peer transactions without the need for a central authority.",
            "[Web Source 3] Beyond cryptocurrencies, blockchain has applications in supply chain management, smart contracts, and digital identity verification."
        ]
    else:
        # Generic fallback
        return [
            f"[Web Source 1] Web search result related to: {query}. This is a simulated result from an external knowledge source.",
            f"[Web Source 2] Additional information about {query} from web search. In a real system, this would come from Bing/Google API.",
            f"[Web Source 3] Further context on {query} retrieved from external sources."
        ]

print("✅ Mock web search function created")
print("   In production, replace with Bing Search API or similar service.")

## Step 4: Build CRAG Query Engine

Now we'll create the main CRAG system that ties everything together.

In [None]:
class CorrectiveRAGEngine:
    """
    Corrective RAG Query Engine with self-evaluation and dynamic routing.
    """
    
    def __init__(self, index, llm, embed_model, retriever_top_k=5):
        self.index = index
        self.llm = llm
        self.embed_model = embed_model
        self.retriever = index.as_retriever(similarity_top_k=retriever_top_k)
        
        # Confidence thresholds
        self.HIGH_THRESHOLD = 0.7
        self.LOW_THRESHOLD = 0.4
    
    def query(self, query_str: str, verbose: bool = True) -> Dict:
        """
        Execute CRAG pipeline:
        1. Retrieve from internal KB
        2. Evaluate retrieval quality
        3. Route based on confidence
        4. Generate answer
        
        Returns:
            Dict with keys: answer, confidence, route, reasoning
        """
        if verbose:
            print(f"\n{'='*80}")
            print(f"🔍 Query: {query_str}")
            print(f"{'='*80}\n")
        
        # Step 1: Retrieve from internal KB
        if verbose:
            print("📚 Step 1: Retrieving from internal knowledge base...")
        
        retrieved_nodes = self.retriever.retrieve(query_str)
        retrieved_texts = [node.node.text for node in retrieved_nodes]
        
        if verbose:
            print(f"   Retrieved {len(retrieved_texts)} chunks from internal KB")
            for i, node in enumerate(retrieved_nodes[:3], 1):
                filename = os.path.basename(node.node.metadata.get('file_name', 'Unknown'))
                print(f"   {i}. {filename} (score: {node.score:.4f})")
        
        # Step 2: Evaluate retrieval quality
        if verbose:
            print("\n🎯 Step 2: Evaluating retrieval quality...")
        
        confidence, reasoning = evaluate_retrieval_relevance(
            query_str, retrieved_texts, self.llm
        )
        
        if verbose:
            print(f"   Confidence Score: {confidence:.2f}")
            print(f"   Reasoning: {reasoning}")
        
        # Step 3: Route based on confidence
        if confidence >= self.HIGH_THRESHOLD:
            route = "INTERNAL"
            context_chunks = retrieved_texts
            if verbose:
                print(f"\n✅ Step 3: Route = {route} (confidence >= {self.HIGH_THRESHOLD})")
                print("   Using internal documents only.")
        
        elif confidence < self.LOW_THRESHOLD:
            route = "WEB_SEARCH"
            web_results = web_search_fallback(query_str)
            context_chunks = web_results
            if verbose:
                print(f"\n🌐 Step 3: Route = {route} (confidence < {self.LOW_THRESHOLD})")
                print("   Internal KB insufficient. Using web search results.")
                for i, result in enumerate(web_results, 1):
                    print(f"   {i}. {result[:80]}...")
        
        else:
            route = "HYBRID"
            web_results = web_search_fallback(query_str)
            context_chunks = retrieved_texts + web_results
            if verbose:
                print(f"\n🔀 Step 3: Route = {route} ({self.LOW_THRESHOLD} <= confidence < {self.HIGH_THRESHOLD})")
                print("   Merging internal KB and web search results.")
        
        # Step 4: Generate answer
        if verbose:
            print(f"\n🤖 Step 4: Generating answer with {len(context_chunks)} context chunks...")
        
        context = "\n\n".join([f"[Source {i+1}] {chunk}" for i, chunk in enumerate(context_chunks)])
        
        generation_prompt = f"""
Answer the following question based on the provided context. Be accurate and cite sources.

Question: {query_str}

Context:
{context}

Answer:
"""
        
        response = self.llm.complete(generation_prompt)
        
        return {
            'answer': response.text,
            'confidence': confidence,
            'route': route,
            'reasoning': reasoning,
            'num_internal_chunks': len(retrieved_texts),
            'num_web_chunks': len(web_search_fallback(query_str)) if route in ['WEB_SEARCH', 'HYBRID'] else 0
        }

# Initialize CRAG engine
crag_engine = CorrectiveRAGEngine(
    index=index,
    llm=azure_llm,
    embed_model=azure_embed,
    retriever_top_k=5
)

print("\n✅ Corrective RAG Engine initialized")
print("   Ready to handle in-domain, out-of-domain, and ambiguous queries.")

## Step 5: Evaluation Scenarios

Let's test the three scenarios: In-domain, Out-of-domain, and Ambiguous queries.

### Scenario 1: In-Domain Query (HIGH Confidence)

This query is about neural networks, which is fully covered in our ML knowledge base.

In [None]:
in_domain_query = "What is backpropagation in neural networks and how does it work?"

result_in_domain = crag_engine.query(in_domain_query, verbose=True)

print(f"\n{'='*80}")
print("📝 FINAL ANSWER:")
print(f"{'='*80}")
print(result_in_domain['answer'])
print(f"\n{'='*80}")
print(f"Route: {result_in_domain['route']}")
print(f"Confidence: {result_in_domain['confidence']:.2f}")
print(f"{'='*80}")

### Scenario 2: Out-of-Domain Query (LOW Confidence)

This query is about climate change, which is NOT in our ML-focused knowledge base.

In [None]:
out_of_domain_query = "What are the main causes of climate change and global warming?"

result_out_domain = crag_engine.query(out_of_domain_query, verbose=True)

print(f"\n{'='*80}")
print("📝 FINAL ANSWER:")
print(f"{'='*80}")
print(result_out_domain['answer'])
print(f"\n{'='*80}")
print(f"Route: {result_out_domain['route']}")
print(f"Confidence: {result_out_domain['confidence']:.2f}")
print(f"{'='*80}")

### Scenario 3: Ambiguous Query (AMBIGUOUS Confidence)

This query mentions machine learning but asks about a specific application (quantum computing) not fully covered.

In [None]:
ambiguous_query = "How can machine learning algorithms be used in quantum computing applications?"

result_ambiguous = crag_engine.query(ambiguous_query, verbose=True)

print(f"\n{'='*80}")
print("📝 FINAL ANSWER:")
print(f"{'='*80}")
print(result_ambiguous['answer'])
print(f"\n{'='*80}")
print(f"Route: {result_ambiguous['route']}")
print(f"Confidence: {result_ambiguous['confidence']:.2f}")
print(f"{'='*80}")

## Step 6: Comparative Analysis

Let's compare CRAG with a naive RAG system that doesn't evaluate or correct.

In [None]:
from llama_index.core.query_engine import RetrieverQueryEngine

# Naive RAG (no evaluation, no correction)
naive_retriever = index.as_retriever(similarity_top_k=5)
naive_engine = RetrieverQueryEngine(retriever=naive_retriever, llm=azure_llm)

print("✅ Naive RAG engine created for comparison")
print("   This engine always uses internal KB, even when insufficient.")

In [None]:
# Test naive RAG on out-of-domain query
print(f"\n🔍 Testing NAIVE RAG on out-of-domain query:\n{out_of_domain_query}\n")

naive_response = naive_engine.query(out_of_domain_query)

print("📝 Naive RAG Answer (No Correction):")
print("="*80)
print(naive_response.response)
print("="*80)
print("\n⚠️ Issues with Naive RAG:")
print("   - May hallucinate or provide irrelevant information")
print("   - No awareness that knowledge base lacks relevant information")
print("   - No fallback to external sources")

In [None]:
# Compare side-by-side
print("\n" + "="*80)
print("SIDE-BY-SIDE COMPARISON: Out-of-Domain Query")
print("="*80)

print(f"\nQuery: {out_of_domain_query}\n")

print("\n[NAIVE RAG - No Evaluation]")
print("-"*80)
print(naive_response.response[:400] + "...")
print("\nRoute: INTERNAL ONLY (always)")
print("Confidence: N/A (no evaluation)")

print("\n\n[CORRECTIVE RAG - With Evaluation]")
print("-"*80)
print(result_out_domain['answer'][:400] + "...")
print(f"\nRoute: {result_out_domain['route']}")
print(f"Confidence: {result_out_domain['confidence']:.2f}")
print(f"Reasoning: {result_out_domain['reasoning']}")

print("\n" + "="*80)
print("✅ CRAG provides accurate answers by recognizing knowledge gaps")
print("   and routing to appropriate external sources.")
print("="*80)

## Visualization: CRAG Decision Flow

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Collect results
scenarios = [
    ('In-Domain\n(Neural Networks)', result_in_domain),
    ('Out-of-Domain\n(Climate Change)', result_out_domain),
    ('Ambiguous\n(ML + Quantum)', result_ambiguous)
]

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))

# Plot 1: Confidence Scores
labels = [s[0] for s in scenarios]
confidences = [s[1]['confidence'] for s in scenarios]
colors = ['green' if c >= 0.7 else 'red' if c < 0.4 else 'orange' for c in confidences]

bars = ax1.bar(labels, confidences, color=colors, alpha=0.7, edgecolor='black', linewidth=1.5)
ax1.axhline(y=0.7, color='green', linestyle='--', linewidth=2, label='HIGH threshold (0.7)')
ax1.axhline(y=0.4, color='red', linestyle='--', linewidth=2, label='LOW threshold (0.4)')
ax1.set_ylabel('Confidence Score', fontsize=12, fontweight='bold')
ax1.set_title('CRAG Confidence Evaluation', fontsize=14, fontweight='bold')
ax1.set_ylim(0, 1.0)
ax1.legend(loc='upper right')
ax1.grid(axis='y', alpha=0.3)

# Add value labels on bars
for bar, conf in zip(bars, confidences):
    height = bar.get_height()
    ax1.text(bar.get_x() + bar.get_width()/2., height + 0.02,
             f'{conf:.2f}', ha='center', va='bottom', fontweight='bold', fontsize=11)

# Plot 2: Routing Decisions
routes = [s[1]['route'] for s in scenarios]
route_colors = {'INTERNAL': 'green', 'WEB_SEARCH': 'red', 'HYBRID': 'orange'}
route_labels = {'INTERNAL': 'Internal KB Only', 'WEB_SEARCH': 'Web Search Only', 'HYBRID': 'Hybrid (Both)'}

route_counts = {r: routes.count(r) for r in set(routes)}
route_names = [route_labels[r] for r in route_counts.keys()]
route_values = list(route_counts.values())
route_colors_list = [route_colors[r] for r in route_counts.keys()]

ax2.pie(route_values, labels=route_names, colors=route_colors_list, autopct='%1.0f%%',
        startangle=90, textprops={'fontsize': 11, 'fontweight': 'bold'})
ax2.set_title('CRAG Routing Distribution', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

print("\n📊 Decision Summary:")
for label, result in scenarios:
    print(f"\n{label}:")
    print(f"  Confidence: {result['confidence']:.2f}")
    print(f"  Route: {result['route']}")
    print(f"  Reasoning: {result['reasoning']}")

## Key Takeaways

### 1. **Self-Evaluation is Critical**
- CRAG evaluates its own retrieval quality before generating
- Prevents hallucinations and low-quality answers
- Provides confidence scores for transparency

### 2. **Dynamic Routing Improves Robustness**
✅ **Three routing strategies:**
- **HIGH confidence (>0.7)**: Use internal KB only (most efficient)
- **LOW confidence (<0.4)**: Use web search only (knowledge gap detected)
- **AMBIGUOUS (0.4-0.7)**: Merge both sources (hybrid approach)

### 3. **Fallback Mechanisms are Essential**
- No knowledge base is complete
- External sources (web search, APIs) fill gaps
- System remains useful even for out-of-domain queries

### 4. **CRAG vs. Naive RAG**

| Aspect | Naive RAG | Corrective RAG |
|--------|-----------|----------------|
| Retrieval Evaluation | ❌ None | ✅ LLM-based scoring |
| Knowledge Gap Detection | ❌ No | ✅ Yes |
| Fallback Mechanism | ❌ No | ✅ Web search |
| Out-of-Domain Handling | ❌ Poor | ✅ Good |
| Hallucination Risk | ⚠️ Higher | ✅ Lower |
| Cost | 💰 Lower | 💰💰 Higher (extra LLM call) |

### 5. **When to Use CRAG**

✅ **Use CRAG when:**
- Knowledge base is limited or specialized
- Query distribution is unpredictable
- Answer accuracy is critical (e.g., medical, legal, financial)
- You have access to reliable external sources
- Cost of wrong answer > cost of extra evaluation

❌ **Skip CRAG when:**
- Knowledge base is comprehensive and well-curated
- All queries are guaranteed to be in-domain
- Cost/latency is critical constraint
- No reliable external sources available

### 6. **Implementation Considerations**

**Threshold Tuning:**
- Adjust HIGH/LOW thresholds based on your tolerance for:
  - False positives (using internal KB when insufficient)
  - False negatives (triggering web search unnecessarily)
- Recommended starting point: HIGH=0.7, LOW=0.4
- Monitor and adjust based on real-world performance

**Evaluator Design:**
- LLM-based evaluation is flexible but adds latency/cost
- Alternative: Train a lightweight classifier (faster, cheaper)
- Hybrid: Use heuristics (chunk scores, keyword matching) + LLM fallback

**Web Search Integration:**
- Production: Use Bing Search API, Google Search API, or Tavily
- Parse and chunk web results same as internal docs
- Consider freshness (web results may be more up-to-date)
- Add source citations to answers

### 7. **Advanced Extensions**

**Knowledge Strip Filtering:**
- Further refine by scoring individual sentences
- Remove low-relevance sentences before generation
- Reduces token count and improves precision

**Multi-Hop CRAG:**
- Iteratively retrieve and evaluate
- Refine query based on evaluation results
- Enable complex reasoning over multiple sources

**Confidence Calibration:**
- Log confidence scores and actual answer quality
- Fine-tune thresholds over time
- Improve evaluator with domain-specific examples

## Architecture Diagram

```
                    User Query
                        ↓
        ┌───────────────────────────────┐
        │  Retrieve from Internal KB    │
        │  (Top-K chunks)               │
        └───────────┬───────────────────┘
                    ↓
        ┌───────────────────────────────┐
        │  LLM Evaluator                │
        │  (Score: 0-1)                 │
        └───────────┬───────────────────┘
                    ↓
         ┌──────────┼──────────┐
         ↓          ↓          ↓
     Score ≥ 0.7  0.4-0.7   < 0.4
         ↓          ↓          ↓
    ┌────────┐ ┌────────┐ ┌────────┐
    │Internal│ │ Hybrid │ │  Web   │
    │   KB   │ │ (Both) │ │ Search │
    └────┬───┘ └────┬───┘ └────┬───┘
         └──────────┼──────────┘
                    ↓
        ┌───────────────────────────────┐
        │  Optional: Knowledge Strip    │
        │  Filtering (sentence-level)   │
        └───────────┬───────────────────┘
                    ↓
        ┌───────────────────────────────┐
        │  LLM Generation               │
        │  (High-quality context)       │
        └───────────┬───────────────────┘
                    ↓
              Final Answer
          (with source citations)
```

## Next Steps

To enhance your CRAG system:
1. **Integrate real web search** (Bing API, Tavily, etc.)
2. **Fine-tune the evaluator** on domain-specific examples
3. **Add knowledge strip filtering** for sentence-level refinement
4. **Implement caching** to reduce repeated evaluations
5. **Monitor confidence calibration** and adjust thresholds
6. **Add source citations** to generated answers

---

**Demo Complete! ✅**

You've successfully implemented Corrective RAG with self-evaluation, dynamic routing, and web search fallback.