# Demo #8: Self-RAG with Reflection Tokens and Adaptive Retrieval

## 🎯 Learning Objectives

In this demonstration, you will learn:

1. **Self-Reflective Generation**: How AI systems can critique and evaluate their own outputs
2. **Adaptive Retrieval**: How to decide *when* to retrieve vs. when to rely on parametric knowledge
3. **Reflection Tokens**: A framework for multi-dimensional critique (relevance, support, utility)
4. **Controllable Generation**: How weighted critique scores enable task-specific behavior

## 📚 Theoretical Background

### The Evolution of RAG Systems

```
Traditional RAG:   ALWAYS retrieve → Generate
                   (Blind execution)

Corrective RAG:    Retrieve → EVALUATE → Route
                   (Post-retrieval reflection)

Self-RAG:          DECIDE → [Conditionally retrieve] → Generate + CRITIQUE
                   (Pre-retrieval AND during-generation reflection)
```

### Key Innovation: Self-RAG as Meta-Reasoner

Self-RAG (Self-Reflective Retrieval-Augmented Generation) represents a paradigm shift from **reactive** to **proactive** intelligence:

- **Traditional RAG**: Assumes retrieval is always necessary → Wasted compute, added latency
- **Corrective RAG**: Evaluates retrieval quality after the fact → Corrects mistakes but still retrieves unnecessarily
- **Self-RAG**: Performs meta-reasoning BEFORE retrieval → "Do I even need external information?"

From the curriculum:
> "The LLM performs meta-reasoning *about* the task itself, asking questions like: 'Is the information I have sufficient?', 'Do I need to use a different tool?', or 'Should I rephrase my query?'"

### Reflection Token Framework

Self-RAG introduces a structured critique system using **reflection tokens**:

#### 1. Retrieval Decision Tokens
- `[Retrieval]`: External knowledge is required
- `[No Retrieval]`: Parametric knowledge is sufficient
- `[Continue to Use Evidence]`: Use previously retrieved context

#### 2. Relevance Tokens (Post-Retrieval)
- `[Relevant]`: Retrieved passage is on-topic and useful
- `[Irrelevant]`: Retrieved passage should be discarded

#### 3. Support/Grounding Tokens (Post-Generation)
- `[Fully supported]`: Generated text is entirely grounded in evidence
- `[Partially supported]`: Some claims lack evidence
- `[No support / Contradictory]`: Hallucination or contradiction detected

#### 4. Utility Tokens (Quality Assessment)
- `[Utility:5]`: Excellent, comprehensive answer
- `[Utility:4]`: Good, addresses query well
- `[Utility:3]`: Adequate but incomplete
- `[Utility:2]`: Minimal value
- `[Utility:1]`: Poor or unhelpful

### Architectural Distinction

| Aspect | Traditional RAG | Corrective RAG | Self-RAG |
|--------|----------------|----------------|----------|
| **Retrieval** | Always | Always | Conditional (adaptive) |
| **Evaluation** | None | After retrieval | Before + During + After |
| **Decision Point** | None | Post-retrieval | Pre-retrieval + Iterative |
| **Self-Awareness** | None | Limited | Full (multi-dimensional critique) |
| **Use Case** | General QA | Robust QA with fallback | High-stakes, controllable generation |

### Implementation Note

⚠️ **Important**: True Self-RAG requires fine-tuning an LLM to predict special reflection tokens.

This demo **simulates** Self-RAG behavior using:
- **Prompting** to elicit critic decisions from Azure OpenAI
- **Structured parsing** to extract reflection token predictions
- **Multi-step orchestration** to implement the Self-RAG pipeline

For production Self-RAG, see the original implementation: [AkariAsai/self-rag](https://github.com/AkariAsai/self-rag)

---

## 🔧 Implementation

In [1]:
# Install required packages (run once)
# !pip install llama-index-core llama-index-llms-azure-openai llama-index-embeddings-azure-openai python-dotenv

In [2]:
# Import required libraries
import os
from dotenv import load_dotenv
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.azure_openai import AzureOpenAI
from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding
from enum import Enum
from typing import List, Dict, Any, Optional, Tuple
from dataclasses import dataclass
import warnings

warnings.filterwarnings('ignore')

# Load environment variables
load_dotenv()

print("✅ Libraries imported successfully")

✅ Libraries imported successfully


## Step 1: Environment Setup

In [4]:
# Configure Azure OpenAI LLM
llm = AzureOpenAI(
    model="gpt-4",
    deployment_name=os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME"),
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_version="2024-02-15-preview",
    temperature=0.0  # Deterministic for critique
)

# Configure Azure OpenAI Embeddings
embed_model = AzureOpenAIEmbedding(
    model="text-embedding-ada-002",
    deployment_name=os.getenv("AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME"),
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_version="2024-02-15-preview",
)

# Set global defaults
Settings.llm = llm
Settings.embed_model = embed_model
Settings.chunk_size = 512
Settings.chunk_overlap = 50

print("✅ Azure OpenAI configured successfully")
print(f"   LLM: {llm.model}")
print(f"   Embeddings: {embed_model.model_name}")

✅ Azure OpenAI configured successfully
   LLM: gpt-4
   Embeddings: text-embedding-ada-002


## Step 2: Define Reflection Token System

Create enums and data structures for all reflection token types.

In [5]:
# Reflection Token Enums

class RetrievalDecision(Enum):
    """Tokens for deciding whether to retrieve"""
    RETRIEVAL = "[Retrieval]"
    NO_RETRIEVAL = "[No Retrieval]"
    CONTINUE = "[Continue to Use Evidence]"


class RelevanceToken(Enum):
    """Tokens for assessing document relevance"""
    RELEVANT = "[Relevant]"
    IRRELEVANT = "[Irrelevant]"


class SupportToken(Enum):
    """Tokens for assessing generation groundedness"""
    FULLY_SUPPORTED = "[Fully supported]"
    PARTIALLY_SUPPORTED = "[Partially supported]"
    NO_SUPPORT = "[No support / Contradictory]"


class UtilityToken(Enum):
    """Tokens for assessing answer quality"""
    UTILITY_5 = "[Utility:5]"
    UTILITY_4 = "[Utility:4]"
    UTILITY_3 = "[Utility:3]"
    UTILITY_2 = "[Utility:2]"
    UTILITY_1 = "[Utility:1]"


@dataclass
class CritiqueResult:
    """Container for all critique dimensions"""
    retrieval_decision: Optional[RetrievalDecision] = None
    relevance: Optional[RelevanceToken] = None
    support: Optional[SupportToken] = None
    utility: Optional[UtilityToken] = None
    reasoning: str = ""
    
    def to_dict(self) -> Dict[str, str]:
        return {
            "retrieval_decision": self.retrieval_decision.value if self.retrieval_decision else None,
            "relevance": self.relevance.value if self.relevance else None,
            "support": self.support.value if self.support else None,
            "utility": self.utility.value if self.utility else None,
            "reasoning": self.reasoning
        }


print("✅ Reflection token system defined")
print("   Retrieval Tokens:", [t.value for t in RetrievalDecision])
print("   Relevance Tokens:", [t.value for t in RelevanceToken])
print("   Support Tokens:", [t.value for t in SupportToken])
print("   Utility Tokens:", [t.value for t in UtilityToken])

✅ Reflection token system defined
   Retrieval Tokens: ['[Retrieval]', '[No Retrieval]', '[Continue to Use Evidence]']
   Relevance Tokens: ['[Relevant]', '[Irrelevant]']
   Support Tokens: ['[Fully supported]', '[Partially supported]', '[No support / Contradictory]']
   Utility Tokens: ['[Utility:5]', '[Utility:4]', '[Utility:3]', '[Utility:2]', '[Utility:1]']


## Step 3: Implement Self-RAG Critic

The critic is the heart of Self-RAG. It implements four core critique functions.

In [6]:
class SelfRAGCritic:
    """
    Simulated Self-RAG critic using Azure OpenAI.
    In production, this would be a fine-tuned model with special tokens.
    """
    
    def __init__(self, llm):
        self.llm = llm
    
    def should_retrieve(self, query: str, current_context: str = "") -> Tuple[RetrievalDecision, str]:
        """
        Critique function 1: Decide if retrieval is needed.
        
        Returns:
            (RetrievalDecision, reasoning)
        """
        prompt = f"""You are a retrieval decision expert. Your task is to determine if answering a query requires external information retrieval.

QUERY: {query}

DECISION CRITERIA:
- Predict [Retrieval] if:
  * Query asks for specific facts, statistics, or recent events
  * Query requires domain-specific knowledge not in general LLM training
  * Query asks about specific entities, documents, or technical details

- Predict [No Retrieval] if:
  * Query asks for general knowledge, definitions, or common concepts
  * Query requires reasoning or mathematical operations (no facts needed)
  * Query is conversational or asks for opinions

Respond ONLY in this format:
TOKEN: [Retrieval] OR [No Retrieval]
REASONING: [One sentence explaining your decision]
"""
        
        response = self.llm.complete(prompt)
        response_text = response.text.strip()
        
        # Parse response
        try:
            token_line = [l for l in response_text.split('\n') if 'TOKEN:' in l][0]
            reasoning_line = [l for l in response_text.split('\n') if 'REASONING:' in l][0]
            
            reasoning = reasoning_line.split('REASONING:')[1].strip()
            
            if "[No Retrieval]" in token_line or "No Retrieval" in token_line:
                decision = RetrievalDecision.NO_RETRIEVAL
            else:
                decision = RetrievalDecision.RETRIEVAL
        except Exception as e:
            print(f"⚠️  Failed to parse retrieval decision: {e}")
            decision = RetrievalDecision.RETRIEVAL  # Safe default
            reasoning = "Failed to parse"
        
        return decision, reasoning
    
    def assess_relevance(self, query: str, passage: str) -> Tuple[RelevanceToken, str]:
        """
        Critique function 2: Assess if retrieved passage is relevant.
        
        Returns:
            (RelevanceToken, reasoning)
        """
        prompt = f"""You are a relevance assessment expert. Determine if a retrieved passage is relevant to the query.

QUERY: {query}

PASSAGE: {passage[:500]}...

DECISION CRITERIA:
- Predict [Relevant] if the passage contains information that directly helps answer the query
- Predict [Irrelevant] if the passage is off-topic or does not provide useful information

Respond ONLY in this format:
TOKEN: [Relevant] OR [Irrelevant]
REASONING: [One sentence explaining your decision]
"""
        
        response = self.llm.complete(prompt)
        response_text = response.text.strip()
        
        try:
            token_line = [l for l in response_text.split('\n') if 'TOKEN:' in l][0]
            reasoning_line = [l for l in response_text.split('\n') if 'REASONING:' in l][0]
            
            reasoning = reasoning_line.split('REASONING:')[1].strip()
            
            if "[Irrelevant]" in token_line or "Irrelevant" in token_line:
                relevance = RelevanceToken.IRRELEVANT
            else:
                relevance = RelevanceToken.RELEVANT
        except Exception as e:
            print(f"⚠️  Failed to parse relevance: {e}")
            relevance = RelevanceToken.RELEVANT
            reasoning = "Failed to parse"
        
        return relevance, reasoning
    
    def assess_groundedness(self, generation: str, context: str) -> Tuple[SupportToken, str]:
        """
        Critique function 3: Check if generation is grounded in context.
        
        Returns:
            (SupportToken, reasoning)
        """
        prompt = f"""You are a factual grounding expert. Assess if a generated answer is supported by the provided context.

GENERATED ANSWER: {generation}

CONTEXT: {context[:800]}...

ASSESSMENT CRITERIA:
- [Fully supported]: Every claim in the answer is directly supported by the context
- [Partially supported]: Some claims are supported, but others lack evidence or are inferred
- [No support / Contradictory]: Answer contains claims not in context or contradicts it

Respond ONLY in this format:
TOKEN: [Fully supported] OR [Partially supported] OR [No support / Contradictory]
REASONING: [One sentence explaining your assessment]
"""
        
        response = self.llm.complete(prompt)
        response_text = response.text.strip()
        
        try:
            token_line = [l for l in response_text.split('\n') if 'TOKEN:' in l][0]
            reasoning_line = [l for l in response_text.split('\n') if 'REASONING:' in l][0]
            
            reasoning = reasoning_line.split('REASONING:')[1].strip()
            
            if "Fully supported" in token_line:
                support = SupportToken.FULLY_SUPPORTED
            elif "Partially supported" in token_line:
                support = SupportToken.PARTIALLY_SUPPORTED
            else:
                support = SupportToken.NO_SUPPORT
        except Exception as e:
            print(f"⚠️  Failed to parse grounding: {e}")
            support = SupportToken.PARTIALLY_SUPPORTED
            reasoning = "Failed to parse"
        
        return support, reasoning
    
    def assess_utility(self, generation: str, query: str) -> Tuple[int, str]:
        """
        Critique function 4: Score answer utility (1-5).
        
        Returns:
            (utility_score, reasoning)
        """
        prompt = f"""You are an answer quality expert. Rate the utility of an answer to a query on a 1-5 scale.

QUERY: {query}

ANSWER: {generation}

SCORING CRITERIA:
5: Excellent - Comprehensive, accurate, directly addresses all aspects of the query
4: Good - Solid answer, addresses query well with minor gaps
3: Adequate - Provides some useful information but incomplete
2: Minimal - Tangentially related or very incomplete
1: Poor - Does not answer query or is unhelpful

Respond ONLY in this format:
SCORE: [1-5]
REASONING: [One sentence explaining your score]
"""
        
        response = self.llm.complete(prompt)
        response_text = response.text.strip()
        
        try:
            score_line = [l for l in response_text.split('\n') if 'SCORE:' in l][0]
            reasoning_line = [l for l in response_text.split('\n') if 'REASONING:' in l][0]
            
            score = int(score_line.split('SCORE:')[1].strip()[0])  # Extract first digit
            reasoning = reasoning_line.split('REASONING:')[1].strip()
        except Exception as e:
            print(f"⚠️  Failed to parse utility: {e}")
            score = 3
            reasoning = "Failed to parse"
        
        return score, reasoning


# Initialize critic
critic = SelfRAGCritic(llm)
print("✅ Self-RAG critic initialized")
print("   Critique Functions: should_retrieve(), assess_relevance(), assess_groundedness(), assess_utility()")

✅ Self-RAG critic initialized
   Critique Functions: should_retrieve(), assess_relevance(), assess_groundedness(), assess_utility()


## Step 4: Data Preparation

Load diverse documents to test adaptive behavior.

In [7]:
# Load documents from multiple directories
documents = SimpleDirectoryReader(
    input_files=[
        "data/tech_docs/transformer_architecture.md",
        "data/tech_docs/bert_model.md",
        "data/ml_concepts/neural_networks.md",
        "data/ml_concepts/random_forests.md",
        "data/ml_concepts/gradient_boosting.md"
    ]
).load_data()

print(f"✅ Loaded {len(documents)} documents")
for i, doc in enumerate(documents, 1):
    print(f"   {i}. {doc.metadata.get('file_name', 'Unknown')} ({len(doc.text)} chars)")

# Build vector index
index = VectorStoreIndex.from_documents(documents)
retriever = index.as_retriever(similarity_top_k=2)
print("\n✅ Vector index built successfully")

✅ Loaded 5 documents
   1. transformer_architecture.md (4247 chars)
   2. bert_model.md (2358 chars)
   3. neural_networks.md (1873 chars)
   4. random_forests.md (2488 chars)
   5. gradient_boosting.md (1612 chars)

✅ Vector index built successfully

✅ Vector index built successfully


## Step 5: Implement Self-RAG Pipeline

The complete Self-RAG system with adaptive retrieval and multi-dimensional critique.

In [8]:
class SelfRAGSystem:
    """
    Self-Reflective RAG with adaptive retrieval and multi-dimensional critique.
    """
    
    def __init__(self, index, llm, critic, retriever):
        self.index = index
        self.llm = llm
        self.critic = critic
        self.retriever = retriever
        
        # Critique weights (for composite scoring)
        self.w_relevance = 1.0
        self.w_support = 2.0  # Higher weight for factual grounding
        self.w_utility = 1.0
    
    def query(
        self, 
        query_text: str, 
        verbose: bool = True
    ) -> Dict[str, Any]:
        """
        Execute Self-RAG pipeline with adaptive retrieval and critique.
        """
        if verbose:
            print(f"\n{'='*90}")
            print(f"🧠 SELF-RAG PIPELINE")
            print(f"{'='*90}")
            print(f"Query: {query_text}\n")
        
        trace = []  # Decision trace for transparency
        
        # STEP 1: Adaptive Retrieval Decision
        if verbose:
            print("[STEP 1] 🔍 Retrieval Decision")
        
        retrieval_decision, retrieval_reasoning = self.critic.should_retrieve(query_text)
        
        if verbose:
            print(f"   Token: {retrieval_decision.value}")
            print(f"   Reasoning: {retrieval_reasoning}")
        
        trace.append({
            "step": "Retrieval Decision",
            "token": retrieval_decision.value,
            "reasoning": retrieval_reasoning
        })
        
        # STEP 2: Conditional Retrieval + Relevance Filtering
        relevant_passages = []
        
        if retrieval_decision == RetrievalDecision.RETRIEVAL:
            if verbose:
                print("\n[STEP 2] 📥 Retrieval + Relevance Filtering")
            
            retrieved_nodes = self.retriever.retrieve(query_text)
            
            for i, node in enumerate(retrieved_nodes):
                passage = node.node.text
                relevance_token, relevance_reasoning = self.critic.assess_relevance(
                    query_text, 
                    passage
                )
                
                if verbose:
                    print(f"\n   Document {i+1}:")
                    print(f"   Token: {relevance_token.value}")
                    print(f"   Reasoning: {relevance_reasoning}")
                
                trace.append({
                    "step": f"Relevance Assessment {i+1}",
                    "token": relevance_token.value,
                    "reasoning": relevance_reasoning
                })
                
                if relevance_token == RelevanceToken.RELEVANT:
                    relevant_passages.append(passage)
            
            if verbose and relevant_passages:
                print(f"\n   ✓ {len(relevant_passages)} relevant passages selected")
            elif verbose:
                print("\n   ⚠️  No relevant passages found")
        
        else:
            if verbose:
                print("\n[STEP 2] ⏭️  Skipped (No retrieval needed)")
        
        # STEP 3: Generation
        if verbose:
            print("\n[STEP 3] ✍️  Generation")
        
        if relevant_passages:
            context = "\n\n".join(relevant_passages)
            generation_prompt = f"""Answer the following query using ONLY the provided context.

Context:
{context}

Query: {query_text}

Answer:"""
        else:
            generation_prompt = f"""Answer the following query using your knowledge.

Query: {query_text}

Answer:"""
        
        response = self.llm.complete(generation_prompt)
        answer = response.text.strip()
        
        if verbose:
            print(f"   Generated answer ({len(answer)} chars)")
        
        # STEP 4: Groundedness Check
        if verbose:
            print("\n[STEP 4] 🔬 Groundedness Check")
        
        if relevant_passages:
            context_for_check = "\n\n".join(relevant_passages)
            support_token, support_reasoning = self.critic.assess_groundedness(
                answer, 
                context_for_check
            )
        else:
            # No context, so can't be fully supported
            support_token = SupportToken.NO_SUPPORT
            support_reasoning = "No retrieved context to ground answer"
        
        if verbose:
            print(f"   Token: {support_token.value}")
            print(f"   Reasoning: {support_reasoning}")
        
        trace.append({
            "step": "Groundedness Check",
            "token": support_token.value,
            "reasoning": support_reasoning
        })
        
        # STEP 5: Utility Assessment
        if verbose:
            print("\n[STEP 5] ⭐ Utility Assessment")
        
        utility_score, utility_reasoning = self.critic.assess_utility(answer, query_text)
        
        if verbose:
            print(f"   Token: [Utility:{utility_score}]")
            print(f"   Reasoning: {utility_reasoning}")
        
        trace.append({
            "step": "Utility Assessment",
            "token": f"[Utility:{utility_score}]",
            "reasoning": utility_reasoning
        })
        
        # STEP 6: Composite Scoring
        composite_score = self._calculate_composite_score(
            len(relevant_passages),
            support_token,
            utility_score
        )
        
        return {
            "answer": answer,
            "retrieval_decision": retrieval_decision.value,
            "num_relevant_passages": len(relevant_passages),
            "support_token": support_token.value,
            "utility_score": utility_score,
            "composite_score": composite_score,
            "trace": trace
        }
    
    def _calculate_composite_score(
        self, 
        num_relevant: int, 
        support: SupportToken, 
        utility: int
    ) -> float:
        """
        Calculate weighted composite critique score.
        """
        # Relevance score (binary: any relevant passages = 1)
        relevance_score = 1.0 if num_relevant > 0 else 0.0
        
        # Support score (0-1 scale)
        support_map = {
            SupportToken.FULLY_SUPPORTED: 1.0,
            SupportToken.PARTIALLY_SUPPORTED: 0.5,
            SupportToken.NO_SUPPORT: 0.0
        }
        support_score = support_map[support]
        
        # Utility score (already 1-5, normalize to 0-1)
        utility_score = (utility - 1) / 4.0  # Maps 1-5 to 0-1
        
        # Weighted combination
        total_weight = self.w_relevance + self.w_support + self.w_utility
        composite = (
            self.w_relevance * relevance_score +
            self.w_support * support_score +
            self.w_utility * utility_score
        ) / total_weight
        
        return round(composite, 3)


# Initialize Self-RAG system
selfrag = SelfRAGSystem(index, llm, critic, retriever)
print("✅ Self-RAG system initialized")
print(f"   Critique Weights: relevance={selfrag.w_relevance}, support={selfrag.w_support}, utility={selfrag.w_utility}")

✅ Self-RAG system initialized
   Critique Weights: relevance=1.0, support=2.0, utility=1.0


## Step 6: Test Suite - Exercising Adaptive Retrieval

### Test 1: No Retrieval Needed (Parametric Knowledge)
Expected: `[No Retrieval]` token

In [9]:
# Test Query 1: Should NOT retrieve
query1 = "What is 2 + 2? Explain your reasoning."

result1 = selfrag.query(query1, verbose=True)

print("\n" + "="*90)
print("📊 RESULT SUMMARY")
print("="*90)
print(f"Answer: {result1['answer']}")
print(f"\nRetrieval Decision: {result1['retrieval_decision']}")
print(f"Support Token: {result1['support_token']}")
print(f"Utility Score: {result1['utility_score']}/5")
print(f"Composite Score: {result1['composite_score']}")


🧠 SELF-RAG PIPELINE
Query: What is 2 + 2? Explain your reasoning.

[STEP 1] 🔍 Retrieval Decision
   Token: [No Retrieval]
   Reasoning: The query asks for a simple mathematical operation that can be answered using general knowledge without needing external information.

[STEP 2] ⏭️  Skipped (No retrieval needed)

[STEP 3] ✍️  Generation
   Token: [No Retrieval]
   Reasoning: The query asks for a simple mathematical operation that can be answered using general knowledge without needing external information.

[STEP 2] ⏭️  Skipped (No retrieval needed)

[STEP 3] ✍️  Generation
   Generated answer (478 chars)

[STEP 4] 🔬 Groundedness Check
   Token: [No support / Contradictory]
   Reasoning: No retrieved context to ground answer

[STEP 5] ⭐ Utility Assessment
   Generated answer (478 chars)

[STEP 4] 🔬 Groundedness Check
   Token: [No support / Contradictory]
   Reasoning: No retrieved context to ground answer

[STEP 5] ⭐ Utility Assessment
   Token: [Utility:5]
   Reasoning: The answer i

### Test 2: Retrieval Required (Specific Technical Knowledge)
Expected: `[Retrieval]` → Relevance filtering → Grounded generation

In [10]:
# Test Query 2: Should retrieve
query2 = "What are the key components of a transformer architecture and how do they work?"

result2 = selfrag.query(query2, verbose=True)

print("\n" + "="*90)
print("📊 RESULT SUMMARY")
print("="*90)
print(f"Answer: {result2['answer']}")
print(f"\nRetrieval Decision: {result2['retrieval_decision']}")
print(f"Relevant Passages: {result2['num_relevant_passages']}")
print(f"Support Token: {result2['support_token']}")
print(f"Utility Score: {result2['utility_score']}/5")
print(f"Composite Score: {result2['composite_score']}")


🧠 SELF-RAG PIPELINE
Query: What are the key components of a transformer architecture and how do they work?

[STEP 1] 🔍 Retrieval Decision
   Token: [No Retrieval]
   Reasoning: The query asks for general knowledge about the transformer architecture, which is a common concept covered in general LLM training.

[STEP 2] ⏭️  Skipped (No retrieval needed)

[STEP 3] ✍️  Generation
   Token: [No Retrieval]
   Reasoning: The query asks for general knowledge about the transformer architecture, which is a common concept covered in general LLM training.

[STEP 2] ⏭️  Skipped (No retrieval needed)

[STEP 3] ✍️  Generation
   Generated answer (2957 chars)

[STEP 4] 🔬 Groundedness Check
   Token: [No support / Contradictory]
   Reasoning: No retrieved context to ground answer

[STEP 5] ⭐ Utility Assessment
   Generated answer (2957 chars)

[STEP 4] 🔬 Groundedness Check
   Token: [No support / Contradictory]
   Reasoning: No retrieved context to ground answer

[STEP 5] ⭐ Utility Assessment
   Token:

### Test 3: Ambiguous Case (General Topic)
Expected: System makes adaptive decision

In [11]:
# Test Query 3: Ambiguous
query3 = "Explain the concept of machine learning"

result3 = selfrag.query(query3, verbose=True)

print("\n" + "="*90)
print("📊 RESULT SUMMARY")
print("="*90)
print(f"Answer: {result3['answer']}")
print(f"\nRetrieval Decision: {result3['retrieval_decision']}")
print(f"Relevant Passages: {result3['num_relevant_passages']}")
print(f"Support Token: {result3['support_token']}")
print(f"Utility Score: {result3['utility_score']}/5")
print(f"Composite Score: {result3['composite_score']}")


🧠 SELF-RAG PIPELINE
Query: Explain the concept of machine learning

[STEP 1] 🔍 Retrieval Decision
   Token: [No Retrieval]
   Reasoning: The query asks for a general concept explanation that is covered by common knowledge within the training of a language model.

[STEP 2] ⏭️  Skipped (No retrieval needed)

[STEP 3] ✍️  Generation
   Token: [No Retrieval]
   Reasoning: The query asks for a general concept explanation that is covered by common knowledge within the training of a language model.

[STEP 2] ⏭️  Skipped (No retrieval needed)

[STEP 3] ✍️  Generation
   Generated answer (2086 chars)

[STEP 4] 🔬 Groundedness Check
   Token: [No support / Contradictory]
   Reasoning: No retrieved context to ground answer

[STEP 5] ⭐ Utility Assessment
   Generated answer (2086 chars)

[STEP 4] 🔬 Groundedness Check
   Token: [No support / Contradictory]
   Reasoning: No retrieved context to ground answer

[STEP 5] ⭐ Utility Assessment
   Token: [Utility:5]
   Reasoning: The answer is comprehensi

## Step 7: Comparative Analysis

Compare Self-RAG with baseline systems.

In [12]:
import pandas as pd

# Compile results
comparison_data = [
    {
        "Query": "2 + 2",
        "Retrieval Decision": result1['retrieval_decision'],
        "Retrieved Docs": result1['num_relevant_passages'],
        "Support": result1['support_token'],
        "Utility": result1['utility_score'],
        "Composite Score": result1['composite_score']
    },
    {
        "Query": "Transformer components",
        "Retrieval Decision": result2['retrieval_decision'],
        "Retrieved Docs": result2['num_relevant_passages'],
        "Support": result2['support_token'],
        "Utility": result2['utility_score'],
        "Composite Score": result2['composite_score']
    },
    {
        "Query": "Explain ML",
        "Retrieval Decision": result3['retrieval_decision'],
        "Retrieved Docs": result3['num_relevant_passages'],
        "Support": result3['support_token'],
        "Utility": result3['utility_score'],
        "Composite Score": result3['composite_score']
    }
]

df = pd.DataFrame(comparison_data)

print("\n" + "="*100)
print("📊 SELF-RAG PERFORMANCE SUMMARY")
print("="*100)
print(df.to_string(index=False))
print("\n" + "="*100)

# System behavior analysis
print("\n🎯 Adaptive Behavior Analysis:")
print(f"   Queries triggering retrieval: {sum(1 for r in [result1, result2, result3] if r['retrieval_decision'] == '[Retrieval]')}/3")
print(f"   Average utility score: {sum(r['utility_score'] for r in [result1, result2, result3]) / 3:.1f}/5")
print(f"   Fully supported answers: {sum(1 for r in [result1, result2, result3] if '[Fully supported]' in r['support_token'])}/3")


📊 SELF-RAG PERFORMANCE SUMMARY
                 Query Retrieval Decision  Retrieved Docs                      Support  Utility  Composite Score
                 2 + 2     [No Retrieval]               0 [No support / Contradictory]        5             0.25
Transformer components     [No Retrieval]               0 [No support / Contradictory]        5             0.25
            Explain ML     [No Retrieval]               0 [No support / Contradictory]        5             0.25


🎯 Adaptive Behavior Analysis:
   Queries triggering retrieval: 0/3
   Average utility score: 5.0/5
   Fully supported answers: 0/3


## Step 8: Critique-Weighted Scoring Demonstration

Show how adjusting weights changes system behavior.

In [13]:
# Create two variants with different weight configurations

# Variant 1: Prioritize factual grounding (for fact-checking tasks)
selfrag_factual = SelfRAGSystem(index, llm, critic, retriever)
selfrag_factual.w_relevance = 0.5
selfrag_factual.w_support = 3.0  # Very high weight on grounding
selfrag_factual.w_utility = 0.5

# Variant 2: Prioritize utility (for user satisfaction)
selfrag_utility = SelfRAGSystem(index, llm, critic, retriever)
selfrag_utility.w_relevance = 0.5
selfrag_utility.w_support = 0.5
selfrag_utility.w_utility = 3.0  # Very high weight on utility

# Test same query with both variants
test_query = "What are the key components of a transformer architecture?"

print("\n" + "="*100)
print("🔬 WEIGHT CONFIGURATION COMPARISON")
print("="*100)
print(f"Test Query: {test_query}\n")

print("[CONFIG 1] Factual-Focused (w_support=3.0)")
print("-" * 50)
result_factual = selfrag_factual.query(test_query, verbose=False)
print(f"Composite Score: {result_factual['composite_score']}")
print(f"Support: {result_factual['support_token']}")
print(f"Utility: {result_factual['utility_score']}/5\n")

print("[CONFIG 2] Utility-Focused (w_utility=3.0)")
print("-" * 50)
result_utility = selfrag_utility.query(test_query, verbose=False)
print(f"Composite Score: {result_utility['composite_score']}")
print(f"Support: {result_utility['support_token']}")
print(f"Utility: {result_utility['utility_score']}/5\n")

print("="*100)
print("\n💡 Insight: Different weight configurations optimize for different objectives.")
print("   - Factual tasks (medical, legal): High w_support")
print("   - User engagement tasks: High w_utility")
print("   - Retrieval-critical tasks: High w_relevance")


🔬 WEIGHT CONFIGURATION COMPARISON
Test Query: What are the key components of a transformer architecture?

[CONFIG 1] Factual-Focused (w_support=3.0)
--------------------------------------------------
Composite Score: 0.125
Support: [No support / Contradictory]
Utility: 5/5

[CONFIG 2] Utility-Focused (w_utility=3.0)
--------------------------------------------------
Composite Score: 0.125
Support: [No support / Contradictory]
Utility: 5/5

[CONFIG 2] Utility-Focused (w_utility=3.0)
--------------------------------------------------
Composite Score: 0.75
Support: [No support / Contradictory]
Utility: 5/5


💡 Insight: Different weight configurations optimize for different objectives.
   - Factual tasks (medical, legal): High w_support
   - User engagement tasks: High w_utility
   - Retrieval-critical tasks: High w_relevance
Composite Score: 0.75
Support: [No support / Contradictory]
Utility: 5/5


💡 Insight: Different weight configurations optimize for different objectives.
   - Factual

## 🎓 Key Takeaways

### What We Learned

1. **Adaptive Retrieval is Efficient**
   - Traditional RAG retrieves for EVERY query (wasteful)
   - Self-RAG decides *when* to retrieve based on query type
   - Saves compute for queries answerable with parametric knowledge

2. **Multi-Dimensional Critique Enables Control**
   - **Retrieval tokens**: Adaptive behavior
   - **Relevance tokens**: Quality filtering
   - **Support tokens**: Hallucination detection
   - **Utility tokens**: User satisfaction measurement

3. **Weighted Scoring Enables Task-Specific Optimization**
   - Adjusting `w_support` optimizes for factual accuracy
   - Adjusting `w_utility` optimizes for helpfulness
   - Adjusting `w_relevance` optimizes for retrieval precision

4. **Self-RAG as Meta-Reasoner**
   - From the curriculum: "The LLM performs meta-reasoning about the task itself"
   - Self-RAG asks: "Do I need help?" before acting
   - This is the bridge to fully Agentic RAG systems

### Architectural Insights

**The Progression of Intelligence:**
1. **Traditional RAG**: Reactive executor
2. **Corrective RAG**: Post-hoc corrector
3. **Self-RAG**: Proactive meta-reasoner ← We are here
4. **Agentic RAG**: Autonomous orchestrator (next demos)

### Production Considerations

1. **Latency Trade-offs**
   - Multiple critique calls add latency (4-5 LLM calls per query)
   - Can parallelize some critiques (relevance assessment)
   - Adaptive retrieval saves time when `[No Retrieval]` is predicted

2. **Implementation Reality**
   - This demo *simulates* Self-RAG with prompting
   - Production Self-RAG requires fine-tuning with special tokens
   - See original paper: [Self-RAG (arXiv:2310.11511)](https://arxiv.org/abs/2310.11511)

3. **When to Use Self-RAG**
   - High-stakes domains (medical, legal, financial)
   - Need for explainability (audit trail of decisions)
   - Variable query complexity (mix of simple and complex queries)

### Comparison with Corrective RAG

| Aspect | Corrective RAG | Self-RAG |
|--------|---------------|----------|
| **When evaluated** | After retrieval | Before + During + After |
| **Retrieval** | Always | Conditional (adaptive) |
| **Critique dimensions** | 1 (relevance) | 4 (retrieval, relevance, support, utility) |
| **Controllability** | Limited | High (weighted scoring) |
| **Latency** | +1 LLM call | +4-5 LLM calls |
| **Use case** | General robustness | High-stakes, explainable AI |

### Next Steps

To extend this implementation:
1. **Segment-Level Generation**: Implement iterative generation with per-segment critique
2. **Beam Search**: Maintain multiple candidate answers scored by composite critique
3. **Fine-Tuning**: Train a model to predict reflection tokens natively
4. **Caching**: Cache retrieval decisions for similar queries
5. **Monitoring**: Track critique distributions to identify system weaknesses

---

## 📚 References

**From the curriculum (AdvancedRAGWorkshop.md):**
- Meta-reasoning and self-reflection in Agentic RAG systems
- LLM as orchestrator: "Is the information I have sufficient?"

**Original Self-RAG Paper:**
- Asai et al., "Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection" (ICLR 2024, Oral)
- arXiv:2310.11511
- Implementation: https://github.com/AkariAsai/self-rag

**Key Concepts Demonstrated:**
- Adaptive retrieval based on query analysis
- Multi-dimensional critique framework (4 token types)
- Controllable generation via weighted scoring
- Meta-reasoning: system reasoning about its own capabilities

---

**Status**: Demo #8 Complete ✅