# RAG system design

RAG systems extend agent capabilities by grounding responses in external knowledge. Rather than relying solely on training data, RAG-enabled agents can query document stores, databases, and knowledge bases to retrieve relevant information before generating responses. This approach enables agents to access up-to-date information, cite sources, and provide answers grounded in specific organizational knowledge. However, poorly designed RAG systems introduce their own problems: noisy retrieval that clutters context with irrelevant information, missing critical details due to inadequate chunking, and wasted tokens on redundant or low-quality content.

RAG system design is fundamentally about maximizing signal-to-noise ratio in retrieved context. Every decision - how documents are chunked, how chunks are indexed, what metadata is attached, how results are ranked, and how many results to return - affects the quality of context provided to the agent. Well-designed RAG systems deliver precisely the information needed to answer queries accurately while minimizing tokens spent on irrelevant details. This requires thoughtful strategies for chunking that preserve semantic coherence, rich metadata that enables filtering, provenance markers that support verification, and relevance scoring that prioritizes quality over quantity.

In this notebook, we explore systematic techniques for designing RAG systems that maximize retrieval quality and context efficiency. We will examine effective chunking strategies that respect document structure, semantic indexing with metadata enrichment, retrieval pipelines with provenance markers, relevance indicators and confidence scores, source attribution for verifiability, and techniques for optimizing signal-to-noise ratio.

In [1]:
import os
from typing import List, Dict, Optional, Tuple
from dataclasses import dataclass
from datetime import datetime
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter, CharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_core.documents import Document
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field
import numpy as np

- **`langchain_openai`**: Provides the integration with OpenAI's models (Chat and Embeddings).
- **`langchain_text_splitters`**: Tools to split long text into smaller, manageable chunks.
- **`langchain_community.vectorstores`**: Contains the FAISS vector store implementation for efficient similarity search.
- **`pydantic`**: Used for data validation and defining structured data models.

### Initialize the language model and embeddings
We initialize the OpenAI chat model and embeddings model.

In [2]:
# Initialize the language model and embeddings
llm = ChatOpenAI(
    model="gpt-4o-mini-2024-07-18",
    api_key=os.getenv("OPENAI_API_KEY", "").strip(),
    temperature=0  # Set to 0 for more deterministic outputs
)
embeddings = OpenAIEmbeddings(api_key=os.getenv("OPENAI_API_KEY", "").strip())

- **`temperature=0`**: Ensures the model returns the most likely response, making it consistent and reproducible.
- **`OpenAIEmbeddings`**: Converts text into vector representations that capture semantic meaning.

### Sample document
We define a sample document (a product return policy) to use for demonstrating the RAG concepts.

In [3]:
# Sample document
sample_document = """Product Return Policy

At TechStore, we want you to be completely satisfied with your purchase. Our return policy is designed to be fair and straightforward.

Return Window: You have 30 days from the date of delivery to return most items. Electronics have a 14-day return window due to their nature.

Condition Requirements: Items must be in original condition with all accessories, manuals, and packaging. Opened software cannot be returned due to licensing restrictions.

Refund Process: Once we receive your return, we'll inspect it within 2 business days. Approved refunds are processed to your original payment method within 5-7 business days.

Shipping Costs: Return shipping is free for defective items. For other returns, a $9.99 shipping fee will be deducted from your refund unless you use our prepaid return label.

Exceptions: Final sale items, gift cards, and personalized products cannot be returned. Contact customer service for assistance with these items."""

print(f"Original document length: {len(sample_document)} characters")
print(f"Original document:\n{sample_document[:200]}...")

Original document length: 972 characters
Original document:
Product Return Policy

At TechStore, we want you to be completely satisfied with your purchase. Our return policy is designed to be fair and straightforward.

Return Window: You have 30 days from the ...



## Part 1: Effective chunking strategies

Chunking transforms long documents into retrievable units, but this seemingly simple operation profoundly affects retrieval quality. Chunk too large and we retrieve irrelevant information alongside relevant content, wasting tokens and diluting signal. Chunk too small and we fragment context, losing the connections and coherence needed to understand the information. The chunking strategy determines whether our RAG system can find and return exactly what's needed or buries valuable information in noise.

Different chunking strategies suit different content types and retrieval needs. Fixed-size chunking with overlap provides consistency and preserves some context across boundaries. Recursive chunking respects document structure by splitting on meaningful delimiters like paragraphs and sentences. Semantic chunking groups content by topic, creating chunks where every sentence contributes to answering a specific type of query. Understanding these approaches and when to apply each is essential for effective RAG system design.

### Bad approach: Naive fixed-size chunking without overlap
This approach simply splits the text into fixed-size chunks without considering any document structure.

In [4]:
# Naive chunking - splits at arbitrary character positions
def naive_chunk(text: str, chunk_size: int = 200) -> List[str]:
    """Split text into fixed-size chunks without considering content."""
    # Use list comprehension to slice the text at fixed intervals
    return [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]

# Apply naive chunking to the sample document
naive_chunks = naive_chunk(sample_document, chunk_size=200)

print("❌ Naive chunking results:")
print(f"Number of chunks: {len(naive_chunks)}\n")
# Iterate and print each chunk to observe the splits
for i, chunk in enumerate(naive_chunks[:3]):
    print(f"Chunk {i+1}:")
    print(chunk)
    print("---")

print("\n⚠️ Problems:")
print("  - Splits mid-sentence")
print("  - Separates related information")
print("  - No context about what comes before/after")
print("  - Impossible to understand chunk 3 alone")

❌ Naive chunking results:
Number of chunks: 5

Chunk 1:
Product Return Policy

At TechStore, we want you to be completely satisfied with your purchase. Our return policy is designed to be fair and straightforward.

Return Window: You have 30 days from the 
---
Chunk 2:
date of delivery to return most items. Electronics have a 14-day return window due to their nature.

Condition Requirements: Items must be in original condition with all accessories, manuals, and pack
---
Chunk 3:
aging. Opened software cannot be returned due to licensing restrictions.

Refund Process: Once we receive your return, we'll inspect it within 2 business days. Approved refunds are processed to your o
---

⚠️ Problems:
  - Splits mid-sentence
  - Separates related information
  - No context about what comes before/after
  - Impossible to understand chunk 3 alone


### Strategy 1: Fixed-size chunking with overlap
We use `CharacterTextSplitter` to split text while maintaining some overlap between chunks to preserve context across boundaries.

In [5]:
# Initialize CharacterTextSplitter with specific parameters
text_splitter = CharacterTextSplitter(
    separator="\n\n",  # Split on paragraphs
    chunk_size=300,  # Target size for each chunk in characters
    chunk_overlap=50,  # 50 character overlap preserves context
    length_function=len,  # Function to measure text length (standard len())
)

# Split the sample document using the configured splitter
fixed_chunks = text_splitter.split_text(sample_document)

print("✅ Fixed-size chunking with overlap:")
print(f"Number of chunks: {len(fixed_chunks)}\n")
for i, chunk in enumerate(fixed_chunks[:3]):
    print(f"Chunk {i+1} ({len(chunk)} chars):")
    print(chunk)
    print("---")

print("\n✅ Improvements:")
print("  - Respects paragraph boundaries")
print("  - Overlap provides context")
print("  - Each chunk is more coherent")

✅ Fixed-size chunking with overlap:
Number of chunks: 5

Chunk 1 (299 chars):
Product Return Policy

At TechStore, we want you to be completely satisfied with your purchase. Our return policy is designed to be fair and straightforward.

Return Window: You have 30 days from the date of delivery to return most items. Electronics have a 14-day return window due to their nature.
---
Chunk 2 (171 chars):
Condition Requirements: Items must be in original condition with all accessories, manuals, and packaging. Opened software cannot be returned due to licensing restrictions.
---
Chunk 3 (174 chars):
Refund Process: Once we receive your return, we'll inspect it within 2 business days. Approved refunds are processed to your original payment method within 5-7 business days.
---

✅ Improvements:
  - Respects paragraph boundaries
  - Overlap provides context
  - Each chunk is more coherent


### Strategy 2: Recursive chunking (respects document structure)
We use `RecursiveCharacterTextSplitter` to split text hierarchically, trying to keep related text together.

In [6]:
# Recursive chunking - tries multiple separators in order
recursive_splitter = RecursiveCharacterTextSplitter(
    # List of separators to try in order: paragraphs, lines, sentences, words, chars
    separators=["\n\n", "\n", ". ", " ", ""],  # Try these in order
    chunk_size=300,
    chunk_overlap=50,
    length_function=len,
)

# Split the document recursively
recursive_chunks = recursive_splitter.split_text(sample_document)

print("✅ Recursive chunking (structure-aware):")
print(f"Number of chunks: {len(recursive_chunks)}\n")
for i, chunk in enumerate(recursive_chunks[:3]):
    print(f"Chunk {i+1} ({len(chunk)} chars):")
    print(chunk)
    print("---")

print("\n✅ Benefits:")
print("  - Respects document hierarchy")
print("  - Falls back gracefully (paragraph -> sentence -> word)")
print("  - Preserves semantic units")

✅ Recursive chunking (structure-aware):
Number of chunks: 5

Chunk 1 (299 chars):
Product Return Policy

At TechStore, we want you to be completely satisfied with your purchase. Our return policy is designed to be fair and straightforward.

Return Window: You have 30 days from the date of delivery to return most items. Electronics have a 14-day return window due to their nature.
---
Chunk 2 (171 chars):
Condition Requirements: Items must be in original condition with all accessories, manuals, and packaging. Opened software cannot be returned due to licensing restrictions.
---
Chunk 3 (174 chars):
Refund Process: Once we receive your return, we'll inspect it within 2 business days. Approved refunds are processed to your original payment method within 5-7 business days.
---

✅ Benefits:
  - Respects document hierarchy
  - Falls back gracefully (paragraph -> sentence -> word)
  - Preserves semantic units


- **`separators`**: The splitter tries these in order. It first tries to split by paragraphs (`\n\n`), then lines (`\n`), then sentences (`. `), and so on. This respects the natural structure of the document better than fixed-size splitting.

### Strategy 3: Semantic chunking (content-aware)
We define a custom function to chunk text based on semantic boundaries (keywords) rather than just length.

In [7]:
def semantic_chunk(text: str, embeddings_model) -> List[str]:
    """Chunk text based on semantic similarity between sentences."""
    # Split text into sentences based on '. '
    sentences = [s.strip() for s in text.split('. ') if s.strip()]
    
    # For small documents, group by topic manually
    # In production, use embeddings to detect topic shifts
    chunks = []
    current_chunk = []
    
    for sentence in sentences:
        current_chunk.append(sentence)
        
        # Group by topic markers (in production, use embedding similarity)
        # Check if the sentence contains any topic shift keywords
        if any(keyword in sentence.lower() for keyword in ['window:', 'requirements:', 'process:', 'costs:', 'exceptions:']):
            # If a keyword is found and we have accumulated sentences, finalize the current chunk
            if len(current_chunk) > 1:
                # Join sentences to form the chunk, excluding the current sentence (start of new topic)
                chunks.append('. '.join(current_chunk[:-1]) + '.')
                # Start the new chunk with the current sentence
                current_chunk = [current_chunk[-1]]

    # Append any remaining sentences as the final chunk
    if current_chunk:
        chunks.append('. '.join(current_chunk) + '.')
    
    return chunks

# Apply semantic chunking
semantic_chunks = semantic_chunk(sample_document, embeddings)

print("✅ Semantic chunking (topic-based):")
print(f"Number of chunks: {len(semantic_chunks)}\n")
for i, chunk in enumerate(semantic_chunks[:4]):
    print(f"Topic {i+1} ({len(chunk)} chars):")
    print(chunk)
    print("---")

print("\n✅ Benefits:")
print("  - Each chunk covers one topic")
print("  - Natural semantic boundaries")
print("  - More relevant retrieval results")

✅ Semantic chunking (topic-based):
Number of chunks: 6

Topic 1 (95 chars):
Product Return Policy

At TechStore, we want you to be completely satisfied with your purchase.
---
Topic 2 (142 chars):
Our return policy is designed to be fair and straightforward.

Return Window: You have 30 days from the date of delivery to return most items.
---
Topic 3 (167 chars):
Electronics have a 14-day return window due to their nature.

Condition Requirements: Items must be in original condition with all accessories, manuals, and packaging.
---
Topic 4 (152 chars):
Opened software cannot be returned due to licensing restrictions.

Refund Process: Once we receive your return, we'll inspect it within 2 business days.
---

✅ Benefits:
  - Each chunk covers one topic
  - Natural semantic boundaries
  - More relevant retrieval results


This custom function iterates through sentences and starts a new chunk whenever it detects a keyword indicating a new topic. This ensures that each chunk is semantically self-contained.

### Choosing the right chunking strategy

| Strategy | Best For | Pros | Cons |
|----------|----------|------|------|
| Fixed-size | Consistent chunk lengths, simple documents | Fast, predictable | Ignores structure |
| Recursive | Structured documents (markdown, code) | Respects hierarchy | May create uneven chunks |
| Semantic | Topic-based documents, FAQs | Best retrieval accuracy | Slower, more complex |

## Part 2: Semantic indexing with metadata

Metadata transforms chunks from isolated text fragments into rich, queryable information objects. Without metadata, chunks float in isolation with no context about their source, recency, category, or reliability. With well-designed metadata, agents can filter by document type, prioritize recent information, verify sources, and understand the provenance of every fact. This enables precise retrieval where only relevant, trustworthy information makes it into the agent's context.

The key is choosing metadata that supports filtering and ranking without adding excessive overhead. Core metadata like source and document type enable basic filtering. Structural metadata like section and chunk position help agents understand context. Temporal metadata like creation and update timestamps support freshness-based ranking. Categorical metadata like tags and keywords enable semantic filtering. Together, these metadata fields transform simple text retrieval into intelligent information discovery.

##### Chunks without metadata
First, let's see what chunks look like without metadata.

In [8]:
# Create Document objects from raw text chunks without adding metadata
chunks_without_metadata = [
    Document(page_content=chunk) 
    for chunk in recursive_chunks
]

print("❌ Chunks without metadata:")
for i, doc in enumerate(chunks_without_metadata[:2]):
    print(f"\nChunk {i+1}:")
    print(f"  Content: {doc.page_content[:100]}...")
    print(f"  Metadata: {doc.metadata}")

print("\n⚠️ Problems:")
print("  - No source attribution")
print("  - Can't filter by document type")
print("  - No timestamp for freshness")
print("  - Missing context about document structure")

❌ Chunks without metadata:

Chunk 1:
  Content: Product Return Policy

At TechStore, we want you to be completely satisfied with your purchase. Our ...
  Metadata: {}

Chunk 2:
  Content: Condition Requirements: Items must be in original condition with all accessories, manuals, and packa...
  Metadata: {}

⚠️ Problems:
  - No source attribution
  - Can't filter by document type
  - No timestamp for freshness
  - Missing context about document structure


Without metadata, the `metadata` dictionary is empty. This limits our ability to filter or track the source of the information.

### Rich metadata for better retrieval
Now we will create a function to automatically attach rich metadata to each chunk.

In [9]:
# With comprehensive metadata
@dataclass
class ChunkMetadata:
    """Metadata schema for document chunks."""
    source: str  # Source document
    document_type: str  # policy, faq, product_info, etc.
    section: str  # Section within document
    chunk_index: int  # Position in document
    total_chunks: int  # Total chunks in document
    created_at: str  # When indexed
    last_updated: str  # Last modification
    keywords: List[str]  # Extracted keywords
    category: str  # High-level category

def create_enriched_chunks(text: str, source: str, document_type: str) -> List[Document]:
    """Create chunks with rich metadata."""
    # Use recursive splitter for better content preservation
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=300,
        chunk_overlap=50,
    )
    chunks = splitter.split_text(text)
    
    documents = []
    for i, chunk in enumerate(chunks):
        # Heuristic to determine section based on keywords in the chunk
        section = "General"
        if "shipping" in chunk.lower(): section = "Shipping Costs"
        elif "refund" in chunk.lower(): section = "Refund Process"
        elif "return" in chunk.lower(): section = "Standard Returns"
        
        # Create metadata object with all relevant fields
        metadata = ChunkMetadata(
            source=source,
            document_type=document_type,
            section=section,
            chunk_index=i,
            total_chunks=len(chunks),
            created_at=datetime.now().isoformat(),
            last_updated=datetime.now().isoformat(),
            keywords=["policy", "returns", section.lower()],
            category="customer_service"
        )
        
        # Create Document object with content and metadata dictionary
        documents.append(Document(page_content=chunk, metadata=metadata.__dict__))
    return documents

# Generate enriched chunks from the sample document
enriched_chunks = create_enriched_chunks(
    sample_document, 
    source="return_policy_v2.md",
    document_type="policy"
)

print("✅ Chunks with rich metadata:")
for i, doc in enumerate(enriched_chunks[:2]):
    print(f"\nChunk {i+1}:")
    print(f"  Content: {doc.page_content[:100]}...")
    print(f"  Metadata:")
    for key, value in doc.metadata.items():
        print(f"    - {key}: {value}")

✅ Chunks with rich metadata:

Chunk 1:
  Content: Product Return Policy

At TechStore, we want you to be completely satisfied with your purchase. Our ...
  Metadata:
    - source: return_policy_v2.md
    - document_type: policy
    - section: Standard Returns
    - chunk_index: 0
    - total_chunks: 5
    - created_at: 2025-11-28T18:52:52.840134
    - last_updated: 2025-11-28T18:52:52.840143
    - keywords: ['policy', 'returns', 'standard returns']
    - category: customer_service

Chunk 2:
  Content: Condition Requirements: Items must be in original condition with all accessories, manuals, and packa...
  Metadata:
    - source: return_policy_v2.md
    - document_type: policy
    - section: Standard Returns
    - chunk_index: 1
    - total_chunks: 5
    - created_at: 2025-11-28T18:52:52.840174
    - last_updated: 2025-11-28T18:52:52.840176
    - keywords: ['policy', 'returns', 'standard returns']
    - category: customer_service


- **`section`**: We use a simple heuristic to guess the section based on keywords. In a real system, this might come from the document structure.
- **`chunk_index`**: Helps in ordering chunks if we need to reconstruct the document.
- **`keywords`**: Allows for keyword-based filtering in addition to semantic search.

### Using metadata for filtered retrieval
We will create a vector store and perform a filtered search. This allows us to narrow down the search space before performing similarity matching.

In [10]:
# Create FAISS vector store with metadata
vectorstore = FAISS.from_documents(enriched_chunks, embeddings)

query = "What's the shipping cost for returns?"

# Retrieval without metadata filtering
basic_results = vectorstore.similarity_search(query, k=2)

print("Basic retrieval (no metadata filtering):")
for i, doc in enumerate(basic_results):
    print(f"\nResult {i+1}:")
    print(f"  Section: {doc.metadata['section']}")
    print(f"  Content: {doc.page_content[:150]}...")


# Retrieval with metadata filtering
filtered_results = vectorstore.similarity_search(
    query,
    k=2,
    filter={"section": "Shipping Costs"}  # Only search specific section
)

print("\n" + "="*60)
print("Filtered retrieval (section: Shipping Costs):")
for i, doc in enumerate(filtered_results):
    print(f"\nResult {i+1}:")
    print(f"  Section: {doc.metadata['section']}")
    print(f"  Content: {doc.page_content[:150]}...")

print("\n✅ Benefits of metadata filtering:")
print("  - More precise results")
print("  - Can filter by date, type, category")
print("  - Reduces noise in retrieved context")

Basic retrieval (no metadata filtering):

Result 1:
  Section: Shipping Costs
  Content: Shipping Costs: Return shipping is free for defective items. For other returns, a $9.99 shipping fee will be deducted from your refund unless you use ...

Result 2:
  Section: Standard Returns
  Content: Product Return Policy

At TechStore, we want you to be completely satisfied with your purchase. Our return policy is designed to be fair and straightf...

Filtered retrieval (section: Shipping Costs):

Result 1:
  Section: Shipping Costs
  Content: Shipping Costs: Return shipping is free for defective items. For other returns, a $9.99 shipping fee will be deducted from your refund unless you use ...

✅ Benefits of metadata filtering:
  - More precise results
  - Can filter by date, type, category
  - Reduces noise in retrieved context


- **`FAISS.from_documents`**: Creates the vector index from our enriched chunks.
- **`filter={"section": "Shipping Costs"}`**: This argument tells the retriever to only consider chunks where the `section` metadata field matches "Shipping Costs". This guarantees that the results are from the relevant section, even if other sections have similar keywords.

## Part 3: Retrieval pipelines with provenance markers
Provenance markers identify where information came from, enabling verification and building trust.

In [11]:
# Without provenance - can't verify sources
def basic_retrieval(query: str) -> str:
    """Basic retrieval without provenance."""
    results = vectorstore.similarity_search(query, k=2)
    context = "\n\n".join([doc.page_content for doc in results])
    return context

query = "Can I return electronics?"
basic_context = basic_retrieval(query)

print("❌ Context without provenance:")
print(basic_context)
print("\n⚠️ Problems:")
print("  - No way to verify information")
print("  - Can't trace back to source")
print("  - User can't fact-check")
print("  - Difficult to debug retrieval issues")

❌ Context without provenance:
Product Return Policy

At TechStore, we want you to be completely satisfied with your purchase. Our return policy is designed to be fair and straightforward.

Return Window: You have 30 days from the date of delivery to return most items. Electronics have a 14-day return window due to their nature.

Condition Requirements: Items must be in original condition with all accessories, manuals, and packaging. Opened software cannot be returned due to licensing restrictions.

⚠️ Problems:
  - No way to verify information
  - Can't trace back to source
  - User can't fact-check
  - Difficult to debug retrieval issues


### Adding provenance markers
We will implement retrieval with source tracking. It is crucial for the agent to know *where* information came from so it can cite it.

In [12]:
def retrieval_with_provenance(query: str) -> Tuple[str, List[Dict]]:
    """Retrieves documents and formats them with source markers."""
    # Retrieve top 2 documents
    results = vectorstore.similarity_search(query, k=2)
    
    # Build context with provenance markers
    context_parts = []
    sources = []
    
    for i, doc in enumerate(results, 1):
        # Add provenance marker for the LLM (e.g., [Source 1: filename - section])
        marker = f"[Source {i}: {doc.metadata['source']} - {doc.metadata['section']}]"
        # Append the marker and content to the context list
        context_parts.append(f"{marker}\n{doc.page_content}")
        
        # Collect source information
        sources.append({
            "id": i,
            "source": doc.metadata['source'],
            "section": doc.metadata['section'],
            "document_type": doc.metadata['document_type'],
            "last_updated": doc.metadata['last_updated'],
        })

    # Join all context parts with double newlines
    context = "\n\n".join(context_parts)
    return context, sources

# Test the function
context_with_provenance, sources = retrieval_with_provenance(query)

print("✅ Context with provenance markers:")
print(context_with_provenance)
print("\n" + "="*60)
print("\nSource metadata:")
for source in sources:
    print(f"\nSource {source['id']}:")
    for key, value in source.items():
        if key != 'id':
            print(f"  {key}: {value}")

✅ Context with provenance markers:
[Source 1: return_policy_v2.md - Standard Returns]
Product Return Policy

At TechStore, we want you to be completely satisfied with your purchase. Our return policy is designed to be fair and straightforward.

Return Window: You have 30 days from the date of delivery to return most items. Electronics have a 14-day return window due to their nature.

[Source 2: return_policy_v2.md - Standard Returns]
Condition Requirements: Items must be in original condition with all accessories, manuals, and packaging. Opened software cannot be returned due to licensing restrictions.


Source metadata:

Source 1:
  source: return_policy_v2.md
  section: Standard Returns
  document_type: policy
  last_updated: 2025-11-28T18:52:52.840143

Source 2:
  source: return_policy_v2.md
  section: Standard Returns
  document_type: policy
  last_updated: 2025-11-28T18:52:52.840176


- **`[Source i: ...]`**: We prepend this string to each chunk. The LLM sees this and can use it to reference the information.
- **`sources` list**: We return the structured metadata separately so the application can display citations to the user.

### Using provenance in agent responses
We instruct the agent to use these markers in its response.

In [13]:
# Define a prompt template that instructs the model to use citations
provenance_prompt = ChatPromptTemplate.from_messages([
    ("system", """You are a customer service agent. Answer questions using the provided context.
    
IMPORTANT: When citing information, reference the source number in brackets, e.g., [1] or [2].

Context with sources:
{context}
"""),
    ("human", "{question}")
])

# Format the messages with the retrieved context and user question
messages = provenance_prompt.format_messages(
    context=context_with_provenance,
    question=query
)

# Invoke the LLM to generate a response
response = llm.invoke(messages)

print("Agent response with citations:")
print(response.content)
print("\n" + "="*60)
print("\nSources:")
for source in sources:
    print(f"[{source['id']}] {source['source']} - {source['section']} (updated: {source['last_updated']})")

Agent response with citations:
Yes, you can return electronics, but please note that they have a 14-day return window from the date of delivery due to their nature. Make sure the items are in original condition with all accessories, manuals, and packaging included [1].


Sources:
[1] return_policy_v2.md - Standard Returns (updated: 2025-11-28T18:52:52.840143)
[2] return_policy_v2.md - Standard Returns (updated: 2025-11-28T18:52:52.840176)


The system prompt explicitly tells the model to use the `[1]` format. Because the context contains `[Source 1: ...]`, the model can easily map the information to the source number.

## Part 4: Relevance indicators and confidence scores
Not all retrieved chunks are equally relevant. Adding relevance scores helps the agent weight information appropriately.

In [14]:
# Retrieval without relevance scores
def retrieve_without_scores(query: str, k: int = 3) -> List[Document]:
    """Retrieve without confidence indicators."""
    return vectorstore.similarity_search(query, k=k)

results_no_scores = retrieve_without_scores("return policy for laptops")

print("❌ Retrieval without relevance scores:")
for i, doc in enumerate(results_no_scores):
    print(f"\nResult {i+1}:")
    print(f"  Content: {doc.page_content[:100]}...")
    print(f"  Relevance: Unknown")

print("\n⚠️ Problems:")
print("  - Agent can't prioritize information")
print("  - No indication of match quality")
print("  - May use less relevant chunks equally")

❌ Retrieval without relevance scores:

Result 1:
  Content: Product Return Policy

At TechStore, we want you to be completely satisfied with your purchase. Our ...
  Relevance: Unknown

Result 2:
  Content: Condition Requirements: Items must be in original condition with all accessories, manuals, and packa...
  Relevance: Unknown

Result 3:
  Content: Shipping Costs: Return shipping is free for defective items. For other returns, a $9.99 shipping fee...
  Relevance: Unknown

⚠️ Problems:
  - Agent can't prioritize information
  - No indication of match quality
  - May use less relevant chunks equally


### Adding relevance scores and confidence levels
We will add confidence scores to retrieval results. This helps us filter out irrelevant results that might hallucinate the agent.

In [15]:
# Define a Pydantic model for structured retrieval results
class RetrievalResult(BaseModel):
    """Retrieval result with confidence scoring."""
    content: str
    source: str
    section: str
    # Ensure similarity score is between 0 and 1
    similarity_score: float = Field(ge=0.0, le=1.0)
    confidence_level: str  # high, medium, low
    relevance_reason: str

def retrieve_with_scores(query: str, k: int = 3) -> List[RetrievalResult]:
    """Retrieve with confidence scores and relevance indicators."""
    # Get results with scores (lower is better) from FAISS
    results_with_scores = vectorstore.similarity_search_with_score(query, k=k)
    
    retrieval_results = []
    
    for doc, score in results_with_scores:
        # Convert FAISS distance to similarity (lower distance = higher similarity)
        # Normalize to 0-1 range (this is approximate)
        similarity = 1 / (1 + score)
        
        # Determine confidence level
        if similarity > 0.8:
            confidence = "high"
            reason = "Strong semantic match"
        elif similarity > 0.6:
            confidence = "medium"
            reason = "Moderate semantic match"
        else:
            confidence = "low"
            reason = "Weak semantic match - verify relevance"

        # Create a structured result object
        retrieval_results.append(RetrievalResult(
            content=doc.page_content,
            source=doc.metadata['source'],
            section=doc.metadata['section'],
            similarity_score=round(similarity, 3),
            confidence_level=confidence,
            relevance_reason=reason
        ))
    
    return retrieval_results

# Test with a query
scored_results = retrieve_with_scores("return policy for laptops", k=3)

print("✅ Retrieval with confidence scores:")
for i, result in enumerate(scored_results, 1):
    print(f"\nResult {i}:")
    print(f"  Source: {result.source} - {result.section}")
    print(f"  Similarity: {result.similarity_score:.3f}")
    print(f"  Confidence: {result.confidence_level.upper()}")
    print(f"  Reason: {result.relevance_reason}")
    print(f"  Content: {result.content[:100]}...")

✅ Retrieval with confidence scores:

Result 1:
  Source: return_policy_v2.md - Standard Returns
  Similarity: 0.743
  Confidence: MEDIUM
  Reason: Moderate semantic match
  Content: Product Return Policy

At TechStore, we want you to be completely satisfied with your purchase. Our ...

Result 2:
  Source: return_policy_v2.md - Standard Returns
  Similarity: 0.708
  Confidence: MEDIUM
  Reason: Moderate semantic match
  Content: Condition Requirements: Items must be in original condition with all accessories, manuals, and packa...

Result 3:
  Source: return_policy_v2.md - Shipping Costs
  Similarity: 0.694
  Confidence: MEDIUM
  Reason: Moderate semantic match
  Content: Shipping Costs: Return shipping is free for defective items. For other returns, a $9.99 shipping fee...


- **`similarity_search_with_score`**: Returns the distance metric along with the document. For FAISS L2 index, this is the Euclidean distance.
- **`1 / (1 + score)`**: Converts the unbounded distance (where 0 is identical) to a normalized 0-1 score (where 1 is identical). This makes it easier to set thresholds.
- **Thresholds**: We arbitrarily define >0.8 as "high" and >0.6 as "medium". These would need tuning in a real application.

### Using confidence scores in context

In [16]:
def build_scored_context(query: str) -> str:
    """Build context with confidence indicators."""
    results = retrieve_with_scores(query, k=3)
    
    context_parts = []
    for i, result in enumerate(results, 1):
        # Add confidence indicator
        confidence_marker = f"[Source {i} - {result.confidence_level.upper()} confidence ({result.similarity_score:.2f})]"
        context_parts.append(f"{confidence_marker}\n{result.content}")
    
    return "\n\n".join(context_parts)

scored_context = build_scored_context("return policy for laptops")

print("Context with confidence scores:")
print(scored_context)

print("\n" + "="*60)
print("\n✅ Benefits:")
print("  - Agent can weight high-confidence information more")
print("  - Low-confidence results flagged for verification")
print("  - Users understand information quality")
print("  - Enables confidence-based filtering")

Context with confidence scores:
[Source 1 - MEDIUM confidence (0.74)]
Product Return Policy

At TechStore, we want you to be completely satisfied with your purchase. Our return policy is designed to be fair and straightforward.

Return Window: You have 30 days from the date of delivery to return most items. Electronics have a 14-day return window due to their nature.

[Source 2 - MEDIUM confidence (0.71)]
Condition Requirements: Items must be in original condition with all accessories, manuals, and packaging. Opened software cannot be returned due to licensing restrictions.

[Source 3 - MEDIUM confidence (0.69)]
Shipping Costs: Return shipping is free for defective items. For other returns, a $9.99 shipping fee will be deducted from your refund unless you use our prepaid return label.


✅ Benefits:
  - Agent can weight high-confidence information more
  - Low-confidence results flagged for verification
  - Users understand information quality
  - Enables confidence-based filtering


## Part 5: Source attribution and signal-to-noise ratio

Maximizing signal-to-noise ratio means retrieving highly relevant information while excluding noise.

In [17]:
# High noise example - retrieving too much
def noisy_retrieval(query: str) -> str:
    """Retrieve many results without filtering."""
    results = vectorstore.similarity_search(query, k=5)  # Too many
    return "\n\n".join([doc.page_content for doc in results])

noisy_context = noisy_retrieval("laptop return window")

print("❌ High-noise retrieval (k=5, no filtering):")
print(f"Context length: {len(noisy_context)} characters")
print(noisy_context[:300] + "...")
print("\n⚠️ Problems:")
print("  - Too much irrelevant information")
print("  - Agent must sort through noise")
print("  - Wastes tokens")
print("  - May confuse the agent")

❌ High-noise retrieval (k=5, no filtering):
Context length: 972 characters
Product Return Policy

At TechStore, we want you to be completely satisfied with your purchase. Our return policy is designed to be fair and straightforward.

Return Window: You have 30 days from the date of delivery to return most items. Electronics have a 14-day return window due to their nature.
...

⚠️ Problems:
  - Too much irrelevant information
  - Agent must sort through noise
  - Wastes tokens
  - May confuse the agent


### Optimizing signal-to-noise ratio

In [18]:
def optimized_retrieval(query: str, min_confidence: float = 0.7) -> Tuple[str, Dict]:
    """Retrieve with confidence filtering for maximum signal-to-noise."""
    # Get results with scores
    results = retrieve_with_scores(query, k=5)
    
    # Filter by confidence threshold
    high_quality = [r for r in results if r.similarity_score >= min_confidence]
    
    # If too few high-quality results, include medium confidence
    if len(high_quality) < 2:
        high_quality = [r for r in results if r.similarity_score >= 0.5]
    
    # Build concise context with only relevant info
    context_parts = []
    for i, result in enumerate(high_quality[:3], 1):  # Max 3 results
        marker = f"[{i}. {result.section} ({result.confidence_level})]"  
        context_parts.append(f"{marker} {result.content}")
    
    context = "\n\n".join(context_parts)
    
    # Metadata about retrieval quality
    metadata = {
        "total_retrieved": len(results),
        "filtered_count": len(high_quality),
        "avg_confidence": np.mean([r.similarity_score for r in high_quality]),
        "min_confidence_threshold": min_confidence,
    }
    
    return context, metadata

optimized_context, metadata = optimized_retrieval("laptop return window", min_confidence=0.7)

print("✅ Optimized retrieval (confidence-filtered):")
print(f"Context length: {len(optimized_context)} characters")
print(optimized_context)
print("\n" + "="*60)
print("\nRetrieval metadata:")
for key, value in metadata.items():
    print(f"  {key}: {value}")

print("\n✅ Improvements:")
print(f"  - Reduced noise: {len(noisy_context) - len(optimized_context)} chars saved")
print("  - Only high-confidence results included")
print("  - Concise and focused")
print("  - Better token efficiency")

✅ Optimized retrieval (confidence-filtered):
Context length: 740 characters
[1. Standard Returns (medium)] Product Return Policy

At TechStore, we want you to be completely satisfied with your purchase. Our return policy is designed to be fair and straightforward.

Return Window: You have 30 days from the date of delivery to return most items. Electronics have a 14-day return window due to their nature.

[2. Standard Returns (medium)] Condition Requirements: Items must be in original condition with all accessories, manuals, and packaging. Opened software cannot be returned due to licensing restrictions.

[3. Shipping Costs (medium)] Shipping Costs: Return shipping is free for defective items. For other returns, a $9.99 shipping fee will be deducted from your refund unless you use our prepaid return label.


Retrieval metadata:
  total_retrieved: 5
  filtered_count: 3
  avg_confidence: 0.734000007311503
  min_confidence_threshold: 0.7

✅ Improvements:
  - Reduced noise: 232 chars saved


## Putting it all together: Production RAG system
We will combine all these techniques into a single `ProductionRAG` class that handles the entire pipeline.

In [19]:
class ProductionRAG:
    """Production-ready RAG system with all best practices."""
    
    def __init__(self, embeddings_model, llm_model):
        # Initialize embeddings and LLM
        self.embeddings = embeddings_model
        self.llm = llm_model
        self.vectorstore = None
        
    def index_documents(self, documents: List[str], sources: List[str], 
                       doc_types: List[str], chunking_strategy: str = "recursive"):
        """Index documents with chosen chunking strategy and rich metadata."""
        all_chunks = []
        
        for doc, source, doc_type in zip(documents, sources, doc_types):
            # Create chunks with metadata
            chunks = create_enriched_chunks(doc, source, doc_type)
            all_chunks.extend(chunks)
        
        # Create vector store
        self.vectorstore = FAISS.from_documents(all_chunks, self.embeddings)
        print(f"✅ Indexed {len(all_chunks)} chunks from {len(documents)} documents")
    
    def retrieve(self, query: str, k: int = 3, min_confidence: float = 0.6,
                filters: Optional[Dict] = None) -> Dict:
        """Retrieve with confidence scoring, provenance, and filtering."""
        # Get results
        if filters:
            results = self.vectorstore.similarity_search_with_score(
                query, k=k*2, filter=filters  # Get extra for filtering
            )
        else:
            results = self.vectorstore.similarity_search_with_score(query, k=k*2)
        
        # Score and filter
        scored_results = []
        for doc, score in results:
            similarity = 1 / (1 + score)
            if similarity >= min_confidence:
                scored_results.append({
                    'content': doc.page_content,
                    'metadata': doc.metadata,
                    'similarity': round(similarity, 3),
                    'confidence': 'high' if similarity > 0.8 else 'medium'
                })
        
        # Take top k
        scored_results = scored_results[:k]
        
        # Build context with provenance
        context_parts = []
        for i, result in enumerate(scored_results, 1):
            provenance = f"[Source {i}: {result['metadata']['source']} - {result['metadata']['section']} | Confidence: {result['confidence'].upper()} ({result['similarity']})]"  
            context_parts.append(f"{provenance}\n{result['content']}")
        
        return {
            'context': "\n\n".join(context_parts),
            'results': scored_results,
            'num_results': len(scored_results),
            'avg_confidence': np.mean([r['similarity'] for r in scored_results]) if scored_results else 0
        }
    
    def answer_query(self, query: str, min_confidence: float = 0.6) -> Dict:
        """Answer query with full RAG pipeline."""
        # Retrieve context
        retrieval = self.retrieve(query, k=3, min_confidence=min_confidence)
        
        # Create prompt
        prompt = ChatPromptTemplate.from_messages([
            ("system", """You are a helpful customer service agent. Answer using the provided context.

IMPORTANT:
- Cite sources using [1], [2], etc.
- If confidence is MEDIUM, mention "According to our records..."
- If no relevant context found, say "I don't have that information"

Context:
{context}
"""),
            ("human", "{question}")
        ])
        
        # Generate response
        messages = prompt.format_messages(
            context=retrieval['context'],
            question=query
        )
        response = self.llm.invoke(messages)
        
        return {
            'answer': response.content,
            'sources': retrieval['results'],
            'confidence': retrieval['avg_confidence'],
            'num_sources': retrieval['num_results']
        }

# Initialize production RAG
rag = ProductionRAG(embeddings, llm)

# Index documents
rag.index_documents(
    documents=[sample_document],
    sources=["return_policy_v2.md"],
    doc_types=["policy"]
)

# Test queries
test_queries = [
    "Can I return a laptop after 20 days?",
    "What's the shipping cost for returns?",
    "Are gift cards returnable?"
]

print("\n" + "="*60)
print("Testing Production RAG System")
print("="*60)

for query in test_queries:
    print(f"\n\nQuery: {query}")
    print("-" * 60)
    
    result = rag.answer_query(query)
    
    print(f"\nAnswer: {result['answer']}")
    print(f"\nConfidence: {result['confidence']:.3f}")
    print(f"Sources used: {result['num_sources']}")
    
    print("\nSource details:")
    for i, source in enumerate(result['sources'], 1):
        print(f"  [{i}] {source['metadata']['source']} - {source['metadata']['section']} (confidence: {source['confidence']})")

✅ Indexed 5 chunks from 1 documents

Testing Production RAG System


Query: Can I return a laptop after 20 days?
------------------------------------------------------------

Answer: According to our records, you can return a laptop within 14 days of delivery due to its nature as an electronic item. Since you mentioned that it has been 20 days, unfortunately, you would not be able to return the laptop [1].

Confidence: 0.727
Sources used: 3

Source details:
  [1] return_policy_v2.md - Standard Returns (confidence: medium)
  [2] return_policy_v2.md - Standard Returns (confidence: medium)
  [3] return_policy_v2.md - Refund Process (confidence: medium)


Query: What's the shipping cost for returns?
------------------------------------------------------------

Answer: Return shipping is free for defective items. For other returns, a $9.99 shipping fee will be deducted from your refund unless you use our prepaid return label [1].

Confidence: 0.761
Sources used: 3

Source details:
  [1] ret