## Setup: Prerequisites from Previous Parts

We'll reuse some components from Part 2 (Data Indexing) and Part 3 (Retrieval Strategies).

In [2]:
# Import sample data and required libraries
from sample_data import SAMPLE_TEXT
import nltk
import os
nltk.download('punkt', quiet=True)

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import chromadb
from sentence_transformers import SentenceTransformer, CrossEncoder
from rank_bm25 import BM25Okapi
import numpy as np
from groq import Groq
import torch

print("‚úÖ All libraries imported successfully!")

  from .autonotebook import tqdm as notebook_tqdm


‚úÖ All libraries imported successfully!


In [3]:
# Semantic chunking function (from Part 2)
def chunk_by_semantic_similarity(text: str, similarity_threshold: float = 0.5, overlap_sentences: int = 2, min_chunk_size: int = 2) -> list:
    """Semantic chunking based on sentence similarity using TF-IDF vectors"""
    sentences = nltk.sent_tokenize(text)
    if len(sentences) <= min_chunk_size:
        return [text]
    
    vectorizer = TfidfVectorizer(stop_words='english')
    try:
        sentence_vectors = vectorizer.fit_transform(sentences)
    except ValueError:
        return [text]
    
    similarities = [cosine_similarity(sentence_vectors[i:i+1], sentence_vectors[i+1:i+2])[0][0] 
                   for i in range(len(sentences) - 1)]
    
    chunk_boundaries = [0]
    current_chunk_size = 1
    
    for i, sim in enumerate(similarities):
        if sim < similarity_threshold and current_chunk_size >= min_chunk_size:
            chunk_boundaries.append(i + 1)
            current_chunk_size = 1
        else:
            current_chunk_size += 1
    
    if chunk_boundaries[-1] != len(sentences):
        chunk_boundaries.append(len(sentences))
    
    chunks = []
    for i in range(len(chunk_boundaries) - 1):
        start_idx = chunk_boundaries[i]
        end_idx = chunk_boundaries[i + 1]
        
        if i > 0 and overlap_sentences > 0:
            overlap_start = max(0, start_idx - overlap_sentences)
            chunk_sentences = sentences[overlap_start:end_idx]
        else:
            chunk_sentences = sentences[start_idx:end_idx]
        
        chunk = " ".join(chunk_sentences)
        chunks.append(chunk)
    
    return chunks

# Create semantic chunks
chunks = chunk_by_semantic_similarity(SAMPLE_TEXT, similarity_threshold=0.15, overlap_sentences=2)
print(f"‚úÖ Created {len(chunks)} semantic chunks")

‚úÖ Created 10 semantic chunks


In [4]:
# Build vector store (from Part 2)
print("üîÑ Loading embedding model...")
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
print("‚úÖ Embedding model loaded!\n")

print(f"üîÑ Generating embeddings for {len(chunks)} chunks...")
embeddings = embedding_model.encode(chunks, show_progress_bar=False)
print(f"‚úÖ Generated {len(embeddings)} embeddings\n")

# Initialize ChromaDB
chroma_client = chromadb.Client()
collection = chroma_client.get_or_create_collection(
    name="augmentation_demo_chunks",
    metadata={"hnsw:space": "cosine"}
)

# Store chunks with metadata
collection.add(
    ids=[f"chunk_{i}" for i in range(len(chunks))],
    embeddings=embeddings.tolist(),
    documents=chunks,
    metadatas=[{
        "chunk_index": i, 
        "length": len(chunk),
        "source": "Remote Work Policy",
        "topic": chunk.split('\n')[0] if '\n' in chunk else "General"
    } for i, chunk in enumerate(chunks)]
)
print(f"‚úÖ Vector store ready with {collection.count()} chunks\n")

üîÑ Loading embedding model...
‚úÖ Embedding model loaded!

üîÑ Generating embeddings for 10 chunks...
‚úÖ Generated 10 embeddings

‚úÖ Vector store ready with 10 chunks



In [5]:
# Hybrid retrieval function (from Part 3)
def hybrid_retrieval(query: str, top_k: int = 3, alpha: float = 0.7):
    """Hybrid retrieval combining BM25 + vector embeddings"""
    # Dense retrieval
    query_embedding = embedding_model.encode([query])[0]
    vector_results = collection.query(
        query_embeddings=[query_embedding.tolist()],
        n_results=top_k
    )
    
    vector_scores_dict = {}
    for chunk_id, distance in zip(vector_results['ids'][0], vector_results['distances'][0]):
        chunk_idx = int(chunk_id.split('_')[1])
        vector_scores_dict[chunk_idx] = 1 - distance
    
    # BM25
    tokenized_chunks = [chunk.lower().split() for chunk in chunks]
    query_tokens = query.lower().split()
    bm25 = BM25Okapi(tokenized_chunks)
    
    bm25_scores_dict = {}
    for chunk_idx in vector_scores_dict.keys():
        bm25_scores_dict[chunk_idx] = bm25.get_scores(query_tokens)[chunk_idx]
    
    # Normalize BM25
    bm25_scores_list = list(bm25_scores_dict.values())
    bm25_min = min(bm25_scores_list)
    bm25_max = max(bm25_scores_list)
    bm25_scores_normalized_dict = {
        chunk_idx: (score - bm25_min) / (bm25_max - bm25_min + 1e-10)
        for chunk_idx, score in bm25_scores_dict.items()
    }
    
    # Combine scores
    hybrid_scores_dict = {
        chunk_idx: alpha * vector_scores_dict[chunk_idx] + (1 - alpha) * bm25_scores_normalized_dict[chunk_idx]
        for chunk_idx in vector_scores_dict.keys()
    }
    
    sorted_results = sorted(hybrid_scores_dict.items(), key=lambda x: x[1], reverse=True)
    
    results = []
    for chunk_idx, hybrid_score in sorted_results:
        # Get metadata from collection
        chunk_metadata = collection.get(ids=[f"chunk_{chunk_idx}"])['metadatas'][0]
        results.append({
            'chunk_idx': chunk_idx,
            'hybrid_score': hybrid_score,
            'vector_score': vector_scores_dict[chunk_idx],
            'bm25_score': bm25_scores_normalized_dict[chunk_idx],
            'content': chunks[chunk_idx],
            'metadata': chunk_metadata
        })
    
    return results

print("‚úÖ Hybrid retrieval function ready!")

‚úÖ Hybrid retrieval function ready!


In [6]:
# Initialize LLM client
groq_client = Groq(api_key=os.environ.get("GROQ_API_KEY"))
print("‚úÖ LLM client initialized successfully!\n")

‚úÖ LLM client initialized successfully!



## 1. Prompt Template Design

The prompt template is the blueprint for how we combine retrieved context with the user's question. A well-designed prompt template:

- **Provides clear instructions** to the LLM about its role
- **Structures the context** in an easy-to-parse format
- **Guides response format** (e.g., "cite sources", "be concise")
- **Sets constraints** (e.g., "only use provided context")

### Common Prompt Template Patterns:

1. **Basic QA Template**: Simple question + context format
2. **Instructional Template**: Detailed instructions with role definition
3. **Citation-focused Template**: Emphasizes source attribution
4. **Chain-of-Thought Template**: Encourages step-by-step reasoning

In [7]:
# Define various prompt templates

# Template 1: Basic QA
BASIC_QA_TEMPLATE = """Context:
{context}

Question: {query}

Answer:"""

# Template 2: Instructional with Role
INSTRUCTIONAL_TEMPLATE = """You are a helpful assistant answering questions about remote work policies.

Instructions:
- Use ONLY the information provided in the context below
- If the answer is not in the context, say "I don't have enough information to answer this question"
- Be concise and accurate
- Cite the relevant section when possible

Context:
{context}

Question: {query}

Answer:"""

# Template 3: Citation-Focused
CITATION_TEMPLATE = """You are a policy assistant that provides accurate answers with proper citations.

Below is relevant information from our company's Remote Work Policy:

{context}

Question: {query}

Instructions:
- Answer the question using the information above
- Include citations in the format [Source: chunk_X] after each fact
- If information is not available, clearly state this

Answer:"""

# Template 4: Chain-of-Thought
CHAIN_OF_THOUGHT_TEMPLATE = """You are an analytical assistant that explains your reasoning.

Context Information:
{context}

Question: {query}

Instructions:
1. First, identify the relevant information from the context
2. Explain your reasoning step-by-step
3. Provide a clear, final answer

Response:"""

print("‚úÖ Prompt templates defined!")
print(f"\nAvailable templates:")
print("1. BASIC_QA_TEMPLATE")
print("2. INSTRUCTIONAL_TEMPLATE")
print("3. CITATION_TEMPLATE")
print("4. CHAIN_OF_THOUGHT_TEMPLATE")

# Demo: Show what each template looks like with sample data
print("\n" + "=" * 80)
print("DEMO: Prompt Template Examples")
print("=" * 80 + "\n")

# Sample data for demonstration
sample_context = """[Chunk 1]
Remote workers must have access to reliable internet connection with minimum speeds of 25 Mbps download and 5 Mbps upload.

[Chunk 2]
All devices must have up-to-date antivirus software and firewalls enabled. The IT department provides technical support."""

sample_query = "What are the internet requirements for remote work?"

print("-" * 80)
print("Template 1: BASIC_QA_TEMPLATE")
print("-" * 80)
example1 = BASIC_QA_TEMPLATE.format(context=sample_context, query=sample_query)
print(example1)
print()

print("-" * 80)
print("Template 2: INSTRUCTIONAL_TEMPLATE")
print("-" * 80)
example2 = INSTRUCTIONAL_TEMPLATE.format(context=sample_context, query=sample_query)
print(example2)
print()

print("-" * 80)
print("Template 3: CITATION_TEMPLATE")
print("-" * 80)
example3 = CITATION_TEMPLATE.format(context=sample_context, query=sample_query)
print(example3)
print()

print("-" * 80)
print("Template 4: CHAIN_OF_THOUGHT_TEMPLATE")
print("-" * 80)
example4 = CHAIN_OF_THOUGHT_TEMPLATE.format(context=sample_context, query=sample_query)
print(example4)
print()

print("=" * 80)
print("üí° Notice how each template structures the same information differently!")
print("=" * 80)

‚úÖ Prompt templates defined!

Available templates:
1. BASIC_QA_TEMPLATE
2. INSTRUCTIONAL_TEMPLATE
3. CITATION_TEMPLATE
4. CHAIN_OF_THOUGHT_TEMPLATE

DEMO: Prompt Template Examples

--------------------------------------------------------------------------------
Template 1: BASIC_QA_TEMPLATE
--------------------------------------------------------------------------------
Context:
[Chunk 1]
Remote workers must have access to reliable internet connection with minimum speeds of 25 Mbps download and 5 Mbps upload.

[Chunk 2]
All devices must have up-to-date antivirus software and firewalls enabled. The IT department provides technical support.

Question: What are the internet requirements for remote work?

Answer:

--------------------------------------------------------------------------------
Template 2: INSTRUCTIONAL_TEMPLATE
--------------------------------------------------------------------------------
You are a helpful assistant answering questions about remote work policies.

Instr

## 2. Context Integration Strategies

After retrieving relevant chunks, we need to decide **how** to integrate them into the prompt. Different strategies work better for different scenarios:

### Strategy 1: **Simple Concatenation**
- Join all chunks with separators
- Fast and preserves all details
- Risk: May exceed token limits with many chunks

### Strategy 2: **Numbered/Labeled Context**
- Add identifiers to each chunk (e.g., [Chunk 1], [Section A])
- Enables easy citation and reference
- Better for attribution and debugging

### Strategy 3: **Metadata-Enriched Context**
- Include metadata (source, date, topic) with each chunk
- Provides additional context to the LLM
- Useful for multi-document retrieval

### Strategy 4: **Summarized Context**
- Use LLM to summarize retrieved chunks before augmentation
- Reduces token usage
- Risk: May lose important details

### Strategy 5: **Hierarchical Context**
- Organize chunks by relevance or topic
- Present most relevant first
- Helps LLM prioritize information

In [8]:
# Context Integration Strategy Implementations


def simple_concatenation(retrieved_results):
    """Strategy 1: Simple concatenation with separators"""
    # Joins all chunks with separators - fast, preserves all details, but no structure
    context = "\n\n---\n\n".join([result["content"] for result in retrieved_results])
    return context


def numbered_context(retrieved_results):
    """Strategy 2: Add chunk numbers for citation"""
    # Adds [Chunk X] labels to enable easy citation and reference tracking
    context_parts = []
    for i, result in enumerate(retrieved_results, 1):
        context_parts.append(f"[Chunk {result['chunk_idx']}]\n{result['content']}")
    return "\n\n".join(context_parts)


def metadata_enriched_context(retrieved_results):
    """Strategy 3: Include metadata with each chunk"""
    # Adds source, topic, and chunk ID headers - provides rich context to LLM
    context_parts = []
    for result in retrieved_results:
        metadata = result["metadata"]
        header = f"[Source: {metadata['source']} | Topic: {metadata['topic']} | Chunk {result['chunk_idx']}]"
        context_parts.append(f"{header}\n{result['content']}")
    return "\n\n".join(context_parts)


def hierarchical_context(retrieved_results):
    """Strategy 5: Organize by relevance score"""
    # Shows relevance scores with each chunk - helps LLM prioritize information
    context_parts = []
    for i, result in enumerate(retrieved_results, 1):
        relevance = (
            "High"
            if result["hybrid_score"] > 0.7
            else "Medium" if result["hybrid_score"] > 0.5 else "Low"
        )
        header = f"[Relevance: {relevance} | Score: {result['hybrid_score']:.3f}]"
        context_parts.append(f"{header}\n{result['content']}")
    return "\n\n".join(context_parts)


def summarized_context(retrieved_results, groq_client):
    """Strategy 4: Summarize chunks before augmentation"""
    # Uses LLM to compress context - reduces tokens but may lose details
    # First concatenate all chunks
    full_context = "\n\n".join([result["content"] for result in retrieved_results])

    # Ask LLM to summarize
    summary_prompt = f"""Summarize the following text concisely, preserving key facts and details:

{full_context}

Summary:"""

    response = groq_client.chat.completions.create(
        model="llama-3.1-8b-instant",
        messages=[{"role": "user", "content": summary_prompt}],
        temperature=0.3,
        max_tokens=300,
    )

    return response.choices[0].message.content.strip()


print("‚úÖ Context integration strategies implemented!")
print(f"\nAvailable strategies:")
print("1. simple_concatenation()")
print("2. numbered_context()")
print("3. metadata_enriched_context()")
print("4. summarized_context()")
print("5. hierarchical_context()")

# Demo: Test each strategy with a sample retrieval
print("\n" + "=" * 80)
print("DEMO: Context Integration Strategies")
print("=" * 80 + "\n")

sample_query = "What are the internet speed requirements?"
print(f"üîç Query: '{sample_query}'")
print(f"üì• Retrieving chunks...\n")

sample_results = hybrid_retrieval(sample_query, top_k=2)

# Show raw retrieved results BEFORE formatting
print("=" * 80)
print("RAW RETRIEVED RESULTS (Before Context Integration)")
print("=" * 80)
for i, result in enumerate(sample_results, 1):
    print(f"\nResult {i}:")
    print(f"  Chunk ID: {result['chunk_idx']}")
    print(f"  Hybrid Score: {result['hybrid_score']:.4f}")
    print(f"  Vector Score: {result['vector_score']:.4f}")
    print(f"  BM25 Score: {result['bm25_score']:.4f}")
    print(f"  Metadata: {result['metadata']}")
    print(f"  Full Content:")
    print(f"  {'-' * 76}")
    print(f"  {result['content']}")
    print(f"  {'-' * 76}")
print("\n" + "=" * 80)
print("üí° Joins chunks with separators - fast, preserves all details")
print("-" * 80)
context1 = simple_concatenation(sample_results)
print(context1)
print()

print("-" * 80)
print("Strategy 2: Numbered Context")
print("üí° Adds [Chunk X] labels for easy citation and reference tracking")
print("-" * 80)
context2 = numbered_context(sample_results)
print(context2)
print()

print("-" * 80)
print("Strategy 3: Metadata-Enriched Context")
print("üí° Includes source, topic, and chunk info - provides rich context to LLM")
print("-" * 80)
context3 = metadata_enriched_context(sample_results)
print(context3)
print()

print("-" * 80)
print("Strategy 4: Hierarchical Context")
print("üí° Shows relevance scores - helps LLM prioritize important information")
print("-" * 80)
context4 = hierarchical_context(sample_results)
print(context4)
print()

print("=" * 80)
print("üí° Each strategy formats the same retrieved chunks differently!")
print("=" * 80)
print()

‚úÖ Context integration strategies implemented!

Available strategies:
1. simple_concatenation()
2. numbered_context()
3. metadata_enriched_context()
4. summarized_context()
5. hierarchical_context()

DEMO: Context Integration Strategies

üîç Query: 'What are the internet speed requirements?'
üì• Retrieving chunks...

RAW RETRIEVED RESULTS (Before Context Integration)

Result 1:
  Chunk ID: 3
  Hybrid Score: 0.6077
  Vector Score: 0.4396
  BM25 Score: 1.0000
  Metadata: {'chunk_index': 3, 'source': 'Remote Work Policy', 'length': 495, 'topic': 'Eligible employees must '}
  Full Content:
  ----------------------------------------------------------------------------
  Eligible employees must 
demonstrate strong performance in their current role, possess excellent communication skills, and have 
access to appropriate technology and workspace. The eligibility criteria ensure that remote work benefits 
both the employee and the company. Performance reviews will be considered when evaluati

## 3. Complete Augmentation Pipeline

Now let's build a complete augmentation function that combines:
1. Retrieval (from Part 3)
2. Context integration strategy
3. Prompt template selection
4. Final prompt generation

In [9]:
def augment_query(
    query: str,
    top_k: int = 3,
    context_strategy: str = "numbered",
    template_type: str = "instructional"
):
    """
    Complete augmentation pipeline:
    1. Retrieve relevant chunks
    2. Apply context integration strategy
    3. Build final prompt using template
    
    Args:
        query: User's question
        top_k: Number of chunks to retrieve
        context_strategy: 'simple', 'numbered', 'metadata', 'hierarchical', 'summarized'
        template_type: 'basic', 'instructional', 'citation', 'chain_of_thought'
    
    Returns:
        dict with 'prompt', 'retrieved_results', 'context'
    """
    
    # Step 1: Retrieve relevant chunks
    print(f"üì• Retrieving top {top_k} chunks...")
    retrieved_results = hybrid_retrieval(query, top_k=top_k)
    print(f"‚úÖ Retrieved {len(retrieved_results)} chunks\n")
    
    # Step 2: Apply context integration strategy
    print(f"üîß Applying context strategy: {context_strategy}")
    if context_strategy == "simple":
        context = simple_concatenation(retrieved_results)
    elif context_strategy == "numbered":
        context = numbered_context(retrieved_results)
    elif context_strategy == "metadata":
        context = metadata_enriched_context(retrieved_results)
    elif context_strategy == "hierarchical":
        context = hierarchical_context(retrieved_results)
    elif context_strategy == "summarized":
        context = summarized_context(retrieved_results, groq_client)
    else:
        raise ValueError(f"Unknown context strategy: {context_strategy}")
    
    print(f"‚úÖ Context prepared ({len(context)} chars)\n")
    
    # Step 3: Select and apply prompt template
    print(f"üìù Applying template: {template_type}")
    if template_type == "basic":
        template = BASIC_QA_TEMPLATE
    elif template_type == "instructional":
        template = INSTRUCTIONAL_TEMPLATE
    elif template_type == "citation":
        template = CITATION_TEMPLATE
    elif template_type == "chain_of_thought":
        template = CHAIN_OF_THOUGHT_TEMPLATE
    else:
        raise ValueError(f"Unknown template type: {template_type}")
    
    # Build final prompt
    final_prompt = template.format(context=context, query=query)
    print(f"‚úÖ Final prompt ready ({len(final_prompt)} chars)\n")
    
    return {
        'prompt': final_prompt,
        'retrieved_results': retrieved_results,
        'context': context,
        'query': query
    }

print("‚úÖ Complete augmentation pipeline ready!")

‚úÖ Complete augmentation pipeline ready!


## 4. Demo: Augmentation Pipeline in Action

Let's see the complete augmentation pipeline with a practical example using numbered context and citation template.

In [10]:
# Test query
test_query = "What are the internet speed requirements for remote work?"

print("=" * 80)
print("AUGMENTATION DEMO: End-to-End Pipeline with Citation")
print("=" * 80)
print(f"\nüîç User Query: '{test_query}'\n")

print("üí° Why Citations Matter:")
print("   - Builds user trust and credibility")
print("   - Allows verification of information")
print("   - Reduces hallucinations by grounding answers in sources")
print("   - Enables traceability for compliance and auditing")
print("   - Helps users explore related information\n")

# Use numbered context + citation template (best for attribution)
augmented_result = augment_query(
    query=test_query,
    top_k=3,
    context_strategy="numbered",
    template_type="citation"
)

print("\nüìã Final Augmented Prompt:")
print("=" * 80)
print(augmented_result['prompt'])
print("=" * 80)

AUGMENTATION DEMO: End-to-End Pipeline with Citation

üîç User Query: 'What are the internet speed requirements for remote work?'

üí° Why Citations Matter:
   - Builds user trust and credibility
   - Allows verification of information
   - Reduces hallucinations by grounding answers in sources
   - Enables traceability for compliance and auditing
   - Helps users explore related information

üì• Retrieving top 3 chunks...
‚úÖ Retrieved 3 chunks

üîß Applying context strategy: numbered
‚úÖ Context prepared (1523 chars)

üìù Applying template: citation
‚úÖ Final prompt ready (1940 chars)


üìã Final Augmented Prompt:
You are a policy assistant that provides accurate answers with proper citations.

Below is relevant information from our company's Remote Work Policy:

[Chunk 1]

Remote Work Policy

TOPIC: Introduction and Benefits
This remote work policy establishes guidelines for employees who work from home or other remote locations. Our company recognizes the benefits of remote w

## 5. Advanced: Context Window Management

---

### üí° Note: Context Window vs. LLM Memory

**Context Window** = Maximum tokens the LLM can process in one request (what we're managing here)

**LLM Memory** = Conversation history across multiple turns (not covered in this notebook)

In a complete RAG chatbot, your context window must fit:
- System prompt + Retrieved chunks (this notebook) + Conversation history (memory) + Current query + Response space

Context window management ensures retrieved chunks fit. Memory management (conversation history) would be an additional concern for multi-turn chatbots.

### The Problem:
Every LLM has a **context window** (maximum token limit). For example:
- GPT-3.5-turbo: ~4K tokens
- GPT-4: ~8K-128K tokens (depending on version)
- Llama-3.1-8b: ~8K tokens

When you retrieve many chunks or have long documents, the combined context can exceed this limit, causing:
- API errors (request rejected)
- Truncated context (LLM only sees partial information)
- Increased costs (more tokens = higher price)

### Real-World Challenge:
**What if retrieved context exceeds the LLM's token limit?**

You need intelligent strategies to fit context within budget while preserving the most important information.

### Available Strategies:

1. **Truncation (Simple & Fast)**
   - Keep only top-N highest-scoring chunks
   - Stops adding chunks when token limit reached
   - ‚úÖ Preserves most relevant information
   - ‚ùå May lose valuable context from lower-ranked chunks

2. **Summarization (Intelligent Compression)**
   - Use another LLM call to compress context
   - Condenses multiple chunks into key points
   - ‚úÖ Maintains semantic meaning in less space
   - ‚ùå Adds latency and cost (extra LLM call)
   - ‚ùå May lose specific details or nuances

3. **Chunked Processing (Divide & Conquer)**
   - Process chunks in batches, generate partial answers
   - Merge partial answers into final response
   - ‚úÖ Handles very large contexts
   - ‚ùå Complex implementation, multiple LLM calls

4. **Smart Filtering (Preprocessing)**
   - Remove redundant sentences using similarity
   - Filter out low-information content
   - ‚úÖ Reduces noise without LLM calls
   - ‚ùå Requires additional processing

**In this demo, we'll implement and compare strategies #1 (Truncation) and #2 (Summarization).**



In [11]:
# ============================================================================
# TOKEN ESTIMATION & CONTEXT WINDOW MANAGEMENT
# ============================================================================

def estimate_tokens(text):
    """
    Estimate the number of tokens in text.
    
    This is a rough approximation. Real tokenization depends on the specific
    tokenizer used by the LLM (e.g., tiktoken for OpenAI models).
    
    Rule of thumb for English:
    - 1 token ‚âà 4 characters
    - 1 token ‚âà 0.75 words
    - 100 tokens ‚âà 75 words
    
    For production use, consider using actual tokenizers:
    - OpenAI: tiktoken library
    - Hugging Face: transformers.AutoTokenizer
    
    Args:
        text: Input text to estimate tokens for
        
    Returns:
        Estimated token count (int)
    """
    return len(text) // 4


def manage_context_window(
    retrieved_results,
    max_tokens=2000,
    strategy="truncate"
):
    """
    Intelligently manage context to fit within LLM token budget.
    
    This function implements two strategies for handling contexts that may
    exceed the LLM's maximum token limit:
    
    1. TRUNCATION: Greedily add chunks (sorted by relevance) until budget exhausted
       - Fast and simple
       - Preserves most relevant chunks
       - No additional LLM calls
       
    2. SUMMARIZATION: Use LLM to compress context into shorter form
       - Better semantic preservation
       - Requires extra LLM call (adds latency + cost)
       - May lose specific details
    
    Args:
        retrieved_results: List of dicts with 'content', 'chunk_idx', 
                          'hybrid_score', and 'metadata' keys
        max_tokens: Maximum tokens allowed for context (reserve space for 
                   prompt template and LLM response)
        strategy: 'truncate' or 'summarize'
        
    Returns:
        List of chunk dicts that fit within token budget
        
    Example:
        >>> chunks = hybrid_retrieval("query", top_k=10)
        >>> managed = manage_context_window(chunks, max_tokens=2000, strategy="truncate")
        >>> # managed now contains only chunks that fit in 2000 tokens
    """
    
    if strategy == "truncate":
        # STRATEGY 1: TRUNCATION
        # Iterate through chunks (already sorted by relevance) and add them
        # until we hit the token budget limit
        
        selected = []  # Chunks that fit within budget
        current_tokens = 0  # Running token count
        
        for result in retrieved_results:
            # Estimate tokens for this chunk
            chunk_tokens = estimate_tokens(result['content'])
            
            # Check if adding this chunk would exceed budget
            if current_tokens + chunk_tokens <= max_tokens:
                selected.append(result)
                current_tokens += chunk_tokens
            else:
                # Budget exhausted, stop adding chunks
                # Note: This means lower-ranked chunks are discarded
                break
        
        print(f"‚úÇÔ∏è Truncated to {len(selected)}/{len(retrieved_results)} chunks")
        print(f"üìä Estimated tokens: {current_tokens}/{max_tokens}")
        print(f"üí∞ Token savings: {estimate_tokens(''.join([r['content'] for r in retrieved_results])) - current_tokens} tokens")
        return selected
    
    elif strategy == "summarize":
        # STRATEGY 2: SUMMARIZATION
        # First check if context exceeds budget, then use LLM to compress
        
        # Combine all chunks into single text
        full_text = "\n\n".join([r['content'] for r in retrieved_results])
        total_tokens = estimate_tokens(full_text)
        
        if total_tokens > max_tokens:
            # Context too large - need to summarize
            print(f"üìâ Context too large ({total_tokens} tokens > {max_tokens} limit)")
            print(f"ü§ñ Calling LLM to summarize...")
            
            # Use the summarized_context function (defined earlier) to compress
            # This makes an LLM call to condense the context
            summary = summarized_context(retrieved_results, groq_client)
            summary_tokens = estimate_tokens(summary)
            
            print(f"‚úÖ Summarized to {summary_tokens} tokens (reduction: {total_tokens - summary_tokens} tokens)")
            
            # Return as a pseudo-chunk with special metadata
            return [{
                'chunk_idx': -1,  # Special ID indicating this is a summary
                'content': summary,
                'hybrid_score': 1.0,  # Max score since it's a summary of all
                'metadata': {
                    'source': 'Summary',
                    'topic': 'Consolidated',
                    'original_chunks': len(retrieved_results),
                    'original_tokens': total_tokens
                }
            }]
        else:
            # Context already fits within budget - no action needed
            print(f"‚úÖ Context within budget ({total_tokens}/{max_tokens} tokens)")
            return retrieved_results
    
    # Fallback: return original results if strategy not recognized
    return retrieved_results


# ============================================================================
# DEMONSTRATION: Context Window Management
# ============================================================================

print("\n" + "=" * 80)
print("CONTEXT WINDOW MANAGEMENT DEMO")
print("=" * 80 + "\n")

print("üí° Scenario: We retrieve 5 chunks, but they might exceed our token budget")
print("   Let's see how truncation and summarization handle this...\n")

# Retrieve more chunks than usual to demonstrate the problem
large_retrieval = hybrid_retrieval(test_query, top_k=5)

# Show original context size
original_text = "\n\n".join([r['content'] for r in large_retrieval])
original_tokens = estimate_tokens(original_text)
print(f"üìè Original retrieval: {len(large_retrieval)} chunks, ~{original_tokens} tokens\n")

# -------------------------
# STRATEGY 1: TRUNCATION
# -------------------------
print("=" * 80)
print("Strategy 1: Truncation (Keep top chunks until budget exhausted)")
print("=" * 80)
managed_truncate = manage_context_window(large_retrieval, max_tokens=500, strategy="truncate")
print(f"\nüëâ Result: Kept {len(managed_truncate)} most relevant chunks")
print("   Use case: When you want to preserve exact wording of top chunks\n")

# -------------------------
# STRATEGY 2: SUMMARIZATION
# -------------------------
print("=" * 80)
print("Strategy 2: Summarization (Compress context using LLM)")
print("=" * 80)
managed_summarize = manage_context_window(large_retrieval, max_tokens=500, strategy="summarize")
print(f"\nüëâ Result: {len(managed_summarize)} chunk(s) - {'Summary' if managed_summarize[0]['chunk_idx'] == -1 else 'Original'}")
print("   Use case: When you need to fit more information in less space")

print("\n" + "=" * 80)
print("üéØ Key Takeaway:")
print("   - Truncation: Fast, preserves details, but discards lower-ranked chunks")
print("   - Summarization: Compresses all info, but adds latency and may lose details")
print("   - Choose based on: token budget, latency requirements, and detail importance")
print("=" * 80)


CONTEXT WINDOW MANAGEMENT DEMO

üí° Scenario: We retrieve 5 chunks, but they might exceed our token budget
   Let's see how truncation and summarization handle this...

üìè Original retrieval: 5 chunks, ~608 tokens

Strategy 1: Truncation (Keep top chunks until budget exhausted)
‚úÇÔ∏è Truncated to 4/5 chunks
üìä Estimated tokens: 490/500
üí∞ Token savings: 116 tokens

üëâ Result: Kept 4 most relevant chunks
   Use case: When you want to preserve exact wording of top chunks

Strategy 2: Summarization (Compress context using LLM)
üìâ Context too large (608 tokens > 500 limit)
ü§ñ Calling LLM to summarize...
‚úÖ Summarized to 218 tokens (reduction: 390 tokens)

üëâ Result: 1 chunk(s) - Summary
   Use case: When you need to fit more information in less space

üéØ Key Takeaway:
   - Truncation: Fast, preserves details, but discards lower-ranked chunks
   - Summarization: Compresses all info, but adds latency and may lose details
   - Choose based on: token budget, latency require

## 6. Comparison: Different Augmentation Approaches

Let's compare how different augmentation strategies affect the final answer quality. We'll briefly use generation here for comparison purposes only.

In [14]:
# Helper function for generation (used only for comparison)
def generate_answer(augmented_result, model="llama-3.1-8b-instant", temperature=0.3):
    """Generate answer using LLM with augmented prompt."""
    response = groq_client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": augmented_result['prompt']}],
        temperature=temperature,
        max_tokens=500
    )
    return response.choices[0].message.content.strip()

print("\n" + "=" * 80)
print("COMPARATIVE ANALYSIS: AUGMENTATION STRATEGIES")
print("=" * 80)

test_queries = [
    "What are the internet speed requirements for remote work?",
    "How is remote work performance evaluated?",
    "What communication tools should remote workers use?"
]

strategies = [
    ("simple", "basic", "Simple + Basic"),
    ("numbered", "citation", "Numbered + Citation"),
    ("metadata", "instructional", "Metadata + Instructional"),
]

# Test first query with all strategies
query = test_queries[0]
print(f"\nüîç Query: '{query}'\n")

results_comparison = []

for context_strat, template_type, label in strategies:
    print(f"\n{'='*80}")
    print(f"Testing: {label}")
    print(f"{'='*80}")
    
    # Augment
    augmented = augment_query(
        query=query,
        top_k=3,
        context_strategy=context_strat,
        template_type=template_type
    )
    
    # Generate
    print("ü§ñ Generating answer...")
    answer = generate_answer(augmented, temperature=0.2)  # Low temp for consistency
    
    # Calculate metrics
    prompt_length = len(augmented['prompt'])
    answer_length = len(answer)
    context_length = len(augmented['context'])
    
    results_comparison.append({
        'strategy': label,
        'answer': answer,
        'prompt_length': prompt_length,
        'answer_length': answer_length,
        'context_length': context_length
    })
    
    print(f"\nüìù Answer ({answer_length} chars):")
    print("-" * 80)
    print(answer)
    print("-" * 80)

# Summary table
print("\n" + "=" * 80)
print("COMPARISON SUMMARY")
print("=" * 80)
print(f"\n{'Strategy':<30} {'Prompt':>10} {'Context':>10} {'Answer':>10}")
print("-" * 80)
for r in results_comparison:
    print(f"{r['strategy']:<30} {r['prompt_length']:>10} {r['context_length']:>10} {r['answer_length']:>10}")
print("-" * 80)

print("\nüí° Insights:")
print("   - Citation template produces answers with source attribution")
print("   - Metadata enrichment provides more context to the LLM")
print("   - Simple strategies are faster but may lack structure")
print("   - Choose based on your use case: speed vs. attribution vs. detail")


COMPARATIVE ANALYSIS: AUGMENTATION STRATEGIES

üîç Query: 'What are the internet speed requirements for remote work?'


Testing: Simple + Basic
üì• Retrieving top 3 chunks...
‚úÖ Retrieved 3 chunks

üîß Applying context strategy: simple
‚úÖ Context prepared (1503 chars)

üìù Applying template: basic
‚úÖ Final prompt ready (1590 chars)

ü§ñ Generating answer...

üìù Answer (100 chars):
--------------------------------------------------------------------------------
The internet speed requirements for remote work are a minimum of 25 Mbps download and 5 Mbps upload.
--------------------------------------------------------------------------------

Testing: Numbered + Citation
üì• Retrieving top 3 chunks...
‚úÖ Retrieved 3 chunks

üîß Applying context strategy: numbered
‚úÖ Context prepared (1523 chars)

üìù Applying template: citation
‚úÖ Final prompt ready (1940 chars)

ü§ñ Generating answer...

üìù Answer (442 chars):
----------------------------------------------------------

## 7. Best Practices & Recommendations

### ‚úÖ Key Takeaways:

#### **Prompt Template Selection**
- **For customer support**: Use instructional template with clear constraints
- **For research/analysis**: Use chain-of-thought template
- **For compliance/legal**: Use citation template with strict source requirements
- **For chatbots**: Use conversational templates with personality

#### **Context Integration Strategy**
- **Default choice**: Numbered context (enables citations, minimal overhead)
- **When context is large**: Use summarized context or truncation
- **Multi-document scenarios**: Use metadata-enriched context
- **High-precision tasks**: Use hierarchical context (relevance-sorted)

#### **Token Budget Management**
- Always estimate token usage before sending to LLM
- Reserve 20-30% of context window for LLM response
- Consider using smaller models for summarization step
- Implement graceful fallbacks if context is too large

#### **Citation & Trust**
- Always include source identifiers in context
- Teach LLM to cite sources through examples and instructions
- Extract and validate citations programmatically
- Consider adding confidence scores to citations

#### **Testing & Iteration**
- Test different strategies on representative queries
- Measure: answer quality, citation accuracy, token usage
- A/B test with real users when possible
- Monitor for hallucinations (answers not grounded in context)

### üîÑ Recommended Pipeline:

```
1. Retrieve (hybrid search, top-5)
   ‚Üì
2. Rerank (cross-encoder, keep top-3)
   ‚Üì
3. Context Integration (numbered)
   ‚Üì
4. Prompt Template (citation-focused)
   ‚Üì
5. Generate (with citations)
   ‚Üì
6. Validate Citations
   ‚Üì
7. Return Answer + Sources
```