## Setup: Import Required Libraries

First, ensure you've installed dependencies from `requirements.txt`:
```bash
pip install -r requirements.txt
```

Now let's import the required libraries.

## Semantic Chunking Function

We'll use semantic chunking to split our text into meaningful chunks based on topic similarity.

In [None]:
# Import sample data and required libraries
from sample_data import SAMPLE_TEXT
import nltk
nltk.download('punkt', quiet=True)

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def chunk_by_semantic_similarity(text: str, similarity_threshold: float = 0.5, overlap_sentences: int = 2, min_chunk_size: int = 2) -> list:
    """
    Semantic chunking based on sentence similarity using TF-IDF vectors
    
    Parameters:
    - similarity_threshold: Cosine similarity threshold (0-1). Lower = more chunks
    - overlap_sentences: Number of sentences from previous chunk to include as context
    - min_chunk_size: Minimum number of sentences per chunk
    
    Groups semantically similar sentences together and creates boundaries where
    the semantic topic shifts (similarity drops below threshold).
    """
    # STEP 1: Split the text into individual sentences
    sentences = nltk.sent_tokenize(text)
    if len(sentences) <= min_chunk_size:
        return [text]
    
    # STEP 2: Convert sentences to numerical vectors using TF-IDF
    vectorizer = TfidfVectorizer(stop_words='english')
    try:
        sentence_vectors = vectorizer.fit_transform(sentences)
    except ValueError:
        return [text]
    
    # STEP 3: Calculate similarity between consecutive sentences
    similarities = [cosine_similarity(sentence_vectors[i:i+1], sentence_vectors[i+1:i+2])[0][0] 
                   for i in range(len(sentences) - 1)]
    
    # STEP 4: Identify chunk boundaries where topics change
    chunk_boundaries = [0]
    current_chunk_size = 1
    
    for i, sim in enumerate(similarities):
        if sim < similarity_threshold and current_chunk_size >= min_chunk_size:
            chunk_boundaries.append(i + 1)
            current_chunk_size = 1
        else:
            current_chunk_size += 1
    
    if chunk_boundaries[-1] != len(sentences):
        chunk_boundaries.append(len(sentences))
    
    # STEP 5: Create actual text chunks with optional overlap
    chunks = []
    for i in range(len(chunk_boundaries) - 1):
        start_idx = chunk_boundaries[i]
        end_idx = chunk_boundaries[i + 1]
        
        if i > 0 and overlap_sentences > 0:
            overlap_start = max(0, start_idx - overlap_sentences)
            chunk_sentences = sentences[overlap_start:end_idx]
        else:
            chunk_sentences = sentences[start_idx:end_idx]
        
        chunk = " ".join(chunk_sentences)
        chunks.append(chunk)
    
    return chunks

# Create semantic chunks
chunks = chunk_by_semantic_similarity(SAMPLE_TEXT, similarity_threshold=0.15, overlap_sentences=2)
print(f"‚úÖ Created {len(chunks)} semantic chunks\n")

for i, chunk in enumerate(chunks, 1):
    print(f"Chunk {i} ({len(chunk)} chars): {chunk[:100]}...\n")

## Build Vector Store

Now we'll create embeddings for our chunks and store them in ChromaDB.

In [None]:
import chromadb
from sentence_transformers import SentenceTransformer

# Initialize embedding model
print("üîÑ Loading embedding model...")
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')  # 384-dimensional embeddings
print("‚úÖ Embedding model loaded!\n")

# Generate embeddings for all chunks
print(f"üîÑ Generating embeddings for {len(chunks)} chunks...")
embeddings = embedding_model.encode(chunks, show_progress_bar=True)
print(f"‚úÖ Generated {len(embeddings)} embeddings, each with {embeddings[0].shape[0]} dimensions\n")

# Initialize ChromaDB
print("üîÑ Initializing ChromaDB vector store...")
chroma_client = chromadb.Client()

# Create collection with cosine similarity
collection = chroma_client.get_or_create_collection(
    name="retrieval_demo_chunks",
    metadata={
        "hnsw:space": "cosine",
        "description": "Chunks for retrieval strategy demonstration"
    }
)
print("‚úÖ ChromaDB collection created!\n")

# Store chunks with embeddings
print(f"üîÑ Storing {len(chunks)} chunks in vector database...")
collection.add(
    ids=[f"chunk_{i}" for i in range(len(chunks))],
    embeddings=embeddings.tolist(),
    documents=chunks,
    metadatas=[{"chunk_index": i, "length": len(chunk)} for i, chunk in enumerate(chunks)]
)
print("‚úÖ All chunks stored in vector database!\n")

print("=" * 80)
print("VECTOR STORE SUMMARY")
print("=" * 80)
print(f"‚Ä¢ Total chunks stored: {collection.count()}")
print(f"‚Ä¢ Embedding dimensions: {embeddings[0].shape[0]}")
print(f"‚Ä¢ Similarity metric: Cosine")
print("=" * 80)

## 1. Retrieval Strategies

Different strategies for fetching relevant chunks:
- **Dense (Vector Search)**: Uses embeddings for semantic similarity
- **Hybrid (BM25 + Embeddings)**: Combines keyword matching with semantic search (recommended for production)

### 1.1 Dense Retrieval (Vector Search Only)

Pure semantic similarity using embeddings.

In [None]:
def dense_retrieval(query: str, top_k: int = 3):
    """
    Dense retrieval using vector embeddings only.
    Returns top_k most semantically similar chunks.
    """
    # Convert query to embedding
    query_embedding = embedding_model.encode([query])[0]
    
    # Search vector database
    results = collection.query(
        query_embeddings=[query_embedding.tolist()],
        n_results=top_k
    )
    
    return results

# Test dense retrieval
query = "What are the internet speed requirements for remote work?"

print("=" * 80)
print("DENSE RETRIEVAL (Vector Search Only)")
print("=" * 80)
print(f"\nüîç Query: '{query}'\n")

results = dense_retrieval(query, top_k=3)

print("Top 3 results by semantic similarity:\n")
for i, (doc, metadata, distance) in enumerate(zip(
    results['documents'][0], 
    results['metadatas'][0],
    results['distances'][0]
), 1):
    similarity_score = 1 - distance
    print(f"Rank {i} | Similarity: {similarity_score:.4f}")
    print(f"Content: {doc}")
    print()

print("=" * 80)
print("üí° Dense retrieval captures semantic meaning but may miss exact keywords")
print("=" * 80)

### 1.2 Hybrid Retrieval (BM25 + Embeddings)

Combines keyword matching (BM25) with semantic search for best results.

**What is BM25?**

BM25 (Best Matching 25) is a ranking function used in information retrieval that scores documents based on the query terms appearing in each document. 

**Why "25"?** The "25" simply indicates it's the 25th iteration/variation in the "Best Matching" family of ranking functions developed by researchers. Earlier versions included BM1, BM11, BM15, etc. BM25 emerged as the most effective version and became the standard.

**How BM25 works - Key components:**

1. **Term Frequency (TF)**: How often a query term appears in a document
   - **Query term**: A word from your search query (e.g., if you search "internet speed", the query terms are "internet" and "speed")
   - **Diminishing returns**: The relevance benefit of each additional occurrence decreases. Here's why:
     - **Real-world intuition**: If you're searching for "python tutorial" and find a document:
       - With 1 occurrence: "This Python tutorial covers basics" ‚Üí Clearly relevant ‚úì
       - With 3 occurrences: Document discusses Python tutorials multiple times ‚Üí More relevant ‚úì‚úì
       - With 100 occurrences: Likely keyword stuffing or spam ("python python python tutorial tutorial...") ‚Üí Not 100x more useful!
     - **The problem with linear counting**: If we simply count occurrences, spam documents that repeat words would always rank #1
     - **BM25's solution**: Uses a saturation function that increases score rapidly at first, then flattens out
     - **Mathematical behavior**: 
       - 1 occurrence ‚Üí score: 1.0
       - 2 occurrences ‚Üí score: 1.5 (50% increase)
       - 3 occurrences ‚Üí score: 1.8 (20% increase)
       - 5 occurrences ‚Üí score: 2.1 (smaller increase)
       - 100 occurrences ‚Üí score: 2.5 (barely any increase from 5 to 100!)
     - **Key insight**: After ~5-10 occurrences, additional mentions add very little value - the curve "saturates"
   
2. **Inverse Document Frequency (IDF)**: How rare/common a term is across all documents
   - **Inverse**: "Opposite" or "reverse" - we want RARE terms to score HIGHER
   - **Document Frequency**: How many documents contain this term
   - **Logic**: If "the" appears in 1000/1000 documents, it tells us nothing useful (low IDF weight)
   - **Logic**: If "photosynthesis" appears in 2/1000 documents, it's highly specific (high IDF weight)
   - **Formula**: IDF = log(total_documents / documents_containing_term)
   - **Example**: 
     - "the" appears in 950/1000 docs ‚Üí IDF = log(1000/950) = 0.05 (low weight)
     - "photosynthesis" appears in 2/1000 docs ‚Üí IDF = log(1000/2) = 2.7 (high weight)
   
3. **Document Length Normalization**: Adjusts scores based on document length
   - **Problem**: Longer documents naturally contain more words, giving them unfair advantage
   - **Solution**: Normalize scores by document length relative to average document length
   - **Example**: A 1000-word document with 3 matches isn't necessarily better than a 100-word document with 2 matches

**Why Hybrid (BM25 + Embeddings)?**

- **BM25 strengths**: Excellent at finding exact keyword/phrase matches (e.g., "Article 5", "GDP growth")
- **Embedding strengths**: Understands semantic meaning and synonyms (e.g., "WiFi" ‚âà "internet", "quick" ‚âà "fast")
- **Combined**: Handles both "WiFi requirements" and "internet speed needs" queries effectively

In [None]:
from rank_bm25 import BM25Okapi
import numpy as np

def hybrid_retrieval(query: str, top_k: int = 3, alpha: float = 0.7):
    """
    Hybrid retrieval combining BM25 keyword matching with vector embeddings.
    
    This function implements a two-stage retrieval approach:
    1. Use vector embeddings to capture semantic similarity
    2. Use BM25 to capture exact keyword/term matches
    3. Combine both scores for robust retrieval
    
    Parameters:
    - query: Search query string from the user
    - top_k: Number of results to return (default: 3)
    - alpha: Weight for vector score (0-1). Higher = more semantic, lower = more keyword
             Example: alpha=0.7 means 70% vector + 30% BM25
    
    Returns: List of dictionaries containing chunk indices, scores, and content
    """
    # Step 1: Dense retrieval (get top candidates by vector similarity)
    # Convert the query text into a numerical embedding vector
    query_embedding = embedding_model.encode([query])[0]
    
    # Query the vector database to find semantically similar chunks
    vector_results = collection.query(
        query_embeddings=[query_embedding.tolist()],
        n_results=top_k
    )
    
    # Extract chunk indices and vector scores from results
    # Vector scores are calculated as (1 - cosine_distance) for similarity
    vector_scores_dict = {}
    for chunk_id, distance in zip(vector_results['ids'][0], vector_results['distances'][0]):
        chunk_idx = int(chunk_id.split('_')[1])  # Extract numeric index from 'chunk_0', 'chunk_1', etc.
        vector_scores_dict[chunk_idx] = 1 - distance  # Convert distance to similarity
    
    # Step 2: BM25 keyword matching
    # Tokenize all chunks into words for BM25 processing
    tokenized_chunks = [chunk.lower().split() for chunk in chunks]
    
    query_tokens = query.lower().split()
    # Initialize BM25 algorithm with tokenized chunks
    # BM25 uses term frequency and inverse document frequency for scoring
    bm25 = BM25Okapi(tokenized_chunks)
    
    # Get BM25 scores for the top_k chunks identified by vector search
    # This ensures we only score chunks that are already semantically relevant
    bm25_scores_dict = {}
    for chunk_idx in vector_scores_dict.keys():
        # bm25.get_scores() returns scores for ALL chunks in the corpus
        # We index it with [chunk_idx] to get the score for this specific chunk
        # Example: if chunk_idx=2, we get the BM25 score for the 3rd chunk (0-indexed)
        bm25_score = bm25.get_scores(query_tokens)[chunk_idx]
        
        # Store the BM25 score for this chunk in our dictionary
        # This allows us to later combine it with the vector score
        bm25_scores_dict[chunk_idx] = bm25_score
    
    # Step 3: Normalize BM25 scores to 0-1 range
    # This makes BM25 scores comparable with vector scores (which are already 0-1)
    bm25_scores_list = list(bm25_scores_dict.values())
    bm25_min = min(bm25_scores_list)
    bm25_max = max(bm25_scores_list)
    # Min-max normalization: (score - min) / (max - min)
    # Add small epsilon (1e-10) to avoid division by zero
    bm25_scores_normalized_dict = {}
    for chunk_idx, score in bm25_scores_dict.items():
        bm25_scores_normalized_dict[chunk_idx] = (score - bm25_min) / (bm25_max - bm25_min + 1e-10)
    
    # Step 4: Combine scores with weighted average
    # Final score = (alpha √ó vector_score) + ((1-alpha) √ó bm25_score)
    # This balances semantic understanding with keyword matching
    hybrid_scores_dict = {}
    for chunk_idx in vector_scores_dict.keys():
        vector_score = vector_scores_dict[chunk_idx]
        bm25_score = bm25_scores_normalized_dict[chunk_idx]
        hybrid_scores_dict[chunk_idx] = alpha * vector_score + (1 - alpha) * bm25_score
    
    # Step 5: Sort by hybrid score (highest first)
    # This gives us the most relevant chunks based on combined scoring
    sorted_results = sorted(hybrid_scores_dict.items(), key=lambda x: x[1], reverse=True)
    
    # Build final results list with all relevant information
    # Include individual scores for transparency and debugging
    results = []
    for chunk_idx, hybrid_score in sorted_results:
        results.append({
            'chunk_idx': chunk_idx,
            'hybrid_score': hybrid_score,
            'vector_score': vector_scores_dict[chunk_idx],
            'bm25_score': bm25_scores_normalized_dict[chunk_idx],
            'content': chunks[chunk_idx]
        })
    
    return results

# Test hybrid retrieval with a sample query
print("=" * 80)
print("HYBRID RETRIEVAL (BM25 + Embeddings)")
print("=" * 80)
print(f"\nüîç Query: '{query}'\n")

# Run hybrid retrieval with alpha=0.7 (70% semantic, 30% keyword)
hybrid_results = hybrid_retrieval(query, top_k=3, alpha=0.7)

# Display results with scores breakdown
print(f"Top 3 hybrid results (Œ±=0.7: 70% semantic, 30% keyword):\n")
for i, result in enumerate(hybrid_results, 1):
    print(f"Rank {i} | Hybrid: {result['hybrid_score']:.4f} " +
          f"(Vector: {result['vector_score']:.4f}, BM25: {result['bm25_score']:.4f})")
    print(f"Content: {result['content'][:150]}...")
    print()

print("=" * 80)
print("üí° Hybrid retrieval combines semantic understanding with keyword matching")
print("   Adjust Œ±: <0.5 favors keywords, >0.5 favors semantics")
print("=" * 80)

### 2. Query Expansion: Improving Retrieval with Better Queries

**Problem:** Users often ask queries in ways that don't match document terminology.

**Examples:**
- User: "How fast should my internet be?" ‚Üí Document: "minimum bandwidth requirements"
- User: "WFH guidelines" ‚Üí Document: "remote work policy"
- User: "What's needed for home office?" ‚Üí Document: "equipment specifications"

**Solution:** Use an LLM to expand or rewrite queries for better matching.

#### LLM-Based Query Expansion

1. **Intelligent Synonyms**: LLMs understand context and generate appropriate synonyms
   - "fast internet" ‚Üí "high-speed connection, broadband, bandwidth requirements"

2. **Acronym Handling**: Automatically expands abbreviations
   - "WFH" ‚Üí "work from home, remote work, telecommuting"

3. **Contextual Rewriting**: Rephrases queries to match document style
   - "How fast should internet be?" ‚Üí "internet speed requirements, minimum bandwidth specifications"

4. **Multi-Query Generation**: Creates diverse query variations
   - Original: "remote work requirements"
   - Variant 1: "work from home eligibility criteria"
   - Variant 2: "home office setup guidelines"
   - Retrieve from all, deduplicate results


In [None]:
import os
from groq import Groq

groq_client = Groq(api_key=os.environ.get("GROQ_API_KEY"))
print("‚úÖ LLM client initialized successfully!\n")

def llm_query_expansion(query: str, num_variations: int = 3):
    prompt = f"""You are a search query optimizer. Given a user's search query, generate {num_variations} 
    improved variations that will help find relevant documents.\n\nEach variation should:\n- 
    Use different words/synonyms but preserve the same search intent\n-
    Expand abbreviations (e.g., 'WFH' ‚Üí 'work from home')\n- 
    Add relevant terms that might appear in documents\n- 
    Use both casual and formal terminology where appropriate\n\nOriginal query: {query}\n\n
    Return ONLY {num_variations} optimized query variations, one per line, numbered 1-{num_variations}:"""
    response = groq_client.chat.completions.create(
        model="llama-3.1-8b-instant",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.7,
        max_tokens=150
    )
    variations_text = response.choices[0].message.content.strip()
    variations = [query]
    for line in variations_text.split('\n'):
        line = line.strip()
        if line and (line[0].isdigit() or line.startswith('-') or line.startswith('‚Ä¢')):
            cleaned = line.lstrip('0123456789.-‚Ä¢) ').strip()
            if cleaned and cleaned not in variations:
                variations.append(cleaned)
    return variations[:num_variations + 1]

def multi_query_retrieval(query: str, top_k: int = 3, num_variations: int = 2):
    query_variations = llm_query_expansion(query, num_variations=num_variations)
    all_results = {}
    for variation in query_variations:
        results = hybrid_retrieval(variation, top_k=top_k * 2, alpha=0.7)
        for result in results:
            chunk_idx = result['chunk_idx']
            if chunk_idx not in all_results or result['hybrid_score'] > all_results[chunk_idx]['hybrid_score']:
                all_results[chunk_idx] = result
    sorted_results = sorted(all_results.values(), key=lambda x: x['hybrid_score'], reverse=True)[:top_k]
    return sorted_results

print("=" * 80)
print("LLM-BASED QUERY EXPANSION & MULTI-QUERY RETRIEVAL DEMO")
print("=" * 80)

test_query = "What are the WFH internet requirements?"
print(f"User Query: {test_query}\n")
query_variations = llm_query_expansion(test_query, num_variations=3)

# DEMO 1: Retrieval using the ORIGINAL query
original_results = hybrid_retrieval(test_query, top_k=3, alpha=0.7)
print("-" * 80)
print("DEMO 1: Retrieval using the ORIGINAL query")
print("-" * 80)
for i, res in enumerate(original_results, 1):
    print(f"Result {i} | Score: {res['hybrid_score']:.4f}\n  Content: {res['content'][:200]}...\n")

# DEMO 2: Retrieval using the FIRST expanded query
print("-" * 80)
print("DEMO 2: Retrieval using the FIRST expanded query")
print("-" * 80)
print(f"Using expanded query: '{query_variations[1]}'")
expanded_results = hybrid_retrieval(query_variations[1], top_k=3, alpha=0.7)
for i, res in enumerate(expanded_results, 1):
    print(f"Result {i} | Score: {res['hybrid_score']:.4f}\n  Content: {res['content'][:200]}...\n")

# DEMO 3: Multi-query retrieval (aggregate best matches from ALL expanded queries)
print("-" * 80)
print("DEMO 3: MULTI-QUERY RETRIEVAL (aggregate best matches from ALL expanded queries)")
print("-" * 80)
test_query2 = "What are the internet speed requirements for remote work?"
multi_results = multi_query_retrieval(test_query2, top_k=3, num_variations=2)
for i, res in enumerate(multi_results, 1):
    print(f"Result {i} | Score: {res['hybrid_score']:.4f}\n  Content: {res['content'][:200]}...\n")

# Crisp Score Comparison Table
print("-" * 80)
print("SCORE COMPARISON TABLE (Scores Only)")
print(f"{'Demo':<30}{'R1':>8}{'R2':>8}{'R3':>8}")
print(f"{'-'*30}{'-'*8}{'-'*8}{'-'*8}")
print(f"{'Original Query':<30}{original_results[0]['hybrid_score']:>8.4f}{original_results[1]['hybrid_score']:>8.4f}{original_results[2]['hybrid_score']:>8.4f}")
print(f"{'First Expanded Query':<30}{expanded_results[0]['hybrid_score']:>8.4f}{expanded_results[1]['hybrid_score']:>8.4f}{expanded_results[2]['hybrid_score']:>8.4f}")
print(f"{'Multi-Query Retrieval':<30}{multi_results[0]['hybrid_score']:>8.4f}{multi_results[1]['hybrid_score']:>8.4f}{multi_results[2]['hybrid_score']:>8.4f}")
print(f"{'-'*30}{'-'*8}{'-'*8}{'-'*8}")


## 3. Re-ranking: Post-filtering for Better Relevance

Re-ranking refines initial retrieval results using more sophisticated models.

**Common approaches:**
- **Cross-encoder rerankers**: More accurate but slower (e.g., `ms-marco-MiniLM`)
- **LLM-based reranking**: Use LLM to score relevance
- **Metadata filtering**: Filter by user role, time window, etc.

**Strategy:**
1. Retrieve larger set (e.g., TOP-K=10) with fast retrieval
2. Rerank to identify best 3-5 chunks
3. Use reranked results for generation

**How is the cross-encoder score different from dense vector search and BM25 scores?**

- **Dense vector search score**: This uses embeddings to measure how similar the query and chunk are in meaning. Each is turned into a vector separately, and their similarity (like cosine similarity) is calculated. It captures overall semantic similarity but doesn't look at the exact way the query and chunk relate word-by-word.

- **BM25 score**: This is a traditional keyword-based method. It scores based on how many query words appear in the chunk, how rare those words are, and document length. It‚Äôs great for exact matches but doesn‚Äôt understand meaning or synonyms.

- **Cross-encoder score**: Here, a neural network looks at the query and chunk together, considering the full context of both at once. It can understand complex relationships, word order, and deeper meaning, making it more accurate for ranking relevance. It‚Äôs slower, but often gives better results for the top matches.

In [None]:
# Cross-Encoder Reranker using sentence-transformers
from sentence_transformers import CrossEncoder
import torch

# --- CROSS-ENCODER RERANKING ---
# This approach uses a neural network to jointly encode the query and each candidate chunk,
# then predicts a relevance score for each pair. This is much more accurate than simple scoring,
# but also slower, so it's best used on a small set of top candidates (e.g., top 5-10 from initial retrieval).

# Load a cross-encoder model (downloads weights if not cached; may take a while the first time)
# The neural network is inside the CrossEncoder model. 
# When you create cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2'), 
# you are loading a pre-trained neural network that has learned to judge how well a query matches a chunk of text. 
if 'cross_encoder' not in globals():
    print("\nüîÑ Loading cross-encoder model for reranking...")
    cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
    print("‚úÖ Cross-encoder loaded!\n")

def cross_encoder_reranker(query, initial_results, top_n=3):
    """
    Rerank initial results using a cross-encoder model for more accurate relevance scoring.
    
    Args:
        query (str): The user query.
        initial_results (list): List of dicts with 'content' key for each chunk (from initial retrieval).
        top_n (int): Number of top results to return after reranking.
    
    Returns:
        List of dicts, reranked by cross-encoder score (descending).
    """
    # Prepare input pairs: (query, chunk) for each candidate chunk.
    # The cross-encoder will score each pair for relevance.
    pairs = [(query, result['content']) for result in initial_results]
    
    # Get relevance scores from the cross-encoder model.
    # The model outputs a score for each (query, chunk) pair: higher = more relevant.
    # We use torch.no_grad() to avoid tracking gradients (not needed for inference).
    
    # This line tells PyTorch not to track gradients while running the code inside it. 
    # A gradient is just the slope or rate of change of something‚Äîin machine learning, 
    # gradients show how much the model's output changes as you change its parameters. 
    # We use torch.no_grad() during prediction (inference), not training, to save memory and make things faster, 
    # since gradients are only needed for learning, not for making predictions.
    
    with torch.no_grad():
        # This line asks the cross-encoder model to look at each pair of (query, chunk) 
        # and give a number showing how well the chunk matches the query. 
        # The higher the number, the more relevant the chunk is to what the user asked. 
        # These scores are then used to sort and pick the best answers.
        scores = cross_encoder.predict(pairs)
    
    # Attach the cross-encoder score to each result and build a new list.
    reranked = []
    for result, score in zip(initial_results, scores):
        result_ce = result.copy()  # Copy original result dict
        result_ce['cross_encoder_score'] = float(score)  # Add the new score
        reranked.append(result_ce)
    
    # Sort all results by cross-encoder score (highest first)
    reranked.sort(key=lambda x: x['cross_encoder_score'], reverse=True)
    return reranked[:top_n]  # Return only the top_n results

# --- DEMONSTRATION: Cross-Encoder Reranking ---
if '_ce_demo_ran' not in globals():
    query = "What are the internet speed requirements for remote work?"

    # Step 1: Hybrid search results and scores
    print("=" * 80)
    print("STEP 1 - Hybrid search results and scores (Top 5)")
    print("=" * 80)
    initial_results = hybrid_retrieval(query, top_k=5, alpha=0.7)
    for i, result in enumerate(initial_results, 1):
        print(f"Rank {i} | Hybrid Score: {result['hybrid_score']:.4f} | Vector: {result['vector_score']:.4f} | BM25: {result['bm25_score']:.4f}")
        print(f"  Content: {result['content'][:100]}...\n")

    # Step 2: Cross-encoder results and scores
    print("-" * 80)
    print("STEP 2 - Cross-encoder reranking results and scores (Top 3)")
    print("-" * 80)
    reranked_results = cross_encoder_reranker(query, initial_results, top_n=3)
    for i, result in enumerate(reranked_results, 1):
        print(f"Rank {i} | Cross-Encoder Score: {result['cross_encoder_score']:.4f}")
        print(f"  Hybrid Score: {result['hybrid_score']:.4f} | Vector: {result['vector_score']:.4f} | BM25: {result['bm25_score']:.4f}")
        print(f"  Content: {result['content'][:100]}...\n")

    # Step 3: Comparison table sorted by cross-encoder score
    print("=" * 80)
    print("FINAL COMPARISON: Top 5 Hybrid Results Sorted by Cross-Encoder Score")
    print("=" * 80)
    # Attach cross-encoder scores to all initial results for fair comparison
    all_pairs = [(query, result['content']) for result in initial_results]
    with torch.no_grad():
        all_scores = cross_encoder.predict(all_pairs)
    for i, (result, ce_score) in enumerate(sorted(zip(initial_results, all_scores), key=lambda x: x[1], reverse=True), 1):
        print(f"Rank {i} | Cross-Encoder Score: {ce_score:.4f} | Hybrid Score: {result['hybrid_score']:.4f}")
        print(f"  Content: {result['content'][:100]}...\n")

    print("=" * 80)
    print("üí° Cross-encoder reranking provides a more accurate final ranking by deeply analyzing the query and chunk together.")
    print("   Use hybrid search for recall, then cross-encoder for precision in the final answer.")
    print("=" * 80)
    _ce_demo_ran = True

## 4. Context Filtering: Deciding Relevance to Query Intent

Context filtering ensures that retrieved chunks are actually relevant to the user's intent and the needs of the downstream application. This step is crucial for improving the quality of answers and reducing irrelevant or distracting information.

### Key Context Filtering Techniques (Theory)

- **Similarity Threshold Filtering:**
  - Only include chunks whose similarity or relevance score (from hybrid or cross-encoder reranker) is above a certain threshold.
  - Lower thresholds include more context (risking noise), higher thresholds are stricter (risking missing useful info).

- **Top-N Filtering:**
  - Return only the top N most relevant chunks, regardless of their absolute score.
  - Useful for limiting context size for LLMs or downstream models.

- **Metadata Filtering:**
  - Use metadata fields (e.g., document type, user role, department, date) to include or exclude chunks.
  - Example: Only show policy documents to HR users, or filter by recent date for time-sensitive queries.

- **Intent Detection and Filtering:**
  - Classify the user's query to understand what kind of information they are seeking (e.g., 'policy', 'requirements', 'troubleshooting').
  - Filter or boost chunks that match the detected intent, either by keyword, metadata, or LLM-based classification.
  - Suppose a user query is:
  Example:
      > "What are the troubleshooting steps for slow internet?"

    **Step 1: Detect Intent**
    - Use simple keyword matching or rules:
      - If the query contains words like "troubleshooting", "fix", "resolve", classify intent as `troubleshooting`.
      - If the query contains "requirements", intent is `requirements`.
      - If the query contains "policy", intent is `policy`.
    
    After retrieval, only include or boost chunks that mention the intent keyword (e.g., "troubleshooting") or are tagged as such in metadata.
    Such filtering can be done using LLMs.

- **Timeline Filtering:**
  - Filter or prioritize chunks based on recency or time relevance (e.g., only include documents from the last year).
