# Hybrid Search Experiment - Semantic + BM25

This notebook experiments with hybrid search combining:
- **Semantic Search**: Embedding-based similarity (current method)
- **Keyword Search**: BM25 algorithm for exact term matching
- **Fusion**: Reciprocal Rank Fusion (RRF) to combine both

**Goal:** Improve retrieval accuracy, especially for technical terms and acronyms.

In [28]:
import sys
sys.path.append('..')

from src.vector_store import initialize_chroma_db
from rank_bm25 import BM25Okapi
import numpy as np
import pickle
import os
from typing import List, Dict, Tuple

## Step 1: Load ChromaDB Collection

In [29]:
print("Loading ChromaDB...")
client, collection = initialize_chroma_db(
    persist_directory="../chroma_db",
    collection_name="documents"
)
doc_count = collection.count()
print(f"‚úÖ Loaded {doc_count:,} documents")

if doc_count == 0:
    print("\n‚ùå ERROR: No documents in collection!")
    print("Please run: python3 index_documents.py --yes")
    raise SystemExit("Cannot continue without documents")

Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given


Loading ChromaDB...
Initializing ChromaDB at: ../chroma_db
‚úÖ Loaded existing collection: documents
   Documents in collection: 31393
‚úÖ Loaded 31,393 documents


## Step 2: Get All Documents for BM25 Index

BM25 needs to see all documents to calculate term frequencies.

In [30]:
# Get all documents from ChromaDB
print("Fetching all documents for BM25 indexing...")
all_data = collection.get(
    limit=doc_count,
    include=["documents", "metadatas"]
)

all_documents = all_data['documents']
all_metadatas = all_data['metadatas']
all_ids = all_data['ids']

print(f"‚úÖ Fetched {len(all_documents):,} documents")

if len(all_documents) > 0:
    print(f"\nSample document preview:")
    print(all_documents[0][:200] + "...")
else:
    print("\n‚ö†Ô∏è  No documents fetched!")

Fetching all documents for BM25 indexing...
‚úÖ Fetched 31,393 documents

Sample document preview:
BrandName: life
Deatils: solid cotton blend collar neck womens a-line dress - indigo
Sizes: Size:Large,Medium,Small,X-Large,X-Small
Category: Westernwear-Women
Original Price: Rs
1699
Selling Price: 8...


## Step 3: Build BM25 Index

Tokenize documents and create BM25 index.

In [31]:
# Simple tokenization (lowercase + split)
def tokenize(text: str) -> List[str]:
    """Simple tokenization for BM25."""
    return text.lower().split()

print("Building BM25 index...")

if len(all_documents) == 0:
    print("‚ùå Cannot build BM25 index: No documents available!")
    raise SystemExit("Please index documents first")

tokenized_docs = [tokenize(doc) for doc in all_documents]
bm25 = BM25Okapi(tokenized_docs)
print(f"‚úÖ BM25 index built with {len(tokenized_docs):,} documents")

Building BM25 index...
‚úÖ BM25 index built with 31,393 documents


## Step 4: Test Queries

Compare semantic search vs BM25 search vs hybrid.

In [32]:
# Test queries
test_queries = [
    "What is CAN protocol?",  # Technical term
    "OBD-II diagnostic system",  # Acronym
    "dresses under 1000 rupees",  # Specific price
    "jewellery brands",  # General query
]

### 4a. Semantic Search (Current Method)

In [33]:
def semantic_search(query: str, n_results: int = 5, domain: str = None) -> List[Dict]:
    """Semantic search using embeddings."""
    filter_metadata = {"domain": domain} if domain else None
    
    results = collection.query(
        query_texts=[query],
        n_results=n_results,
        where=filter_metadata
    )
    
    return [
        {
            "id": results['ids'][0][i],
            "document": results['documents'][0][i],
            "distance": results['distances'][0][i],
            "metadata": results['metadatas'][0][i]
        }
        for i in range(len(results['ids'][0]))
    ]

# Test semantic search
query = test_queries[0]
print(f"Query: {query}\n")
semantic_results = semantic_search(query, n_results=5, domain="automotive")

print("=" * 60)
print("SEMANTIC SEARCH RESULTS:")
print("=" * 60)
for i, result in enumerate(semantic_results, 1):
    print(f"\n{i}. Distance: {result['distance']:.4f}")
    print(f"   Source: {result['metadata'].get('source', 'Unknown')}")
    print(f"   Preview: {result['document'][:150]}...")

Query: What is CAN protocol?

SEMANTIC SEARCH RESULTS:

1. Distance: 0.9009
   Source: data/automotive/CAN.pdf
   Preview: party device (e.g. a sensor-to-CAN module) to inject data into an existing CAN bus. If you do not ensure the global 
 uniqueness of the CAN IDs of ext...

2. Distance: 0.9544
   Source: data/automotive/CAN.pdf
   Preview: UDS vs CAN bus: Standards & OSI model 
 To better understand UDS, we will look at how it relates to CAN bus and the OSI model. 
 As explained in our  ...

3. Distance: 0.9693
   Source: data/automotive/CAN.pdf
   Preview: Quick overview of the UDS OSI model layers 
 In the following we provide a quick breakdown of each layer of the OSI model: 
 ‚óè  Application: This is d...

4. Distance: 0.9991
   Source: data/automotive/CAN.pdf
   Preview: bytes. Further, the network speed is limited to 1 Mbit/s, restricting the implementation of data-producing features. CAN 
 FD resolves these issues - ...

5. Distance: 1.0086
   Source: data/automotive/CAN.pdf

### 4b. BM25 Keyword Search

In [34]:
def bm25_search(query: str, n_results: int = 5, domain: str = None) -> List[Dict]:
    """BM25 keyword search."""
    tokenized_query = tokenize(query)
    scores = bm25.get_scores(tokenized_query)
    
    # Get top N indices
    top_indices = np.argsort(scores)[::-1]
    
    # Filter by domain if specified
    results = []
    for idx in top_indices:
        if domain and all_metadatas[idx].get('domain') != domain:
            continue
        
        results.append({
            "id": all_ids[idx],
            "document": all_documents[idx],
            "score": scores[idx],
            "metadata": all_metadatas[idx]
        })
        
        if len(results) >= n_results:
            break
    
    return results

# Test BM25 search
bm25_results = bm25_search(query, n_results=5, domain="automotive")

print("\n" + "=" * 60)
print("BM25 KEYWORD SEARCH RESULTS:")
print("=" * 60)
for i, result in enumerate(bm25_results, 1):
    print(f"\n{i}. Score: {result['score']:.4f}")
    print(f"   Source: {result['metadata'].get('source', 'Unknown')}")
    print(f"   Preview: {result['document'][:150]}...")


BM25 KEYWORD SEARCH RESULTS:

1. Score: 16.2089
   Source: data/automotive/CAN.pdf
   Preview: bytes. Further, the network speed is limited to 1 Mbit/s, restricting the implementation of data-producing features. CAN 
 FD resolves these issues - ...

2. Score: 16.0587
   Source: data/automotive/CAN.pdf
   Preview: Table of Contents 
 CANedge - 2 x CAN/LIN Bus Data Logger  2 
 CAN Bus Explained - A Simple Intro  2 
 What is CAN bus?  2 
 Top 4 beneÔ¨Åts of CAN bus ...

3. Score: 15.2533
   Source: data/automotive/CAN.pdf
   Preview: UDS data logging - applications  60 
 CANopen Explained - A Simple Intro  61 
 What is CANopen?  61 
 Six core CANopen concepts  63 
 CANopen communic...

4. Score: 14.6621
   Source: data/automotive/CAN.pdf
   Preview: retroÔ¨Åtting a CAN data logger to record all messages 
 being communicated on the CAN buses. However, in 
 the case of LOG, the aim is to allow for the...

5. Score: 14.5882
   Source: data/automotive/CAN.pdf
   Preview: CAN DBC File Explain

### 4c. Hybrid Search (Reciprocal Rank Fusion)

**RRF Formula:**
```
RRF(d) = Œ£ 1 / (k + rank(d))
```
where k=60 is a constant (standard in literature).

In [35]:
def reciprocal_rank_fusion(
    semantic_results: List[Dict],
    bm25_results: List[Dict],
    k: int = 60
) -> List[Dict]:
    """
    Combine semantic and BM25 results using Reciprocal Rank Fusion.
    
    Args:
        semantic_results: Results from semantic search
        bm25_results: Results from BM25 search
        k: RRF constant (default: 60)
    
    Returns:
        Fused results sorted by RRF score
    """
    # Calculate RRF scores
    rrf_scores = {}
    
    # Add semantic results
    for rank, result in enumerate(semantic_results, 1):
        doc_id = result['id']
        rrf_scores[doc_id] = rrf_scores.get(doc_id, 0) + 1 / (k + rank)
    
    # Add BM25 results
    for rank, result in enumerate(bm25_results, 1):
        doc_id = result['id']
        rrf_scores[doc_id] = rrf_scores.get(doc_id, 0) + 1 / (k + rank)
    
    # Sort by RRF score
    sorted_ids = sorted(rrf_scores.keys(), key=lambda x: rrf_scores[x], reverse=True)
    
    # Build result list
    fused_results = []
    for doc_id in sorted_ids:
        # Find document in either result set
        doc = None
        for result in semantic_results + bm25_results:
            if result['id'] == doc_id:
                doc = result
                break
        
        if doc:
            fused_results.append({
                "id": doc_id,
                "document": doc['document'],
                "rrf_score": rrf_scores[doc_id],
                "metadata": doc['metadata']
            })
    
    return fused_results

# Test hybrid search
hybrid_results = reciprocal_rank_fusion(semantic_results, bm25_results)

print("\n" + "=" * 60)
print("HYBRID SEARCH RESULTS (RRF):")
print("=" * 60)
for i, result in enumerate(hybrid_results[:5], 1):
    print(f"\n{i}. RRF Score: {result['rrf_score']:.6f}")
    print(f"   Source: {result['metadata'].get('source', 'Unknown')}")
    print(f"   Preview: {result['document'][:150]}...")


HYBRID SEARCH RESULTS (RRF):

1. RRF Score: 0.032018
   Source: data/automotive/CAN.pdf
   Preview: bytes. Further, the network speed is limited to 1 Mbit/s, restricting the implementation of data-producing features. CAN 
 FD resolves these issues - ...

2. RRF Score: 0.016393
   Source: data/automotive/CAN.pdf
   Preview: party device (e.g. a sensor-to-CAN module) to inject data into an existing CAN bus. If you do not ensure the global 
 uniqueness of the CAN IDs of ext...

3. RRF Score: 0.016129
   Source: data/automotive/CAN.pdf
   Preview: UDS vs CAN bus: Standards & OSI model 
 To better understand UDS, we will look at how it relates to CAN bus and the OSI model. 
 As explained in our  ...

4. RRF Score: 0.016129
   Source: data/automotive/CAN.pdf
   Preview: Table of Contents 
 CANedge - 2 x CAN/LIN Bus Data Logger  2 
 CAN Bus Explained - A Simple Intro  2 
 What is CAN bus?  2 
 Top 4 beneÔ¨Åts of CAN bus ...

5. RRF Score: 0.015873
   Source: data/automotive/CAN.pdf
   Previe

## Step 5: Compare All Methods

Test all queries and compare results.

In [36]:
def compare_search_methods(query: str, domain: str = None, n_results: int = 3):
    """Compare semantic, BM25, and hybrid search."""
    print("\n" + "=" * 80)
    print(f"QUERY: {query}")
    if domain:
        print(f"DOMAIN: {domain}")
    print("=" * 80)
    
    # Get results from all methods
    semantic_res = semantic_search(query, n_results=n_results, domain=domain)
    bm25_res = bm25_search(query, n_results=n_results, domain=domain)
    hybrid_res = reciprocal_rank_fusion(semantic_res, bm25_res)[:n_results]
    
    # Display results
    print("\nüìä SEMANTIC SEARCH:")
    for i, r in enumerate(semantic_res, 1):
        print(f"{i}. [{r['metadata'].get('source', 'Unknown')}] {r['document'][:100]}...")
    
    print("\nüî§ BM25 KEYWORD SEARCH:")
    for i, r in enumerate(bm25_res, 1):
        print(f"{i}. [{r['metadata'].get('source', 'Unknown')}] {r['document'][:100]}...")
    
    print("\nüîÑ HYBRID SEARCH (RRF):")
    for i, r in enumerate(hybrid_res, 1):
        print(f"{i}. [{r['metadata'].get('source', 'Unknown')}] {r['document'][:100]}...")

# Test all queries
compare_search_methods("What is CAN protocol?", domain="automotive")
compare_search_methods("OBD-II diagnostic system", domain="automotive")
compare_search_methods("dresses under 1000 rupees", domain="fashion")


QUERY: What is CAN protocol?
DOMAIN: automotive

üìä SEMANTIC SEARCH:
1. [data/automotive/CAN.pdf] party device (e.g. a sensor-to-CAN module) to inject data into an existing CAN bus. If you do not en...
2. [data/automotive/CAN.pdf] UDS vs CAN bus: Standards & OSI model 
 To better understand UDS, we will look at how it relates to ...
3. [data/automotive/CAN.pdf] Quick overview of the UDS OSI model layers 
 In the following we provide a quick breakdown of each l...

üî§ BM25 KEYWORD SEARCH:
1. [data/automotive/CAN.pdf] bytes. Further, the network speed is limited to 1 Mbit/s, restricting the implementation of data-pro...
2. [data/automotive/CAN.pdf] Table of Contents 
 CANedge - 2 x CAN/LIN Bus Data Logger  2 
 CAN Bus Explained - A Simple Intro  2...
3. [data/automotive/CAN.pdf] UDS data logging - applications  60 
 CANopen Explained - A Simple Intro  61 
 What is CANopen?  61 ...

üîÑ HYBRID SEARCH (RRF):
1. [data/automotive/CAN.pdf] party device (e.g. a sensor-to-CAN module) to i

## Step 6: Analysis

**Expected Observations:**

1. **Semantic Search:**
   - Good for conceptual queries
   - May miss exact term matches

2. **BM25 Search:**
   - Excellent for exact terms ("CAN", "OBD-II")
   - May miss semantic similarity

3. **Hybrid (RRF):**
   - Best of both worlds
   - Documents appearing in both get higher scores
   - More robust retrieval

**Next Steps:**
- Integrate hybrid search into `src/hybrid_search.py`
- Update main.py to use hybrid search
- Add option to switch between search modes

## Step 7: Save BM25 Index (Optional)

For production, we can pickle the BM25 index to avoid rebuilding.

In [37]:
import pickle

# Save BM25 index
with open('../bm25_index.pkl', 'wb') as f:
    pickle.dump({
        'bm25': bm25,
        'documents': all_documents,
        'metadatas': all_metadatas,
        'ids': all_ids
    }, f)

print("‚úÖ BM25 index saved to: bm25_index.pkl")
print("   Size:", round(os.path.getsize('../bm25_index.pkl') / 1024 / 1024, 2), "MB")

‚úÖ BM25 index saved to: bm25_index.pkl
   Size: 22.18 MB
