# Information Retrieval: BM25 vs Jina Embeddings v4

This notebook compares two retrieval approaches:
- **BM25**: Traditional lexical (keyword-based) retrieval
- **Jina Embeddings v4**: Modern semantic (meaning-based) retrieval

We'll explore their strengths and weaknesses with carefully designed examples.

## Setup

In [5]:


%%capture
%pip install pandas==2.3.3

In [6]:
import numpy as np
import pandas as pd
from retrievers.embeddings import JinaEmbedder, DummyEmbedder, EmbeddingRetriever
from retrievers.bm25 import BM25
import warnings
warnings.filterwarnings('ignore')

  from .autonotebook import tqdm as notebook_tqdm


## Test Corpus

We've designed this corpus to highlight different retrieval scenarios:
1. **Exact keyword matches** - BM25 should excel
2. **Semantic/paraphrased queries** - Embeddings should excel
3. **Technical terms** - BM25's strength with rare terms
4. **Conceptual understanding** - Embeddings' strength

In [None]:
corpus = [
    {
        "doc_id": "climate1",
        "text": "Climate change is causing rising sea levels and extreme weather events, "
               "threatening coastal cities worldwide. Scientists warn of increasing hurricanes and floods."
    },
    {
        "doc_id": "climate2", 
        "text": "Renewable energy sources like solar and wind power are essential for "
               "reducing greenhouse gas emissions and combating global warming."
    },
    {
        "doc_id": "ml1",
        "text": "Machine learning algorithms can identify patterns in large datasets and "
               "make predictions based on training data using neural networks."
    },
    {
        "doc_id": "ml2",
        "text": "Deep neural networks use multiple layers to learn hierarchical "
               "representations of data for complex tasks like image recognition."
    },
    {
        "doc_id": "ml3",
        "text": "Artificial intelligence systems can now understand natural language, "
               "recognize objects in images, and even generate creative content."
    },
    {
        "doc_id": "space1",
        "text": "The James Webb Space Telescope is revealing unprecedented details about "
               "distant galaxies and the early universe using infrared technology."
    },
    {
        "doc_id": "quantum1",
        "text": "Quantum computing leverages quantum entanglement and superposition to solve "
               "complex computational problems exponentially faster than classical computers."
    },
    {
        "doc_id": "bio1",
        "text": "CRISPR gene editing technology enables precise modifications to DNA sequences, "
               "opening new possibilities for treating genetic diseases and improving crops."
    }
]

# Display corpus
df_corpus = pd.DataFrame(corpus)
print(f"📚 Corpus: {len(corpus)} documents\n")
df_corpus

## Test Queries

Each query is designed to test specific retrieval characteristics:

### BM25 Strengths:
- **Exact keywords** (queries 1, 4, 5)
- **Rare technical terms**

### Embedding Strengths:
- **Semantic similarity** (queries 2, 7)
- **Paraphrasing** (query 6)
- **Conceptual understanding** (query 3)

In [None]:
queries = [
    {
        "id": "q1",
        "query": "quantum entanglement superposition",
        "expected": "quantum1",
        "type": "Exact keywords (BM25 strength)",
        "description": "Contains rare, technical terms that appear in only one document"
    },
    {
        "id": "q2",
        "query": "global warming and rising temperatures",
        "expected": "climate2",
        "type": "Semantic similarity (Embedding strength)",
        "description": "'global warming' is semantically similar to 'greenhouse gas emissions'"
    },
    {
        "id": "q3",
        "query": "How do computers learn from data?",
        "expected": "ml1",
        "type": "Conceptual question (Embedding strength)",
        "description": "Question form, no exact keywords but conceptually about machine learning"
    },
    {
        "id": "q4",
        "query": "CRISPR DNA editing",
        "expected": "bio1",
        "type": "Exact technical terms (BM25 strength)",
        "description": "Specific acronym and technical terms"
    },
    {
        "id": "q5",
        "query": "James Webb infrared telescope",
        "expected": "space1",
        "type": "Multi-keyword match (BM25 strength)",
        "description": "All three keywords appear in target document"
    },
    {
        "id": "q6",
        "query": "AI that understands human language",
        "expected": "ml3",
        "type": "Paraphrasing (Embedding strength)",
        "description": "Different words, same meaning as 'understand natural language'"
    },
    {
        "id": "q7",
        "query": "environmental impact of CO2",
        "expected": "climate2",
        "type": "Conceptual understanding (Embedding strength)",
        "description": "CO2 → greenhouse gas, conceptual link without exact words"
    },
]

df_queries = pd.DataFrame(queries)
print(f"🔍 Queries: {len(queries)} test cases\n")
df_queries[["id", "query", "type", "expected"]]

## Initialize Retrievers

### 1. BM25 Retriever

In [None]:
print("📊 Initializing BM25 retriever...")
bm25 = BM25()
bm25.fit(corpus)
print("✓ BM25 ready!")

### 2. Jina Embeddings Retriever

**Note:** First run will download ~7.5GB model. Subsequent runs load from cache.

In [None]:
jina_retriever = None
jina_embedder = None

try:
    print("🚀 Loading Jina embeddings v4 model...")
    print("   (This may take a few minutes on first run)\n")
    
    jina_embedder = JinaEmbedder(model_name="jinaai/jina-embeddings-v4", task="retrieval")
    print("✓ Model loaded!")
    
    # Test encoding
    test_emb = jina_embedder.encode(["test"], prompt_name="passage")
    print(f"✓ Embedding dimension: {test_emb.shape[1]}")
    print(f"✓ Normalized: {np.allclose(np.linalg.norm(test_emb[0]), 1.0, rtol=1e-3)}")
    
    jina_retriever = EmbeddingRetriever(embedder=jina_embedder)
    jina_retriever.fit(corpus)
    print("✓ Jina retriever ready!\n")
    
except Exception as e:
    print(f"❌ Error loading Jina model: {e}")
    print("   Install dependencies: pip install -r requirements.txt")
    print("   Continuing with BM25 only...\n")

### 3. Dummy Embeddings (Baseline)

Random normalized embeddings for comparison.

In [None]:
print("🎲 Initializing Dummy embedder (baseline)...")
dummy_embedder = DummyEmbedder()
dummy_retriever = EmbeddingRetriever(embedder=dummy_embedder)
dummy_retriever.fit(corpus)
print("✓ Dummy retriever ready!")

## Run Retrieval Experiments

In [None]:
def run_retrieval(retriever, query, k=3, name="Retriever"):
    """Run retrieval and return results."""
    if retriever is None:
        return None
    
    try:
        results = retriever.rank(query, k=k)
        return results
    except Exception as e:
        print(f"Error with {name}: {e}")
        return None


def format_results(results, expected_doc=None):
    """Format results as dataframe with highlighting."""
    if results is None:
        return None
    
    df = pd.DataFrame(results, columns=["doc_id", "score"])
    df["rank"] = range(1, len(df) + 1)
    df["correct"] = df["doc_id"] == expected_doc if expected_doc else False
    df = df[["rank", "doc_id", "score", "correct"]]
    return df

## Comparison Results

For each query, we'll show:
- Top 3 results from each retriever
- Whether the expected document was retrieved
- Analysis of why each method succeeded or failed

In [None]:
# Store all results
all_results = []

for q in queries:
    print("=" * 80)
    print(f"Query {q['id']}: {q['query']}")
    print(f"Type: {q['type']}")
    print(f"Expected: {q['expected']} - {q['description']}")
    print("=" * 80)
    
    # BM25
    print("\n📊 BM25 Results:")
    bm25_results = run_retrieval(bm25, q['query'], k=3, name="BM25")
    bm25_df = format_results(bm25_results, q['expected'])
    if bm25_df is not None:
        display(bm25_df)
        bm25_correct = bm25_df[bm25_df['correct']].shape[0] > 0
        bm25_rank = bm25_df[bm25_df['correct']]['rank'].values[0] if bm25_correct else None
    else:
        bm25_correct = False
        bm25_rank = None
    
    # Jina Embeddings
    print("\n🚀 Jina Embeddings Results:")
    jina_results = run_retrieval(jina_retriever, q['query'], k=3, name="Jina")
    jina_df = format_results(jina_results, q['expected'])
    if jina_df is not None:
        display(jina_df)
        jina_correct = jina_df[jina_df['correct']].shape[0] > 0
        jina_rank = jina_df[jina_df['correct']]['rank'].values[0] if jina_correct else None
    else:
        print("   (Not available)")
        jina_correct = False
        jina_rank = None
    
    # Dummy (baseline)
    print("\n🎲 Dummy Embeddings Results:")
    dummy_results = run_retrieval(dummy_retriever, q['query'], k=3, name="Dummy")
    dummy_df = format_results(dummy_results, q['expected'])
    if dummy_df is not None:
        display(dummy_df)
        dummy_correct = dummy_df[dummy_df['correct']].shape[0] > 0
        dummy_rank = dummy_df[dummy_df['correct']]['rank'].values[0] if dummy_correct else None
    else:
        dummy_correct = False
        dummy_rank = None
    
    # Store results
    all_results.append({
        'query_id': q['id'],
        'query': q['query'],
        'type': q['type'],
        'expected': q['expected'],
        'bm25_correct': bm25_correct,
        'bm25_rank': bm25_rank,
        'jina_correct': jina_correct,
        'jina_rank': jina_rank,
        'dummy_correct': dummy_correct,
        'dummy_rank': dummy_rank,
    })
    
    print("\n\n")

## Summary Analysis

In [None]:
df_results = pd.DataFrame(all_results)

print("=" * 80)
print("SUMMARY: Retrieval Performance")
print("=" * 80)

# Overall accuracy
print("\n📊 Accuracy (found expected document in top 3):")
print(f"  BM25:            {df_results['bm25_correct'].sum()}/{len(queries)} = {df_results['bm25_correct'].mean():.1%}")
if jina_retriever:
    print(f"  Jina Embeddings: {df_results['jina_correct'].sum()}/{len(queries)} = {df_results['jina_correct'].mean():.1%}")
print(f"  Dummy (baseline): {df_results['dummy_correct'].sum()}/{len(queries)} = {df_results['dummy_correct'].mean():.1%}")

# By query type
print("\n📈 Performance by Query Type:")
display(df_results.groupby('type').agg({
    'bm25_correct': 'mean',
    'jina_correct': 'mean',
    'dummy_correct': 'mean'
}).round(2))

print("\n💡 Full Results Table:")
display(df_results)

## Key Insights

### BM25 Strengths:
✅ Excellent with **exact keyword matches**  
✅ Strong on **rare technical terms** (high IDF)  
✅ Fast and efficient  
✅ Interpretable (you can see which terms matched)  

### BM25 Weaknesses:
❌ No semantic understanding ("global warming" ≠ "greenhouse gas")  
❌ Struggles with **paraphrasing**  
❌ Can't handle **conceptual queries**  
❌ Vocabulary mismatch problems  

### Jina Embeddings Strengths:
✅ Understands **semantic similarity**  
✅ Handles **paraphrasing** well  
✅ Works with **conceptual queries**  
✅ Robust to vocabulary mismatch  
✅ Multilingual and multimodal (text + images)  

### Jina Embeddings Weaknesses:
❌ Computationally expensive  
❌ Requires GPU for fast inference at scale  
❌ Less interpretable (black box)  
❌ May miss exact matches if not in training data  

### Best Practice:
🎯 **Hybrid retrieval** - Combine both approaches!
- Use BM25 for keyword matches
- Use embeddings for semantic understanding
- Merge and re-rank results

## Similarity Inspection (Jina Embeddings)

Let's visualize how Jina understands semantic relationships.

In [None]:
if jina_embedder:
    print("🔬 Pairwise Semantic Similarities\n")
    
    pairs = [
        ("climate change", "global warming"),
        ("machine learning", "artificial intelligence"),
        ("neural networks", "deep learning"),
        ("quantum computing", "classical computing"),
        ("climate change", "quantum computing"),  # Unrelated
    ]
    
    similarity_data = []
    
    for text1, text2 in pairs:
        emb1 = jina_embedder.encode([text1], prompt_name="passage")[0]
        emb2 = jina_embedder.encode([text2], prompt_name="passage")[0]
        
        # Cosine similarity (dot product for normalized vectors)
        similarity = float(np.dot(emb1, emb2))
        
        similarity_data.append({
            'text1': text1,
            'text2': text2,
            'similarity': similarity,
            'interpretation': (
                'Very similar' if similarity > 0.8 else
                'Similar' if similarity > 0.6 else
                'Somewhat similar' if similarity > 0.4 else
                'Different'
            )
        })
    
    df_similarity = pd.DataFrame(similarity_data)
    display(df_similarity)
else:
    print("⚠️  Jina embeddings not available")

## Conclusion

Both BM25 and neural embeddings have their place in modern IR systems:

- **BM25**: Fast, interpretable, great for exact matches
- **Embeddings**: Semantic understanding, handles paraphrasing
- **Best approach**: Hybrid systems that combine both strengths

For production systems, consider:
1. First-stage retrieval with BM25 (fast, broad recall)
2. Re-ranking with embeddings (precise, semantic)
3. Evaluation metrics (Precision@k, MRR, NDCG)