# Information Retrieval: BM25 vs Jina Embeddings v4

This notebook compares two retrieval approaches:
- **BM25**: Traditional lexical (keyword-based) retrieval
- **Jina Embeddings v4**: Modern semantic (meaning-based) retrieval

We'll explore their strengths and weaknesses with carefully designed examples.

## Setup

In [86]:


%%capture
%pip install pandas==2.3.3

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [87]:
import sys
from pathlib import Path
import json

# Add parent directory to path so we can import from retrievers
sys.path.insert(0, str(Path.cwd().parent))

import numpy as np
import pandas as pd
from retrievers.embeddings import JinaEmbedder, DummyEmbedder, EmbeddingRetriever
from retrievers.bm25 import BM25
from metrics import precision_at_k, mrr, ndcg_at_k
import warnings
warnings.filterwarnings('ignore', message='.*urllib3 v2.*')

## Test Corpus

We've designed this corpus to highlight different retrieval scenarios:
1. **Exact keyword matches** - BM25 should excel
2. **Semantic/paraphrased queries** - Embeddings should excel
3. **Technical terms** - BM25's strength with rare terms
4. **Conceptual understanding** - Embeddings' strength

In [88]:
# Load corpus from data/corpus.jsonl
corpus = []
corpus_path = Path("../data/corpus.jsonl")
with open(corpus_path, 'r') as f:
    for line in f:
        entry = json.loads(line.strip())
        corpus.append(entry)

# Display corpus
df_corpus = pd.DataFrame(corpus)
print(f"üìö Corpus loaded from {corpus_path}: {len(corpus)} documents\n")
df_corpus

üìö Corpus loaded from ../data/corpus.jsonl: 8 documents



Unnamed: 0,doc_id,text
0,climate1,Climate change is causing rising sea levels an...
1,climate2,Renewable energy sources like solar and wind p...
2,ml1,Machine learning algorithms can identify patte...
3,ml2,Deep neural networks use multiple layers to le...
4,ml3,Artificial intelligence systems can now unders...
5,space1,The James Webb Space Telescope is revealing un...
6,quantum1,Quantum computing leverages quantum entangleme...
7,bio1,CRISPR gene editing technology enables precise...


## Test Queries

Each query is designed to test specific retrieval characteristics:

### BM25 Strengths:
- **Exact keywords** (queries 1, 4, 5)
- **Rare technical terms**

### Embedding Strengths:
- **Semantic similarity** (queries 2, 7)
- **Paraphrasing** (query 6)
- **Conceptual understanding** (query 3)

In [89]:
# Load queries from data/queries.jsonl
queries_basic = []
queries_path = Path("../data/queries.jsonl")
with open(queries_path, 'r') as f:
    for line in f:
        entry = json.loads(line.strip())
        queries_basic.append(entry)

# Add metadata for each query (expected doc, type, description)
query_metadata = {
    "q1": {
        "expected": "quantum1",
        "type": "Exact keywords (BM25 strength)",
        "description": "Contains rare, technical terms that appear in only one document"
    },
    "q2": {
        "expected": "climate2",
        "type": "Semantic similarity (Embedding strength)",
        "description": "'global warming' is semantically similar to 'greenhouse gas emissions'"
    },
    "q3": {
        "expected": "ml1",
        "type": "Conceptual question (Embedding strength)",
        "description": "Question form, no exact keywords but conceptually about machine learning"
    },
    "q4": {
        "expected": "bio1",
        "type": "Exact technical terms (BM25 strength)",
        "description": "Specific acronym and technical terms"
    },
    "q5": {
        "expected": "space1",
        "type": "Multi-keyword match (BM25 strength)",
        "description": "All three keywords appear in target document"
    },
    "q6": {
        "expected": "ml3",
        "type": "Paraphrasing (Embedding strength)",
        "description": "Different words, same meaning as 'understand natural language'"
    },
    "q7": {
        "expected": "climate2",
        "type": "Conceptual understanding (Embedding strength)",
        "description": "CO2 ‚Üí greenhouse gas, conceptual link without exact words"
    }
}

# Merge query text with metadata
queries = []
for q_basic in queries_basic:
    qid = q_basic['qid']
    metadata = query_metadata.get(qid, {})
    queries.append({
        "id": qid,
        "query": q_basic['query'],
        "expected": metadata.get("expected", ""),
        "type": metadata.get("type", "Unknown"),
        "description": metadata.get("description", "")
    })

df_queries = pd.DataFrame(queries)
print(f"üîç Queries loaded from {queries_path}: {len(queries)} test cases\n")
df_queries[["id", "query", "type", "expected"]]

üîç Queries loaded from ../data/queries.jsonl: 7 test cases



Unnamed: 0,id,query,type,expected
0,q1,quantum entanglement superposition,Exact keywords (BM25 strength),quantum1
1,q2,global warming and rising temperatures,Semantic similarity (Embedding strength),climate2
2,q3,How do computers learn from data?,Conceptual question (Embedding strength),ml1
3,q4,CRISPR DNA editing,Exact technical terms (BM25 strength),bio1
4,q5,James Webb infrared telescope,Multi-keyword match (BM25 strength),space1
5,q6,AI that understands human language,Paraphrasing (Embedding strength),ml3
6,q7,environmental impact of CO2,Conceptual understanding (Embedding strength),climate2


## Initialize Retrievers

### 1. BM25 Retriever

In [90]:
print("üìä Initializing BM25 retriever...")
bm25 = BM25()
bm25.fit(corpus)
print("‚úì BM25 ready!")

üìä Initializing BM25 retriever...
‚úì BM25 ready!


### 2. Jina Embeddings Retriever

**Note:** First run will download ~7.5GB model. Subsequent runs load from cache.

In [91]:
jina_retriever = None
jina_embedder = None

try:
    print("üöÄ Loading Jina embeddings v4 model...")
    print("   (This may take a few minutes on first run)\n")
    
    jina_embedder = JinaEmbedder(model_name="jinaai/jina-embeddings-v4", task="retrieval")
    print("‚úì Model loaded!")
    
    # Test encoding
    test_emb = jina_embedder.encode(["test"], prompt_name="passage")
    print(f"‚úì Embedding dimension: {test_emb.shape[1]}")
    print(f"‚úì Normalized: {np.allclose(np.linalg.norm(test_emb[0]), 1.0, rtol=1e-3)}")
    
    jina_retriever = EmbeddingRetriever(embedder=jina_embedder)
    jina_retriever.fit(corpus)
    print("‚úì Jina retriever ready!\n")
    
except Exception as e:
    print(f"‚ùå Error loading Jina model: {e}")
    print("   Install dependencies: pip install -r requirements.txt")
    print("   Continuing with BM25 only...\n")

üöÄ Loading Jina embeddings v4 model...
   (This may take a few minutes on first run)



Fetching 2 files: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2/2 [00:00<00:00, 8160.12it/s]
Loading checkpoint shards: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2/2 [00:00<00:00, 11.34it/s]
Fetching 2 files: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2/2 [00:00<00:00, 36314.32it/s]


‚úì Model loaded!
‚úì Embedding dimension: 2048
‚úì Normalized: True
‚úì Jina retriever ready!



### 3. Dummy Embeddings (Baseline)

Random normalized embeddings for comparison.

In [92]:
print("üé≤ Initializing Dummy embedder (baseline)...")
dummy_embedder = DummyEmbedder()
dummy_retriever = EmbeddingRetriever(embedder=dummy_embedder)
dummy_retriever.fit(corpus)
print("‚úì Dummy retriever ready!")

üé≤ Initializing Dummy embedder (baseline)...
‚úì Dummy retriever ready!


## Run Retrieval Experiments

In [93]:
def run_retrieval(retriever, query, k=3, name="Retriever"):
    """Run retrieval and return results."""
    if retriever is None:
        return None
    
    try:
        results = retriever.rank(query, k=k)
        return results
    except Exception as e:
        print(f"Error with {name}: {e}")
        return None


def format_results(results, expected_doc=None):
    """Format results as dataframe with highlighting."""
    if results is None:
        return None
    
    df = pd.DataFrame(results, columns=["doc_id", "score"])
    df["rank"] = range(1, len(df) + 1)
    df["correct"] = df["doc_id"] == expected_doc if expected_doc else False
    df = df[["rank", "doc_id", "score", "correct"]]
    return df

## Comparison Results

For each query, we'll show:
- Top 3 results from each retriever
- Whether the expected document was retrieved
- Analysis of why each method succeeded or failed

In [94]:
# Store all results
all_results = []

for q in queries:
    print("=" * 80)
    print(f"Query {q['id']}: {q['query']}")
    print(f"Type: {q['type']}")
    print(f"Expected: {q['expected']} - {q['description']}")
    print("=" * 80)
    
    # BM25
    print("\nüìä BM25 Results:")
    bm25_results = run_retrieval(bm25, q['query'], k=3, name="BM25")
    bm25_df = format_results(bm25_results, q['expected'])
    if bm25_df is not None:
        display(bm25_df)
        bm25_correct = bm25_df[bm25_df['correct']].shape[0] > 0
        bm25_rank = bm25_df[bm25_df['correct']]['rank'].values[0] if bm25_correct else None
    else:
        bm25_correct = False
        bm25_rank = None
    
    # Jina Embeddings
    print("\nüöÄ Jina Embeddings Results:")
    jina_results = run_retrieval(jina_retriever, q['query'], k=3, name="Jina")
    jina_df = format_results(jina_results, q['expected'])
    if jina_df is not None:
        display(jina_df)
        jina_correct = jina_df[jina_df['correct']].shape[0] > 0
        jina_rank = jina_df[jina_df['correct']]['rank'].values[0] if jina_correct else None
    else:
        print("   (Not available)")
        jina_correct = False
        jina_rank = None
    
    # Dummy (baseline)
    print("\nüé≤ Dummy Embeddings Results:")
    dummy_results = run_retrieval(dummy_retriever, q['query'], k=3, name="Dummy")
    dummy_df = format_results(dummy_results, q['expected'])
    if dummy_df is not None:
        display(dummy_df)
        dummy_correct = dummy_df[dummy_df['correct']].shape[0] > 0
        dummy_rank = dummy_df[dummy_df['correct']]['rank'].values[0] if dummy_correct else None
    else:
        dummy_correct = False
        dummy_rank = None
    
    # Store results
    all_results.append({
        'query_id': q['id'],
        'query': q['query'],
        'type': q['type'],
        'expected': q['expected'],
        'bm25_correct': bm25_correct,
        'bm25_rank': bm25_rank,
        'jina_correct': jina_correct,
        'jina_rank': jina_rank,
        'dummy_correct': dummy_correct,
        'dummy_rank': dummy_rank,
    })
    
    print("\n\n")

Query q1: quantum entanglement superposition
Type: Exact keywords (BM25 strength)
Expected: quantum1 - Contains rare, technical terms that appear in only one document

üìä BM25 Results:


Unnamed: 0,rank,doc_id,score,correct
0,1,quantum1,5.626808,True
1,2,climate1,0.0,False
2,3,climate2,0.0,False



üöÄ Jina Embeddings Results:


Unnamed: 0,rank,doc_id,score,correct
0,1,quantum1,0.654099,True
1,2,space1,0.435976,False
2,3,ml3,0.430316,False



üé≤ Dummy Embeddings Results:


Unnamed: 0,rank,doc_id,score,correct
0,1,bio1,0.041708,False
1,2,ml3,0.014613,False
2,3,quantum1,-0.007347,True





Query q2: global warming and rising temperatures
Type: Semantic similarity (Embedding strength)
Expected: climate2 - 'global warming' is semantically similar to 'greenhouse gas emissions'

üìä BM25 Results:


Unnamed: 0,rank,doc_id,score,correct
0,1,climate2,2.046342,True
1,2,climate1,1.914822,False
2,3,ml3,0.190853,False



üöÄ Jina Embeddings Results:


Unnamed: 0,rank,doc_id,score,correct
0,1,climate1,0.754224,False
1,2,climate2,0.643384,True
2,3,space1,0.506174,False



üé≤ Dummy Embeddings Results:


Unnamed: 0,rank,doc_id,score,correct
0,1,climate1,0.060412,False
1,2,quantum1,0.025392,False
2,3,ml2,-0.002697,False





Query q3: How do computers learn from data?
Type: Conceptual question (Embedding strength)
Expected: ml1 - Question form, no exact keywords but conceptually about machine learning

üìä BM25 Results:


Unnamed: 0,rank,doc_id,score,correct
0,1,ml2,1.829934,False
1,2,climate1,0.0,False
2,3,climate2,0.0,False



üöÄ Jina Embeddings Results:


Unnamed: 0,rank,doc_id,score,correct
0,1,ml1,0.620494,True
1,2,ml2,0.591145,False
2,3,quantum1,0.524754,False



üé≤ Dummy Embeddings Results:


Unnamed: 0,rank,doc_id,score,correct
0,1,ml2,0.106231,False
1,2,ml3,0.04428,False
2,3,space1,0.001975,False





Query q4: CRISPR DNA editing
Type: Exact technical terms (BM25 strength)
Expected: bio1 - Specific acronym and technical terms

üìä BM25 Results:


Unnamed: 0,rank,doc_id,score,correct
0,1,bio1,5.234873,True
1,2,climate1,0.0,False
2,3,climate2,0.0,False



üöÄ Jina Embeddings Results:


Unnamed: 0,rank,doc_id,score,correct
0,1,bio1,0.766719,True
1,2,ml3,0.42257,False
2,3,quantum1,0.403295,False



üé≤ Dummy Embeddings Results:


Unnamed: 0,rank,doc_id,score,correct
0,1,ml3,0.128329,False
1,2,climate2,0.069529,False
2,3,climate1,0.059128,False





Query q5: James Webb infrared telescope
Type: Multi-keyword match (BM25 strength)
Expected: space1 - All three keywords appear in target document

üìä BM25 Results:


Unnamed: 0,rank,doc_id,score,correct
0,1,space1,5.359307,True
1,2,climate1,0.0,False
2,3,climate2,0.0,False



üöÄ Jina Embeddings Results:


Unnamed: 0,rank,doc_id,score,correct
0,1,space1,0.790224,True
1,2,ml3,0.38803,False
2,3,climate1,0.383982,False



üé≤ Dummy Embeddings Results:


Unnamed: 0,rank,doc_id,score,correct
0,1,ml2,0.071142,False
1,2,ml3,0.036583,False
2,3,bio1,0.015456,False





Query q6: AI that understands human language
Type: Paraphrasing (Embedding strength)
Expected: ml3 - Different words, same meaning as 'understand natural language'

üìä BM25 Results:


Unnamed: 0,rank,doc_id,score,correct
0,1,climate1,0.0,False
1,2,climate2,0.0,False
2,3,ml1,0.0,False



üöÄ Jina Embeddings Results:


Unnamed: 0,rank,doc_id,score,correct
0,1,ml3,0.752927,True
1,2,ml1,0.592,False
2,3,ml2,0.548452,False



üé≤ Dummy Embeddings Results:


Unnamed: 0,rank,doc_id,score,correct
0,1,climate2,0.045536,False
1,2,ml2,0.02919,False
2,3,ml3,0.026435,True





Query q7: environmental impact of CO2
Type: Conceptual understanding (Embedding strength)
Expected: climate2 - CO2 ‚Üí greenhouse gas, conceptual link without exact words

üìä BM25 Results:


Unnamed: 0,rank,doc_id,score,correct
0,1,ml2,1.308225,False
1,2,climate1,1.192117,False
2,3,climate2,0.0,True



üöÄ Jina Embeddings Results:


Unnamed: 0,rank,doc_id,score,correct
0,1,climate1,0.650154,False
1,2,climate2,0.610414,True
2,3,space1,0.433448,False



üé≤ Dummy Embeddings Results:


Unnamed: 0,rank,doc_id,score,correct
0,1,quantum1,0.085921,False
1,2,space1,0.066898,False
2,3,ml3,0.053408,False







## Summary Analysis

In [95]:
df_results = pd.DataFrame(all_results)

print("=" * 80)
print("SUMMARY: Retrieval Performance")
print("=" * 80)

# Overall accuracy
print("\nüìä Accuracy (found expected document in top 3):")
print(f"  BM25:            {df_results['bm25_correct'].sum()}/{len(queries)} = {df_results['bm25_correct'].mean():.1%}")
if jina_retriever:
    print(f"  Jina Embeddings: {df_results['jina_correct'].sum()}/{len(queries)} = {df_results['jina_correct'].mean():.1%}")
print(f"  Dummy (baseline): {df_results['dummy_correct'].sum()}/{len(queries)} = {df_results['dummy_correct'].mean():.1%}")

# By query type
print("\nüìà Performance by Query Type:")
display(df_results.groupby('type').agg({
    'bm25_correct': 'mean',
    'jina_correct': 'mean',
    'dummy_correct': 'mean'
}).round(2))

print("\nüí° Full Results Table:")
print("(Rank shows position 1-3 if found, '>3' if not in top 3)\n")

# Replace NaN with ">3" for better readability
df_results_display = df_results.copy()
rank_columns = ['bm25_rank', 'jina_rank', 'dummy_rank']
for col in rank_columns:
    if col in df_results_display.columns:
        df_results_display[col] = df_results_display[col].fillna('>3').astype(str).str.replace('.0', '', regex=False)

display(df_results_display)

SUMMARY: Retrieval Performance

üìä Accuracy (found expected document in top 3):
  BM25:            5/7 = 71.4%
  Jina Embeddings: 7/7 = 100.0%
  Dummy (baseline): 2/7 = 28.6%

üìà Performance by Query Type:


Unnamed: 0_level_0,bm25_correct,jina_correct,dummy_correct
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Conceptual question (Embedding strength),0.0,1.0,0.0
Conceptual understanding (Embedding strength),1.0,1.0,0.0
Exact keywords (BM25 strength),1.0,1.0,1.0
Exact technical terms (BM25 strength),1.0,1.0,0.0
Multi-keyword match (BM25 strength),1.0,1.0,0.0
Paraphrasing (Embedding strength),0.0,1.0,1.0
Semantic similarity (Embedding strength),1.0,1.0,0.0



üí° Full Results Table:
(Rank shows position 1-3 if found, '>3' if not in top 3)



Unnamed: 0,query_id,query,type,expected,bm25_correct,bm25_rank,jina_correct,jina_rank,dummy_correct,dummy_rank
0,q1,quantum entanglement superposition,Exact keywords (BM25 strength),quantum1,True,1,True,1,True,3
1,q2,global warming and rising temperatures,Semantic similarity (Embedding strength),climate2,True,1,True,2,False,>3
2,q3,How do computers learn from data?,Conceptual question (Embedding strength),ml1,False,>3,True,1,False,>3
3,q4,CRISPR DNA editing,Exact technical terms (BM25 strength),bio1,True,1,True,1,False,>3
4,q5,James Webb infrared telescope,Multi-keyword match (BM25 strength),space1,True,1,True,1,False,>3
5,q6,AI that understands human language,Paraphrasing (Embedding strength),ml3,False,>3,True,1,True,3
6,q7,environmental impact of CO2,Conceptual understanding (Embedding strength),climate2,True,3,True,2,False,>3


## Detailed Metrics Analysis

Let's calculate standard IR metrics: **Precision@k**, **MRR** (Mean Reciprocal Rank), and **NDCG@k** (Normalized Discounted Cumulative Gain).


In [96]:
# Build qrels (query relevance judgments) from our queries
# Each query has exactly one expected document with relevance score 1
qrels = {}
for q in queries:
    qrels[q['id']] = {q['expected']: 1}

# Build run dictionaries (query_id -> ranked list of doc_ids) for each retriever
def build_run_dict(all_results):
    """Convert results to format expected by metrics: {qid: [ranked_doc_ids]}"""
    run = {}
    for result in all_results:
        qid = result['query_id']
        query_text = result['query']
        
        # Get top-3 results from each retriever
        bm25_results = bm25.rank(query_text, k=3)
        run[qid] = [doc_id for doc_id, score in bm25_results]
    return run

# Build runs for each retriever
print("üìä Calculating IR Metrics (P@3, MRR, NDCG@3)...\n")

# BM25 run
bm25_run = {}
for q in queries:
    results = bm25.rank(q['query'], k=3)
    bm25_run[q['id']] = [doc_id for doc_id, score in results]

# Jina run
jina_run = {}
if jina_retriever:
    for q in queries:
        results = jina_retriever.rank(q['query'], k=3)
        jina_run[q['id']] = [doc_id for doc_id, score in results]

# Dummy run
dummy_run = {}
for q in queries:
    results = dummy_retriever.rank(q['query'], k=3)
    dummy_run[q['id']] = [doc_id for doc_id, score in results]

# Calculate metrics for each retriever
def calculate_metrics(run, qrels, k=3):
    """Calculate average metrics across all queries."""
    p_sum = mrr_sum = ndcg_sum = 0.0
    n_queries = 0
    
    for qid, ranked_ids in run.items():
        if qid not in qrels:
            continue
        n_queries += 1
        p_sum += precision_at_k(ranked_ids, qrels[qid], k)
        mrr_sum += mrr(ranked_ids, qrels[qid])
        ndcg_sum += ndcg_at_k(ranked_ids, qrels[qid], k, method="exponential")
    
    if n_queries == 0:
        return {"P@k": 0.0, "MRR": 0.0, "NDCG@k": 0.0}
    
    return {
        f"P@{k}": round(p_sum / n_queries, 3),
        "MRR": round(mrr_sum / n_queries, 3),
        f"NDCG@{k}": round(ndcg_sum / n_queries, 3)
    }

# Calculate metrics for each retriever
k = 3
bm25_metrics = calculate_metrics(bm25_run, qrels, k=k)
jina_metrics = calculate_metrics(jina_run, qrels, k=k) if jina_retriever else {"P@3": 0.0, "MRR": 0.0, "NDCG@3": 0.0}
dummy_metrics = calculate_metrics(dummy_run, qrels, k=k)

# Create comparison table
metrics_comparison = pd.DataFrame({
    'BM25': bm25_metrics,
    'Jina Embeddings': jina_metrics,
    'Dummy (baseline)': dummy_metrics
}).T

print("=" * 80)
print("IR METRICS COMPARISON")
print("=" * 80)
print()
display(metrics_comparison)

print("\nüìñ Metric Definitions:")
print(f"  ‚Ä¢ P@{k} (Precision@{k}): Fraction of relevant docs in top {k} results")
print("  ‚Ä¢ MRR (Mean Reciprocal Rank): Average of 1/rank of first relevant doc")
print(f"  ‚Ä¢ NDCG@{k}: Normalized DCG - measures ranking quality (0-1, higher=better)")
print("\nüí° All metrics averaged across all queries. Higher is better!")


üìä Calculating IR Metrics (P@3, MRR, NDCG@3)...

IR METRICS COMPARISON



Unnamed: 0,P@3,MRR,NDCG@3
BM25,0.238,0.619,0.643
Jina Embeddings,0.333,0.857,0.895
Dummy (baseline),0.095,0.095,0.143



üìñ Metric Definitions:
  ‚Ä¢ P@3 (Precision@3): Fraction of relevant docs in top 3 results
  ‚Ä¢ MRR (Mean Reciprocal Rank): Average of 1/rank of first relevant doc
  ‚Ä¢ NDCG@3: Normalized DCG - measures ranking quality (0-1, higher=better)

üí° All metrics averaged across all queries. Higher is better!


### Per-Query Metrics Breakdown

Let's examine metrics for each individual query to understand where each retriever excels.


In [97]:
# Calculate per-query metrics
per_query_metrics = []

for q in queries:
    qid = q['id']
    qtype = q['type']
    
    # BM25 metrics
    bm25_ranked = bm25_run[qid]
    bm25_p = precision_at_k(bm25_ranked, qrels[qid], k=1)
    bm25_mrr = mrr(bm25_ranked, qrels[qid])
    bm25_ndcg = ndcg_at_k(bm25_ranked, qrels[qid], k=3, method="exponential")
    
    # Jina metrics
    if jina_retriever:
        jina_ranked = jina_run[qid]
        jina_p = precision_at_k(jina_ranked, qrels[qid], k=1)
        jina_mrr = mrr(jina_ranked, qrels[qid])
        jina_ndcg = ndcg_at_k(jina_ranked, qrels[qid], k=3, method="exponential")
    else:
        jina_p = jina_mrr = jina_ndcg = 0.0
    
    # Dummy metrics
    dummy_ranked = dummy_run[qid]
    dummy_p = precision_at_k(dummy_ranked, qrels[qid], k=1)
    dummy_mrr = mrr(dummy_ranked, qrels[qid])
    dummy_ndcg = ndcg_at_k(dummy_ranked, qrels[qid], k=3, method="exponential")
    
    per_query_metrics.append({
        'Query ID': qid,
        'Query Type': qtype,
        'BM25 P@1': round(bm25_p, 2),
        'BM25 MRR': round(bm25_mrr, 2),
        'BM25 NDCG@3': round(bm25_ndcg, 2),
        'Jina P@1': round(jina_p, 2),
        'Jina MRR': round(jina_mrr, 2),
        'Jina NDCG@3': round(jina_ndcg, 2),
        'Dummy P@1': round(dummy_p, 2),
        'Dummy MRR': round(dummy_mrr, 2),
        'Dummy NDCG@3': round(dummy_ndcg, 2),
    })

df_per_query = pd.DataFrame(per_query_metrics)

# Calculate aggregate scores (mean across all queries)
aggregate_row = {
    'Query ID': 'AVERAGE',
    'Query Type': 'All queries',
    'BM25 P@1': round(df_per_query['BM25 P@1'].mean(), 2),
    'BM25 MRR': round(df_per_query['BM25 MRR'].mean(), 2),
    'BM25 NDCG@3': round(df_per_query['BM25 NDCG@3'].mean(), 2),
    'Jina P@1': round(df_per_query['Jina P@1'].mean(), 2),
    'Jina MRR': round(df_per_query['Jina MRR'].mean(), 2),
    'Jina NDCG@3': round(df_per_query['Jina NDCG@3'].mean(), 2),
    'Dummy P@1': round(df_per_query['Dummy P@1'].mean(), 2),
    'Dummy MRR': round(df_per_query['Dummy MRR'].mean(), 2),
    'Dummy NDCG@3': round(df_per_query['Dummy NDCG@3'].mean(), 2),
}

# Add aggregate row to the dataframe
df_per_query_with_avg = pd.concat([df_per_query, pd.DataFrame([aggregate_row])], ignore_index=True)

# Display with better formatting
print("üìä Per-Query Metrics Breakdown:\n")
display(df_per_query_with_avg)

# Highlight best performing retriever for each query type
print("\nüèÜ Best Performer by Query Type (based on NDCG@3):")
for qtype in df_per_query['Query Type'].unique():
    type_rows = df_per_query[df_per_query['Query Type'] == qtype]
    avg_bm25 = type_rows['BM25 NDCG@3'].mean()
    avg_jina = type_rows['Jina NDCG@3'].mean()
    avg_dummy = type_rows['Dummy NDCG@3'].mean()
    
    best = max([('BM25', avg_bm25), ('Jina', avg_jina), ('Dummy', avg_dummy)], key=lambda x: x[1])
    print(f"  ‚Ä¢ {qtype[:50]}... ‚Üí {best[0]} ({best[1]:.2f})")


üìä Per-Query Metrics Breakdown:



Unnamed: 0,Query ID,Query Type,BM25 P@1,BM25 MRR,BM25 NDCG@3,Jina P@1,Jina MRR,Jina NDCG@3,Dummy P@1,Dummy MRR,Dummy NDCG@3
0,q1,Exact keywords (BM25 strength),1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.33,0.5
1,q2,Semantic similarity (Embedding strength),1.0,1.0,1.0,0.0,0.5,0.63,0.0,0.0,0.0
2,q3,Conceptual question (Embedding strength),0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0
3,q4,Exact technical terms (BM25 strength),1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0
4,q5,Multi-keyword match (BM25 strength),1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0
5,q6,Paraphrasing (Embedding strength),0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.33,0.5
6,q7,Conceptual understanding (Embedding strength),0.0,0.33,0.5,0.0,0.5,0.63,0.0,0.0,0.0
7,AVERAGE,All queries,0.57,0.62,0.64,0.71,0.86,0.89,0.0,0.09,0.14



üèÜ Best Performer by Query Type (based on NDCG@3):
  ‚Ä¢ Exact keywords (BM25 strength)... ‚Üí BM25 (1.00)
  ‚Ä¢ Semantic similarity (Embedding strength)... ‚Üí BM25 (1.00)
  ‚Ä¢ Conceptual question (Embedding strength)... ‚Üí Jina (1.00)
  ‚Ä¢ Exact technical terms (BM25 strength)... ‚Üí BM25 (1.00)
  ‚Ä¢ Multi-keyword match (BM25 strength)... ‚Üí BM25 (1.00)
  ‚Ä¢ Paraphrasing (Embedding strength)... ‚Üí Jina (1.00)
  ‚Ä¢ Conceptual understanding (Embedding strength)... ‚Üí Jina (0.63)


### Graded Relevance Scoring

**Why use graded relevance instead of binary?**

So far, we've used binary relevance (relevant=1, not relevant=0). But in real-world IR, documents can be:
- **Highly relevant** (rel=3): Perfect match, directly answers the query
- **Moderately relevant** (rel=2): Partially relevant, contains useful info
- **Somewhat relevant** (rel=1): Tangentially related
- **Not relevant** (rel=0): Irrelevant

NDCG is particularly well-suited for graded relevance because it rewards systems that rank highly-relevant documents higher.

Let's compare binary vs graded relevance using our qrels from `data/qrels.jsonl`.


In [98]:
# Load graded relevance judgments from data/qrels.jsonl
graded_qrels = {}
qrels_path = Path("../data/qrels.jsonl")
with open(qrels_path, 'r') as f:
    for line in f:
        entry = json.loads(line.strip())
        qid = entry['qid']
        doc_id = entry['doc_id']
        rel = entry['rel']
        if qid not in graded_qrels:
            graded_qrels[qid] = {}
        graded_qrels[qid][doc_id] = rel

print("üìä Loaded graded relevance judgments from data/qrels.jsonl")
print(f"Queries with graded qrels: {len(graded_qrels)}")
print(f"\nExample: Query 'q2' (global warming):")
print(f"  climate2: rel={graded_qrels['q2']['climate2']} (highly relevant)")
print(f"  climate1: rel={graded_qrels['q2']['climate1']} (moderately relevant)")

# Compare binary vs graded NDCG for all queries
comparison_results = []

for q in queries:
    qid = q['id']
    
    if qid not in graded_qrels:
        continue
    
    # Get rankings
    bm25_ranked = bm25_run[qid]
    jina_ranked = jina_run[qid] if jina_retriever else []
    
    # Calculate NDCG with graded relevance
    bm25_ndcg_graded = ndcg_at_k(bm25_ranked, graded_qrels[qid], k=3, method="exponential")
    jina_ndcg_graded = ndcg_at_k(jina_ranked, graded_qrels[qid], k=3, method="exponential") if jina_retriever else 0.0
    
    # Calculate NDCG with binary relevance (convert all rel>0 to 1)
    binary_qrels = {doc: 1 for doc, rel in graded_qrels[qid].items() if rel > 0}
    bm25_ndcg_binary = ndcg_at_k(bm25_ranked, binary_qrels, k=3, method="exponential")
    jina_ndcg_binary = ndcg_at_k(jina_ranked, binary_qrels, k=3, method="exponential") if jina_retriever else 0.0
    
    comparison_results.append({
        'Query ID': qid,
        'Query': q['query'][:40] + '...',
        'BM25 NDCG (Binary)': round(bm25_ndcg_binary, 3),
        'BM25 NDCG (Graded)': round(bm25_ndcg_graded, 3),
        'BM25 Œî': round(bm25_ndcg_graded - bm25_ndcg_binary, 3),
        'Jina NDCG (Binary)': round(jina_ndcg_binary, 3),
        'Jina NDCG (Graded)': round(jina_ndcg_graded, 3),
        'Jina Œî': round(jina_ndcg_graded - jina_ndcg_binary, 3),
    })

df_graded_comparison = pd.DataFrame(comparison_results)

# Calculate average scores
avg_bm25_binary = df_graded_comparison['BM25 NDCG (Binary)'].mean()
avg_bm25_graded = df_graded_comparison['BM25 NDCG (Graded)'].mean()
avg_jina_binary = df_graded_comparison['Jina NDCG (Binary)'].mean()
avg_jina_graded = df_graded_comparison['Jina NDCG (Graded)'].mean()

# Create average row
average_row = {
    'Query ID': 'AVERAGE',
    'Query': 'All queries',
    'BM25 NDCG (Binary)': round(avg_bm25_binary, 3),
    'BM25 NDCG (Graded)': round(avg_bm25_graded, 3),
    'BM25 Œî': round(avg_bm25_graded - avg_bm25_binary, 3),
    'Jina NDCG (Binary)': round(avg_jina_binary, 3),
    'Jina NDCG (Graded)': round(avg_jina_graded, 3),
    'Jina Œî': round(avg_jina_graded - avg_jina_binary, 3),
}

# Add average row to dataframe
df_graded_comparison_with_avg = pd.concat([df_graded_comparison, pd.DataFrame([average_row])], ignore_index=True)

print("\n" + "="*80)
print("BINARY vs GRADED RELEVANCE COMPARISON (NDCG@3)")
print("="*80)
print()
display(df_graded_comparison_with_avg)

print(f"\nüí° Key Insights:")
print(f"  ‚Ä¢ Graded relevance reveals finer differences in ranking quality")
print(f"  ‚Ä¢ Negative Œî means the system ranked lower-relevance docs higher")
print(f"  ‚Ä¢ Positive Œî means the system ranked higher-relevance docs higher")
print(f"  ‚Ä¢ Use graded relevance when document quality varies significantly")


üìä Loaded graded relevance judgments from data/qrels.jsonl
Queries with graded qrels: 7

Example: Query 'q2' (global warming):
  climate2: rel=3 (highly relevant)
  climate1: rel=2 (moderately relevant)

BINARY vs GRADED RELEVANCE COMPARISON (NDCG@3)



Unnamed: 0,Query ID,Query,BM25 NDCG (Binary),BM25 NDCG (Graded),BM25 Œî,Jina NDCG (Binary),Jina NDCG (Graded),Jina Œî
0,q1,quantum entanglement superposition...,0.613,0.917,0.304,1.0,1.0,0.0
1,q2,global warming and rising temperatures...,1.0,1.0,0.0,1.0,0.834,-0.166
2,q3,How do computers learn from data?...,0.469,0.319,-0.15,0.765,0.947,0.181
3,q4,CRISPR DNA editing...,1.0,1.0,0.0,1.0,1.0,0.0
4,q5,James Webb infrared telescope...,1.0,1.0,0.0,1.0,1.0,0.0
5,q6,AI that understands human language...,0.307,0.066,-0.241,1.0,1.0,0.0
6,q7,environmental impact of CO2...,0.693,0.606,-0.087,1.0,0.834,-0.166
7,AVERAGE,All queries,0.726,0.701,-0.025,0.966,0.945,-0.021



üí° Key Insights:
  ‚Ä¢ Graded relevance reveals finer differences in ranking quality
  ‚Ä¢ Negative Œî means the system ranked lower-relevance docs higher
  ‚Ä¢ Positive Œî means the system ranked higher-relevance docs higher
  ‚Ä¢ Use graded relevance when document quality varies significantly


### When to Use Graded vs Binary Relevance

**Use Binary Relevance (0/1) when:**
- ‚úÖ Documents are clearly relevant or not (e.g., product search - item matches or doesn't)
- ‚úÖ You want simpler annotation (faster, cheaper)
- ‚úÖ Using metrics like Precision@k, Recall@k, or MRR
- ‚úÖ You have limited annotation resources

**Use Graded Relevance (1-3 or 1-5 scale) when:**
- ‚úÖ Document quality varies significantly (e.g., research papers, news articles)
- ‚úÖ You want to distinguish "perfect" from "acceptable" results
- ‚úÖ Using NDCG or similar metrics that leverage graded judgments
- ‚úÖ You need fine-grained evaluation of ranking quality
- ‚úÖ User satisfaction depends on result quality, not just relevance

**Real-World Examples:**
- **Search engines**: Use graded relevance (Google uses 5-point scale)
- **E-commerce**: Often binary (product matches query or not)
- **Research retrieval**: Graded (papers can be highly/moderately/tangentially relevant)
- **FAQ matching**: Binary (answer is correct or not)

**üí° Best Practice:** Start with binary relevance for quick evaluation, then add graded relevance for production systems where ranking quality matters.


## Key Insights

### BM25 Strengths:
‚úÖ Excellent with **exact keyword matches**  
‚úÖ Strong on **rare technical terms** (high IDF)  
‚úÖ Fast and efficient  
‚úÖ Interpretable (you can see which terms matched)  

### BM25 Weaknesses:
‚ùå No semantic understanding ("global warming" ‚â† "greenhouse gas")  
‚ùå Struggles with **paraphrasing**  
‚ùå Can't handle **conceptual queries**  
‚ùå Vocabulary mismatch problems  

### Jina Embeddings Strengths:
‚úÖ Understands **semantic similarity**  
‚úÖ Handles **paraphrasing** well  
‚úÖ Works with **conceptual queries**  
‚úÖ Robust to vocabulary mismatch  
‚úÖ Multilingual and multimodal (text + images)  

### Jina Embeddings Weaknesses:
‚ùå Computationally expensive  
‚ùå Requires GPU for fast inference at scale  
‚ùå Less interpretable (black box)  
‚ùå May miss exact matches if not in training data  

### Best Practice:
üéØ **Hybrid retrieval** - Combine both approaches!
- Use BM25 for keyword matches
- Use embeddings for semantic understanding
- Merge and re-rank results

## Similarity Inspection (Jina Embeddings)

Let's visualize how Jina understands semantic relationships.

In [99]:
if jina_embedder:
    print("üî¨ Pairwise Semantic Similarities\n")
    
    pairs = [
        ("climate change", "global warming"),
        ("machine learning", "artificial intelligence"),
        ("neural networks", "deep learning"),
        ("quantum computing", "classical computing"),
        ("climate change", "quantum computing"),  # Unrelated
    ]
    
    similarity_data = []
    
    for text1, text2 in pairs:
        emb1 = jina_embedder.encode([text1], prompt_name="passage")[0]
        emb2 = jina_embedder.encode([text2], prompt_name="passage")[0]
        
        # Cosine similarity (dot product for normalized vectors)
        similarity = float(np.dot(emb1, emb2))
        
        similarity_data.append({
            'text1': text1,
            'text2': text2,
            'similarity': similarity,
            'interpretation': (
                'Very similar' if similarity > 0.8 else
                'Similar' if similarity > 0.6 else
                'Somewhat similar' if similarity > 0.4 else
                'Different'
            )
        })
    
    df_similarity = pd.DataFrame(similarity_data)
    display(df_similarity)
else:
    print("‚ö†Ô∏è  Jina embeddings not available")

üî¨ Pairwise Semantic Similarities



Unnamed: 0,text1,text2,similarity,interpretation
0,climate change,global warming,0.915079,Very similar
1,machine learning,artificial intelligence,0.837106,Very similar
2,neural networks,deep learning,0.827342,Very similar
3,quantum computing,classical computing,0.869445,Very similar
4,climate change,quantum computing,0.731588,Similar


## Conclusion

Both BM25 and neural embeddings have their place in modern IR systems:

- **BM25**: Fast, interpretable, great for exact matches
- **Embeddings**: Semantic understanding, handles paraphrasing
- **Best approach**: Hybrid systems that combine both strengths

For production systems, consider:
1. First-stage retrieval with BM25 (fast, broad recall)
2. Re-ranking with embeddings (precise, semantic)
3. Evaluation metrics (Precision@k, MRR, NDCG)