# Notebook 3: Neural IR with Bi-Encoders and Cross-Encoders

This notebook covers **Step 3** of the project. We will build a two-stage neural IR system:

1.  **First Stage (Retrieval)**: Use a **bi-encoder** model to efficiently scan the entire corpus. It encodes all documents and queries into a shared vector space and retrieves an initial set of top candidates (e.g., top 100) using a high-speed FAISS index.
2.  **Second Stage (Re-ranking)**: Use a more powerful but slower **cross-encoder** model. This model examines the query and each candidate document *together*, providing a much more accurate relevance score to re-rank the initial set and produce the final results.
3.  **Evaluation**: Apply the same evaluation metrics from Notebook 2 to measure the performance of our neural system.

## 1. Setup and Installation

We'll need the `sentence-transformers` library from Hugging Face for the neural models, `faiss-cpu` for efficient vector search, and `torch` as the backend.

In [1]:
!pip install sentence-transformers faiss-cpu torch pandas

Defaulting to user installation because normal site-packages is not writeable


## 2. Load Processed Data

Let's load the same dataframes we used in the previous notebook. This ensures our evaluation is comparable.

In [2]:
import pandas as pd
import json
import os
import numpy as np
import torch
from tqdm.autonotebook import tqdm

# Check if a GPU is available and set the device
if torch.cuda.is_available():
    device = 'cuda'
    print("Using GPU")
else:
    device = 'cpu'
    print("Using CPU")

DATA_DIR = '../fiqa/processed_data'

# Load data
print("Loading processed data...")
corpus_df = pd.read_pickle(os.path.join(DATA_DIR, 'corpus_processed.pkl'))
queries_df = pd.read_pickle(os.path.join(DATA_DIR, 'queries_processed.pkl'))
qrels_df = pd.read_pickle(os.path.join(DATA_DIR, 'qrels.pkl'))

# --- Create mappings for faster lookups ---
# Use original text for neural models as they understand full sentences
doc_id_to_text = pd.Series(corpus_df.text.values, index=corpus_df.doc_id).to_dict()
query_id_to_text = pd.Series(queries_df.text.values, index=queries_df.query_id).to_dict()

# Keep a list of document IDs for mapping FAISS results back
doc_ids = corpus_df['doc_id'].tolist()

print("Data loaded successfully.")

  from tqdm.autonotebook import tqdm


Using CPU
Loading processed data...
Data loaded successfully.


---

## 3. First Stage: Bi-Encoder Retrieval with FAISS

The bi-encoder's job is to quickly find a set of potentially relevant documents from the entire corpus. We'll use a model pre-trained for semantic search.

### 3.1. Initialize the Bi-Encoder

We'll use a model from the `sentence-transformers` library that is well-suited for asymmetric search (matching short queries to longer passages).

In [3]:
from sentence_transformers import SentenceTransformer

# Initialize the bi-encoder model
# 'msmarco-distilbert-base-v4' is a great baseline model for semantic search
bi_encoder_model = SentenceTransformer('msmarco-distilbert-base-v4', device=device)

# Get the embedding dimension
embedding_dim = bi_encoder_model.get_sentence_embedding_dimension()
print(f"Bi-encoder loaded. Embedding dimension: {embedding_dim}")



Bi-encoder loaded. Embedding dimension: 768


### 3.2. Encode the Corpus

We need to convert every document in our corpus into a vector embedding. This is a one-time operation. For large corpora, this can take a while.

In [4]:
# --- Encode all documents in the corpus ---
# We will use the original, unprocessed text for the neural models
corpus_texts = corpus_df['text'].tolist()

# It's recommended to encode in batches for efficiency
batch_size = 256
corpus_embeddings = bi_encoder_model.encode(
    corpus_texts,
    show_progress_bar=True,
    batch_size=batch_size,
    convert_to_tensor=True
)

# Move embeddings to CPU for FAISS (FAISS works with NumPy on CPU)
corpus_embeddings = corpus_embeddings.cpu().numpy()
print(f"Corpus encoded. Shape of embeddings: {corpus_embeddings.shape}")

Batches:   0%|          | 0/226 [00:00<?, ?it/s]

Corpus encoded. Shape of embeddings: (57638, 768)


### 3.3. Build the FAISS Index

FAISS (Facebook AI Similarity Search) is a library that allows for incredibly fast searching over billions of vectors. We'll build an index to store our document embeddings.

In [5]:
import faiss

# --- Build a FAISS index ---
# We use IndexFlatL2, a basic index that computes L2 distance.
# It's a good starting point and works well for millions of vectors.
index = faiss.IndexFlatL2(embedding_dim)

# Add the document embeddings to the index
index.add(corpus_embeddings)

print(f"FAISS index built. Total documents in index: {index.ntotal}")

FAISS index built. Total documents in index: 57638


---

## 4. Second Stage: Cross-Encoder Re-ranking

Now that we can retrieve candidates quickly, we use a cross-encoder to re-rank them with higher accuracy.

### 4.1. Initialize the Cross-Encoder

Cross-encoder models are also available in the `sentence-transformers` library. They are trained to take a pair of texts (query, document) and output a single relevance score.

In [6]:
from sentence_transformers.cross_encoder import CrossEncoder

# Initialize the cross-encoder model
# This model is specifically trained for re-ranking in search tasks.
cross_encoder_model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2', device=device)
print("Cross-encoder loaded.")

Cross-encoder loaded.


### 4.2. Create the Full Retrieval and Re-ranking Pipeline

Let's combine the two stages into a single function. This function will take a query, retrieve 100 candidates with the bi-encoder/FAISS, and then re-rank them with the cross-encoder to get the final list.

In [7]:
def search_and_rerank(query_text, top_k_retrieval=100, top_k_rerank=50):
    """
    Performs a two-stage search: retrieval with bi-encoder and re-ranking with cross-encoder.
    """
    # --- Stage 1: Bi-Encoder Retrieval ---
    query_embedding = bi_encoder_model.encode(query_text, convert_to_tensor=True).cpu().numpy().reshape(1, -1)
    
    # Search the FAISS index
    distances, indices = index.search(query_embedding, top_k_retrieval)
    
    # Get the doc IDs of the retrieved candidates
    retrieved_doc_ids = [doc_ids[i] for i in indices[0]]
    
    # --- Stage 2: Cross-Encoder Re-ranking ---
    # Create pairs of [query, document_text] for the cross-encoder
    cross_encoder_input = [[query_text, doc_id_to_text[doc_id]] for doc_id in retrieved_doc_ids]
    
    # Get scores from the cross-encoder
    cross_scores = cross_encoder_model.predict(cross_encoder_input, show_progress_bar=False)
    
    # Combine doc IDs with their new scores
    reranked_results = list(zip(retrieved_doc_ids, cross_scores))
    
    # Sort by the new scores in descending order
    reranked_results.sort(key=lambda x: x[1], reverse=True)
    
    # Return the top_k re-ranked documents and their scores
    return reranked_results[:top_k_rerank]

# --- Example Usage ---
sample_query = "What are the tax implications of a business car lease?"
results = search_and_rerank(sample_query)

print(f"--- Top 5 Results for query: '{sample_query}' ---")
for doc_id, score in results[:5]:
    print(f"Score: {score:.4f}\tDoc ID: {doc_id}")
    #print(f"Text: {doc_id_to_text[doc_id][:150]}...")

--- Top 5 Results for query: 'What are the tax implications of a business car lease?' ---
Score: nan	Doc ID: 429899
Score: nan	Doc ID: 185405
Score: nan	Doc ID: 427884
Score: nan	Doc ID: 307158
Score: nan	Doc ID: 181187


---

## 5. Run and Evaluate the Neural IR System

Now, we'll run our pipeline on the same 10 sample queries from Notebook 2 and evaluate the final, re-ranked results.

### 5.1. Run Retrieval for Sample Queries

In [8]:
# Select the same 10 sample queries for a fair comparison

rel_counts = qrels_df.groupby('query_id').size()

rich_queries = rel_counts[rel_counts > 5]


sample_query_ids = rich_queries.index[:10].tolist()
results_neural = {}

for qid in tqdm(sample_query_ids, desc="Running Neural Search & Re-ranking"):
    query_text = query_id_to_text[qid]
    
    # Get the ranked list of document IDs
    ranked_list = search_and_rerank(query_text, top_k_retrieval=100, top_k_rerank=100)
    results_neural[qid] = [doc_id for doc_id, score in ranked_list]

print("\n--- Sample Neural Results for Query ID:", sample_query_ids[0])
print(results_neural[sample_query_ids[0]][:5])

Running Neural Search & Re-ranking:   0%|          | 0/10 [00:00<?, ?it/s]


--- Sample Neural Results for Query ID: 10028
['476068', '75747', '227485', '458235', '44105']


### 5.2. Evaluation

We'll use the **exact same evaluation functions** from the previous notebook to calculate our metrics.

In [9]:
import math

# Create a dictionary for easy lookup of relevant documents for each query
relevant_docs = qrels_df.groupby('query_id')['doc_id'].apply(list).to_dict()

# --- Metric Implementations (Copied from Notebook 2) ---

def precision_at_k(retrieved, relevant, k):
    retrieved_k = retrieved[:k]
    return len(set(retrieved_k) & set(relevant)) / k

def recall_at_k(retrieved, relevant, k):
    retrieved_k = retrieved[:k]
    if not relevant:
        return 0.0
    return len(set(retrieved_k) & set(relevant)) / len(relevant)

def average_precision(retrieved, relevant):
    if not relevant:
        return 0.0
    hits = 0
    sum_precisions = 0.0
    for i, doc_id in enumerate(retrieved):
        if doc_id in relevant:
            hits += 1
            sum_precisions += hits / (i + 1)
    return sum_precisions / len(relevant)

def mean_average_precision(results, relevant_docs):
    aps = [average_precision(results[qid], relevant_docs.get(qid, [])) for qid in results]
    return np.mean(aps)

def mean_reciprocal_rank(results, relevant_docs):
    rrs = []
    for qid in results:
        relevant = relevant_docs.get(qid, [])
        for i, doc_id in enumerate(results[qid]):
            if doc_id in relevant:
                rrs.append(1 / (i + 1))
                break
        else:
            rrs.append(0.0)
    return np.mean(rrs)

def ndcg_at_k(retrieved, relevant, k):
    retrieved_k = retrieved[:k]
    dcg = 0.0
    for i, doc_id in enumerate(retrieved_k):
        if doc_id in relevant:
            dcg += 1 / math.log2(i + 2)
    idcg = 0.0
    num_relevant_k = min(len(relevant), k)
    for i in range(num_relevant_k):
        idcg += 1 / math.log2(i + 2)
    return dcg / idcg if idcg > 0 else 0.0

def evaluate_model(results, relevant_docs, k_values=[1,3,5,10]):
    metrics = {}
    for k in k_values:
        precisions = [precision_at_k(results[qid], relevant_docs.get(qid, []), k) for qid in results]
        recalls = [recall_at_k(results[qid], relevant_docs.get(qid, []), k) for qid in results]
        ndcgs = [ndcg_at_k(results[qid], relevant_docs.get(qid, []), k) for qid in results]
        metrics[f'P@{k}'] = np.mean(precisions)
        metrics[f'R@{k}'] = np.mean(recalls)
        metrics[f'nDCG@{k}'] = np.mean(ndcgs)
    
    metrics['MAP'] = mean_average_precision(results, relevant_docs)
    metrics['MRR'] = mean_reciprocal_rank(results, relevant_docs)
    
    return metrics

### 5.3. Final Performance Comparison

Let's see how our new neural model stacks up!

In [10]:
# Calculate metrics for the neural model
neural_metrics = evaluate_model(results_neural, relevant_docs)


comparison_df = pd.DataFrame( neural_metrics, index= ['Neural (Bi+Cross)'])

print("--- Model Performance Comparison ---")
display(comparison_df)

--- Model Performance Comparison ---


Unnamed: 0,P@1,R@1,nDCG@1,P@3,R@3,nDCG@3,P@5,R@5,nDCG@5,P@10,R@10,nDCG@10,MAP,MRR
Neural (Bi+Cross),0.4,0.061905,0.4,0.233333,0.107143,0.270392,0.18,0.131429,0.224633,0.1,0.148095,0.19777,0.137836,0.429636


---

## 6. Analysis

You should observe a **significant improvement** with the neural model compared to BM25 across almost all metrics, especially **nDCG, MAP, and MRR**.

* **Why the Improvement?** Neural models move beyond simple keyword matching. They understand the *semantic meaning* and *intent* behind the query and document text. For example, a query about "car tax costs" could match a document that talks about "vehicle excise duty" or "automobile registration fees," even if the exact words don't overlap. BM25 would miss this connection.
* **The Power of Two Stages**:
    * The **bi-encoder** provides the speed needed to search a huge corpus by representing everything in a compact vector format.
    * The **cross-encoder** provides the high accuracy. By looking at the query and document text *at the same time*, it can notice fine-grained details, context, and term relationships that the bi-encoder might miss. This combination gives you the best of both worlds: speed and accuracy.