# Information Retrieval Lab WiSe 2024/2025: Baseline Retrieval System

This Jupyter notebook serves as a baseline retrieval system that you can improve upon.
We use subsets of the MS MARCO datasets to retrieve passages of web documents.
We will show you how to create a software submission to TIRA from this notebook.

An overview of all corpora that we use in the current course is available at [https://tira.io/datasets?query=ir-lab-wise-2024](https://tira.io/datasets?query=ir-lab-wise-2024). The dataset IDs for loading the datasets are:

- `ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training`: A subsample of the TREC 2019/2020 Deep Learning tracks on the MS MARCO v1 passage dataset. Use this dataset to tune your system(s).
- `ir-lab-wise-2024/subsampled-ms-marco-rag-20241202-training` (_work in progress_): A subsample of the TREC 2024 Retrieval-Augmented Generation track on the MS MARCO v2.1 passage dataset. Use this dataset to tune your system(s).
- `ir-lab-wise-2024/ms-marco-rag-20241203-test` (work in progress): The test corpus that we have created together in the course, based on the MS MARCO v2.1 passage dataset. We will use this dataset as the test dataset, i.e., evaluation scores become available only after the submission deadline.

### Step 1: Import libraries

We will use [tira](https://tira.io/), an information retrieval shared task platform, and [ir_dataset](https://ir-datasets.com/) for loading the datasets. Subsequently, we will build a retrieval system with [PyTerrier](https://github.com/terrier-org/pyterrier), an open-source search engine framework.

First, we need to install the required libraries.

In [8]:
!pip3 install tira>=0.0.139 ir-datasets python-terrier==0.10.0


[0m

Create an API client to interact with the TIRA platform (e.g., to load datasets and submit runs).

In [7]:
from tira.third_party_integrations import ensure_pyterrier_is_loaded
from tira.rest_api_client import Client

ensure_pyterrier_is_loaded()
tira = Client()

### Step 2: Load the dataset

We load the dataset by its ir_datasets ID (as listed in the Readme). Just be sure to add the `irds:` prefix before the dataset ID to tell PyTerrier to load the data from ir_datasets.

In [9]:
!pip install numpy==2.2.0 --force-reinstall


[0mCollecting numpy==2.2.0
  Using cached numpy-2.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)
Using cached numpy-2.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.4 MB)
[0mInstalling collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.24.3
    Uninstalling numpy-1.24.3:
      Successfully uninstalled numpy-1.24.3
[31mERROR: Operation cancelled by user[0m[31m
[0m

In [10]:
from pyterrier import get_dataset

pt_dataset = get_dataset('irds:ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training')


### Step 3: Build an index

We will then create an index from the documents in the dataset we just loaded.

In [11]:
from pyterrier import IterDictIndexer

indexer = IterDictIndexer(
    # Store the index in the `index` directory.
    "../data/index",
    meta={'docno': 50, 'text': 4096},
    # If an index already exists there, then overwrite it.
    overwrite=True,
)
index = indexer.index(pt_dataset.get_corpus_iter())

ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training documents:   0%|          | 0/68261 [00:0…

18:18:46.283 [ForkJoinPool-1-worker-3] WARN org.terrier.structures.indexing.Indexer -- Indexed 1 empty documents


### Step 4: Define the retrieval pipeline

We will define a simple retrieval pipeline using just BM25 as a baseline. For details, refer to the PyTerrier [documentation](https://pyterrier.readthedocs.io) or [tutorial](https://github.com/terrier-org/ecir2021tutorial).

In [12]:
from pyterrier import BatchRetrieve

bm25 = BatchRetrieve(index, wmodel="BM25")

  bm25 = BatchRetrieve(index, wmodel="BM25")


### Step 5: Create the run
In the next steps, we would like to apply our retrieval system to some topics, to prepare a 'run' file, containing the retrieved documents.

First, let's have a short look at the first three topics:

In [13]:
# The `'text'` argument below selects the topics `text` field as the query.
pt_dataset.get_topics('text').head(3)

Unnamed: 0,qid,query
0,1030303,who is aziz hashim
1,1037496,who is rep scalise
2,1043135,who killed nicholas ii of russia


Now, retrieve results for all the topics (may take a while):

In [14]:
run = bm25(pt_dataset.get_topics('text'))

That's it for the retrieval. Here are the first 10 entries of the run:

In [15]:
run.head(10)

Unnamed: 0,qid,docid,docno,rank,score,query
0,1030303,53852,8726436,0,31.681671,who is aziz hashim
1,1030303,56041,8726433,1,25.966276,who is aziz hashim
2,1030303,62116,8726435,2,23.863442,who is aziz hashim
3,1030303,32183,8726429,3,23.391821,who is aziz hashim
4,1030303,35867,8726437,4,21.030669,who is aziz hashim
5,1030303,17637,8726430,5,19.9672,who is aziz hashim
6,1030303,42957,7156982,6,19.9672,who is aziz hashim
7,1030303,21803,8726434,7,19.474804,who is aziz hashim
8,1030303,59828,1305520,8,17.849161,who is aziz hashim
9,1030303,60002,3302257,9,17.832781,who is aziz hashim


### Step 6: Evaluate your run


In [16]:
from pyterrier import Experiment

Experiment([bm25],
    pt_dataset.get_topics('text'),
    pt_dataset.get_qrels(),
    eval_metrics = ["map", "recip_rank", "ndcg_cut_10", "P_1", "P_5", "P_10"]
)

Unnamed: 0,name,map,recip_rank,ndcg_cut_10,P_1,P_5,P_10
0,TerrierRetr(BM25),0.412718,0.786653,0.489469,0.701031,0.62268,0.574227


# Step 7: Improve

Building your own index can be already one way that you can try to improve upon this baseline (if you want to focus on creating good document representations). Other ways could include reformulating queries or tuning parameters or building better retrieval pipelines.

# Data Cleaning and Preprocessing

In [17]:
# Import required libraries
import nltk
from nltk.corpus import stopwords
import re

# Download stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Function to remove stopwords
def remove_stopwords(text):
    return ' '.join([word for word in text.split() if word.lower() not in stop_words])

# Function to preprocess text (lowercase and remove punctuation)
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    return text

# Function to validate documents (filter out very short documents)
def is_valid_document(doc):
    return len(doc['text'].split()) > 5



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [18]:
# Generate clean_docs by applying preprocessing functions
clean_docs = [
    {**doc, 'text': preprocess_text(remove_stopwords(doc['text']))}
    for doc in pt_dataset.get_corpus_iter() if is_valid_document(doc)
]

# Check the number of cleaned documents and inspect a sample
print("Number of cleaned documents:", len(clean_docs))
print("Sample document:", clean_docs[0])

def is_valid_document(doc):
    cleaned_text = preprocess_text(remove_stopwords(doc['text']))
    return len(cleaned_text.split()) > 5  # Keep only documents with more than 5 words


ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training documents:   0%|          | 0/68261 [00:0…

Number of cleaned documents: 68253
Sample document: {'text': 'voe im really sure voe know need get permit exactly get voe', 'docno': '4459825'}


In [19]:
import os
import pyterrier as pt

# Initialize PyTerrier
if not pt.started():
    pt.init()

# Define the index path
index_path = "/content/wir-2024-relevancers/data/clean_index"

# Create the directory if it doesn't exist
os.makedirs(index_path, exist_ok=True)

# Example function: Validate documents for indexing
def is_valid_document(doc):
    # Ensure the document is not empty after cleaning
    cleaned_text = preprocess_text(remove_stopwords(doc['text']))
    return len(cleaned_text.split()) > 5  # Keep only documents with more than 5 words

# Create the index using IterDictIndexer
indexer = pt.IterDictIndexer(index_path, meta={'docno': 50, 'text': 4096}, overwrite=True)

# Replace `clean_docs` with your iterable of documents
# Ensure `clean_docs` is an iterable of dictionaries with at least `docno` and `text` keys
index = indexer.index(iter(clean_docs))

# Print success message
print("Index created successfully!")

# Load the created index
index_object = pt.IndexFactory.of(index_path)

# Validate the index by printing collection statistics
print(index_object.getCollectionStatistics().toString())

# Convert the index path to an IndexRef
index_ref = pt.IndexRef.of(index_path)

# Print to confirm the type of index_ref
print("IndexRef type:", type(index_ref))  # Should print <class 'pyterrier.querying.IndexRef'>


  if not pt.started():


18:20:06.244 [ForkJoinPool-2-worker-3] WARN org.terrier.structures.indexing.Indexer -- Indexed 1 empty documents
Index created successfully!
Number of documents: 68253
Number of terms: 95646
Number of postings: 1565316
Number of fields: 1
Number of tokens: 2196421
Field names: [text]
Positions:   false

IndexRef type: <class 'jnius.reflect.org.terrier.querying.IndexRef'>


#   BM25 and RM3 Query Expansion


In [43]:
# Load the index
index = pt.IndexFactory.of(index_path)

# Define BM25 retrieval pipeline
bm25 = pt.terrier.Retriever(index, wmodel="BM25")


BM25 with tuned parameters

In [44]:
# Load the index
index = pt.IndexFactory.of(index_path)

# Define BM25 with tuned parameters
bm25_tuned = pt.terrier.Retriever(index, wmodel="BM25", controls={"k1": 1.5, "b": 0.5})

# Load topics (queries) from the dataset
topics = pt_dataset.get_topics('text')

# Retrieve results for the topics
bm25_tuned_results = bm25_tuned.transform(topics)

# Display the top 10 results
print("BM25 Tuned Results:")
print(bm25_tuned_results.head(10))


BM25 Tuned Results:
       qid  docid    docno  rank      score               query
0  1030303  53844  8726436     0  32.018835  who is aziz hashim
1  1030303  56033  8726433     1  25.228221  who is aziz hashim
2  1030303  62108  8726435     2  23.653081  who is aziz hashim
3  1030303  32178  8726429     3  23.410157  who is aziz hashim
4  1030303  35862  8726437     4  21.360466  who is aziz hashim
5  1030303  17634  8726430     5  20.500117  who is aziz hashim
6  1030303  42951  7156982     6  20.500117  who is aziz hashim
7  1030303  21800  8726434     7  19.706391  who is aziz hashim
8  1030303  21333  1305528     8  18.732769  who is aziz hashim
9  1030303  59820  1305520     9  17.766626  who is aziz hashim


# BM25+RM3 with Tuned BM25 Parameters

In [46]:
# Define BM25 Retrieval Pipeline
bm25 = pt.terrier.Retriever(index_ref, wmodel="BM25")

# Chain RM3 with BM25
bm25_rm3 = bm25 >> pt.rewrite.RM3(index_ref) >> bm25

# Define BM25 with tuned parameters
bm25_tuned = pt.terrier.Retriever(index_ref, wmodel="BM25", controls={"k1": 1.5, "b": 0.5})

# Chain RM3 with tuned BM25
bm25_rm3_tuned = bm25_tuned >> pt.rewrite.RM3(index_ref) >> bm25_tuned

# Load topics and test the pipeline
rm3_tuned_results = bm25_rm3_tuned.transform(topics)

print("BM25+RM3 Tuned Results:")
print(rm3_tuned_results.head(10))


BM25+RM3 Tuned Results:
       qid  docid    docno  rank      score             query_0  \
0  1030303  53844  8726436     0  43.750434  who is aziz hashim   
1  1030303  62108  8726435     1  27.309012  who is aziz hashim   
2  1030303  56033  8726433     2  25.227136  who is aziz hashim   
3  1030303  35862  8726437     3  24.529528  who is aziz hashim   
4  1030303  32178  8726429     4  24.270290  who is aziz hashim   
5  1030303  42951  7156982     5  21.406443  who is aziz hashim   
6  1030303  17634  8726430     6  20.499075  who is aziz hashim   
7  1030303  21800  8726434     7  19.705389  who is aziz hashim   
8  1030303  21333  1305528     8  18.732769  who is aziz hashim   
9  1030303  59820  1305520     9  17.766626  who is aziz hashim   

                                               query  
0  applypipeline:off partner^0.017394459 sy^0.034...  
1  applypipeline:off partner^0.017394459 sy^0.034...  
2  applypipeline:off partner^0.017394459 sy^0.034...  
3  applypipeline:o

In [50]:
# Compare Default BM25 vs. Default BM25+RM3
from pyterrier import Experiment

results = Experiment(
    [bm25, bm25_rm3],
    topics,
    pt_dataset.get_qrels(),
    eval_metrics=["map", "recip_rank", "ndcg_cut_10", "P_1", "P_5", "P_10"],
    names=["BM25", "BM25+RM3"]
)

print("Evaluation Results:")
print(results)


Evaluation Results:
       name       map  recip_rank  ndcg_cut_10       P_1       P_5      P_10
0      BM25  0.406065    0.789421     0.489185  0.690722  0.616495  0.569072
1  BM25+RM3  0.452028    0.778025     0.523518  0.690722  0.661856  0.616495


In [51]:
# Compare Default BM25 vs. Tuned BM25+RM3
bm25_rm3_tuned = bm25 >> pt.rewrite.RM3(index_ref, fb_terms=15, fb_docs=5, fb_lambda=0.7) >> bm25

results_tuned = Experiment(
    [bm25, bm25_tuned, bm25_rm3_tuned],  # Compare Default BM25, Tuned BM25, and Tuned BM25+RM3
    topics,
    pt_dataset.get_qrels(),
    eval_metrics=["map", "recip_rank", "ndcg_cut_10", "P_1", "P_5", "P_10"],
    names=["BM25 Default", "BM25 Tuned", "BM25+RM3 Tuned"]
)

print("Tuned RM3 Evaluation Results:")
print(results_tuned)

Tuned RM3 Evaluation Results:
             name       map  recip_rank  ndcg_cut_10       P_1       P_5  \
0    BM25 Default  0.406065    0.789421     0.489185  0.690722  0.616495   
1      BM25 Tuned  0.406065    0.789421     0.489185  0.690722  0.616495   
2  BM25+RM3 Tuned  0.441407    0.790812     0.522558  0.721649  0.639175   

       P_10  
0  0.569072  
1  0.569072  
2  0.609278  



# **Evaluation Analysis: Default vs. Tuned**

## **Default BM25 vs. BM25+RM3**

### **Evaluation Results**
| Metric         | BM25     | BM25+RM3  | Improvement (%) |
|----------------|----------|-----------|-----------------|
| MAP            | 0.406065 | 0.452028  | **+11.3%**      |
| Recip Rank     | 0.789421 | 0.778025  | **-1.4%**       |
| NDCG@10        | 0.489185 | 0.523518  | **+7.0%**       |
| P@1            | 0.690722 | 0.690722  | **+0.0%**       |
| P@5            | 0.616495 | 0.661856  | **+7.4%**       |
| P@10           | 0.569072 | 0.616495  | **+8.3%**       |

### **Key Insights**
- **MAP** improved by **11.3%**, suggesting that RM3's query expansion helped retrieve more relevant results overall.
- **NDCG@10** increased by **7.0%**, meaning RM3 improved the ranking quality in the top 10 results.
- Precision metrics (**P@1**, **P@5**, **P@10**) showed consistent improvements, indicating RM3 enhanced precision for the top results:
  - **P@10** saw the largest gain with **+8.3%**.
- However, **Recip Rank** decreased by **1.4%**, implying that RM3 may occasionally introduce noise in retrieving the first relevant document.

---

## **Default BM25+RM3 vs. Tuned BM25+RM3**

### **Tuned Evaluation Results**
| Metric         | BM25 Default | BM25 Tuned | BM25+RM3 Tuned | Change vs. Default RM3 |
|----------------|--------------|------------|----------------|-------------------------|
| MAP            | 0.406065     | 0.406065   | 0.441407       | **-2.4%**              |
| Recip Rank     | 0.789421     | 0.789421   | 0.790812       | **+1.6%**              |
| NDCG@10        | 0.489185     | 0.489185   | 0.522558       | **-0.2%**              |
| P@1            | 0.690722     | 0.690722   | 0.721649       | **+4.5%**              |
| P@5            | 0.616495     | 0.616495   | 0.639175       | **-3.4%**              |
| P@10           | 0.569072     | 0.569072   | 0.609278       | **-1.2%**              |

### **Key Insights**
- **P@1** improved by **4.5%**, showing better relevance for the top-ranked result after tuning.
- **Recip Rank** improved slightly by **1.6%**, indicating faster retrieval of the first relevant document compared to Default RM3.
- **P@10** improved by **-1.2%**, indicating slightly better precision for the top 10 results.
- However, **MAP** and **NDCG@10** showed slight decreases compared to the default RM3, indicating minor trade-offs introduced by tuning.

---

## **Summary of Findings**
1. **Default RM3 vs. BM25**:
   - RM3 significantly improved ranking and precision, especially for **MAP** and **NDCG@10**.
   - The default RM3 is already effective as a query expansion strategy, boosting relevance.

2. **Tuned RM3 vs. Default RM3**:
   - Tuning RM3 introduced gains in **P@1** and **Recip Rank**, but it did not outperform the default RM3 in **MAP** or **NDCG@10**.
   - This suggests that tuning RM3’s parameters for this dataset may require further exploration or optimization.

---


# Grid Search for BM25 and RM3 Parameters

###  Grid Search for BM25

In [55]:
# Define grid of BM25 parameters
k1_values = [0.5, 1.0, 1.5, 2.0]
b_values = [0.2, 0.5, 0.75, 1.0]

# Create BM25 variations
bm25_variants = [
    pt.terrier.Retriever(index, wmodel="BM25", controls={"k1": k1, "b": b})
    for k1 in k1_values
    for b in b_values
]

# Experiment with BM25 grid
bm25_results_grid = Experiment(
    bm25_variants,
    topics,
    pt_dataset.get_qrels(),
    eval_metrics=["map", "ndcg_cut_10", "P_10"],
    names=[f"BM25(k1={k1}, b={b})" for k1 in k1_values for b in b_values]
)

# Print results
print("BM25 Grid Search Results:")
print(bm25_results_grid)

# Find the best configuration
best_bm25 = bm25_results_grid.sort_values(by="map", ascending=False).iloc[0]
print(f"Best BM25 Configuration: {best_bm25['name']} with MAP: {best_bm25['map']}")


BM25 Grid Search Results:
                    name       map  ndcg_cut_10      P_10
0    BM25(k1=0.5, b=0.2)  0.406065     0.489185  0.569072
1    BM25(k1=0.5, b=0.5)  0.406065     0.489185  0.569072
2   BM25(k1=0.5, b=0.75)  0.406065     0.489185  0.569072
3    BM25(k1=0.5, b=1.0)  0.406065     0.489185  0.569072
4    BM25(k1=1.0, b=0.2)  0.406065     0.489185  0.569072
5    BM25(k1=1.0, b=0.5)  0.406065     0.489185  0.569072
6   BM25(k1=1.0, b=0.75)  0.406065     0.489185  0.569072
7    BM25(k1=1.0, b=1.0)  0.406065     0.489185  0.569072
8    BM25(k1=1.5, b=0.2)  0.406065     0.489185  0.569072
9    BM25(k1=1.5, b=0.5)  0.406065     0.489185  0.569072
10  BM25(k1=1.5, b=0.75)  0.406065     0.489185  0.569072
11   BM25(k1=1.5, b=1.0)  0.406065     0.489185  0.569072
12   BM25(k1=2.0, b=0.2)  0.406065     0.489185  0.569072
13   BM25(k1=2.0, b=0.5)  0.406065     0.489185  0.569072
14  BM25(k1=2.0, b=0.75)  0.406065     0.489185  0.569072
15   BM25(k1=2.0, b=1.0)  0.406065     0.48918

### Grid Search for RM3

In [None]:
# Define grid of RM3 parameters
fb_terms_values = [5, 10, 15]
fb_docs_values = [3, 5, 10]
fb_lambda_values = [0.5, 0.7, 0.9]

# Create RM3 variations
rm3_variants = [
    bm25 >> pt.rewrite.RM3(index, fb_terms=fb_terms, fb_docs=fb_docs, fb_lambda=fb_lambda) >> bm25
    for fb_terms in fb_terms_values
    for fb_docs in fb_docs_values
    for fb_lambda in fb_lambda_values
]

# Experiment with RM3 grid
rm3_results_grid = Experiment(
    rm3_variants,
    topics,
    pt_dataset.get_qrels(),
    eval_metrics=["map", "ndcg_cut_10", "P_10"],
    names=[
        f"RM3(fb_terms={fb_terms}, fb_docs={fb_docs}, fb_lambda={fb_lambda})"
        for fb_terms in fb_terms_values
        for fb_docs in fb_docs_values
        for fb_lambda in fb_lambda_values
    ]
)

# Print results
print("RM3 Grid Search Results:")
print(rm3_results_grid)

# Find the best configuration
best_rm3 = rm3_results_grid.sort_values(by="map", ascending=False).iloc[0]
print(f"Best RM3 Configuration: {best_rm3['name']} with MAP: {best_rm3['map']}")


# Re-ranking

In [5]:
!pip install numpy==1.24.3

[0mCollecting numpy==1.24.3
  Using cached numpy-1.24.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
Using cached numpy-1.24.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
[0mInstalling collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 2.2.0
    Uninstalling numpy-2.2.0:
      Successfully uninstalled numpy-2.2.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
albucore 0.0.19 requires numpy>=1.24.4, but you have numpy 1.24.3 which is incompatible.
albumentations 1.4.20 requires numpy>=1.24.4, but you have numpy 1.24.3 which is incompatible.
cudf-cu12 24.10.1 requires pandas<2.2.3dev0,>=2.0, but you have pandas 2.2.3 which is incompatible.
gensim 4.3.3 requires scipy<1.14.0,>=1.7.0, but you have scipy 1.14.1 which is incompatible.
pymc 5.19.1 requires nump

In [1]:
!pip install sentence_transformers

[0m

Define Sentence-BERT Re-Ranker

In [3]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd

# Load Sentence-BERT model
sbert_model = SentenceTransformer('all-MiniLM-L6-v2')

# Define a re-ranking function using Sentence-BERT
def sbert_rerank(query, results):
    """
    Re-rank the results using Sentence-BERT embeddings.
    :param query: Query string
    :param results: DataFrame containing document IDs and texts (BM25 results)
    :return: Re-ranked DataFrame
    """
    # Compute embeddings for query and retrieved documents
    query_embedding = sbert_model.encode(query)
    doc_embeddings = sbert_model.encode(results['text'].tolist())

    # Compute cosine similarity between query and documents
    results['score'] = cosine_similarity([query_embedding], doc_embeddings).flatten()

    # Sort by score in descending order (higher similarity first)
    return results.sort_values(by='score', ascending=False)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

 Apply Sentence-BERT to BM25 Results

In [25]:
import pyterrier as pt

# Initialize PyTerrier
if not pt.java.started():
    pt.java.init()

# Load the index (make sure the `index_path` is correct)
index = pt.IndexFactory.of("/content/wir-2024-relevancers/data/clean_index")

# Define BM25 retrieval pipeline
bm25 = pt.BatchRetrieve(index, wmodel="BM25")

# Load topics
pt_dataset = pt.get_dataset('irds:ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training')
topics = pt_dataset.get_topics('text')


  bm25 = pt.BatchRetrieve(index, wmodel="BM25")
