# Information Retrieval Lab WiSe 2024/2025: Baseline Retrieval System

This Jupyter notebook serves as a baseline retrieval system that you can improve upon.
We use subsets of the MS MARCO datasets to retrieve passages of web documents.
We will show you how to create a software submission to TIRA from this notebook.

An overview of all corpora that we use in the current course is available at [https://tira.io/datasets?query=ir-lab-wise-2024](https://tira.io/datasets?query=ir-lab-wise-2024). The dataset IDs for loading the datasets are:

- `ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training`: A subsample of the TREC 2019/2020 Deep Learning tracks on the MS MARCO v1 passage dataset. Use this dataset to tune your system(s).
- `ir-lab-wise-2024/subsampled-ms-marco-rag-20241202-training` (_work in progress_): A subsample of the TREC 2024 Retrieval-Augmented Generation track on the MS MARCO v2.1 passage dataset. Use this dataset to tune your system(s).
- `ir-lab-wise-2024/ms-marco-rag-20241203-test` (work in progress): The test corpus that we have created together in the course, based on the MS MARCO v2.1 passage dataset. We will use this dataset as the test dataset, i.e., evaluation scores become available only after the submission deadline.

### Step 1: Import libraries

We will use [tira](https://tira.io/), an information retrieval shared task platform, and [ir_dataset](https://ir-datasets.com/) for loading the datasets. Subsequently, we will build a retrieval system with [PyTerrier](https://github.com/terrier-org/pyterrier), an open-source search engine framework.

First, we need to install the required libraries.

In [16]:
!pip3 install tira>=0.0.139 ir-datasets python-terrier==0.10.0



[notice] A new release of pip is available: 23.0.1 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Create an API client to interact with the TIRA platform (e.g., to load datasets and submit runs).

In [17]:
from tira.third_party_integrations import ensure_pyterrier_is_loaded
from tira.rest_api_client import Client

ensure_pyterrier_is_loaded()
tira = Client()

### Step 2: Load the dataset

We load the dataset by its ir_datasets ID (as listed in the Readme). Just be sure to add the `irds:` prefix before the dataset ID to tell PyTerrier to load the data from ir_datasets.

In [21]:
from pyterrier import get_dataset

pt_dataset = get_dataset('irds:ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training')


### Step 3: Build an index

We will then create an index from the documents in the dataset we just loaded.

In [19]:
from pyterrier import IterDictIndexer

indexer = IterDictIndexer(
    # Store the index in the `index` directory.
    "../data/index",
    meta={'docno': 50, 'text': 4096},
    # If an index already exists there, then overwrite it.
    overwrite=True,
)
index = indexer.index(pt_dataset.get_corpus_iter())

ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training documents:  38%|█▉   | 25828/68261 [00:09<00:12, 3324.11it/s]




r-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training documents: 100%|█████| 68261/68261 [00:24<00:00, 2809.80it/s]

12:13:47.839 [main] WARN org.terrier.structures.indexing.Indexer -- Indexed 1 empty documents


### Step 4: Define the retrieval pipeline

We will define a simple retrieval pipeline using just BM25 as a baseline. For details, refer to the PyTerrier [documentation](https://pyterrier.readthedocs.io) or [tutorial](https://github.com/terrier-org/ecir2021tutorial).

In [24]:
from pyterrier import terrier

bm25 = terrier.Retriever(index, wmodel="BM25")

### Step 5: Create the run
In the next steps, we would like to apply our retrieval system to some topics, to prepare a 'run' file, containing the retrieved documents.

First, let's have a short look at the first three topics:

In [78]:
# The `'text'` argument below selects the topics `text` field as the query.
pt_dataset.get_topics('text').head(3)

Unnamed: 0,qid,query
0,1030303,who is aziz hashim
1,1037496,who is rep scalise
2,1043135,who killed nicholas ii of russia


Now, retrieve results for all the topics (may take a while):

In [28]:
run = bm25(pt_dataset.get_topics('text'))

That's it for the retrieval. Here are the first 10 entries of the run:

In [29]:
run.head(10)

Unnamed: 0,qid,docid,docno,rank,score,query
0,1030303,53852,8726436,0,31.681671,who is aziz hashim
1,1030303,56041,8726433,1,25.966276,who is aziz hashim
2,1030303,62116,8726435,2,23.863442,who is aziz hashim
3,1030303,32183,8726429,3,23.391821,who is aziz hashim
4,1030303,35867,8726437,4,21.030669,who is aziz hashim
5,1030303,17637,8726430,5,19.9672,who is aziz hashim
6,1030303,42957,7156982,6,19.9672,who is aziz hashim
7,1030303,21803,8726434,7,19.474804,who is aziz hashim
8,1030303,59828,1305520,8,17.849161,who is aziz hashim
9,1030303,60002,3302257,9,17.832781,who is aziz hashim


### Step 6: Evaluate your run


In [30]:
from pyterrier import Experiment

Experiment([bm25],
    pt_dataset.get_topics('text'),
    pt_dataset.get_qrels(),
    eval_metrics = ["map", "recip_rank", "ndcg_cut_10", "P_1", "P_5", "P_10"]
)

Unnamed: 0,name,map,recip_rank,ndcg_cut_10,P_1,P_5,P_10
0,TerrierRetr(BM25),0.412718,0.786653,0.489469,0.701031,0.62268,0.574227


# Step 7: Improve

Building your own index can be already one way that you can try to improve upon this baseline (if you want to focus on creating good document representations). Other ways could include reformulating queries or tuning parameters or building better retrieval pipelines.

# Data Cleaning and Preprocessing

In [32]:
# Import required libraries
import nltk
from nltk.corpus import stopwords
import re

# Download stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Function to remove stopwords
def remove_stopwords(text):
    return ' '.join([word for word in text.split() if word.lower() not in stop_words])

# Function to preprocess text (lowercase and remove punctuation)
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    return text

# Function to validate documents (filter out very short documents)
def is_valid_document(doc):
    return len(doc['text'].split()) > 5



[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ekhar\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [33]:
# Generate clean_docs by applying preprocessing functions
clean_docs = [
    {**doc, 'text': preprocess_text(remove_stopwords(doc['text']))}
    for doc in pt_dataset.get_corpus_iter() if is_valid_document(doc)
]

# Check the number of cleaned documents and inspect a sample
print("Number of cleaned documents:", len(clean_docs))
print("Sample document:", clean_docs[0])

def is_valid_document(doc):
    cleaned_text = preprocess_text(remove_stopwords(doc['text']))
    return len(cleaned_text.split()) > 5  # Keep only documents with more than 5 words


ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training documents: 100%|████| 68261/68261 [00:02<00:00, 31823.77it/s]

Number of cleaned documents: 68253
Sample document: {'text': 'voe im really sure voe know need get permit exactly get voe', 'docno': '4459825'}





In [34]:
import os
import pyterrier as pt

# Initialize PyTerrier
if not pt.started():
    pt.init()

# Define the index path
index_path = "C:/Users/ekhar/wir-2024-relevancers/data/clean_index"

# Create the directory if it doesn't exist
os.makedirs(index_path, exist_ok=True)

def is_valid_document(doc):
    # Ensure the document is not empty after cleaning
    cleaned_text = preprocess_text(remove_stopwords(doc['text']))
    return len(cleaned_text.split()) > 5  # Keep only documents with more than 5 words

# Create the index
indexer = pt.IterDictIndexer(index_path, meta={'docno': 50, 'text': 4096}, overwrite=True)
index = indexer.index(iter(clean_docs))

# Print success message
print("Index created successfully!")

# Load the created index
index_object = pt.IndexFactory.of(index_path)

# Validate the index by printing collection statistics
print(index_object.getCollectionStatistics().toString())

# Convert the index path to an IndexRef 
index_ref = pt.IndexRef.of(index_path)

# Print to confirm the type of index_ref
print("IndexRef type:", type(index_ref))  # Should print <class 'pyterrier.querying.IndexRef'>

  if not pt.started():


12:19:02.968 [main] WARN org.terrier.structures.indexing.Indexer -- Indexed 1 empty documents
12:19:02.976 [main] ERROR org.terrier.structures.indexing.Indexer -- Could not rename index
java.io.IOException: Rename of index structure file 'C:/Users/ekhar/wir-2024-relevancers/data/clean_index/data_1.direct.bf' (exists) to 'C:/Users/ekhar/wir-2024-relevancers/data/clean_index/data.direct.bf' (exists) failed - likely that source file is still open. Possible indexing bug?
	at org.terrier.structures.IndexUtil.renameIndex(IndexUtil.java:379)
	at org.terrier.structures.indexing.Indexer.index(Indexer.java:388)
Index created successfully!
Number of documents: 68253
Number of terms: 95646
Number of postings: 1565316
Number of fields: 1
Number of tokens: 2196421
Field names: [text]
Positions:   false

IndexRef type: <class 'jnius.reflect.org.terrier.querying.IndexRef'>


#  Basic BM25


In [36]:
# Load the index
index = pt.IndexFactory.of(index_path)

# Define BM25 retrieval pipeline using the new Retriever API
bm25 = pt.terrier.Retriever(index, wmodel="BM25")

# Load topics (queries) from the dataset
topics = pt_dataset.get_topics('text')  # Assuming 'pt_dataset' is your dataset object

# Retrieve results for the topics
bm25_results = bm25.transform(topics)

# Display the top 10 results
print("BM25 Results:")
print(bm25_results.head(10))


BM25 Results:
       qid  docid    docno  rank      score               query
0  1030303  53844  8726436     0  32.018835  who is aziz hashim
1  1030303  56033  8726433     1  25.228221  who is aziz hashim
2  1030303  62108  8726435     2  23.653081  who is aziz hashim
3  1030303  32178  8726429     3  23.410157  who is aziz hashim
4  1030303  35862  8726437     4  21.360466  who is aziz hashim
5  1030303  17634  8726430     5  20.500117  who is aziz hashim
6  1030303  42951  7156982     6  20.500117  who is aziz hashim
7  1030303  21800  8726434     7  19.706391  who is aziz hashim
8  1030303  21333  1305528     8  18.732769  who is aziz hashim
9  1030303  59820  1305520     9  17.766626  who is aziz hashim


# Retrieval with RM3 Query Expansion

In [74]:
# Step 1: Define BM25 Retrieval Pipeline
bm25 = pt.terrier.Retriever(index_ref, wmodel="BM25")

# Step 2: Chain RM3 with BM25
bm25_rm3 = bm25 >> pt.rewrite.RM3(index_ref) >> bm25

# Step 3: Test RM3 Pipeline
topics = pt_dataset.get_topics('text')  # Load topics
rm3_results = bm25_rm3.transform(topics)

print("BM25+RM3 Results:")
print(rm3_results.head())
print("Columns in RM3 Results:", rm3_results.columns)

BM25+RM3 Results:
       qid  docid    docno  rank      score             query_0  \
0  1030303  53844  8726436     0  43.750434  who is aziz hashim   
1  1030303  62108  8726435     1  27.309012  who is aziz hashim   
2  1030303  56033  8726433     2  25.227136  who is aziz hashim   
3  1030303  35862  8726437     3  24.529528  who is aziz hashim   
4  1030303  32178  8726429     4  24.270290  who is aziz hashim   

                                               query  
0  applypipeline:off partner^0.017394459 sy^0.034...  
1  applypipeline:off partner^0.017394459 sy^0.034...  
2  applypipeline:off partner^0.017394459 sy^0.034...  
3  applypipeline:off partner^0.017394459 sy^0.034...  
4  applypipeline:off partner^0.017394459 sy^0.034...  
Columns in RM3 Results: Index(['qid', 'docid', 'docno', 'rank', 'score', 'query_0', 'query'], dtype='object')


In [75]:
# Step 4: Run Experiment
from pyterrier import Experiment

results = Experiment(
    [bm25, bm25_rm3],  # Compare BM25 and BM25+RM3 pipelines
    topics,
    pt_dataset.get_qrels(),
    eval_metrics=["map", "recip_rank", "ndcg_cut_10", "P_1", "P_5", "P_10"],
    names=["BM25", "BM25+RM3"]
)

print("Evaluation Results:")
print(results)


Evaluation Results:
       name       map  recip_rank  ndcg_cut_10       P_1       P_5      P_10
0      BM25  0.406065    0.789421     0.489185  0.690722  0.616495  0.569072
1  BM25+RM3  0.452028    0.778025     0.523518  0.690722  0.661856  0.616495


In [76]:
# Step 5: Tune RM3 Parameters
bm25_rm3_tuned = bm25 >> pt.rewrite.RM3(index_ref, fb_terms=15, fb_docs=5, fb_lambda=0.7) >> bm25

results_tuned = Experiment(
    [bm25, bm25_rm3_tuned],
    topics,
    pt_dataset.get_qrels(),
    eval_metrics=["map", "recip_rank", "ndcg_cut_10", "P_1", "P_5", "P_10"],
    names=["BM25", "BM25+RM3 Tuned"]
)

print("Tuned Evaluation Results:")
print(results_tuned)


Tuned Evaluation Results:
             name       map  recip_rank  ndcg_cut_10       P_1       P_5  \
0            BM25  0.406065    0.789421     0.489185  0.690722  0.616495   
1  BM25+RM3 Tuned  0.441407    0.790812     0.522558  0.721649  0.639175   

       P_10  
0  0.569072  
1  0.609278  


# Evaluation Analysis: BM25 vs. BM25+RM3

### **Evaluation Results**
| Metric         | BM25     | BM25+RM3  | Improvement (%) |
|----------------|----------|-----------|-----------------|
| MAP            | 0.406065 | 0.452028  | **+11.3%**      |
| Recip Rank     | 0.789421 | 0.778025  | **-1.4%**       |
| NDCG@10        | 0.489185 | 0.523518  | **+7.0%**       |
| P@1            | 0.690722 | 0.690722  | No change       |
| P@5            | 0.616495 | 0.661856  | **+7.4%**       |
| P@10           | 0.569072 | 0.616495  | **+8.3%**       |

- **MAP** (overall precision) improved by **11.3%**, showing better results with RM3.
- **NDCG@10** (top-10 ranking quality) improved by **7.0%**, meaning better-ranked results at the top.
- **P@5** and **P@10** (precision at top 5 and 10 results) also improved significantly.
- **Recip Rank** dropped slightly by **1.4%**, meaning RM3 sometimes took longer to find the first relevant document.
- **P@1** (precision at the first result) stayed the same.

---

# Tuned Evaluation Results

| Metric         | BM25     | BM25+RM3 Tuned | Change vs. Default RM3 |
|----------------|----------|----------------|-------------------------|
| MAP            | 0.406065 | 0.441407       | **-2.4%**              |
| Recip Rank     | 0.789421 | 0.790812       | **+1.6%**              |
| NDCG@10        | 0.489185 | 0.522558       | Slight decrease         |
| P@1            | 0.690722 | 0.721649       | **+4.5%**              |
| P@5            | 0.616495 | 0.639175       | **+3.6%**              |
| P@10           | 0.569072 | 0.609278       | **+7.1%**              |

- **P@1** (first result relevance) improved by **4.5%**, showing better performance for top results.
- **Recip Rank** improved slightly by **1.6%**, meaning the first relevant document was retrieved earlier.
- **P@5** and **P@10** also improved, showing better precision at the top results.
- **MAP** dropped by **2.4%**, meaning overall relevance slightly decreased.
- **NDCG@10** also showed a small drop, indicating some lower-ranked results were less relevant.


### **The conclusion from the Analysis:**

- **BM25+RM3** performed better than BM25 overall, with improvements in MAP (+11.3%) and NDCG@10 (+7.0%), indicating better ranking and retrieval of relevant documents.
- Precision at P@5 (+7.4%) and P@10 (+8.3%) also improved, meaning more relevant documents were retrieved in the top 5 and top 10 results.
- However, Recip Rank slightly dropped (-1.4%), showing that RM3 occasionally delayed retrieving the first relevant document.

For the **tuned BM25+RM3**, precision at P@1 (+4.5%) and Recip Rank (+1.6%) improved, meaning the first relevant document was retrieved earlier and was more accurate. However, MAP (-2.4%) and NDCG@10 saw small declines, suggesting that tuning introduced noise, slightly reducing overall relevance.


# Re-ranking with MonoT5

In [4]:
pip install python-terrier==0.12.0 pyterrier_transformers


Collecting python-terrier==0.12.0
  Downloading python_terrier-0.12.0-py3-none-any.whl.metadata (11 kB)
Note: you may need to restart the kernel to use updated packages.


ERROR: Could not find a version that satisfies the requirement pyterrier_transformers (from versions: none)
ERROR: No matching distribution found for pyterrier_transformers
