# Information Retrieval Lab WiSe 2024/2025: Baseline Retrieval System

This Jupyter notebook serves as a baseline retrieval system that you can improve upon.
We use subsets of the MS MARCO datasets to retrieve passages of web documents.
We will show you how to create a software submission to TIRA from this notebook.

An overview of all corpora that we use in the current course is available at [https://tira.io/datasets?query=ir-lab-wise-2024](https://tira.io/datasets?query=ir-lab-wise-2024). The dataset IDs for loading the datasets are:

- `ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training`: A subsample of the TREC 2019/2020 Deep Learning tracks on the MS MARCO v1 passage dataset. Use this dataset to tune your system(s).
- `ir-lab-wise-2024/subsampled-ms-marco-rag-20241202-training` (_work in progress_): A subsample of the TREC 2024 Retrieval-Augmented Generation track on the MS MARCO v2.1 passage dataset. Use this dataset to tune your system(s).
- `ir-lab-wise-2024/ms-marco-rag-20241203-test` (work in progress): The test corpus that we have created together in the course, based on the MS MARCO v2.1 passage dataset. We will use this dataset as the test dataset, i.e., evaluation scores become available only after the submission deadline.

### Step 1: Import libraries

We will use [tira](https://tira.io/), an information retrieval shared task platform, and [ir_dataset](https://ir-datasets.com/) for loading the datasets. Subsequently, we will build a retrieval system with [PyTerrier](https://github.com/terrier-org/pyterrier), an open-source search engine framework.

First, we need to install the required libraries.

In [9]:
!pip3 install tira>=0.0.139 ir-datasets python-terrier==0.10.0


Create an API client to interact with the TIRA platform (e.g., to load datasets and submit runs).

In [10]:
from tira.third_party_integrations import ensure_pyterrier_is_loaded
from tira.rest_api_client import Client

ensure_pyterrier_is_loaded()
tira = Client()

### Step 2: Load the dataset

We load the dataset by its ir_datasets ID (as listed in the Readme). Just be sure to add the `irds:` prefix before the dataset ID to tell PyTerrier to load the data from ir_datasets.

In [11]:
from pyterrier import get_dataset

pt_dataset = get_dataset('irds:ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training')


### Step 3: Build an index

We will then create an index from the documents in the dataset we just loaded.

In [12]:
from pyterrier import IterDictIndexer

indexer = IterDictIndexer(
    # Store the index in the `index` directory.
    "../data/index",
    meta={'docno': 50, 'text': 4096},
    # If an index already exists there, then overwrite it.
    overwrite=True,
)
index = indexer.index(pt_dataset.get_corpus_iter())

ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training documents:   0%|          | 0/68261 [00:0…

16:18:13.540 [ForkJoinPool-2-worker-3] WARN org.terrier.structures.indexing.Indexer - Indexed 1 empty documents


### Step 4: Define the retrieval pipeline

We will define a simple retrieval pipeline using just BM25 as a baseline. For details, refer to the PyTerrier [documentation](https://pyterrier.readthedocs.io) or [tutorial](https://github.com/terrier-org/ecir2021tutorial).

In [14]:
from pyterrier import BatchRetrieve

bm25 = BatchRetrieve(index, wmodel="BM25")

### Step 5: Create the run
In the next steps, we would like to apply our retrieval system to some topics, to prepare a 'run' file, containing the retrieved documents.

First, let's have a short look at the first three topics:

In [11]:
# The `'text'` argument below selects the topics `text` field as the query.
pt_dataset.get_topics('text').head(3)

Download from Zenodo: https://zenodo.org/records/14254044/files/subsampled-ms-marco-deep-learning-20241201-training-truths.zip


Download: 100%|██████████| 61.7k/61.7k [00:00<00:00, 276kiB/s]


Download finished. Extract...
Extraction finished:  /root/.tira/extracted_datasets/ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training/


Unnamed: 0,qid,query
0,1030303,who is aziz hashim
1,1037496,who is rep scalise
2,1043135,who killed nicholas ii of russia


Now, retrieve results for all the topics (may take a while):

In [15]:
run = bm25(pt_dataset.get_topics('text'))

That's it for the retrieval. Here are the first 10 entries of the run:

In [16]:
run.head(10)

Unnamed: 0,qid,docid,docno,rank,score,query
0,1030303,53852,8726436,0,31.681671,who is aziz hashim
1,1030303,56041,8726433,1,25.966276,who is aziz hashim
2,1030303,62116,8726435,2,23.863442,who is aziz hashim
3,1030303,32183,8726429,3,23.391821,who is aziz hashim
4,1030303,35867,8726437,4,21.030669,who is aziz hashim
5,1030303,17637,8726430,5,19.9672,who is aziz hashim
6,1030303,42957,7156982,6,19.9672,who is aziz hashim
7,1030303,21803,8726434,7,19.474804,who is aziz hashim
8,1030303,59828,1305520,8,17.849161,who is aziz hashim
9,1030303,60002,3302257,9,17.832781,who is aziz hashim


### Step 6: Evaluate your run


In [17]:
from pyterrier import Experiment

Experiment([bm25],
    pt_dataset.get_topics('text'),
    pt_dataset.get_qrels(),
    eval_metrics = ["map", "recip_rank", "ndcg_cut_10", "P_1", "P_5", "P_10"]
)

Unnamed: 0,name,map,recip_rank,ndcg_cut_10,P_1,P_5,P_10
0,BR(BM25),0.412718,0.786653,0.489469,0.701031,0.62268,0.574227


# Step 7: Improve

Building your own index can be already one way that you can try to improve upon this baseline (if you want to focus on creating good document representations). Other ways could include reformulating queries or tuning parameters or building better retrieval pipelines.

# Data Cleaning and Preprocessing

In [18]:
# Import required libraries
import nltk
from nltk.corpus import stopwords
import re

# Download stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Function to remove stopwords
def remove_stopwords(text):
    return ' '.join([word for word in text.split() if word.lower() not in stop_words])

# Function to preprocess text (lowercase and remove punctuation)
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    return text

# Function to validate documents (filter out very short documents)
def is_valid_document(doc):
    return len(doc['text'].split()) > 5



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [19]:
# Generate clean_docs by applying preprocessing functions
clean_docs = [
    {**doc, 'text': preprocess_text(remove_stopwords(doc['text']))}
    for doc in pt_dataset.get_corpus_iter() if is_valid_document(doc)
]

# Check the number of cleaned documents and inspect a sample
print("Number of cleaned documents:", len(clean_docs))
print("Sample document:", clean_docs[0])

def is_valid_document(doc):
    cleaned_text = preprocess_text(remove_stopwords(doc['text']))
    return len(cleaned_text.split()) > 5  # Keep only documents with more than 5 words


ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training documents:   0%|          | 0/68261 [00:0…

Number of cleaned documents: 68253
Sample document: {'text': 'voe im really sure voe know need get permit exactly get voe', 'docno': '4459825'}


In [20]:
import os
import pyterrier as pt

# Initialize PyTerrier
if not pt.started():
    pt.init()

# Define the index path
index_path = "/content/wir-2024-relevancers/data/clean_index"

# Create the directory if it doesn't exist
os.makedirs(index_path, exist_ok=True)

# Example function: Validate documents for indexing
def is_valid_document(doc):
    # Ensure the document is not empty after cleaning
    cleaned_text = preprocess_text(remove_stopwords(doc['text']))
    return len(cleaned_text.split()) > 5  # Keep only documents with more than 5 words

# Create the index using IterDictIndexer
indexer = pt.IterDictIndexer(index_path, meta={'docno': 50, 'text': 4096}, overwrite=True)

# Replace `clean_docs` with your iterable of documents
# Ensure `clean_docs` is an iterable of dictionaries with at least `docno` and `text` keys
index = indexer.index(iter(clean_docs))

# Print success message
print("Index created successfully!")

# Load the created index
index_object = pt.IndexFactory.of(index_path)

# Validate the index by printing collection statistics
print(index_object.getCollectionStatistics().toString())

# Convert the index path to an IndexRef
index_ref = pt.IndexRef.of(index_path)

# Print to confirm the type of index_ref
print("IndexRef type:", type(index_ref))  # Should print <class 'pyterrier.querying.IndexRef'>


16:20:24.557 [ForkJoinPool-3-worker-3] WARN org.terrier.structures.indexing.Indexer - Indexed 1 empty documents
Index created successfully!
Number of documents: 68253
Number of terms: 95646
Number of postings: 1565316
Number of fields: 1
Number of tokens: 2196421
Field names: [text]
Positions:   false

IndexRef type: <class 'jnius.reflect.org.terrier.querying.IndexRef'>


#  Basic BM25


In [22]:
# Load the index
index = pt.IndexFactory.of(index_path)

# Define BM25 retrieval pipeline using the new Retriever API
bm25 = pt.BatchRetrieve(index, wmodel="BM25")

# Load topics (queries) from the dataset
topics = pt_dataset.get_topics('text')

# Retrieve results for the topics
bm25_results = bm25.transform(topics)

# Display the top 10 results
print("BM25 Results:")
print(bm25_results.head(10))


BM25 Results:
       qid  docid    docno  rank      score               query
0  1030303  53844  8726436     0  32.018835  who is aziz hashim
1  1030303  56033  8726433     1  25.228221  who is aziz hashim
2  1030303  62108  8726435     2  23.653081  who is aziz hashim
3  1030303  32178  8726429     3  23.410157  who is aziz hashim
4  1030303  35862  8726437     4  21.360466  who is aziz hashim
5  1030303  17634  8726430     5  20.500117  who is aziz hashim
6  1030303  42951  7156982     6  20.500117  who is aziz hashim
7  1030303  21800  8726434     7  19.706391  who is aziz hashim
8  1030303  21333  1305528     8  18.732769  who is aziz hashim
9  1030303  59820  1305520     9  17.766626  who is aziz hashim


# Retrieval with RM3 Query Expansion

In [24]:
# Step 1: Define BM25 Retrieval Pipeline
bm25 = pt.BatchRetrieve(index_ref, wmodel="BM25")

# Step 2: Chain RM3 with BM25
bm25_rm3 = bm25 >> pt.rewrite.RM3(index_ref) >> bm25

# Step 3: Test RM3 Pipeline
topics = pt_dataset.get_topics('text')  # Load topics
rm3_results = bm25_rm3.transform(topics)

print("BM25+RM3 Results:")
print(rm3_results.head())
print("Columns in RM3 Results:", rm3_results.columns)

BM25+RM3 Results:
       qid  docid    docno  rank      score             query_0  \
0  1030303  53844  8726436     0  38.501560  who is aziz hashim   
1  1030303  62108  8726435     1  30.684877  who is aziz hashim   
2  1030303  56033  8726433     2  27.497540  who is aziz hashim   
3  1030303  35862  8726437     3  26.724978  who is aziz hashim   
4  1030303  32178  8726429     4  24.604161  who is aziz hashim   

                                               query  
0  applypipeline:off partner^0.028783683 capit^0....  
1  applypipeline:off partner^0.028783683 capit^0....  
2  applypipeline:off partner^0.028783683 capit^0....  
3  applypipeline:off partner^0.028783683 capit^0....  
4  applypipeline:off partner^0.028783683 capit^0....  
Columns in RM3 Results: Index(['qid', 'docid', 'docno', 'rank', 'score', 'query_0', 'query'], dtype='object')


In [25]:
# Step 4: Run Experiment
from pyterrier import Experiment

results = Experiment(
    [bm25, bm25_rm3],  # Compare BM25 and BM25+RM3 pipelines
    topics,
    pt_dataset.get_qrels(),
    eval_metrics=["map", "recip_rank", "ndcg_cut_10", "P_1", "P_5", "P_10"],
    names=["BM25", "BM25+RM3"]
)

print("Evaluation Results:")
print(results)


Evaluation Results:
       name       map  recip_rank  ndcg_cut_10       P_1       P_5      P_10
0      BM25  0.406065    0.789421     0.489185  0.690722  0.616495  0.569072
1  BM25+RM3  0.447490    0.785040     0.516655  0.701031  0.651546  0.610309


# **Evaluation Analysis: BM25 vs. BM25+RM3**

### **Evaluation Results**
| Metric         | BM25     | BM25+RM3  | Improvement (%) |
|----------------|----------|-----------|-----------------|
| MAP            | 0.406065 | 0.447490  | **+10.2%**      |
| Recip Rank     | 0.789421 | 0.785040  | **-0.6%**       |
| NDCG@10        | 0.489185 | 0.516655  | **+5.6%**       |
| P@1            | 0.690722 | 0.701031  | **+1.5%**       |
| P@5            | 0.616495 | 0.651546  | **+5.7%**       |
| P@10           | 0.569072 | 0.610309  | **+7.2%**       |

- **MAP** improved by **10.2%**, indicating overall better relevance with RM3.
- **NDCG@10** improved by **5.6%**, highlighting better-ranked top-10 results.
- **P@1**, **P@5**, and **P@10** showed consistent improvements, indicating better precision for the top retrieved results.
- **Recip Rank** decreased slightly by **0.6%**, indicating RM3 occasionally took slightly longer to retrieve the first relevant document.


In [26]:
# Step 5: Tune RM3 Parameters
bm25_rm3_tuned = bm25 >> pt.rewrite.RM3(index_ref, fb_terms=15, fb_docs=5, fb_lambda=0.7) >> bm25

results_tuned = Experiment(
    [bm25, bm25_rm3_tuned],
    topics,
    pt_dataset.get_qrels(),
    eval_metrics=["map", "recip_rank", "ndcg_cut_10", "P_1", "P_5", "P_10"],
    names=["BM25", "BM25+RM3 Tuned"]
)

print("Tuned Evaluation Results:")
print(results_tuned)


Tuned Evaluation Results:
             name       map  recip_rank  ndcg_cut_10       P_1       P_5  \
0            BM25  0.406065    0.789421     0.489185  0.690722  0.616495   
1  BM25+RM3 Tuned  0.436078    0.784178     0.510363  0.711340  0.618557   

       P_10  
0  0.569072  
1  0.594845  


# **Tuned Evaluation Results**

| Metric         | BM25     | BM25+RM3 Tuned | Change vs. Default RM3 |
|----------------|----------|----------------|-------------------------|
| MAP            | 0.406065 | 0.436078       | **-2.6%**              |
| Recip Rank     | 0.789421 | 0.784178       | **-0.1%**              |
| NDCG@10        | 0.489185 | 0.510363       | **-1.2%**              |
| P@1            | 0.690722 | 0.711340       | **+1.5%**              |
| P@5            | 0.616495 | 0.618557       | **-5.1%**              |
| P@10           | 0.569072 | 0.594845       | **+3.8%**              |

- **P@1** improved by **1.5%**, showing better relevance for the first result.
- **P@10** improved by **3.8%**, indicating slightly better precision at the top 10 results.
- **MAP**, **Recip Rank**, and **NDCG@10** showed slight decreases compared to the default RM3, indicating tuning introduced small trade-offs.

### **The conclusion from the Analysis:**

- **BM25+RM3** performed better than BM25 overall, with improvements in MAP (+10.2%), NDCG@10 (+5.6%), and precision at P@10 (+7.2%). This indicates better ranking and retrieval of relevant documents, especially in the top 10 results.
- Precision at P@5 (+5.7%) and P@1 (+1.5%) also showed improvements, meaning more relevant documents were retrieved in the top results.
- However, **Recip Rank** slightly decreased (-0.6%), suggesting RM3 occasionally delayed retrieving the first relevant document.

- For the **tuned BM25+RM3**, P@1 (+1.5%) and P@10 (+3.8%) showed slight improvements, meaning the first result and top 10 results were more relevant. However, MAP (-2.6%), Recip Rank (-0.1%), and NDCG@10 (-1.2%) experienced slight decreases, indicating that the tuning introduced minor trade-offs in overall relevance and ranking consistency.

# Re-ranking with MonoT5