# Information Retrieval Lab WiSe 2024/2025: Baseline Retrieval System

This Jupyter notebook serves as a baseline retrieval system that you can improve upon.
We use subsets of the MS MARCO datasets to retrieve passages of web documents.
We will show you how to create a software submission to TIRA from this notebook.

An overview of all corpora that we use in the current course is available at [https://tira.io/datasets?query=ir-lab-wise-2024](https://tira.io/datasets?query=ir-lab-wise-2024). The dataset IDs for loading the datasets are:

- `ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training`: A subsample of the TREC 2019/2020 Deep Learning tracks on the MS MARCO v1 passage dataset. Use this dataset to tune your system(s).
- `ir-lab-wise-2024/subsampled-ms-marco-rag-20241202-training` (_work in progress_): A subsample of the TREC 2024 Retrieval-Augmented Generation track on the MS MARCO v2.1 passage dataset. Use this dataset to tune your system(s).
- `ir-lab-wise-2024/ms-marco-rag-20241203-test` (work in progress): The test corpus that we have created together in the course, based on the MS MARCO v2.1 passage dataset. We will use this dataset as the test dataset, i.e., evaluation scores become available only after the submission deadline.

### Step 1: Import libraries

We will use [tira](https://tira.io/), an information retrieval shared task platform, and [ir_dataset](https://ir-datasets.com/) for loading the datasets. Subsequently, we will build a retrieval system with [PyTerrier](https://github.com/terrier-org/pyterrier), an open-source search engine framework.

First, we need to install the required libraries.

In [1]:
!pip3 install 'tira>=0.0.139' ir-datasets 'python-terrier==0.10.0'

Collecting python-terrier==0.10.0
  Using cached python_terrier-0.10.0-py3-none-any.whl
Installing collected packages: python-terrier
  Attempting uninstall: python-terrier
    Found existing installation: python-terrier 0.12.1
    Uninstalling python-terrier-0.12.1:
      Successfully uninstalled python-terrier-0.12.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pyterrier-caching 0.3.0 requires python-terrier>=0.11.0, but you have python-terrier 0.10.0 which is incompatible.[0m[31m
[0mSuccessfully installed python-terrier-0.10.0


Create an API client to interact with the TIRA platform (e.g., to load datasets and submit runs).

In [2]:
from tira.third_party_integrations import ensure_pyterrier_is_loaded
from tira.rest_api_client import Client
import pyterrier as pt

ensure_pyterrier_is_loaded()
tira = Client()

PyTerrier 0.10.0 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7



### Step 2: Load the dataset

We load the dataset by its ir_datasets ID (as listed in the Readme). Just be sure to add the `irds:` prefix before the dataset ID to tell PyTerrier to load the data from ir_datasets.

In [3]:
from pyterrier import get_dataset

pt_dataset = get_dataset('irds:ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training')

### Step 3: Build an index

We will then create an index from the documents in the dataset we just loaded.

In [4]:
from pyterrier import IterDictIndexer

indexer = IterDictIndexer(
    # Store the index in the `index` directory.
    "../data/index",
    meta={'docno': 50, 'text': 4096},
    # If an index already exists there, then overwrite it.
    overwrite=True,
)
index = indexer.index(pt_dataset.get_corpus_iter())

ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training documents:   0%|          | 0/68261 [00:0…

16:36:32.734 [ForkJoinPool-1-worker-3] WARN org.terrier.structures.indexing.Indexer - Indexed 1 empty documents


### Step 4: Define the retrieval pipeline

We will define a simple retrieval pipeline using just BM25 as a baseline. For details, refer to the PyTerrier [documentation](https://pyterrier.readthedocs.io) or [tutorial](https://github.com/terrier-org/ecir2021tutorial).

In [5]:
from pyterrier import BatchRetrieve

bm25 = BatchRetrieve(index, wmodel="BM25")

### Step 5: Create the run
In the next steps, we would like to apply our retrieval system to some topics, to prepare a 'run' file, containing the retrieved documents.

First, let's have a short look at the first three topics:

In [6]:
# The `'text'` argument below selects the topics `text` field as the query.
df = pt_dataset.get_topics('text')
pt_dataset.get_topics('text')

Unnamed: 0,qid,query
0,1030303,who is aziz hashim
1,1037496,who is rep scalise
2,1043135,who killed nicholas ii of russia
3,1051399,who sings monk theme song
4,1064670,why do hunters pattern their shotguns
...,...,...
92,405717,is cdg airport in main paris
93,182539,example of monotonic function
94,1113437,what is physical description of spruce
95,1129237,hydrogen is a liquid below what temperature


Now, retrieve results for all the topics (may take a while):

In [7]:
run = bm25(pt_dataset.get_topics('text'))

That's it for the retrieval. Here are the first 10 entries of the run:

In [8]:
run.head(10)

Unnamed: 0,qid,docid,docno,rank,score,query
0,1030303,53852,8726436,0,31.681671,who is aziz hashim
1,1030303,56041,8726433,1,25.966276,who is aziz hashim
2,1030303,62116,8726435,2,23.863442,who is aziz hashim
3,1030303,32183,8726429,3,23.391821,who is aziz hashim
4,1030303,35867,8726437,4,21.030669,who is aziz hashim
5,1030303,17637,8726430,5,19.9672,who is aziz hashim
6,1030303,42957,7156982,6,19.9672,who is aziz hashim
7,1030303,21803,8726434,7,19.474804,who is aziz hashim
8,1030303,59828,1305520,8,17.849161,who is aziz hashim
9,1030303,60002,3302257,9,17.832781,who is aziz hashim


### Step 6: Evaluate your run

In [9]:
from pyterrier import Experiment

Experiment([bm25],
    pt_dataset.get_topics('text'),
    pt_dataset.get_qrels(),
    eval_metrics = ["map", "recip_rank", "ndcg_cut_10", "P_1", "P_5", "P_10"]
)

Unnamed: 0,name,map,recip_rank,ndcg_cut_10,P_1,P_5,P_10
0,BR(BM25),0.412718,0.786653,0.489469,0.701031,0.62268,0.574227


### Step 7: Improve

Building your own index can be already one way that you can try to improve upon this baseline (if you want to focus on creating good document representations). Other ways could include reformulating queries or tuning parameters or building better retrieval pipelines.

# Data Cleaning and Preprocessing

In [10]:
def preprocess_text(text):
    # Comprehensive text cleaning pipeline
    text = text.lower()  # Lowercase
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    text = re.sub(r'\d+', '', text)  # Remove numbers

    # Advanced cleaning steps
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra whitespace

    return text

# Stop word removal
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    return ' '.join([word for word in text.split() if word not in stop_words])

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# Comparison of multiple retrieval models with parameter tuning (BM25 + TF_IDF)

In [11]:
# Multiple retrieval models with parameter tuning

# BM25 with varying k1 and b values
# k1: Controls the term frequency saturation. A higher value means more weight on term frequency.
# b: Controls the length normalization. Values closer to 1 mean greater length normalization.
bm25_tuned = [
    BatchRetrieve(index, wmodel="BM25", controls={"c": k1, "b": b})
    for k1 in [0.5, 1.0, 1.5]
    for b in [0.2, 0.5, 0.8]
]

# TF-IDF model
tfidf = BatchRetrieve(index, wmodel="TF_IDF")

In [32]:
# Define retrievers with labels
retrievers = bm25 + bm25_tuned + tfidf

# Extract only the retrieval systems
retr_systems_only = [retriever for retriever, label in retrievers]

results = Experiment(
    retr_systems=retr_systems_only,
    topics=pt_dataset.get_topics('text'),
    qrels=pt_dataset.get_qrels(),
    eval_metrics=["map", "recip_rank", "ndcg_cut_10", "P_1", "P_5", "P_10"]
)

# Add labels back to the results
results['System'] = [label for _, label in retrievers]

print(results)

TypeError: 'BatchRetrieve' object is not iterable

# Try out more models (BM25F, PL2, DirichletLM)

In [13]:
# BM25F: Field-based BM25 model for multi-field indexing
# PL2: A probabilistic retrieval model based on the Divergence From Randomness (DFR) framework
# DirichletLM: A language modeling approach with Dirichlet smoothing

# BM25F (Field-based BM25)
bm25f_tuned = [
    (BatchRetrieve(index, wmodel="BM25F", controls={"k1": k1, "b": b}), f"BM25F(k1={k1}, b={b})")
    for k1 in [0.5, 1.0, 1.5]
    for b in [0.2, 0.5, 0.8]
]

# PL2 (DFR model)
pl2_tuned = [
    (BatchRetrieve(index, wmodel="PL2", controls={"c": c}), f"PL2(c={c})")
    for c in [0.1, 0.5, 1.0]
]

# DirichletLM (Language model with Dirichlet smoothing)
dirichletlm_tuned = [
    (BatchRetrieve(index, wmodel="DirichletLM", controls={"mu": mu}), f"DirichletLM(mu={mu})")
    for mu in [500, 1000, 1500]
]

# Define retrievers with labels
retrievers = bm25f_tuned + pl2_tuned + dirichletlm_tuned

# Extract only the retrieval systems
retr_systems_only = [retriever for retriever, label in retrievers]

In [14]:
# Run the experiment
results = Experiment(
    retr_systems=retr_systems_only,
    topics=pt_dataset.get_topics('text'),
    qrels=pt_dataset.get_qrels(),
    eval_metrics=["map", "recip_rank", "ndcg_cut_10", "P_1", "P_5", "P_10"]
)

# Add labels back to the results
results['System'] = [label for _, label in retrievers]

print(results)

               name       map  recip_rank  ndcg_cut_10       P_1       P_5  \
0         BR(BM25F)  0.401502    0.790825     0.470550  0.701031  0.616495   
1         BR(BM25F)  0.401502    0.790825     0.470550  0.701031  0.616495   
2         BR(BM25F)  0.401502    0.790825     0.470550  0.701031  0.616495   
3         BR(BM25F)  0.401502    0.790825     0.470550  0.701031  0.616495   
4         BR(BM25F)  0.401502    0.790825     0.470550  0.701031  0.616495   
5         BR(BM25F)  0.401502    0.790825     0.470550  0.701031  0.616495   
6         BR(BM25F)  0.401502    0.790825     0.470550  0.701031  0.616495   
7         BR(BM25F)  0.401502    0.790825     0.470550  0.701031  0.616495   
8         BR(BM25F)  0.401502    0.790825     0.470550  0.701031  0.616495   
9           BR(PL2)  0.227536    0.631489     0.293182  0.505155  0.412371   
10          BR(PL2)  0.358087    0.731683     0.432186  0.618557  0.581443   
11          BR(PL2)  0.393056    0.775629     0.469707  0.690722

BM25F outperforms PL2 and DirichletLM on all metrics. Its parameter tuning (k1 and b) does not significantly impact effectiveness.

# Try out Rank Fusion (CombSum & RRF)

In [58]:
# Define retrieval models
bm25 = BatchRetrieve(index, wmodel="BM25")
tfidf = BatchRetrieve(index, wmodel="TF_IDF")
pl2 = BatchRetrieve(index, wmodel="PL2")
dirichletlm = BatchRetrieve(index, wmodel="DirichletLM")

In [59]:
#CombSum
# Retrieve results from each model
bm25_results = bm25(pt_dataset.get_topics('text'))
tfidf_results = tfidf(pt_dataset.get_topics('text'))
pl2_results = pl2(pt_dataset.get_topics('text'))
dirichletlm_results = dirichletlm(pt_dataset.get_topics('text'))

# Combine results using CombSUM
comb_sum_results = bm25_results \
    .merge(tfidf_results, on=["qid", "docno"], how="outer", suffixes=("_bm25", "_tfidf")) \
    .merge(pl2_results, on=["qid", "docno"], how="outer", suffixes=("", "_pl2")) \
    .merge(dirichletlm_results, on=["qid", "docno"], how="outer", suffixes=("", "_dirichlet"))

# Replace NaN scores with 0 and sum the scores
comb_sum_results["score"] = comb_sum_results[["score_bm25", "score_tfidf", "score", "score_dirichlet"]].fillna(0).sum(axis=1)

# Sort by query ID and score
comb_sum_results = comb_sum_results[["qid", "docno", "score"]].sort_values(by=["qid", "score"], ascending=[True, False])

In [60]:
# Define RRF with k parameter
import pandas as pd


def reciprocal_rank_fusion(*runs, k=60):
    combined_list = []
    for run in runs:
        run = run.copy()
        run["rank"] = run.groupby("qid")["score"].rank(ascending=False, method="first")
        run["score"] = 1 / (k + run["rank"])
        combined_list.append(run)
    # Concatenate all runs into a single dataframe
    combined = pd.concat(combined_list)
    # Aggregate scores by qid and docno
    combined = combined.groupby(["qid", "docno"], as_index=False).sum()
    return combined.sort_values(by=["qid", "score"], ascending=[True, False])

# Perform RRF
rrf_results = reciprocal_rank_fusion(bm25_results, bm25f_results, tfidf_results, pl2_results, dirichletlm_results)

In [61]:
# Create evaluation systems
evaluation_systems = [
    (bm25_results, "BM25"),
    (tfidf_results, "TF-IDF"),
    (pl2_results, "PL2"),
    (dirichletlm_results, "DirichletLM"),
    (comb_sum_results, "CombSUM Fusion"),
    (rrf_results, "RRF Fusion"),
]

# Extract systems and labels
systems = [system for system, label in evaluation_systems]
labels = [label for system, label in evaluation_systems]

# Run experiment
results = Experiment(
    retr_systems=systems,
    topics=pt_dataset.get_topics('text'),
    qrels=pt_dataset.get_qrels(),
    eval_metrics=["map", "recip_rank", "ndcg_cut_10", "P_1", "P_5", "P_10"]
)

# Add system labels
results['System'] = labels

print(results)

                                                name       map  recip_rank  \
0             qid  docid    docno  rank      scor...  0.412718    0.786653   
1             qid  docid    docno  rank      scor...  0.410917    0.786086   
2             qid  docid    docno  rank      scor...  0.393056    0.775629   
3             qid  docid    docno  rank     score...  0.363818    0.690372   
4              qid    docno      score\n1004    1...  0.413748    0.791683   
5              qid    docno   docid    rank     s...  0.408603    0.767280   

   ndcg_cut_10       P_1       P_5      P_10          System  
0     0.489469  0.701031  0.622680  0.574227            BM25  
1     0.488676  0.701031  0.620619  0.572165          TF-IDF  
2     0.469707  0.690722  0.608247  0.555670             PL2  
3     0.391900  0.597938  0.505155  0.483505     DirichletLM  
4     0.486363  0.711340  0.622680  0.571134  CombSUM Fusion  
5     0.480001  0.659794  0.620619  0.570103      RRF Fusion  


While BM25 is the top standalone system, CombSUM Fusion slightly demonstrates the best overall performance, suggesting that combining multiple systems can enhance retrieval effectiveness.

# Improve further with Query Expansion

Building onto the baseline using CombSUM we try to get better results using methods for query expansion.

In [62]:
# Perform initial retrieval using CombSUM
initial_results = comb_sum_results

In [70]:
# Bo1 Divergence from Randomess query expansion model to rewrite the query based on the occurences of
# terms in the feedback documents provided for each query
bo1 = pt.rewrite.Bo1QueryExpansion(index, fb_terms=20, fb_docs=10)
pipeline_with_bo1 = initial_results >> bo1 >> initial_results

  pipeline_with_bo1 = initial_results >> bo1 >> initial_results


In [71]:
# Experiment with both pipelines
pt.Experiment(
    [initial_results, pipeline_with_bo1],
    topics=pt_dataset.get_topics('text'),
    qrels=pt_dataset.get_qrels(),
    eval_metrics=["map", "ndcg", "recip_rank"],
    names=["CombSUM Baseline", "CombSUM + bo1"]
)

Unnamed: 0,name,map,ndcg,recip_rank
0,CombSUM Baseline,0.412718,0.687574,0.786653
1,CombSUM + bo1,0.412718,0.687574,0.786653


In [22]:
# Metzler and Croft’s sequential dependence model, designed to boost the scores of documents where the query terms occur in close proximity
sdm = pt.rewrite.SequentialDependence(index)
pipeline_with_sdm = initial_results >> sdm >> initial_results

  pipeline_with_sdm = initial_results >> sdm >> initial_results


In [23]:
# Experiment with both pipelines
pt.Experiment(
    [initial_results, pipeline_with_sdm],
    topics=pt_dataset.get_topics('text'),
    qrels=pt_dataset.get_qrels(),
    eval_metrics=["map", "ndcg", "recip_rank"],
    names=["CombSUM Baseline", "CombSUM + SDM"]
)

Unnamed: 0,name,map,ndcg,recip_rank
0,CombSUM Baseline,0.413748,0.687757,0.791683
1,CombSUM + SDM,0.413748,0.687757,0.791683


In [24]:
# RM3, query expansion strategy based on language modeling
rm3 = pt.rewrite.RM3(index)
pipeline_with_rm3 = initial_results >> rm3 >> initial_results

  pipeline_with_rm3 = initial_results >> rm3 >> initial_results


In [25]:
# Experiment with both pipelines
pt.Experiment(
    [initial_results, pipeline_with_rm3],
    topics=pt_dataset.get_topics('text'),
    qrels=pt_dataset.get_qrels(),
    eval_metrics=["map", "ndcg", "recip_rank"],
    names=["CombSUM Baseline", "CombSUM + rm3"]
)

Unnamed: 0,name,map,ndcg,recip_rank
0,CombSUM Baseline,0.413748,0.687757,0.791683
1,CombSUM + rm3,0.413748,0.687757,0.791683


Query Expansion with Bo1, SDM and RM3 lead to no further improvements from the CombSUM baseline system.

# Try to improve Indexing with Doc2Query

In [26]:
!pip install pyterrier-doc2query pyterrier-caching

Collecting python-terrier>=0.5.0 (from pyterrier-doc2query)
  Using cached python_terrier-0.12.1-py3-none-any.whl.metadata (11 kB)
Using cached python_terrier-0.12.1-py3-none-any.whl (147 kB)
Installing collected packages: python-terrier
  Attempting uninstall: python-terrier
    Found existing installation: python-terrier 0.10.0
    Uninstalling python-terrier-0.10.0:
      Successfully uninstalled python-terrier-0.10.0
Successfully installed python-terrier-0.12.1


In [None]:
from pyterrier_caching import IndexerCache
from pyterrier_doc2query import Doc2Query

doc2query = Doc2Query(append=False, num_samples=5) # append generated queries to the original document text
# Step 1: Generate queries and cache them
cache = IndexerCache('doc2query_new.cache')
(doc2query >> cache).index(pt_dataset.get_corpus_iter())
# Step 2: Index from the cache
indexer = pt.IterDictIndexer('doc2query.terrier')
indexer.index(cache.get_corpus_iter())

ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training documents:   0%|          | 0/68261 [00:0…