# Information Retrieval Lab WiSe 2024/2025: Baseline Retrieval System

This Jupyter notebook serves as a baseline retrieval system that you can improve upon.
We use subsets of the MS MARCO datasets to retrieve passages of web documents.
We will show you how to create a software submission to TIRA from this notebook.

An overview of all corpora that we use in the current course is available at [https://tira.io/datasets?query=ir-lab-wise-2024](https://tira.io/datasets?query=ir-lab-wise-2024). The dataset IDs for loading the datasets are:

- `ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training`: A subsample of the TREC 2019/2020 Deep Learning tracks on the MS MARCO v1 passage dataset. Use this dataset to tune your system(s).
- `ir-lab-wise-2024/subsampled-ms-marco-rag-20250105-training`: A subsample of the 2024 TREC RAG track on the MS MARCO v2.1 passage dataset. You can use this dataset to develop your system.
- `ir-lab-wise-2024/subsampled-ms-marco-ir-lab-20250105-test`: The test corpus that we all developed together throughout the course on the MS MARCO v2.1 passage dataset. This dataset is the final test dataset, i.e., evaluation scores become available only after the submission deadline.

### Step 1: Import libraries

We will use [tira](https://tira.io/), an information retrieval shared task platform, and [ir_dataset](https://ir-datasets.com/) for loading the datasets. Subsequently, we will build a retrieval system with [PyTerrier](https://github.com/terrier-org/pyterrier), an open-source search engine framework.

First, we need to install the required libraries.

In [1]:
!pip3 install 'tira>=0.0.141' ir-datasets 'python-terrier==0.10.0'

Collecting python-terrier==0.10.0
  Downloading python-terrier-0.10.0.tar.gz (107 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m107.6/107.6 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting matchpy (from python-terrier==0.10.0)
  Using cached matchpy-0.5.5-py3-none-any.whl.metadata (12 kB)
Collecting scikit-learn (from python-terrier==0.10.0)
  Downloading scikit_learn-1.6.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Collecting nptyping==1.4.4 (from python-terrier==0.10.0)
  Using cached nptyping-1.4.4-py3-none-any.whl.metadata (7.7 kB)
Collecting typish>=1.7.0 (from nptyping==1.4.4->python-terrier==0.10.0)
  Using cached typish-1.9.3-py3-none-any.whl.metadata (7.2 kB)
Collecting multiset<3.0,>=2.0 (from matchpy->python-terrier==0.10.0)
  U

Create an API client to interact with the TIRA platform (e.g., to load datasets and submit runs).

In [2]:
from tira.third_party_integrations import ensure_pyterrier_is_loaded
from tira.rest_api_client import Client

ensure_pyterrier_is_loaded()
tira = Client()

terrier-assemblies 5.7 jar-with-dependencies not found, downloading to /home/heinrich/.pyterrier...
Done
terrier-python-helper 0.0.7 jar not found, downloading to /home/heinrich/.pyterrier...
Done


PyTerrier 0.10.0 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


### Step 2: Load the dataset

We load the dataset by its ir_datasets ID (as listed in the Readme). Just be sure to add the `irds:` prefix before the dataset ID to tell PyTerrier to load the data from ir_datasets.

In [19]:
from pyterrier import get_dataset

pt_dataset = get_dataset('irds:ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training')
# pt_dataset = get_dataset('irds:ir-lab-wise-2024/subsampled-ms-marco-rag-20250105-training')
# pt_dataset = get_dataset('irds:ir-lab-wise-2024/subsampled-ms-marco-ir-lab-20250105-test')

### Step 3: Build an index

We will then create an index from the documents in the dataset we just loaded.

In [20]:
from pyterrier import IterDictIndexer

indexer = IterDictIndexer(
    # Store the index in the `index` directory.
    "../data/index",
    meta={'docno': 50, 'text': 4096},
    # If an index already exists there, then overwrite it.
    overwrite=True,
)
index = indexer.index(pt_dataset.get_corpus_iter())

Download from Zenodo: https://zenodo.org/records/14602253/files/subsampled-ms-marco-ir-lab-20250105-test-inputs.zip


Download: 100%|██████████| 106M/106M [00:20<00:00, 5.56MiB/s] 


Download finished. Extract...
Extraction finished:  /home/heinrich/.tira/extracted_datasets/ir-lab-wise-2024/subsampled-ms-marco-ir-lab-20250105-test/


ir-lab-wise-2024/subsampled-ms-marco-ir-lab-20250105-test documents: 100%|██████████| 125112/125112 [01:27<00:00, 1433.04it/s]


### Step 4: Define the retrieval pipeline

We will define a simple retrieval pipeline using just BM25 as a baseline. For details, refer to the PyTerrier [documentation](https://pyterrier.readthedocs.io) or [tutorial](https://github.com/terrier-org/ecir2021tutorial).

In [21]:
from pyterrier import BatchRetrieve

bm25 = BatchRetrieve(index, wmodel="BM25")

### Step 5: Create the run
In the next steps, we would like to apply our retrieval system to some topics, to prepare a 'run' file, containing the retrieved documents.

First, let's have a short look at the first three topics:

In [22]:
# The `'text'` argument below selects the topics `text` field as the query.
pt_dataset.get_topics('text').head(3)

Download from Zenodo: https://zenodo.org/records/14602253/files/subsampled-ms-marco-ir-lab-20250105-test-truths.zip


Download: 100%|██████████| 12.5k/12.5k [00:00<00:00, 1.89MiB/s]

Download finished. Extract...
Extraction finished:  /home/heinrich/.tira/extracted_datasets/ir-lab-wise-2024/subsampled-ms-marco-ir-lab-20250105-test/





Unnamed: 0,qid,query
0,34,latex color box
1,39,technology behind fpv drones
2,41,eco friendly logistics solutions for green pro...


Now, retrieve results for all the topics (may take a while):

In [23]:
run = bm25(pt_dataset.get_topics('text'))

That's it for the retrieval. Here are the first 10 entries of the run:

In [24]:
run.head(10)

Unnamed: 0,qid,docid,docno,rank,score,query
0,34,37054,msmarco_v2.1_doc_00_559772133#2_1023411002,0,24.336448,latex color box
1,34,21521,msmarco_v2.1_doc_41_700042731#8_1496549190,1,24.134513,latex color box
2,34,9418,msmarco_v2.1_doc_00_559772133#3_1023412049,2,23.650914,latex color box
3,34,31673,msmarco_v2.1_doc_51_461434843#3_944006293,3,23.610725,latex color box
4,34,42698,msmarco_v2.1_doc_41_700042731#7_1496547286,4,23.59397,latex color box
5,34,27794,msmarco_v2.1_doc_51_461434843#2_944004995,5,23.439174,latex color box
6,34,23212,msmarco_v2.1_doc_41_700033541#1_1496515805,6,23.395326,latex color box
7,34,12541,msmarco_v2.1_doc_50_674652070#3_1371739662,7,23.286553,latex color box
8,34,24739,msmarco_v2.1_doc_20_1217204421#4_2676012792,8,23.181642,latex color box
9,34,11566,msmarco_v2.1_doc_39_941719527#3_1917293465,9,23.172034,latex color box


### Step 6: Persist and upload run to TIRA

The output of our retrieval system is a run file. This run file can later (and, e.g., in a different notebook or by a different person) be statistically evaluated. We will therefore first upload the run to TIRA.

In [25]:
from tira.third_party_integrations import persist_and_normalize_run

persist_and_normalize_run(
    run,
    # Give your approach a short but descriptive name tag.
    system_name='bm25-baseline', 
    default_output='data/runs',
    upload_to_tira=pt_dataset,
)

The run file is normalized outside the TIRA sandbox, I will store it at "data/runs".
Done. run file is stored under "data/runs/run.txt.gz".
Run uploaded to TIRA. Claim ownership via: https://www.tira.io/claim-submission/4f8cc0e7-3feb-4934-bdfe-888ab5d8fdd8


Click on the link in the cell output above to claim your submission on TIRA.

In [26]:
# Optionally, clean up the outputs.
from pathlib import Path
Path('data/runs/run.txt.gz').unlink()

# Step 7: Improve

Building your own index can be already one way that you can try to improve upon this baseline (if you want to focus on creating good document representations). Other ways could include reformulating queries or tuning parameters or building better retrieval pipelines.