# BM25 retrieval with PyTerrier

In [1]:
from sys import modules

if "google.colab" in modules:
    # This is only needed in Google Colab.
    !pip install ir-datasets~=0.5.5 ir-measures~=0.3.3 python-terrier~=0.10.0 tira~=0.0.79 tqdm~=4.66

### Step 1: Import libraries and load variables

In [2]:
from pathlib import Path

from tira.third_party_integrations import ensure_pyterrier_is_loaded, persist_and_normalize_run

Ensure that the PyTerrier integration is loaded:

In [3]:
ensure_pyterrier_is_loaded()

Due to execution in TIRA, I have patched ir_datasets to always return the single input dataset mounted to the sandbox.
Start PyTerrier with version=5.7, helper_version=0.0.7, no_download=True


PyTerrier 0.10.0 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


In [4]:
from pyterrier.batchretrieve import BatchRetrieve
from pyterrier.bootstrap import IndexFactory
from pyterrier.datasets import get_dataset
from pyterrier.index import IterDictIndexer

Specify output directory for the run files:

In [5]:
output_directory = "./output"
index_directory = "./index"

### Step 2: Load the data

In [7]:
dataset = get_dataset("irds:ir-lab-jena-leipzig-wise-2023/training-20231104-training")
dataset

Load ir_dataset "ir-lab-jena-leipzig-wise-2023/training-20231104-training" from tira.


IRDSDataset('ir-lab-jena-leipzig-wise-2023/training-20231104-training')

Let's look at the topics:

In [8]:
topics = dataset.get_topics(variant="title")
topics

No settings given in /home/heinrich/.tira/.tira-settings.json. I will use defaults.


Unnamed: 0,qid,query
0,q06223196,car shelter
1,q062228,airport
2,q062287,antivirus comparison
3,q06223261,free antivirus
4,q062291,orange antivirus
...,...,...
667,q062224914,tax garden shed
668,q062224961,land of france
669,q062225030,find my training pole job
670,q062225194,gpl car


And how many documents do we have for training?

In [9]:
print(f"The dataset has {dataset.irds_ref().docs_count()} documents.")

No settings given in /home/heinrich/.tira/.tira-settings.json. I will use defaults.
The dataset has 47064 documents.


The `docs_store` is a view of the dataset for accessing documents with random access.

In [10]:
docs_store = dataset.irds_ref().docs_store()

Let's look at the document with the ID `doc062200109610`:

In [11]:
print(docs_store.get("doc062200109610"))

GenericDoc(doc_id='doc062200109610', text='\n\nEDF\n-\nGDF School-Valentine (25480)\n- Opening of electricity and gas meter Opening of your electricity or gas meter at École-Valentin on the Enedis/ErDF or GrDF network with papernest Free and non-binding service Announcement\n- papernest is not a partner of EDF.\nThank you.\nYour request has been taken into account A counsellor will call you back to the I understood\nIt seems that there is an error with our service Try again Opening your electricity or gas meter at École-Valentin on the Enedis/ErDF or GrDF network\nwith agence-france-electricite.fr Call the Me to call back Simple and quick: 5 minutes is enough No commitment or cancellation fee On 13 users Announcement\n- agency-france-electricite.fr\nis not a partner of Edf Contacts and rates of Engie gas offers to École-Valentin\nEngie\n, formerly SFM\nSuez, is one of the main suppliers of energy in Franche-Comté and throughout France.\nThe company emerged from the merger between Suez 

### Step 3: Create the index

In [12]:
if Path(index_directory).exists():
    index_ref = index_directory
else:
    indexer = IterDictIndexer("./index", overwrite=False)
    index_ref = indexer.index(dataset.get_corpus_iter(verbose=True))
index = IndexFactory.of(index_ref)

### Step 4: Create retrieval pipeline

In [13]:
bm25 = BatchRetrieve(index, wmodel="BM25", verbose=True)

### Step 5: Create the run
This will retrieve documents with the BM25 model for all queries.

In [14]:
run = bm25.transform(topics)

BR(BM25): 100%|██████████| 672/672 [00:58<00:00, 11.55q/s]


Let's look at a few results of the run:

In [15]:
run.iloc[995:1005]

Unnamed: 0,qid,docid,docno,rank,score,query
995,q06223196,10827,doc062205406611,995,3.249728,car shelter
996,q06223196,15619,doc062200206552,996,3.249516,car shelter
997,q06223196,9686,doc062202003811,997,3.249338,car shelter
998,q06223196,43217,doc062205700094,998,3.249158,car shelter
999,q06223196,5372,doc062201801330,999,3.249158,car shelter
1000,q062228,18692,doc062214607455,0,9.122952,airport
1001,q062228,37997,doc062214701942,1,9.060229,airport
1002,q062228,24031,doc062214408047,2,9.058265,airport
1003,q062228,34995,doc062214509661,3,9.052274,airport
1004,q062228,11166,doc062208006681,4,9.050859,airport


### Step 6: Persist run

Perfect! All that's left is to persist the run in the standard TREC format:

In [16]:
persist_and_normalize_run(run, output_file=output_directory, system_name="BM25", depth=1000)

Done. run file is stored under "./output/run.txt".
