# BM25 Retrieval with PyTerrier

### Step 1: Import everything and load variables

In [8]:
import pyterrier as pt
import pandas as pd

# We use three methods from the tira third_party_integrations that help to mitigate common pitfalls:
# - ensure_pyterrier_is_loaded:
#    loads PyTerrier without internet connection
#     (in TIRA, retrieval approaches have no access to the internet to improve reproducibility)
#
# - get_input_directory_and_output_directory:
#   Your software is expected to read the data from an input directory and write the results (i.e., the run file) to an output directory.
#   Both input and output directories are passes as arguments when the software is executed within TIRA,
#   so this command ensures that you can run the same notebook for development as in TIRA by
#   returning the passed input directory (that might be mounted) if the software is not executed in TIRA.
#
# - persist_and_normalize_run:
#   Writing run files can come with some non-obvious edge cases (e.g., score ties).
#   This method takes care of some frequent of those edge cases.
#
# You do not have to use any of those methods, in the end it is only "generate an output from an input".
# We are of course also happy for pull requests that help to improve the handling of frequently used patterns.
# Please find the documentation here: https://github.com/tira-io/tira/blob/main/python-client/tira/third_party_integrations.py
#
from tira.third_party_integrations import ensure_pyterrier_is_loaded, get_input_directory_and_output_directory, persist_and_normalize_run
import json
from tqdm import tqdm

ensure_pyterrier_is_loaded()
input_directory, output_directory = get_input_directory_and_output_directory('./iranthology-dataset-tira')


I will use a small hardcoded example located in ./iranthology-dataset-tira.
The output directory is /tmp/


In [9]:
print('The input directory contains the following files:\n')
!ls -lh {input_directory}

The input directory contains the following files:

total 67M
-rw-r--r-- 1 root root  67M May 30 08:12 anthology_documents.jsonl
-rwxrwxrwx 1 root root 2.3K May  2 08:19 anthology_topics.xml
-rw-r--r-- 1 root root 8.8K May 30 07:51 qrels.txt


### Step 2: Load the Data

In [10]:
print('Step 2: Load the data.')

queries = pt.io.read_topics(input_directory + '/anthology_topics.xml', format='trecxml')

documents = [json.loads(i) for i in open(input_directory + '/anthology_documents.jsonl', 'r')]


Step 2: Load the data.


In [11]:
print('We look at the first document:\n')
print(documents[0])

We look at the first document:

{'docno': '2019.sigirconf_workshop-2019birndl.0', 'text': "'series': 'CEUR Workshop Proceedings', 'volume': '2414', 'publisher': 'CEUR-WS.org', 'year': '2019', 'url': 'http://ceur-ws.org/Vol-2414', 'urn': 'urn:nbn:de:0074-2414-3', 'biburl': 'https://dblp.org/rec/conf/sigir/2019birndl.bib', 'bibsource': 'dblp computer science bibliography, https://dblp.org', 'bibkey': 'DBLP:conf/sigir/2019birndl', 'bibtype': 'proceedings', 'booktitle': 'Proceedings of the 4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL 2019) co-located with the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2019), Paris, France, July 25, 2019', 'authors': [], 'editors': ['Muthu Kumar Chandrasekaran', 'Philipp Mayr'], 'venue': 'SIGIR', 'date': 1581522299.0, 'abstract': '', 'title': 'Proceedings of the 4th Joint Workshop on Bibliometric-enhanced Information Re

In [12]:
print('We look at the first query:\n')
print(queries.iloc[0].to_dict())

We look at the first query:

{'qid': '1', 'query': ' relevant documents include the words index and or indexing and exhaustivity in combination documents only containing indexing and or index or only containing exhaustivity are not relevant'}


### Step 3: Create the Index

In [13]:
print('Step 3: Create the Index.')

!rm -Rf ./index
iter_indexer = pt.IterDictIndexer("./index", meta={'docno' : 100})
index_ref = iter_indexer.index(tqdm(documents))

Step 3: Create the Index.



  0%|                                                                                         | 0/53673 [00:00<?, ?it/s][A
  0%|                                                                             | 1/53673 [00:08<121:33:41,  8.15s/it][A
  0%|                                                                              | 43/53673 [00:15<4:40:30,  3.19it/s][A
  0%|                                                                              | 48/53673 [00:16<4:05:09,  3.65it/s][A
  0%|                                                                              | 59/53673 [00:16<2:54:23,  5.12it/s][A
  0%|                                                                              | 70/53673 [00:16<2:04:24,  7.18it/s][A
  0%|                                                                              | 81/53673 [00:16<1:36:54,  9.22it/s][A
  0%|▏                                                                             | 87/53673 [00:17<1:28:45, 10.06it/s][A
  0%|▏ 

  3%|██▎                                                                           | 1551/53673 [00:50<29:22, 29.58it/s][A
  3%|██▎                                                                           | 1600/53673 [00:50<22:26, 38.68it/s][A
  3%|██▎                                                                           | 1634/53673 [00:51<19:42, 44.00it/s][A
  3%|██▍                                                                           | 1668/53673 [00:51<15:42, 55.19it/s][A
  3%|██▍                                                                           | 1709/53673 [00:51<12:13, 70.83it/s][A
  3%|██▌                                                                           | 1747/53673 [00:51<10:29, 82.46it/s][A
  3%|██▌                                                                           | 1785/53673 [00:51<08:55, 96.91it/s][A
  3%|██▋                                                                           | 1826/53673 [00:52<12:04, 71.55it/s][A
  3%|██▋

  9%|██████▋                                                                      | 4664/53673 [01:21<02:39, 306.56it/s][A
  9%|██████▋                                                                      | 4703/53673 [01:21<02:42, 301.93it/s][A
  9%|██████▊                                                                      | 4742/53673 [01:21<02:38, 309.14it/s][A
  9%|██████▊                                                                      | 4790/53673 [01:21<02:22, 343.36it/s][A
  9%|██████▉                                                                      | 4836/53673 [01:21<02:29, 327.16it/s][A
  9%|███████                                                                      | 4891/53673 [01:22<02:45, 295.42it/s][A
  9%|███████                                                                      | 4940/53673 [01:22<02:27, 330.32it/s][A
  9%|███████▏                                                                     | 5006/53673 [01:22<02:16, 357.25it/s][A
  9%|███

 16%|████████████▏                                                                | 8522/53673 [01:43<03:43, 202.32it/s][A
 16%|████████████▎                                                                | 8563/53673 [01:43<03:39, 205.91it/s][A
 16%|████████████▎                                                                | 8602/53673 [01:43<03:18, 227.58it/s][A
 16%|████████████▍                                                                | 8669/53673 [01:43<02:46, 269.86it/s][A
 16%|████████████▌                                                                | 8732/53673 [01:43<02:15, 332.86it/s][A
 16%|████████████▌                                                                | 8777/53673 [01:43<02:27, 303.77it/s][A
 16%|████████████▋                                                                | 8826/53673 [01:43<02:11, 339.96it/s][A
 17%|████████████▋                                                                | 8868/53673 [01:44<02:23, 313.14it/s][A
 17%|███

 25%|███████████████████                                                         | 13477/53673 [01:57<00:48, 824.68it/s][A
 25%|███████████████████▏                                                        | 13561/53673 [01:57<00:53, 743.72it/s][A
 25%|███████████████████▎                                                        | 13638/53673 [01:57<01:01, 650.87it/s][A
 26%|███████████████████▍                                                        | 13706/53673 [01:57<01:06, 601.17it/s][A
 26%|███████████████████▍                                                        | 13769/53673 [01:58<01:17, 514.86it/s][A
 26%|███████████████████▌                                                        | 13824/53673 [01:58<01:27, 457.32it/s][A
 26%|███████████████████▋                                                        | 13873/53673 [01:58<01:25, 462.81it/s][A
 26%|███████████████████▊                                                        | 13957/53673 [01:58<01:15, 525.42it/s][A
 26%|███

 36%|███████████████████████████▌                                                | 19446/53673 [02:06<01:26, 395.06it/s][A
 36%|███████████████████████████▌                                                | 19499/53673 [02:07<01:22, 415.43it/s][A
 37%|███████████████████████████▋                                                | 19592/53673 [02:07<01:06, 514.58it/s][A
 37%|███████████████████████████▉                                                | 19698/53673 [02:07<00:55, 615.20it/s][A
 37%|████████████████████████████                                                | 19820/53673 [02:07<00:44, 753.08it/s][A
 37%|████████████████████████████▏                                               | 19936/53673 [02:07<00:42, 795.50it/s][A
 37%|████████████████████████████▍                                               | 20069/53673 [02:08<01:37, 345.58it/s][A
 38%|████████████████████████████▌                                               | 20189/53673 [02:08<01:28, 377.37it/s][A
 38%|███

 46%|██████████████████████████████████▉                                         | 24715/53673 [02:21<00:49, 583.28it/s][A
 46%|███████████████████████████████████                                         | 24805/53673 [02:21<00:43, 665.70it/s][A
 46%|███████████████████████████████████▎                                        | 24896/53673 [02:21<00:39, 725.62it/s][A
 47%|███████████████████████████████████▎                                        | 24971/53673 [02:21<00:41, 692.10it/s][A
 47%|███████████████████████████████████▌                                        | 25091/53673 [02:21<00:36, 774.57it/s][A
 47%|███████████████████████████████████▋                                        | 25169/53673 [02:21<00:36, 771.99it/s][A
 47%|███████████████████████████████████▋                                        | 25247/53673 [02:21<00:36, 769.61it/s][A
 47%|███████████████████████████████████▊                                        | 25325/53673 [02:21<00:42, 661.10it/s][A
 47%|███

 59%|████████████████████████████████████████████▌                               | 31489/53673 [02:31<00:32, 689.25it/s][A
 59%|████████████████████████████████████████████▋                               | 31562/53673 [02:31<00:32, 675.18it/s][A
 59%|████████████████████████████████████████████▊                               | 31638/53673 [02:31<00:32, 674.01it/s][A
 59%|████████████████████████████████████████████▉                               | 31728/53673 [02:31<00:31, 691.09it/s][A
 59%|█████████████████████████████████████████████                               | 31798/53673 [02:31<00:37, 590.12it/s][A
 59%|█████████████████████████████████████████████▏                              | 31880/53673 [02:31<00:36, 596.46it/s][A
 60%|█████████████████████████████████████████████▎                              | 31990/53673 [02:31<00:32, 660.45it/s][A
 60%|█████████████████████████████████████████████▍                              | 32105/53673 [02:31<00:30, 706.92it/s][A
 60%|███

 70%|████████████████████████████████████████████████████▉                       | 37357/53673 [02:39<00:23, 687.82it/s][A
 70%|█████████████████████████████████████████████████████                       | 37452/53673 [02:39<00:24, 672.53it/s][A
 70%|█████████████████████████████████████████████████████▏                      | 37539/53673 [02:39<00:22, 703.22it/s][A
 70%|█████████████████████████████████████████████████████▎                      | 37636/53673 [02:39<00:21, 761.87it/s][A
 70%|█████████████████████████████████████████████████████▌                      | 37803/53673 [02:39<00:17, 920.81it/s][A
 71%|█████████████████████████████████████████████████████▋                      | 37912/53673 [02:40<00:17, 877.02it/s][A
 71%|█████████████████████████████████████████████████████▊                      | 38006/53673 [02:40<00:18, 868.58it/s][A
 71%|█████████████████████████████████████████████████████▉                      | 38095/53673 [02:40<00:18, 852.47it/s][A
 71%|███

 84%|██████████████████████████████████████████████████████████████▉            | 45041/53673 [02:49<00:07, 1166.75it/s][A
 84%|███████████████████████████████████████████████████████████████            | 45158/53673 [02:49<00:07, 1164.66it/s][A
 84%|███████████████████████████████████████████████████████████████▎           | 45313/53673 [02:49<00:07, 1180.49it/s][A
 85%|███████████████████████████████████████████████████████████████▌           | 45479/53673 [02:49<00:07, 1057.18it/s][A
 85%|███████████████████████████████████████████████████████████████▊           | 45658/53673 [02:49<00:07, 1133.36it/s][A
 85%|███████████████████████████████████████████████████████████████▉           | 45774/53673 [02:49<00:07, 1046.43it/s][A
 86%|████████████████████████████████████████████████████████████████▏          | 45923/53673 [02:50<00:07, 1046.35it/s][A
 86%|████████████████████████████████████████████████████████████████▍          | 46094/53673 [02:50<00:06, 1096.83it/s][A
 86%|███

 97%|██████████████████████████████████████████████████████████████████████████  | 52301/53673 [03:12<00:01, 834.44it/s][A
 98%|██████████████████████████████████████████████████████████████████████████▏ | 52391/53673 [03:13<00:01, 851.15it/s][A
 98%|██████████████████████████████████████████████████████████████████████████▎ | 52519/53673 [03:13<00:01, 885.68it/s][A
 98%|██████████████████████████████████████████████████████████████████████████▌ | 52646/53673 [03:13<00:01, 961.79it/s][A
 98%|██████████████████████████████████████████████████████████████████████████▋ | 52765/53673 [03:13<00:00, 959.07it/s][A
 98%|██████████████████████████████████████████████████████████████████████████▊ | 52862/53673 [03:13<00:00, 849.36it/s][A
 99%|██████████████████████████████████████████████████████████████████████████▉ | 52960/53673 [03:13<00:00, 775.87it/s][A
 99%|███████████████████████████████████████████████████████████████████████████ | 53041/53673 [03:13<00:00, 759.22it/s][A
 99%|███

### Step 4: Create Retrieval Pipeline

In [14]:
bm25 = pt.BatchRetrieve(index_ref, wmodel="BM25", verbose=True)

### Step 5: Create the run

In [15]:
print('Step 5: Create Run.')

run = bm25(queries)

Step 5: Create Run.



BR(BM25):   0%|                                                                                    | 0/6 [00:00<?, ?q/s][A
BR(BM25):  17%|████████████▋                                                               | 1/6 [00:05<00:25,  5.06s/q][A
BR(BM25):  33%|█████████████████████████▎                                                  | 2/6 [00:06<00:11,  2.87s/q][A
BR(BM25):  50%|██████████████████████████████████████                                      | 3/6 [00:08<00:06,  2.28s/q][A
BR(BM25):  67%|██████████████████████████████████████████████████▋                         | 4/6 [00:09<00:03,  1.85s/q][A
BR(BM25):  83%|███████████████████████████████████████████████████████████████▎            | 5/6 [00:10<00:01,  1.50s/q][A
BR(BM25): 100%|████████████████████████████████████████████████████████████████████████████| 6/6 [00:11<00:00,  1.93s/q][A


In [16]:
print('We look at the first 10 results of the run:\n')
run.head(10)

We look at the first 10 results of the run:



Unnamed: 0,qid,docid,docno,rank,score,query
0,1,15484,2017.wsdm_conference-2017.24,0,16.608272,relevant documents include the words index an...
1,1,51891,2008.ipm_journal-ir0anthology0volumeA44A4.8,1,16.378209,relevant documents include the words index an...
2,1,37385,2006.trec_conference-2006.9,2,15.981597,relevant documents include the words index an...
3,1,2193,2009.clef_workshop-2009w.90,3,15.922364,relevant documents include the words index an...
4,1,37794,2001.trec_conference-2001.65,4,15.701605,relevant documents include the words index an...
5,1,52854,2015.tois_journal-ir0anthology0volumeA33A4.1,5,15.258457,relevant documents include the words index an...
6,1,15120,2013.wsdm_conference-2013.9,6,15.105135,relevant documents include the words index an...
7,1,22826,2010.cikm_conference-2010.49,7,15.062287,relevant documents include the words index an...
8,1,6275,2011.sigirconf_conference-2011.58,8,14.880142,relevant documents include the words index an...
9,1,52856,2015.tois_journal-ir0anthology0volumeA33A4.3,9,14.837122,relevant documents include the words index an...


### Step 6: Persist Run

In [20]:
print('Step 6: Persist Run.')

persist_and_normalize_run(run, output_file=output_directory, system_name='BM25', depth=1000)

print('Done :)')

Step 6: Persist Run.
Done :)
