<a href="https://colab.research.google.com/github/lorenzomadiai/information_retrieval/blob/main/bm25_pyserini_msmarco_passage_demo_(Assignment).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pyserini Demo on the MS MARCO Passage Dataset

This notebook replicates the BM25 baseline for the [MS MARCO passage ranking task](http://www.msmarco.org/) with [Pyserini](http://pyserini.io/), the Python interface to [Anserini](http://anserini.io).


## Installation


Install Python dependencies:


In [None]:
!pip install pyserini==0.12.0
!pip install pytrec_eval
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyserini==0.12.0
  Downloading pyserini-0.12.0-py3-none-any.whl (67.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.5/67.5 MB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
Collecting transformers>=4.0.0
  Downloading transformers-4.27.1-py3-none-any.whl (6.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.7/6.7 MB[0m [31m74.0 MB/s[0m eta [36m0:00:00[0m
Collecting pyjnius>=1.2.1
  Downloading pyjnius-1.4.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m56.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentencepiece>=0.1.95
  Downloading sentencepiece-0.1.97-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m58.8 MB/s[0m eta [36m

## Usage

You can use `search` function to search over an index. The topics (i.e., queries) are already distributed in Pyserini:

In [None]:
from pyserini.search import get_topics

topics = get_topics('msmarco-passage-dev-subset')
print(f'{len(topics)} queries total')

6980 queries total


Let's take a look at a specific question. Topics often have different "fields": "title" is the one we want. (Again, this is just TREC parlance.)

In [None]:
topics[1102400]['title']

'why do bears hibernate'

Next, we can initialize a searcher with a pre-built index, which Pyserini will automatically download:

In [None]:
from pyserini.search import SimpleSearcher

searcher = SimpleSearcher.from_prebuilt_index('msmarco-passage')

Attempting to initialize pre-built index msmarco-passage.
Downloading index at https://git.uwaterloo.ca/jimmylin/anserini-indexes/raw/master/index-msmarco-passage-20201117-f87c94.tar.gz...


index-msmarco-passage-20201117-f87c94.tar.gz: 2.07GB [03:03, 12.1MB/s]                            


Extracting /root/.cache/pyserini/indexes/index-msmarco-passage-20201117-f87c94.tar.gz into /root/.cache/pyserini/indexes/index-msmarco-passage-20201117-f87c94.1efad4f1ae6a77e235042eff4be1612d...
Initializing msmarco-passage...


Now we can search:

In [None]:
import json

hits = searcher.search(topics[1102400]['title'])

# Prints the first 10 hits
for i in range(0, 10):
    jsondoc = json.loads(hits[i].raw)
    print(f'{i+1:2} {hits[i].score:.5f} {jsondoc["contents"][:80]}...')

 1 17.33580 Why do Bears hibernate? March 31, 2010, Joan, Leave a comment. Why do bears hibe...
 2 13.23090 Why do bears hibernate? Watch this to discover how much effort is spent on survi...
 3 13.13570 Technically, as the other anwerer said, bears do not hibernate, but there isn't ...
 4 13.01460 It is a common misconception that bears hibernate during the winter. While bears...
 5 13.00390 To prepare for hibernation, grizzlies must prepare a den, and consume an immense...
 6 12.68940 Some zoo bears are fed year round, and do not hibernate. Since they do not under...
 7 12.55450 Bears in zoos will not hibernate if food is available, though they will slow dow...
 8 12.51710 All kinds of bears technically don't hibernate. They enter into a phase called t...
 9 12.43500 Date: 12-11-2012. It is a common misconception that bears hibernate during the w...
10 12.37460 While bears tend to slow down during the winter, they are not true hibernators. ...


The `hits` data structure holds the `docid`, the retrieval score, as well as the document content:

In [None]:
from IPython.core.display import display, HTML
display(HTML('<div style="font-family: Times New Roman; padding-bottom:10px">' + hits[0].raw + '</div>'))

Let's run all the queries from the dev set:

In [None]:
from pyserini.search import SimpleSearcher

def run_all_queries(file, topics, searcher):
    with open(file, 'w') as runfile:
        cnt = 0
        print('Running {} queries in total'.format(len(topics)))
        for id in topics:
            query = topics[id]['title']
            hits = searcher.search(query, 1000)
            for i in range(0, len(hits)):
                _ = runfile.write('{} Q0 {} {} {:.6f} Anserini\n'.format(id, hits[i].docid, i+1, hits[i].score))
            cnt += 1
            if cnt % 100 == 0:
                print(f'{cnt} queries completed')

searcher = SimpleSearcher.from_prebuilt_index('msmarco-passage')

run_all_queries('run-msmarco-passage-bm25.txt', topics, searcher)


Attempting to initialize pre-built index msmarco-passage.
/root/.cache/pyserini/indexes/index-msmarco-passage-20201117-f87c94.1efad4f1ae6a77e235042eff4be1612d already exists, skipping download.
Initializing msmarco-passage...
Running 6980 queries in total
100 queries completed
200 queries completed
300 queries completed
400 queries completed
500 queries completed
600 queries completed
700 queries completed
800 queries completed
900 queries completed
1000 queries completed
1100 queries completed
1200 queries completed
1300 queries completed
1400 queries completed
1500 queries completed
1600 queries completed
1700 queries completed
1800 queries completed
1900 queries completed
2000 queries completed
2100 queries completed
2200 queries completed
2300 queries completed
2400 queries completed
2500 queries completed
2600 queries completed
2700 queries completed
2800 queries completed
2900 queries completed
3000 queries completed
3100 queries completed
3200 queries completed
3300 queries comp

  Let's evaluate using `pytrec_eval`: the expected MAP, Recall@1000, NDCG@10 should be: X, Y, Z.

In [None]:
!wget https://raw.githubusercontent.com/castorini/anserini-tools/28a938134b652a9153172edc0d82b7b765b66216/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt

--2023-03-19 11:56:32--  https://raw.githubusercontent.com/castorini/anserini-tools/28a938134b652a9153172edc0d82b7b765b66216/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 143300 (140K) [text/plain]
Saving to: ‘qrels.msmarco-passage.dev-subset.txt’


2023-03-19 11:56:32 (4.95 MB/s) - ‘qrels.msmarco-passage.dev-subset.txt’ saved [143300/143300]



In [None]:
import pytrec_eval
import numpy as np
with open('qrels.msmarco-passage.dev-subset.txt', 'r') as f_qrel:
    qrel = pytrec_eval.parse_qrel(f_qrel)

with open('run-msmarco-passage-bm25.txt', 'r') as f_run:
    first_run = pytrec_eval.parse_run(f_run)

measures = {'map', 'ndcg_cut.10', 'recall.1000'}
evaluator = pytrec_eval.RelevanceEvaluator(qrel, measures)
results = evaluator.evaluate(first_run)
for measure in list(measures):
    mean_measure = np.mean([ele[measure.replace(".","_")] for ele in results.values()])
    print(measure, mean_measure)

# Loading queries and document collection
## We provide two document and corpus loader functions to facilitate having access to the content of queries and documents by having their ids. This could help answer exercise two.

## downloading files

In [None]:
!wget https://raw.githubusercontent.com/castorini/anserini-tools/28a938134b652a9153172edc0d82b7b765b66216/topics-and-qrels/topics.msmarco-passage.dev-subset.txt
!wget https://msmarco.z22.web.core.windows.net/msmarcoranking/collection.tar.gz

--2024-03-14 07:06:36--  https://raw.githubusercontent.com/castorini/anserini-tools/28a938134b652a9153172edc0d82b7b765b66216/topics-and-qrels/topics.msmarco-passage.dev-subset.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 290193 (283K) [text/plain]
Saving to: ‘topics.msmarco-passage.dev-subset.txt’


2024-03-14 07:06:36 (8.13 MB/s) - ‘topics.msmarco-passage.dev-subset.txt’ saved [290193/290193]

--2024-03-14 07:06:36--  https://msmarco.z22.web.core.windows.net/msmarcoranking/collection.tar.gz
Resolving msmarco.z22.web.core.windows.net (msmarco.z22.web.core.windows.net)... 20.150.34.1
Connecting to msmarco.z22.web.core.windows.net (msmarco.z22.web.core.windows.net)|20.150.34.1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 103

In [None]:
!tar -xvzf collection.tar.gz

collection.tsv


In [None]:
import tqdm
def read_collection(f_path):
  collection = {}
  with open(f_path, "r") as fp:
    for line in tqdm.tqdm(fp, desc="reading {}".format(f_path)):
      did, dtext = line.strip().split("\t")
      collection[did] = dtext
  return collection

In [None]:
queries_dict = read_collection("topics.msmarco-passage.dev-subset.txt")

reading topics.msmarco-passage.dev-subset.txt: 6980it [00:00, 879774.08it/s]


In [None]:
documents_dict = read_collection("collection.tsv")

reading collection.tsv: 8841823it [00:20, 429018.07it/s]


#### Demo: representing the content of a document and a query

In [None]:
#get id of first document in the dict.
print("id of first document in the documents_dict: ", list(documents_dict.keys())[0])
#get id of first query in the dict.
print("id of first query in the queries_dict: ", list(queries_dict.keys())[0])

print("\n----- here we print their content ----\n")

print("content of document id 0: ", documents_dict['0'])
print("content of query id 1048585: ", queries_dict['1048585'])

id of first document in the documents_dict:  0
id of first query in the queries_dict:  1048585

----- here we print their content ----

content of document id 0:  The presence of communication amid scientific minds was equally important to the success of the Manhattan Project as scientific intellect was. The only cloud hanging over the impressive achievement of the atomic researchers and engineers is what their success truly meant; hundreds of thousands of innocent lives obliterated.
content of query id 1048585:  what is paula deen's brother


## Clues on answering questions manually


In [None]:
!wget https://www.dropbox.com/s/7wcskq2o9dr8qiv/bert-run.text

--2023-03-19 12:00:42--  https://www.dropbox.com/s/7wcskq2o9dr8qiv/bert-run.text
Resolving www.dropbox.com (www.dropbox.com)... 162.125.80.18, 2620:100:6035:18::a27d:5512
Connecting to www.dropbox.com (www.dropbox.com)|162.125.80.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /s/raw/7wcskq2o9dr8qiv/bert-run.text [following]
--2023-03-19 12:00:43--  https://www.dropbox.com/s/raw/7wcskq2o9dr8qiv/bert-run.text
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://ucba85837bbacfdd9d0d758fa70b.dl.dropboxusercontent.com/cd/0/inline/B4gGjLB8HXjKFZAUFstr3ao51ZAJZGaWCa_1-EMCOeAymwTT7BIrwTDQTwfsxmFjTiav04qiQdTAH3JjVsSnuBiWiYkMmbDI2Pht1Y0BNQEETCqOT5VeKQ0XoYP5OA0rsGlXIhxsPG0UcGTjiiIcpQBXMPOS_qlSNeIMxm9boGjXsA/file# [following]
--2023-03-19 12:00:43--  https://ucba85837bbacfdd9d0d758fa70b.dl.dropboxusercontent.com/cd/0/inline/B4gGjLB8HXjKFZAUFstr3ao51ZAJZGaWCa_1-EMCOeAymwTT7BIrwTDQTwfsxmFjTiav04qiQd

In [None]:
import pytrec_eval
#load qrel, bm25 ranking run file, and bert ranking run file
with open('qrels.msmarco-passage.dev-subset.txt', 'r') as f_qrel:
    qrel = pytrec_eval.parse_qrel(f_qrel)

with open('run-msmarco-passage-bm25.txt', 'r') as f_run:
    bm25_run = pytrec_eval.parse_run(f_run)

with open('bert-run.text', 'r') as f_run:
    bert_run = pytrec_eval.parse_run(f_run)

# set measures and initialize evaluator
measures = {'recall.500', 'recall.10'}
evaluator = pytrec_eval.RelevanceEvaluator(qrel, measures)

# evaluate bm25 per query
bm25_results = evaluator.evaluate(bm25_run)

# evaluate bert per query
bert_results = evaluator.evaluate(bert_run)

In [None]:
bm25_recall500_zeros = {}
for query, measures_results in bm25_results.items():
  recall_500 = float(measures_results['recall_500'])
  if recall_500 == 0.0:
    bm25_recall500_zeros[query] = 0

#print last 10 queries with recall@500=0
print(list(bm25_recall500_zeros.keys())[-10:])

['1101552', '199572', '857943', '1083278', '320792', '717751', '329114', '1029791', '1083268', '1083267']


In [None]:
#print bert effectiveness for a query for that BM25 recall@500=0 and BERT recall@10=1
for query, measures_results in bert_results.items():
  recall_10 = float(measures_results['recall_10'])
  if recall_10 == 1.0 and query in bm25_recall500_zeros.keys():
    print("bert effectiveness for {} query in terms of recall@10 is:".format(query), bert_results[query]['recall_10'])
    break

bert effectiveness for 999921 query in terms of recall@10 is: 1.0
