<a href="https://colab.research.google.com/github/mallpriyanshu/pyserini-robust/blob/main/pyserini_robust04_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pyserini on Robust04

This notebook provides a brief overview of how to use [Pyserini](http://pyserini.io/), the Python interface to [Anserini](http://anserini.io), to search the collection from the TREC 2004 Robust Track.


## Installation


Install Python dependencies

In [1]:
%%capture
!pip install pyserini==0.12.0

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"

## SimpleSearcher Usage

You can use the `SimpleSearcher` to search over an index. We can initialize the searcher with a pre-built index, which Pyserini will automatically download:

In [None]:
from pyserini.search import SimpleSearcher

searcher = SimpleSearcher.from_prebuilt_index('robust04')

Attempting to initialize pre-built index robust04.
Downloading index at https://git.uwaterloo.ca/jimmylin/anserini-indexes/raw/master/index-robust04-20191213.tar.gz...


index-robust04-20191213.tar.gz: 1.70GB [01:58, 15.3MB/s]                            


Extracting /root/.cache/pyserini/indexes/index-robust04-20191213.tar.gz into /root/.cache/pyserini/indexes/index-robust04-20191213.15f3d001489c97849a010b0a4734d018...
Initializing robust04...


Now we can search:

In [None]:
hits = searcher.search('black bear attacks')

# Prints the first 10 hits
for i in range(0, 10):
    print(f'{i+1:2} {hits[i].docid:15} {hits[i].score:.5f}')

 1 LA092790-0015   7.06680
 2 LA081689-0039   6.89020
 3 FBIS4-16530     6.61630
 4 LA102589-0076   6.46450
 5 FT932-15491     6.25090
 6 FBIS3-12276     6.24630
 7 LA091090-0085   6.17030
 8 FT922-13519     6.04270
 9 LA052790-0205   5.94060
10 LA103089-0041   5.90650


The `hits` data structure holds the `docid`, the retrieval score, as well as the document content:

In [None]:
from IPython.core.display import display, HTML
display(HTML('<div style="font-family: Times New Roman; padding-bottom:10px">' + hits[0].raw + '</div>'))

Here's how you can configure search options, such as BM25 parameters and using relevance feedback.

In [None]:
searcher.set_bm25(0.8,0.4)
searcher.set_rm3(20,10, 0.5)

hits2 = searcher.search('black bear attacks',1000)

# Prints the first 10 hits
for i in range(0, 10):
    print(f'{i+1:2} {hits2[i].docid:15} {hits2[i].score:.5f}')

 1 LA081689-0039   1.86260
 2 LA092790-0015   1.78510
 3 FBIS4-16530     1.75220
 4 LA091090-0085   1.73180
 5 FT922-13519     1.69180
 6 FT932-15491     1.64840
 7 LA102589-0076   1.64730
 8 LA052790-0205   1.63230
 9 LA103089-0041   1.56050
10 FR940902-1-00057 1.49490


Note that the results have changed.

## IndexReaderUtils Usage

The `IndexReaderUtils` class provides various methods to read the index directly. For example, we can fetch a raw document from the index given its `docid`:

In [None]:
from pyserini.index import IndexReader

reader = IndexReader.from_prebuilt_index('robust04')

doc = reader.doc('LA092790-0015').raw()
display(HTML('<div style="font-family: Times New Roman; padding-bottom:10px">' + doc + '</div>'))

Attempting to initialize pre-built index robust04.
/root/.cache/pyserini/indexes/index-robust04-20191213.15f3d001489c97849a010b0a4734d018 already exists, skipping download.
Initializing robust04...


Note that the result is exactly the same as displaying the hit contents above. Given the raw text, we can obtain its analyzed form (i.e., tokenized, stemmed, stopwords removed, etc.). Here we show the first ten tokens:

In [None]:
analyzed = reader.analyze(doc)
analyzed[0:10]

['date',
 'p',
 'septemb',
 '27',
 '1990',
 'thursdai',
 'ventura',
 'counti',
 'edit',
 'p']

The index also stores the raw document vector, which we can obtain as a Python dictionary of analyzed terms to counts (i.e., term frequency).
For brevity, we only look at terms that appear more than once:

In [None]:
doc_vector = reader.get_document_vector('LA092790-0015')
# { k: v for (k, v) in doc_vector.items() if v >1 }

NameError: ignored

In [None]:
import re

def processQuery(filePath):
    with open(filePath, 'r', encoding='ISO-8859-1') as f:
        inDoc = False
        ids = []
        queries = []
        docid = ""
        query =  ""
        for line in f:
            if inDoc:
                if line.startswith("<num>"):
                    m = re.search('<num>(.+?)</num>', line)
                    docid = m.group(1)
                    ids.append(docid)
                    continue
                if line.startswith("<title>"):
                    m = re.search('<title>(.+?)</title>', line)
                    query = m.group(1)
                    query = query.replace("/", " "); 
                    queries.append(query)
                    continue
                else:
                    if line.strip() == "</top>":
                        inDoc = False
            elif line.strip() == "<top>":
                inDoc = True
    return ids, queries

In [None]:
id, q = processQuery('trec678-robust.xml')

In [None]:
searcher.set_bm25(0.8,0.4)
searcher.set_rm3(20,70, 0.8)
out_file = open('out20-70-0.8.txt', "w")
j = 0

for que in q:
  hits2 = searcher.search(que,1000)
  res = ''
  # Prints the first 10 hits
  for i in range(0, 1000):
      # print(f'{i+1:2} {hits2[i].docid:15} {hits2[i].score:.5f}')
      
      r = str(id[j]) + '\t' + 'Q0' + '\t' + str(hits2[i].docid) + '\t' + str(i+1) + '\t' + str(hits2[i].score) + '\t' + str('prm') + '\n'
      # res = res + r
      out_file.write(r)
  j = j+1
out_file.close() 

In [2]:
from pyserini.search import SimpleSearcher

searcher = SimpleSearcher.from_prebuilt_index('robust04')
hits = searcher.search('hubble space telescope')

# Print the first 10 hits:
for i in range(0, 10):
    print(f'{i+1:2} {hits[i].docid:15} {hits[i].score:.5f}')

Attempting to initialize pre-built index robust04.
Downloading index at https://git.uwaterloo.ca/jimmylin/anserini-indexes/raw/master/index-robust04-20191213.tar.gz...


index-robust04-20191213.tar.gz: 1.70GB [00:32, 55.5MB/s]                            


Extracting /root/.cache/pyserini/indexes/index-robust04-20191213.tar.gz into /root/.cache/pyserini/indexes/index-robust04-20191213.15f3d001489c97849a010b0a4734d018...
Initializing robust04...
 1 LA071090-0047   16.85690
 2 FT934-5418      16.75630
 3 FT921-7107      16.68290
 4 LA052890-0021   16.37390
 5 LA070990-0052   16.36460
 6 LA062990-0180   16.19260
 7 LA070890-0154   16.15610
 8 FT934-2516      16.08950
 9 LA041090-0148   16.08810
10 FT944-128       16.01920


In [5]:
!python -m pyserini.search --topics robust04 --index robust04 --output run.robust04.txt --bm25

Attempting to initialize pre-built index robust04.
/root/.cache/pyserini/indexes/index-robust04-20191213.15f3d001489c97849a010b0a4734d018 already exists, skipping download.
Initializing robust04...
Running robust04 topics, saving to run.robust04.txt...
100% 250/250 [00:23<00:00, 10.54it/s]


In [7]:
!python -m pyserini.eval.trec_eval -m map -m P.30 robust04 run.robust04.txt

Downloading https://search.maven.org/remotecontent?filepath=uk/ac/gla/dcs/terrierteam/jtreceval/0.0.5/jtreceval-0.0.5-jar-with-dependencies.jar to /root/.cache/pyserini/eval/jtreceval-0.0.5-jar-with-dependencies.jar...
jtreceval-0.0.5-jar-with-dependencies.jar: 1.79MB [00:00, 4.52MB/s]                
Running command: ['java', '-jar', '/root/.cache/pyserini/eval/jtreceval-0.0.5-jar-with-dependencies.jar', '-m', 'map', '-m', 'P.30', '/root/.cache/pyserini/topics-and-qrels/qrels.robust04.txt', 'run.robust04.txt']
Results:
map                   	all	0.2531
P_30                  	all	0.3102

