# Pyserini Library

Reading and processing the CISI.ALL file, which contains the documents to be searched/retrieved.

In [1]:
with open('CISI.ALL', 'r') as file:
    content = file.read()

In [2]:
documents = content.split('.I ')

For each document, the title, author, and text are concatenated to assist in information retrieval.

In [3]:
docs = []
for i, document in enumerate(documents):
  title = document[document.find('\n.T') + 3:document.find('\n.A')].strip()
  author = document[document.find('\n.A') + 3: document.find('\n.W')].strip()
  text = document[document.find('\n.W') + 3: document.find('\n.X')].strip()
  doc = title + ' ' + author + ' ' + text
  docs.append(doc)

Firstly, it is necessary to convert the documents to one of the formats accepted by the library. Here we will use the JSONL format.

In [4]:
import json
import os

directory = "json"

if not os.path.exists(directory):
    os.makedirs(directory)

data = []
for i in range(1, len(docs)):
  row = {"id":i, "contents":docs[i]}
  data.append(row)

with open('json/CISI.jsonl', 'w') as outfile:
    for d in data:
        json.dump(d, outfile)
        outfile.write('\n')


In [5]:
!pip install pyserini

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyserini
  Downloading pyserini-0.20.0-py3-none-any.whl (137.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.1/137.1 MB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
Collecting nmslib>=2.1.1
  Downloading nmslib-2.1.1-cp38-cp38-manylinux2010_x86_64.whl (13.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.4/13.4 MB[0m [31m71.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentencepiece>=0.1.95
  Downloading sentencepiece-0.1.97-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m66.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pandas>=1.4.0
  Downloading pandas-1.5.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.2/12.2 MB[0m [31m77.3 MB/s[0m 

In [6]:
!python -m pyserini.index.lucene \
  --collection JsonCollection \
  --input json \
  --index indexes/cisi_jsonl \
  --generator DefaultLuceneDocumentGenerator \
  --threads 1 \
  --storePositions --storeDocvectors --storeRaw

2023-02-22 20:05:44,772 INFO  [main] index.IndexCollection (IndexCollection.java:391) - Setting log level to INFO
2023-02-22 20:05:44,776 INFO  [main] index.IndexCollection (IndexCollection.java:394) - Starting indexer...
2023-02-22 20:05:44,776 INFO  [main] index.IndexCollection (IndexCollection.java:396) - DocumentCollection path: json
2023-02-22 20:05:44,777 INFO  [main] index.IndexCollection (IndexCollection.java:397) - CollectionClass: JsonCollection
2023-02-22 20:05:44,777 INFO  [main] index.IndexCollection (IndexCollection.java:398) - Generator: DefaultLuceneDocumentGenerator
2023-02-22 20:05:44,778 INFO  [main] index.IndexCollection (IndexCollection.java:399) - Threads: 1
2023-02-22 20:05:44,778 INFO  [main] index.IndexCollection (IndexCollection.java:400) - Language: en
2023-02-22 20:05:44,778 INFO  [main] index.IndexCollection (IndexCollection.java:401) - Stemmer: porter
2023-02-22 20:05:44,779 INFO  [main] index.IndexCollection (IndexCollection.java:402) - Keep stopwords? fa

Additionally, it is necessary to convert the queries to the format expected by the library.

Reading and processing the CISI.QRY file, which contains the queries.

In [7]:
with open('CISI.QRY', 'r') as file:
    content = file.read()

In [8]:
queries = content.split('.I ')

In [9]:
query_docs = []
for query in queries:
  text = query[query.find('.W\n') + 3:].strip()
  query_docs.append(text)

In [10]:
with open('queries.tsv', 'w') as outfile:
  for i in range(1, len(query_docs)):
    outfile.write(str(i) + '\t' + query_docs[i].replace('\n',' '))
    outfile.write('\n')

Reading and processing of the CISI.REL file, which contains the reference/target relevance pairs that relate the queries to the documents.

In [11]:
from collections import defaultdict

map_query_to_docs = defaultdict(list)
total = 0

with open('CISI.REL', 'r') as file:
    for line in file:
        cols = line.split()
        query_id = cols[0]
        doc_id = cols[1]
        map_query_to_docs[int(query_id)].append(int(doc_id))
        total += 1

Next, we get the maximum number of hits.

As we''ll see later, the command interface of this library requires specifying the number of hits to be considered, making it difficult to use the percentile-based threshold heuristic, as was done with Rank BM25. However, as we will also see later, the results obtained here were better, even with a fixed number of documents to be considered.

In [12]:
max_hits = 0
for v in map_query_to_docs.values():
  if len(v) > max_hits:
    max_hits = len(v)

In [13]:
max_hits

155

Installation of auxiliary libraries

In [14]:
!pip install torch

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [15]:
!pip install faiss-cpu

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting faiss-cpu
  Downloading faiss_cpu-1.7.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.0/17.0 MB[0m [31m54.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.7.3


Execution of a test with hyperparameter values according to the documentation. Before the hyperparameter tuning step, it is important to highlight that the recall and F-1 metrics exceed those obtained by the Rank BM25 library implementations.

In [16]:
!python -m pyserini.search.lucene \
  --index indexes/cisi_jsonl \
  --topics queries.tsv \
  --output runs/run.txt \
  --output-format msmarco \
  --hits 155 \
  --bm25 --k1 0.82 --b 0.68

2023-02-22 20:13:52.584112: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-22 20:13:54.719929: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-02-22 20:13:54.720160: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
Setting BM25 parameters: k1=0.82, b=0.

Evaluation function of the result

In [17]:
import numpy as np

def eval_bm25(filename):
  map_retrieved_docs = defaultdict(list)

  with open(filename, 'r') as output:
    for line in output:
      values = line.split()
      query_id = values[0]
      doc_id = values[1]
      map_retrieved_docs[int(query_id)].append(int(doc_id))

  query_precision = defaultdict(list)
  query_recall = defaultdict(list)

  for i in range(1, len(query_docs)):
      relevant_docs = map_query_to_docs[i]
      if len(relevant_docs) > 0:
        retrieved_docs = map_retrieved_docs[i]
        relevant_retrieved = 0
        for doc in relevant_docs:
          if doc in retrieved_docs:
            relevant_retrieved += 1
        recall = relevant_retrieved/len(relevant_docs)
        query_recall[i].append(recall)

        retrieved_relevant = 0
        for doc in retrieved_docs:
          if doc in relevant_docs:
            retrieved_relevant += 1
        precision = retrieved_relevant/len(retrieved_docs)
        query_precision[i].append(precision)

  map = 0
  recall = 0
  n = 0
  for q in query_precision:
    mean_query_precision = sum(query_precision[q])/len(query_precision[q])
    map += mean_query_precision
    mean_query_recall = sum(query_recall[q])/len(query_recall[q])
    recall += mean_query_recall
    n += 1

  map = map/n
  recall = recall/n
  return map, recall

In [18]:
map, recall = eval_bm25('runs/run.txt')
print('MAP = ', map)
print('Recall = ', recall)
print('F-1 = ', 2*map*recall/(map + recall))

MAP =  0.1161290322580645
Recall =  0.5154883783658808
F-1 =  0.18955515004177198


Next, there is a grid search that varies:
* the k1 hyperparameter, and
* the b hyperparameter.

In [19]:
from IPython import get_ipython
ipython = get_ipython()

best_map = 0
best_k1 = None
best_b = None
best_recall = 0

for k1 in [0.5, 1.0, 1.2, 1.5, 2.0, 2.5, 3.0, 4.0, 5.0]:
  for b in [0.25, 0.5, 0.75, 1.0]:
    code = ipython.transform_cell(f'!python -m pyserini.search.lucene \
      --index indexes/cisi_jsonl \
      --topics queries.tsv \
      --output runs/run_k1_{k1}_b_{b}.txt \
      --output-format msmarco \
      --hits 155 \
      --bm25 --k1 {k1} --b {b}')
    exec(code)

    map, recall = eval_bm25(f'runs/run_k1_{k1}_b_{b}.txt')
    print('MAP = ', map)
    print('Recall = ', recall)
    print('F-1 = ', 2*map*recall/(map + recall))

    if map > best_map:
      best_k1 = k1
      best_b = b
      best_map = map
      best_recall = recall

print('Best MAP = ', best_map)
print('Recall = ', best_recall)
print('Best k1 = ', best_k1)
print('Best b = ', best_b)
print('Best F-1 = ', 2*best_map*best_recall/(best_map + best_recall))

2023-02-22 20:16:56.481219: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-22 20:16:58.304802: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-02-22 20:16:58.304995: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
Setting BM25 parameters: k1=0.5, b=0.2

**Comments**: The best MAP value obtained here was around 12%, slightly better than the MAP value obtained by Rank BM25. Also, a significnt improvement was made to recall (around 31 percent points!), leading to an improvement of 8 percent points to F-1. Although not shown here, we evaluated the use of a fixed number of 155 returns for Rank BM25, but the results were worse, not even reaching 8% for MAP, which suggests that for that library the heuristic of returning the number of documents according to the percentiles of the scores generates better results.