# Biblioteca Pyserini

Leitura e processamento do arquivo CISI.ALL, que contém os documentos a serem pesquisados/recuperados.

In [4]:
with open('CISI.ALL', 'r') as file:
    content = file.read()

In [5]:
documents = content.split('.I ')

Para cada documento, o título, autor e texto são concatenados, para ajudar na recuperação das informações.

In [6]:
docs = []
for i, document in enumerate(documents):
  title = document[document.find('\n.T') + 3:document.find('\n.A')].strip()
  author = document[document.find('\n.A') + 3: document.find('\n.W')].strip()
  text = document[document.find('\n.W') + 3: document.find('\n.X')].strip()
  doc = title + ' ' + author + ' ' + text
  docs.append(doc)

Primeramente, é necesssário converter os documentos para um dos formatos aceitos pela biblioteca.  Vamos utilizar aqui o formato JSONL.

In [10]:
import json
import os

directory = "json"

if not os.path.exists(directory):
    os.makedirs(directory)

data = []
for i in range(1, len(docs)):
  row = {"id":i, "contents":docs[i]}
  data.append(row)

with open('json/CISI.jsonl', 'w') as outfile:
    for d in data:
        json.dump(d, outfile)
        outfile.write('\n')


In [8]:
!pip install pyserini

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyserini
  Downloading pyserini-0.20.0-py3-none-any.whl (137.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.1/137.1 MB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
Collecting pandas>=1.4.0
  Downloading pandas-1.5.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.2/12.2 MB[0m [31m66.8 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece>=0.1.95
  Downloading sentencepiece-0.1.97-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m61.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting onnxruntime>=1.8.1
  Downloading onnxruntime-1.14.0-cp38-cp38-manylinux_2_27_x86_64.whl (5.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.0/5.0 MB[0m [31m75.7 MB/s[

In [11]:
!python -m pyserini.index.lucene \
  --collection JsonCollection \
  --input json \
  --index indexes/cisi_jsonl \
  --generator DefaultLuceneDocumentGenerator \
  --threads 1 \
  --storePositions --storeDocvectors --storeRaw

2023-02-21 01:14:49,532 INFO  [main] index.IndexCollection (IndexCollection.java:391) - Setting log level to INFO
2023-02-21 01:14:49,539 INFO  [main] index.IndexCollection (IndexCollection.java:394) - Starting indexer...
2023-02-21 01:14:49,542 INFO  [main] index.IndexCollection (IndexCollection.java:396) - DocumentCollection path: json
2023-02-21 01:14:49,543 INFO  [main] index.IndexCollection (IndexCollection.java:397) - CollectionClass: JsonCollection
2023-02-21 01:14:49,546 INFO  [main] index.IndexCollection (IndexCollection.java:398) - Generator: DefaultLuceneDocumentGenerator
2023-02-21 01:14:49,547 INFO  [main] index.IndexCollection (IndexCollection.java:399) - Threads: 1
2023-02-21 01:14:49,547 INFO  [main] index.IndexCollection (IndexCollection.java:400) - Language: en
2023-02-21 01:14:49,548 INFO  [main] index.IndexCollection (IndexCollection.java:401) - Stemmer: porter
2023-02-21 01:14:49,548 INFO  [main] index.IndexCollection (IndexCollection.java:402) - Keep stopwords? fa

Adicionalmente, é necessário converter as queries para o formato esperado pela biblioteca.

Leitura e processamento do arquivo CISI.QRY, que contém as consultas (queries).

In [14]:
with open('CISI.QRY', 'r') as file:
    content = file.read()

In [15]:
queries = content.split('.I ')

In [16]:
query_docs = []
for query in queries:
  text = query[query.find('.W\n') + 3:].strip()
  query_docs.append(text)

In [25]:
with open('queries.tsv', 'w') as outfile:
  for i in range(1, len(query_docs)):
    outfile.write(str(i) + '\t' + query_docs[i].replace('\n',' '))
    outfile.write('\n')

Leitura e processamento do arquivo CISI.REL, que contém os valores-alvo de relevância que relacionam as consultas (queries) aos documentos.

In [19]:
from collections import defaultdict

map_query_to_docs = defaultdict(list)
total = 0

with open('CISI.REL', 'r') as file:
    for line in file:
        cols = line.split()
        query_id = cols[0]
        doc_id = cols[1]
        map_query_to_docs[int(query_id)].append(int(doc_id))
        total += 1

Obtém o número máximo de hits. Como será visto mais adiante, a interface de comando desta biblioteca exige que se especifique o número de hits a serem consdierados, dificultando que se use a heurística do limiar baseado em percentis, como foi feito acima com o Rank BM25.  Entretanto, conforme também será visto adiante, os resultados aqui obtidos foram melhores, mesmo com um número fixo de documentos a serem considerados.

In [20]:
max_hits = 0
for v in map_query_to_docs.values():
  if len(v) > max_hits:
    max_hits = len(v)

In [21]:
max_hits

155

Instalação de bibliotecas auxiliares

In [22]:
!pip install torch

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [23]:
!pip install faiss-cpu

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting faiss-cpu
  Downloading faiss_cpu-1.7.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.0/17.0 MB[0m [31m47.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.7.3


Execução de teste com valores dos hiperparâmetros segundo a documentação.  Antes da etapa de tuning dos hiperparâmetros, é importante ressaltar que as métricas superam as obtidas pelas implementações da biblioteca Rank BM25.

In [26]:
!python -m pyserini.search.lucene \
  --index indexes/cisi_jsonl \
  --topics queries.tsv \
  --output runs/run.txt \
  --output-format msmarco \
  --hits 155 \
  --bm25 --k1 0.82 --b 0.68

2023-02-21 01:20:39.816227: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-21 01:20:41.177135: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-02-21 01:20:41.177254: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
Setting BM25 parameters: k1=0.82, b=0.

Função de avaliação do resultado

In [27]:
import numpy as np

def eval_bm25(filename):
  map_retrieved_docs = defaultdict(list)

  with open(filename, 'r') as output:
    for line in output:
      values = line.split()
      query_id = values[0]
      doc_id = values[1]
      map_retrieved_docs[int(query_id)].append(int(doc_id))

  query_precision = defaultdict(list)
  query_recall = defaultdict(list)

  for i in range(1, len(query_docs)):
      relevant_docs = map_query_to_docs[i]
      if len(relevant_docs) > 0:
        retrieved_docs = map_retrieved_docs[i]
        relevant_retrieved = 0
        for doc in relevant_docs:
          if doc in retrieved_docs:
            relevant_retrieved += 1
        recall = relevant_retrieved/len(relevant_docs)
        query_recall[i].append(recall)

        retrieved_relevant = 0
        for doc in retrieved_docs:
          if doc in relevant_docs:
            retrieved_relevant += 1
        precision = retrieved_relevant/len(retrieved_docs)
        query_precision[i].append(precision)

  map = 0
  recall = 0
  n = 0
  for q in query_precision:
    mean_query_precision = sum(query_precision[q])/len(query_precision[q])
    map += mean_query_precision
    mean_query_recall = sum(query_recall[q])/len(query_recall[q])
    recall += mean_query_recall
    n += 1

  map = map/n
  recall = recall/n
  return map, recall

In [29]:
map, recall = eval_bm25('runs/run.txt')
print('MAP = ', map)
print('Recall = ', recall)
print('F-1 = ', 2*map*recall/(map + recall))

MAP =  0.1161290322580645
Recall =  0.5154883783658808
F-1 =  0.18955515004177198


A seguir, grid search que varia:
* hiperparâmetro k1
* hiperparâmetro b

In [31]:
from IPython import get_ipython
ipython = get_ipython()

best_map = 0
best_k1 = None
best_b = None
best_recall = 0

for k1 in [0.5, 1.0, 1.2, 1.5, 2.0, 2.5, 3.0, 4.0, 5.0]:
  for b in [0.25, 0.5, 0.75, 1.0]:
    code = ipython.transform_cell(f'!python -m pyserini.search.lucene \
      --index indexes/cisi_jsonl \
      --topics queries.tsv \
      --output runs/run_k1_{k1}_b_{b}.txt \
      --output-format msmarco \
      --hits 155 \
      --bm25 --k1 {k1} --b {b}')
    exec(code)

    map, recall = eval_bm25(f'runs/run_k1_{k1}_b_{b}.txt')
    print('MAP = ', map)
    print('Recall = ', recall)
    print('F-1 = ', 2*map*recall/(map + recall))

    if map > best_map:
      best_k1 = k1
      best_b = b
      best_map = map
      best_recall = recall

print('Best MAP = ', best_map)
print('Recall = ', best_recall)
print('Best k1 = ', best_k1)
print('Best b = ', best_b)
print('Best F-1 = ', 2*best_map*best_recall/(best_map + best_recall))

2023-02-21 01:23:56.618134: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-21 01:23:57.962650: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-02-21 01:23:57.962802: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
Setting BM25 parameters: k1=0.5, b=0.2

**Comentários**:  O melhor valor para MAP obtido aqui foi de cerca de 12%, enquanto o melhor valor para MAP obtido pela Rank BM25 foi de cerca de 10%.  Embora não esteja sendo mostrado aqui, foi avaliado o uso de um número fixo de 155 retornos para a Rank BM25, porém os resultados foram piores, não chegando nem a 8% para MAP, o que sugere que para aquela biblioteca a heurística de retornar o número de documentos de acordo com os percentis dos scores gera melhores resultados.