# Buscador booleano/bag-of-words no TREC-DL 2020

Aqui, é implementado um buscador booleano, que apenas leva em consideração a ocorrência ou não de cada termo da query em cada documento, independente do número de ocorrências de cada termo (abordagem bag-of-words).

## Download do dataset

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
main_path = '/content/drive/MyDrive/Unicamp-aula-2/'

import os

if not os.path.exists(main_path):
  os.makedirs(main_path)
else:
  print('Diretório já existente')

Diretório já existente


## Download de ferramentas auxiliares

In [3]:
!pip install pyserini

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyserini
  Downloading pyserini-0.20.0-py3-none-any.whl (137.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.1/137.1 MB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentencepiece>=0.1.95
  Downloading sentencepiece-0.1.97-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m68.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pandas>=1.4.0
  Downloading pandas-1.5.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.2/12.2 MB[0m [31m97.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting lightgbm>=3.3.2
  Downloading lightgbm-3.3.5-py3-none-manylinux1_x86_64.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m87.5 MB/s[

In [None]:
!git clone https://github.com/castorini/pyserini.git --recurse-submodules {main_path}/pyserini

In [None]:
!cd {main_path}/pyserini/tools/eval && tar xvfz trec_eval.9.0.4.tar.gz && cd trec_eval.9.0.4 && make && cd ../../..
!cd {main_path}/pyserini/tools/eval/ndeval && make && cd ../../..

## Construção do índice invertido

In [4]:
from pyserini.analysis import Analyzer, get_lucene_analyzer

In [5]:
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')

Your runtime has 27.3 gigabytes of available RAM

You are using a high-RAM runtime!


In [6]:
analyzer = Analyzer(get_lucene_analyzer(stemmer='porter'))

def preprocess_and_tokenize(text):
  return analyzer.analyze(text)

In [7]:

collection_path = main_path + '/collections/msmarco-passage/collection.tsv'



In [43]:
import nltk
import string

from nltk.stem import PorterStemmer
from nltk.stem.snowball import SnowballStemmer

nltk.download('stopwords')  # Download stopwords if not already downloaded

from nltk.corpus import stopwords

stop_words = stopwords.words('english')
stop_words = set(stop_words)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [45]:

import pandas as pd
from collections import defaultdict
#df_collection = pd.read_csv('collections/msmarco-passage/collection.tsv', sep='\t', header=None)

# set the chunk size
chunk_size = 1000
chunks = []
inverted_index = defaultdict(set)
full_text = ''

def process(row):
  tokenized_text = preprocess_and_tokenize(row[1])
  doc_id = row[0]
  for token in tokenized_text:
    #teste para reduzir o índice
    if token not in stop_words:
      inverted_index[token].add(doc_id)

chunk_id = 0
# iterate through the file in chunks
for chunk in pd.read_csv(collection_path, sep='\t', header=None, chunksize=chunk_size):
  # process the chunk here
  if (chunk_id % 1000) == 0:
    print(f'Processing chunk {chunk_id}')
  for index, row in chunk.iterrows():
    #full_text = process2(row, full_text)
    process(row)
  del(chunk)
  chunk_id += 1

Processing chunk 0
Processing chunk 1000
Processing chunk 2000
Processing chunk 3000
Processing chunk 4000
Processing chunk 5000
Processing chunk 6000
Processing chunk 7000
Processing chunk 8000


In [46]:
len(inverted_index)

2660662

In [None]:
!head {collection_path}

0	The presence of communication amid scientific minds was equally important to the success of the Manhattan Project as scientific intellect was. The only cloud hanging over the impressive achievement of the atomic researchers and engineers is what their success truly meant; hundreds of thousands of innocent lives obliterated.
1	The Manhattan Project and its atomic bomb helped bring an end to World War II. Its legacy of peaceful uses of atomic energy continues to have an impact on history and science.
2	Essay on The Manhattan Project - The Manhattan Project The Manhattan Project was to see if making an atomic bomb possible. The success of this project would forever change the world forever making it known that something this powerful can be manmade.
3	The Manhattan Project was the name for a project conducted during World War II, to develop the first atomic bomb. It refers specifically to the period of the project from 194 â¦ 2-1946 under the control of the U.S. Army Corps of Engineers

## Avaliação

In [11]:
topics_file = main_path + '/pyserini/tools/topics-and-qrels/topics.dl20.txt'
qrels_eval = main_path + '/pyserini/tools/topics-and-qrels/qrels.dl20-passage.txt'

In [12]:
!head {topics_file}

1030303	who is aziz hashim
1037496	who is rep scalise?
1043135	who killed nicholas ii of russia
1045109	who owns barnhart crane
1049519	who said no one can make you feel inferior
1051399	who sings monk theme song
1056416	who was the highest career passer  rating in the nfl
1064670	why do hunters pattern their shotguns?
1065636	why do some places on my scalp feel sore
1071750	why is pete rose banned from hall of fame


In [13]:
!head {qrels_eval}

23849 0 1020327 2
23849 0 1034183 3
23849 0 1120730 0
23849 0 1139571 1
23849 0 1143724 0
23849 0 1147202 0
23849 0 1150311 0
23849 0 1158886 2
23849 0 1175024 1
23849 0 1201385 0


In [25]:
def search(query, use_stopwords=False):
  doc_scores = defaultdict(int) # int (doc_id) -> int (score)
  query_tokens = preprocess_and_tokenize(query)

  for token in query_tokens:
    if token in inverted_index:
      if use_stopwords or token not in stop_words:
        doc_ids = inverted_index[token]
        for doc_id in doc_ids:
          doc_scores[doc_id] += 1

  return doc_scores

In [16]:
results = search('who is aziz hashim', True)

['who', 'aziz', 'hashim']


In [17]:
len(results)

479793

In [18]:
results = search('who is aziz hashim')

['who', 'aziz', 'hashim']


In [19]:
len(results)

245

In [44]:

#6989780, 1305521, 4358004, 1815707, 7508059

OBS.: A lista de stopwords do Lucene Analyzer parece ser muito retrista.  Assim, para reduzir o tamanho do índice, uma alternativa seria combinar com a lista de stopwords do NLTK - ou seja, só salvar no índie invertido se não estiver na lsita de stopwords do NLTK.

In [20]:
def get_document_by_id(id):
  result = None
  with open(collection_path, 'r') as f:
    for line in f:
      fields = line.strip().split('\t')
      doc_id = fields[0]
      if doc_id == id:
        result = fields[1]
        break

  return result

In [21]:
sorted_results = sorted(results.items(), key=lambda x: x[1], reverse=True)[:10]

In [22]:
sorted_results

[(7156982, 2),
 (8726429, 2),
 (8726430, 2),
 (8726433, 2),
 (8726434, 2),
 (8726435, 2),
 (8726436, 2),
 (8726437, 2),
 (794624, 1),
 (4820481, 1)]

In [23]:
get_document_by_id('8726437')

'Aziz Hashim is one of the worldâ\x80\x99s leading experts on franchising and a highly regarded executive in the U.S. and international franchise space. He is the Founder and Managing Partner of NRD Capital (NRD), the first business fund both sponsored and managed by a former multi-unit franchisee.'

In [47]:
query_to_results = dict()

with open(topics_file, 'r') as f:
  for line in f:
      fields = line.strip().split('\t')
      query_id = fields[0]
      query_text = fields[1]
      results = search(query_text)
      query_to_results[int(query_id)] = sorted(results.items(), key=lambda x: x[1], reverse=True)[:10]

with open('run.dl20.boolean.trec', 'w') as f:
  for query_id, results in query_to_results.items():
    for i, (doc_id, score) in enumerate(results):
      f.write(f'{query_id}\tQ0\t{doc_id}\t{i+1}\t{score}\tboolean\n')

In [32]:
!head run.dl20.boolean.trec

1030303	Q0	7156982	1	2	boolean
1030303	Q0	8726429	2	2	boolean
1030303	Q0	8726430	3	2	boolean
1030303	Q0	8726433	4	2	boolean
1030303	Q0	8726434	5	2	boolean
1030303	Q0	8726435	6	2	boolean
1030303	Q0	8726436	7	2	boolean
1030303	Q0	8726437	8	2	boolean
1030303	Q0	794624	9	1	boolean
1030303	Q0	4820481	10	1	boolean


In [36]:
!python {main_path}/pyserini/tools/scripts/msmarco/convert_msmarco_to_trec_qrels.py \
   --input {qrels_eval} \
   --output qrels.dl20.trec

Done!


In [37]:
!head qrels.dl20.trec

23849 0 1020327 2
23849 0 1034183 3
23849 0 1120730 0
23849 0 1139571 1
23849 0 1143724 0
23849 0 1147202 0
23849 0 1150311 0
23849 0 1158886 2
23849 0 1175024 1
23849 0 1201385 0


In [48]:
!{main_path}/pyserini/tools/eval/trec_eval.9.0.4/trec_eval -c -m map -m ndcg_cut.10 -l 2 \
   qrels.dl20.trec run.dl20.boolean.trec

map                   	all	0.1117
ndcg_cut_10           	all	0.3189
