<a href="https://colab.research.google.com/github/leonardo3108/IA368dd/blob/main/exercicios/Aula_6/Aula_6_Indexing_Search_Evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Enunciado

* Treinar um modelo seq2seq (a partir do T5-base) na tarefa de expansão de documentos usando o doc2query
* Usar como treino o dataset "tiny" do MS MARCO na tarefa doc2query
https://storage.googleapis.com/unicamp-dl/ia368dd_2023s1/msmarco/msmarco_triples.train.tiny.tsv
* doc2query: A entrada é a passagem e o target é a query
Note que apenas pares (query, passagem relevante) são usados como treino.
O treino é relativamente rápido (<1 hora).
* Validar a cada X steps usando o sacreBLEU 
* A parte lenta deste exercício é a pré-indexação: para cada documento da coleção, temos que gerar uma ou mais queries, que depois são concatenadas ao documento original, e esse documento "expandido" é indexado.
* Avaliar no TREC-COVID (171K docs), pois é menor que o MS MARCO/TREC-DL 2020 (8.8M passagens). 
  * Indice invertido do Trec-covid no pyserini: beir-v1.0.0-trec-covid-flat
  * Corpus e queries na HF: https://huggingface.co/datasets/BeIR/trec-covid
  * qrels: https://huggingface.co/datasets/BeIR/trec-covid-qrels
  * Usar nDCG@10
  * Comparar com o BM25 com e sem os documentos expandidos pelo doc2query

# Setup

## Integração com Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Instalação de libs

In [None]:
!pip install pyserini
!pip install faiss-cpu
!pip install trectools

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting trectools
  Downloading trectools-0.0.49.tar.gz (28 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sarge>=0.1.1
  Downloading sarge-0.1.7.post1-py2.py3-none-any.whl (18 kB)
Collecting bs4>=0.0.0.1
  Downloading bs4-0.0.1.tar.gz (1.1 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: trectools, bs4
  Building wheel for trectools (setup.py) ... [?25l[?25hdone
  Created wheel for trectools: filename=trectools-0.0.49-py3-none-any.whl size=27141 sha256=79d2010256328cedcdf90c4984fe25796d663abda792b6c26a90de22235a275c
  Stored in directory: /root/.cache/pip/wheels/b2/1d/4d/445b0fb9a145de0dc24861a535cbe755f637327da7f5d65ed7

## Importação de libs

In [None]:
from pyserini.index import IndexReader
from pyserini.search import SimpleSearcher
import json

import pandas as pd
import trectools

# Preparação de dados

## Obtenção do corpus

In [None]:
!wget https://huggingface.co/datasets/BeIR/trec-covid/resolve/main/corpus.jsonl.gz

--2023-04-12 21:36:04--  https://huggingface.co/datasets/BeIR/trec-covid/resolve/main/corpus.jsonl.gz
Resolving huggingface.co (huggingface.co)... 18.67.0.67, 18.67.0.34, 18.67.0.90, ...
Connecting to huggingface.co (huggingface.co)|18.67.0.67|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs.huggingface.co/repos/a8/10/a810e88b0e7b233be82b89c1fa6ec2d75efc6d55784c2ada9dcac8434a634f3a/e9e97686e3138eaff989f67c04cd32e8f8f4c0d4857187e3f180275b23e24e85?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27corpus.jsonl.gz%3B+filename%3D%22corpus.jsonl.gz%22%3B&response-content-type=application%2Fgzip&Expires=1681594565&Policy=eyJTdGF0ZW1lbnQiOlt7IlJlc291cmNlIjoiaHR0cHM6Ly9jZG4tbGZzLmh1Z2dpbmdmYWNlLmNvL3JlcG9zL2E4LzEwL2E4MTBlODhiMGU3YjIzM2JlODJiODljMWZhNmVjMmQ3NWVmYzZkNTU3ODRjMmFkYTlkY2FjODQzNGE2MzRmM2EvZTllOTc2ODZlMzEzOGVhZmY5ODlmNjdjMDRjZDMyZThmOGY0YzBkNDg1NzE4N2UzZjE4MDI3NWIyM2UyNGU4NT9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSomcmVzcG9u

In [None]:
!gzip -dv corpus.jsonl.gz

corpus.jsonl.gz:	 66.8% -- replaced with corpus.jsonl


In [None]:
!head corpus.jsonl

{"_id": "ug7v899j", "title": "Clinical features of culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia", "text": "OBJECTIVE: This retrospective chart review describes the epidemiology and clinical features of 40 patients with culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia. METHODS: Patients with positive M. pneumoniae cultures from respiratory specimens from January 1997 through December 1998 were identified through the Microbiology records. Charts of patients were reviewed. RESULTS: 40 patients were identified, 33 (82.5%) of whom required admission. Most infections (92.5%) were community-acquired. The infection affected all age groups but was most common in infants (32.5%) and pre-school children (22.5%). It occurred year-round but was most common in the fall (35%) and spring (30%). More than three-quarters of patients (77.5%) had comorbidities. Twenty-four isolates (60

## Extração dos textos

In [None]:
docs_data = []
texts = []
for line in open('corpus.jsonl', 'r'):
    doc_data = json.loads(line)
    doc_data['id'] = doc_data['_id']
    doc_data['contents'] = doc_data['title'] + '\n' + doc_data['text']
    docs_data.append(doc_data)
len(docs_data)

171332

## Dump dos documentos originais

In [None]:
!mkdir corpus_original
!mkdir index_original

In [None]:
with open('corpus_original/corpus.jsonl', 'w') as fout:
    for doc_data in docs_data:
        fout.write(json.dumps(doc_data, ensure_ascii=True))
        fout.write('\n')

## Obtenção das queries geradas

In [None]:
path = '/content/drive/MyDrive/temp'

generated_queries = []
for i, line in enumerate(open(path + '/generated_queries.txt')):
    generated_queries.append(line)
len(generated_queries)

171332

## Produção dos documentos aumentados

In [None]:
for id in range(len(generated_queries)):
    docs_data[id]['contents'] = docs_data[id]['contents'] + '\n' + generated_queries[id].rstrip() 

    #expanded_docs = [doc_data['contents'] + '\n' + query.rstrip() 

print('Exemplo - documento 123:')
print('=================================================')
for field in docs_data[123].keys():
    print(field + ':', docs_data[123][field])
    print('=================================================')

Exemplo - documento 123:
_id: y2nhss9u
title: Nucleolus: the fascinating nuclear body
text: Nucleoli are the prominent contrasted structures of the cell nucleus. In the nucleolus, ribosomal RNAs are synthesized, processed and assembled with ribosomal proteins. RNA polymerase I synthesizes the ribosomal RNAs and this activity is cell cycle regulated. The nucleolus reveals the functional organization of the nucleus in which the compartmentation of the different steps of ribosome biogenesis is observed whereas the nucleolar machineries are in permanent exchange with the nucleoplasm and other nuclear bodies. After mitosis, nucleolar assembly is a time and space regulated process controlled by the cell cycle. In addition, by generating a large volume in the nucleus with apparently no RNA polymerase II activity, the nucleolus creates a domain of retention/sequestration of molecules normally active outside the nucleolus. Viruses interact with the nucleolus and recruit nucleolar proteins to fa

## Dump dos documentos aumentados

In [None]:
!mkdir corpus
!mkdir index

mkdir: cannot create directory ‘corpus’: File exists
mkdir: cannot create directory ‘index’: File exists


In [None]:
with open('corpus/augmented_corpus.jsonl', 'w') as fout:
    for doc_data in docs_data:
        fout.write(json.dumps(doc_data, ensure_ascii=True))
        fout.write('\n')

## Obtenção das avaliações (qrels)

In [None]:
!wget https://huggingface.co/datasets/BeIR/trec-covid-qrels/raw/main/test.tsv

--2023-04-12 23:25:58--  https://huggingface.co/datasets/BeIR/trec-covid-qrels/raw/main/test.tsv
Resolving huggingface.co (huggingface.co)... 18.67.0.90, 18.67.0.67, 18.67.0.34, ...
Connecting to huggingface.co (huggingface.co)|18.67.0.90|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 980831 (958K) [text/plain]
Saving to: ‘test.tsv.1’


2023-04-12 23:25:58 (4.13 MB/s) - ‘test.tsv.1’ saved [980831/980831]



In [None]:
qrels = pd.read_csv('test.tsv', delimiter='\t')
qrels

Unnamed: 0,query-id,corpus-id,score
0,1,005b2j4b,2
1,1,00fmeepz,1
2,1,g7dhmyyo,2
3,1,0194oljo,1
4,1,021q9884,1
...,...,...,...
66331,50,zvop8bxh,2
66332,50,zwf26o63,1
66333,50,zwsvlnwe,0
66334,50,zxr01yln,1


In [None]:
qrels['Q0'] = '0'
qrels = qrels[['query-id', 'Q0', 'corpus-id', 'score']]
qrels

Unnamed: 0,query-id,Q0,corpus-id,score
0,1,0,005b2j4b,2
1,1,0,00fmeepz,1
2,1,0,g7dhmyyo,2
3,1,0,0194oljo,1
4,1,0,021q9884,1
...,...,...,...,...
66331,50,0,zvop8bxh,2
66332,50,0,zwf26o63,1
66333,50,0,zwsvlnwe,0
66334,50,0,zxr01yln,1


In [None]:
qrels.to_csv('qrels_adjusted.tsv', sep='\t', header=None, index=None)

In [None]:
qrels = trectools.TrecQrel('qrels_adjusted.tsv')

## Obtenção das queries do TREC-Covid

In [None]:
!wget https://huggingface.co/datasets/BeIR/trec-covid/resolve/main/queries.jsonl.gz

--2023-04-12 23:26:12--  https://huggingface.co/datasets/BeIR/trec-covid/resolve/main/queries.jsonl.gz
Resolving huggingface.co (huggingface.co)... 18.67.0.90, 18.67.0.67, 18.67.0.34, ...
Connecting to huggingface.co (huggingface.co)|18.67.0.90|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs.huggingface.co/repos/a8/10/a810e88b0e7b233be82b89c1fa6ec2d75efc6d55784c2ada9dcac8434a634f3a/9eadcc2cdf140addc9dae83648bb2c6611f5e4b66eaed7475fa5a0ca48eda371?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27queries.jsonl.gz%3B+filename%3D%22queries.jsonl.gz%22%3B&response-content-type=application%2Fgzip&Expires=1681601173&Policy=eyJTdGF0ZW1lbnQiOlt7IlJlc291cmNlIjoiaHR0cHM6Ly9jZG4tbGZzLmh1Z2dpbmdmYWNlLmNvL3JlcG9zL2E4LzEwL2E4MTBlODhiMGU3YjIzM2JlODJiODljMWZhNmVjMmQ3NWVmYzZkNTU3ODRjMmFkYTlkY2FjODQzNGE2MzRmM2EvOWVhZGNjMmNkZjE0MGFkZGM5ZGFlODM2NDhiYjJjNjYxMWY1ZTRiNjZlYWVkNzQ3NWZhNWEwY2E0OGVkYTM3MT9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSomcmVzc

In [None]:
!gzip -dv 'queries.jsonl.gz'

queries.jsonl.gz:	 71.8% -- replaced with queries.jsonl


In [None]:
!head 'queries.jsonl'

{"_id": "1", "text": "what is the origin of COVID-19", "metadata": {"query": "coronavirus origin", "narrative": "seeking range of information about the SARS-CoV-2 virus's origin, including its evolution, animal source, and first transmission into humans"}}
{"_id": "2", "text": "how does the coronavirus respond to changes in the weather", "metadata": {"query": "coronavirus response to weather changes", "narrative": "seeking range of information about the SARS-CoV-2 virus viability in different weather/climate conditions as well as information related to transmission of the virus in different climate conditions"}}
{"_id": "3", "text": "will SARS-CoV2 infected people develop immunity? Is cross protection possible?", "metadata": {"query": "coronavirus immunity", "narrative": "seeking studies of immunity developed due to infection with SARS-CoV2 or cross protection gained due to infection with other coronavirus types"}}
{"_id": "4", "text": "what causes death from Covid-19?", "metadata": {"

In [None]:
queries={}
for line in open('queries.jsonl'):
    data = json.loads(line)
    queries[data["_id"]] = data["text"]


queries_qrels=list(qrels.qrels_data["query"].unique())
queries={query_id:value for query_id, value in queries.items() if str(query_id) in queries_qrels}

In [None]:
df_queries = pd.DataFrame(queries, index=[0]).T
df_queries = df_queries.reset_index()
df_queries.to_csv('queries_adjusted.tsv', header=None, index=None, sep='\t')

# Geração de índice

## Docs aumentados

In [None]:
!python -m pyserini.index.lucene \
  --collection JsonCollection \
  --input corpus \
  --index index \
  --generator DefaultLuceneDocumentGenerator \
  --threads 1 \
  --storePositions --storeDocvectors --storeRaw

2023-04-12 23:09:50,291 INFO  [main] index.IndexCollection (IndexCollection.java:380) - Setting log level to INFO
2023-04-12 23:09:50,294 INFO  [main] index.IndexCollection (IndexCollection.java:383) - Starting indexer...
2023-04-12 23:09:50,294 INFO  [main] index.IndexCollection (IndexCollection.java:385) - DocumentCollection path: corpus
2023-04-12 23:09:50,295 INFO  [main] index.IndexCollection (IndexCollection.java:386) - CollectionClass: JsonCollection
2023-04-12 23:09:50,295 INFO  [main] index.IndexCollection (IndexCollection.java:387) - Generator: DefaultLuceneDocumentGenerator
2023-04-12 23:09:50,296 INFO  [main] index.IndexCollection (IndexCollection.java:388) - Threads: 1
2023-04-12 23:09:50,296 INFO  [main] index.IndexCollection (IndexCollection.java:389) - Language: en
2023-04-12 23:09:50,297 INFO  [main] index.IndexCollection (IndexCollection.java:390) - Stemmer: porter
2023-04-12 23:09:50,297 INFO  [main] index.IndexCollection (IndexCollection.java:391) - Keep stopwords? 

## Docs originais

In [None]:
!python -m pyserini.index.lucene \
  --collection JsonCollection \
  --input corpus_original \
  --index index_original \
  --generator DefaultLuceneDocumentGenerator \
  --threads 1 \
  --storePositions --storeDocvectors --storeRaw 

2023-04-13 00:22:14,857 INFO  [main] index.IndexCollection (IndexCollection.java:380) - Setting log level to INFO
2023-04-13 00:22:14,860 INFO  [main] index.IndexCollection (IndexCollection.java:383) - Starting indexer...
2023-04-13 00:22:14,861 INFO  [main] index.IndexCollection (IndexCollection.java:385) - DocumentCollection path: corpus_original
2023-04-13 00:22:14,861 INFO  [main] index.IndexCollection (IndexCollection.java:386) - CollectionClass: JsonCollection
2023-04-13 00:22:14,862 INFO  [main] index.IndexCollection (IndexCollection.java:387) - Generator: DefaultLuceneDocumentGenerator
2023-04-13 00:22:14,862 INFO  [main] index.IndexCollection (IndexCollection.java:388) - Threads: 1
2023-04-13 00:22:14,862 INFO  [main] index.IndexCollection (IndexCollection.java:389) - Language: en
2023-04-13 00:22:14,863 INFO  [main] index.IndexCollection (IndexCollection.java:390) - Stemmer: porter
2023-04-13 00:22:14,863 INFO  [main] index.IndexCollection (IndexCollection.java:391) - Keep st

# Execução e avaliação

In [None]:
!mkdir 'runs'

## Docs aumentados

In [None]:
!python -m pyserini.search.lucene \
  --index index \
  --topics queries_adjusted.tsv \
  --output runs/run.augmented_index.bm25.txt \
  --output-format trec \
  --hits 1000 \
  --bm25 --k1 0.82 --b 0.68

Setting BM25 parameters: k1=0.82, b=0.68
Running queries_adjusted.tsv topics, saving to runs/run.augmented_index.bm25.txt...
100% 50/50 [00:07<00:00,  7.01it/s]


In [None]:
!python -m pyserini.eval.trec_eval -c -mndcg_cut.10 -mmap qrels_adjusted.tsv runs/run.augmented_index.bm25.txt

Downloading https://search.maven.org/remotecontent?filepath=uk/ac/gla/dcs/terrierteam/jtreceval/0.0.5/jtreceval-0.0.5-jar-with-dependencies.jar to /root/.cache/pyserini/eval/jtreceval-0.0.5-jar-with-dependencies.jar...
/root/.cache/pyserini/eval/jtreceval-0.0.5-jar-with-dependencies.jar already exists!
Skipping download.
Running command: ['java', '-jar', '/root/.cache/pyserini/eval/jtreceval-0.0.5-jar-with-dependencies.jar', '-c', '-mndcg_cut.10', '-mmap', 'qrels_adjusted.tsv', 'runs/run.augmented_index.bm25.txt']
Results:
map                   	all	0.2109
ndcg_cut_10           	all	0.6373


## Docs originais

In [None]:
!python -m pyserini.search.lucene \
  --index index_original \
  --topics queries_adjusted.tsv \
  --output runs/run.original_index.bm25.txt \
  --output-format trec \
  --hits 1000 \
  --bm25 --k1 0.82 --b 0.68

Setting BM25 parameters: k1=0.82, b=0.68
Running queries_adjusted.tsv topics, saving to runs/run.original_index.bm25.txt...
100% 50/50 [00:07<00:00,  6.31it/s]


In [None]:
!python -m pyserini.eval.trec_eval -c -mndcg_cut.10 -mmap qrels_adjusted.tsv runs/run.original_index.bm25.txt

Downloading https://search.maven.org/remotecontent?filepath=uk/ac/gla/dcs/terrierteam/jtreceval/0.0.5/jtreceval-0.0.5-jar-with-dependencies.jar to /root/.cache/pyserini/eval/jtreceval-0.0.5-jar-with-dependencies.jar...
/root/.cache/pyserini/eval/jtreceval-0.0.5-jar-with-dependencies.jar already exists!
Skipping download.
Running command: ['java', '-jar', '/root/.cache/pyserini/eval/jtreceval-0.0.5-jar-with-dependencies.jar', '-c', '-mndcg_cut.10', '-mmap', 'qrels_adjusted.tsv', 'runs/run.original_index.bm25.txt']
Results:
map                   	all	0.1880
ndcg_cut_10           	all	0.5963
