<a href="https://colab.research.google.com/github/leonardo3108/IA368dd/blob/main/exercicios/Aula_10/Aula_10_Tradeoffs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Enunciado


Exercício desta semana: Trade-offs de eficiência e qualidade

O objetivo do exercício desta semana é construir alguns pipelines de busca e analisá-los em termos das seguintes métricas:
- Qualidade dos resultados: nDCG@10;
- Latência (seg/query);
- USD por query assumindo utilização "perfeita": assim que terminou de processar -uma query, já tem outra para ser processada;
- USD/mês para deixar o sistema rodando para poucos usuários (ex: 100 queries/dia);
- Custo de indexação em USD;

Iremos avaliar os pipelines no TREC-COVID.
A latência precisa ser menor que 2 segundos por query.
Não assumir processamento de queries em batch.

Considerar:
- 1,50 USD/hora por A100 ou 0,21 USD/hora por T4 ou 0,50 USD/hora por V100
- 0,03 USD/hora por CPU core
- 0,005 USD/hora por GB de CPU RAM

Dicas:
- Utilizar modelos de busca "SOTA" já treinados no MS MARCO como parte do pipeline, como o SPLADE distil (esparso), contriever (denso), Colbert-v2 (denso), miniLM (reranker), monoT5-3B (reranker), doc2query minus-minus (expansão de documentos + filtragem com reranqueador na etapa de indexação)
- Pode usar API's como Cohere, OpenAI Embeddings

Variar parâmetros como número de documentos retornados em cada estágio. Por exemplo, BM25 retorna 1000 documentos, um modelo denso ou esparso pode franqueá-los, e passar os top 50 para o miniLM/monoT5 fazer um ranqueamento final.



# Setup

## Criação das pastas

In [1]:
!mkdir runs

## Instalação de libs

In [2]:
!pip install evaluate
!pip install faiss-cpu
!pip install pyserini
!pip install transformers
!pip install trectools

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets>=2.0.0 (from evaluate)
  Downloading datasets-2.12.0-py3-none-any.whl (474 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m34.9 MB/s[0m eta [36m0:00:00[0m
Collecting dill (from evaluate)
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m16.0 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from evaluate)
  Downloading xxhash-3.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.5/212.5 kB[0m [31m28.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiproc

## Importação de libs

In [3]:
import json
import numpy as np
import os
import pandas as pd
import time
import torch.nn.functional as F
import torch

from evaluate import load
from pyserini.search import get_topics
from pyserini.search.lucene import LuceneSearcher
from torch.utils.data import Dataset, DataLoader
from tqdm import tqdm
from transformers import AutoTokenizer, AutoModelForSequenceClassification, BatchEncoding

## Carga do avaliador

In [4]:
trec_eval = load("trec_eval")

Downloading builder script:   0%|          | 0.00/5.51k [00:00<?, ?B/s]

## Avaliação do número de processadores

In [5]:
!nproc

12


## Utilização de GPUs

In [6]:
if torch.cuda.is_available(): 
   dev = "cuda:0"
else: 
   dev = "cpu"
device = torch.device(dev)
print('Using {}'.format(device))

Using cuda:0


In [7]:
if dev != 'cpu':
    !nvidia-smi

Wed May 10 02:34:18 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P0    53W / 400W |      3MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# Preparação

## Obtenção do corpus - TREC-COVID

In [8]:
!wget -nc https://huggingface.co/datasets/BeIR/trec-covid/resolve/main/corpus.jsonl.gz

--2023-05-10 02:34:18--  https://huggingface.co/datasets/BeIR/trec-covid/resolve/main/corpus.jsonl.gz
Resolving huggingface.co (huggingface.co)... 18.155.68.116, 18.155.68.121, 18.155.68.38, ...
Connecting to huggingface.co (huggingface.co)|18.155.68.116|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs.huggingface.co/repos/a8/10/a810e88b0e7b233be82b89c1fa6ec2d75efc6d55784c2ada9dcac8434a634f3a/e9e97686e3138eaff989f67c04cd32e8f8f4c0d4857187e3f180275b23e24e85?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27corpus.jsonl.gz%3B+filename%3D%22corpus.jsonl.gz%22%3B&response-content-type=application%2Fgzip&Expires=1683938580&Policy=eyJTdGF0ZW1lbnQiOlt7IlJlc291cmNlIjoiaHR0cHM6Ly9jZG4tbGZzLmh1Z2dpbmdmYWNlLmNvL3JlcG9zL2E4LzEwL2E4MTBlODhiMGU3YjIzM2JlODJiODljMWZhNmVjMmQ3NWVmYzZkNTU3ODRjMmFkYTlkY2FjODQzNGE2MzRmM2EvZTllOTc2ODZlMzEzOGVhZmY5ODlmNjdjMDRjZDMyZThmOGY0YzBkNDg1NzE4N2UzZjE4MDI3NWIyM2UyNGU4NT9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uP

In [9]:
!gzip -dv corpus.jsonl.gz

corpus.jsonl.gz:	 66.8% -- replaced with corpus.jsonl


## Extração dos textos

In [10]:
doc = {}
corpus = []
for idx, line in enumerate(open('corpus.jsonl', 'r')):
    text_data = json.loads(line)
    if ('title' in text_data) and len(text_data['title']) >= 5:
        contents = text_data['title'] + '. ' + text_data['text']
    else:
        contents = text_data['text']
    doc[text_data['_id']] = {'contents': contents, 'title': text_data['title'], 'text': text_data['text'], 'idx': idx}
    corpus.append({'id': text_data['_id'], 'contents': contents})
len(doc), len(corpus)

(171332, 171332)

In [11]:
print(doc['mq6qjs2s'])
print(corpus[170891])

{'contents': 'Mechanisms and evidence of vertical transmission of infections in pregnancy including SARS‐CoV‐2. There remain unanswered questions concerning mother‐to‐child‐transmission (MTCT) of SARS‐CoV‐2. Despite reports of neonatal COVID‐19, SARS‐CoV‐2 has not been consistently isolated in perinatal samples thus, definitive proof of transplacental infection is still lacking. To address these questions, we assessed investigative tools used to confirm maternal‐fetal infection and known protective mechanisms of the placental barrier that prevent transplacental pathogen migration. Forty studies of COVID‐19 pregnancies reviewed suggest a lack of consensus on diagnostic strategy for congenital infection. While RT‐PCR of neonatal swabs was universally performed, a wide range of clinical samples was screened including vaginal secretions (22.5%), amniotic fluid (35%), breast milk (22.5%) and umbilical cord blood. Neonatal COVID‐19 was reported in eight studies, two of which were based on th

## Obtenção de queries

In [12]:
topics = get_topics('covid-round5')
len(topics)

50

In [13]:
topics[1]

{'question': 'what is the origin of COVID-19',
 'query': 'coronavirus origin',
 'narrative': "seeking range of information about the SARS-CoV-2 virus's origin, including its evolution, animal source, and first transmission into humans"}

In [14]:
with open('queries.tsv', 'w') as w:
    for id in sorted(topics.keys()):
        w.write(str(id) + '\t' + topics[id]['query'] + '\n')

!head queries.tsv

1	coronavirus origin
2	coronavirus response to weather changes
3	coronavirus immunity
4	how do people die from the coronavirus
5	animal models of COVID-19
6	coronavirus test rapid testing
7	serological tests for coronavirus
8	coronavirus under reporting
9	coronavirus in Canada
10	coronavirus social distancing impact


## Obtenção das avaliações (qrels)

In [15]:
!wget -nc https://huggingface.co/datasets/BeIR/trec-covid-qrels/raw/main/test.tsv

--2023-05-10 02:34:31--  https://huggingface.co/datasets/BeIR/trec-covid-qrels/raw/main/test.tsv
Resolving huggingface.co (huggingface.co)... 18.155.68.116, 18.155.68.121, 18.155.68.38, ...
Connecting to huggingface.co (huggingface.co)|18.155.68.116|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 980831 (958K) [text/plain]
Saving to: ‘test.tsv’


2023-05-10 02:34:31 (3.76 MB/s) - ‘test.tsv’ saved [980831/980831]



## Tratamento das avaliações

In [16]:
qrels = pd.read_csv('test.tsv', delimiter='\t', skiprows = 1, names=["query", "docid", "rel"])
qrels["q0"] = "q0"
qrels

Unnamed: 0,query,docid,rel,q0
0,1,005b2j4b,2,q0
1,1,00fmeepz,1,q0
2,1,g7dhmyyo,2,q0
3,1,0194oljo,1,q0
4,1,021q9884,1,q0
...,...,...,...,...
66331,50,zvop8bxh,2,q0
66332,50,zwf26o63,1,q0
66333,50,zwsvlnwe,0,q0
66334,50,zxr01yln,1,q0


In [17]:
qrels_dict = qrels.to_dict(orient="list")
qrels_dict['query'][0], qrels_dict['docid'][0], qrels_dict['rel'][0]

(1, '005b2j4b', 2)

## Classe de Dataset - Rerank

In [18]:
class DatasetQueryText(Dataset):
    def __init__(self, texts: np.ndarray, tokenizer):
      self.texts = texts
      self.tokenizer = tokenizer
      self.max_seq_length = tokenizer.model_max_length

      input_ids = []
      token_type_ids = []
      attention_masks = []
      for query, text in tqdm(texts, desc='encoding query+doc'):
          encoding = tokenizer.encode_plus(
              query, 
              text,
              add_special_tokens=True,
              max_length=self.max_seq_length,
              padding='max_length',
              return_tensors = 'pt',
              truncation=True,
              return_attention_mask=True,
              return_token_type_ids=True
          )
          input_ids.append(encoding['input_ids'].long())
          token_type_ids.append(encoding['token_type_ids'].long())
          attention_masks.append(encoding['attention_mask'].long())
      self.input_ids = torch.stack(input_ids).squeeze(1)
      self.attention_masks = torch.stack(attention_masks).squeeze(1)
      self.token_type_ids = torch.stack(token_type_ids).squeeze(1)

    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        return {
            'input_ids': self.input_ids[idx],
            'attention_mask': self.attention_masks[idx],
            'token_type_ids': self.token_type_ids[idx]
        }

## Carga do modelo - Rerank

In [19]:
model_name = 'cross-encoder/ms-marco-MiniLM-L-6-v2'
model = AutoModelForSequenceClassification.from_pretrained(model_name).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Downloading (…)lve/main/config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

## Função - execução rerank

In [20]:
def executar_rerank(run_base, model, tokenizer, batch_size, K):
    df = pd.DataFrame(run_base, columns=["query", "q0", "docid", "rank", "score", "system"])
    df['query_text'] = df['query'].apply(lambda query: topics[query]['question']).astype(str)
    df['doc_text'] = df['docid'].apply(lambda docid: doc[docid]['contents']).astype(str)
    dataset_rerank = DatasetQueryText(texts = df[['query_text','doc_text']].values, tokenizer=tokenizer)
    dataloader_rerank = DataLoader(dataset_rerank, batch_size= 400, shuffle=False)
    scores = []
    model.eval()
    with torch.no_grad():
        for ndx, batch in tqdm(enumerate(dataloader_rerank), total=len(dataloader_rerank), mininterval=0.5, desc='reranking', disable=False):
            logits = model(**BatchEncoding(batch).to(device)).logits
            scores.extend(logits.squeeze().cpu().numpy())
    df.rename(columns={'rank': 'rank_bm25', 'score': 'score_bm25'}, inplace=True)
    df['score_rerank'] = scores
    df = df.groupby('query', group_keys=False).apply(lambda x: x.sort_values(['score_rerank'], ascending=[False]))
    df['rerank'] = df.groupby('query').cumcount() + 1
    df = df.query('rerank <= ' + str(K))
    df['system'] = df['system'] + '+rerank'
    df.rename(columns={'rerank': 'rank', 'score_rerank':'score'}, inplace=True)
    return df[["query", "q0", "docid", "rank", "score", "system"]]

## Função - cálculo custo

In [28]:
gpu = None
if dev != 'cpu':
    !nvidia-smi > 'nvidia-smi.txt'
    for line in open('nvidia-smi.txt'):
        if 'A100' in line:
            gpu = 'A100'
print('GPU:', gpu)            

GPU: A100


In [32]:
def calculate_cost(latency, device, gpu, memory):
    nproc = os.cpu_count()
    cost_query = latency * (.03 * nproc + .005 * memory)
    if gpu:
        if gpu == 'A100':
            cost_hour = 1.50
        elif gpu == 'V100':
            cost_hour = .5
        else:          # 'T4'
            cost_hour = .21
        cost_query += latency * cost_hour / 3600
    cost_month = cost_query * 100 * 30
    return cost_query, cost_month

# Execução - BM25

## Carga do índice

In [30]:
searcher = LuceneSearcher.from_prebuilt_index('beir-v1.0.0-trec-covid.flat')
searcher.num_docs

Downloading index at https://rgw.cs.uwaterloo.ca/pyserini/indexes/lucene-index.beir-v1.0.0-trec-covid.flat.20221116.505594.tar.gz...


lucene-index.beir-v1.0.0-trec-covid.flat.20221116.505594.tar.gz: 216MB [00:21, 10.8MB/s]                           


171331

## BM25 K=10

In [33]:
run = []
tempo_inicial = time.time()
for id in topics:
    query = topics[id]['question']
    hits = searcher.search(query, 10)
    for i in range(0, len(hits)):
        run.append((id, "Q0", hits[i].docid, i+1, hits[i].score, "BM25"))
duracao_bm25_10 = time.time() - tempo_inicial
print(f'{duracao_bm25_10} segundos')

0.7576761245727539 segundos


## Avaliação

In [34]:
run = pd.DataFrame(run, columns=["query", "q0", "docid", "rank", "score", "system"])
run

Unnamed: 0,query,q0,docid,rank,score,system
0,44,Q0,xfjexm5b,1,12.713800,BM25
1,44,Q0,28utunid,2,11.653200,BM25
2,44,Q0,qi1henyy,3,11.653199,BM25
3,44,Q0,qp77vl6h,4,11.350500,BM25
4,44,Q0,ugkxxaeb,5,11.312800,BM25
...,...,...,...,...,...,...
495,43,Q0,7eksp1sj,6,16.434700,BM25
496,43,Q0,lcmkribq,7,15.623700,BM25
497,43,Q0,ekajojon,8,15.335800,BM25
498,43,Q0,j0qperwz,9,15.170100,BM25


In [35]:
results_bm25 = trec_eval.compute(predictions=[run.to_dict(orient="list")], references=[qrels_dict])
print()
print()
print(f"nDCG@10: {results_bm25['NDCG@10']}")

To preserve the previous behavior, use

	>>> .groupby(..., group_keys=False)


	>>> .groupby(..., group_keys=True)
  topX = topX.groupby("query").apply(lambda x: x.head(n_relevant_docs.loc[x.name])).reset_index(drop=True)




nDCG@10: 0.5946917010118077


In [36]:
latency = duracao_bm25_10 / len(topics)
cost_query, cost_month = calculate_cost(latency, 'cpu', None, .3)

In [37]:
lista_resultados = []
lista_resultados.append({'pipeline': 'bm25', 'ndcg':round(results_bm25['NDCG@10'],4), 'latency': latency,
                         'cost/query': cost_query, 'cost/month': cost_month, 'indexing': 0})
lista_resultados

[{'retriever': 'bm25',
  'ndcg': 0.5947,
  'latency': 0.015153522491455079,
  'cost/query': 0.005477998380661011,
  'cost/month': 16.433995141983033,
  'indexing': 0}]

# Execução - BM25 + Rerank

## BM25 K=1000

In [38]:
run = []
tempo_inicial = time.time()
for id in topics:
    query = topics[id]['question']
    hits = searcher.search(query, 1000)
    for i in range(0, len(hits)):
        run.append((id, "Q0", hits[i].docid, i+1, hits[i].score, "BM25"))
#duracao_bm25_1k = time.time() - tempo_inicial
#print(f'\n{duracao_bm25_1k} segundos')

## Rerank

In [39]:
df_rerank = executar_rerank(run, model, tokenizer, 400, 10)
duracao_rerank = time.time() - tempo_inicial
#duracao_rerank = duracao_bm25_1k + time.time() - tempo_inicial
print(f'{duracao_rerank} segundos')

encoding query+doc: 100%|██████████| 50000/50000 [01:23<00:00, 601.53it/s]
reranking: 100%|██████████| 125/125 [01:11<00:00,  1.75it/s]

159.54768419265747 segundos



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['system'] = df['system'] + '+rerank'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.rename(columns={'rerank': 'rank', 'score_rerank':'score'}, inplace=True)


In [40]:
df_rerank

Unnamed: 0,query,q0,docid,rank,score,system
17057,1,Q0,4dtk1kyh,1,9.036143,BM25+rerank
17074,1,Q0,deee71uw,2,8.527672,BM25+rerank
17078,1,Q0,utsr0zv7,3,8.175335,BM25+rerank
17224,1,Q0,1mjaycee,4,8.066898,BM25+rerank
17770,1,Q0,v99vlnox,5,8.007099,BM25+rerank
...,...,...,...,...,...,...
6024,50,Q0,aju2nr9x,6,4.930460,BM25+rerank
6064,50,Q0,g4qak0bu,7,4.777392,BM25+rerank
6372,50,Q0,6m9llyta,8,4.666793,BM25+rerank
6214,50,Q0,7qd8z5e7,9,4.654888,BM25+rerank


## Avaliação

In [42]:
results_rerank = trec_eval.compute(predictions=[df_rerank.to_dict(orient="list")], references=[qrels_dict])
print()
print()
print(f"nDCG@10: {results_rerank['NDCG@10']}")

To preserve the previous behavior, use

	>>> .groupby(..., group_keys=False)


	>>> .groupby(..., group_keys=True)
  topX = topX.groupby("query").apply(lambda x: x.head(n_relevant_docs.loc[x.name])).reset_index(drop=True)




nDCG@10: 0.7069302977414234


In [41]:
# assumindo 2 CPUs, 2.5 GB de RAM
latency = duracao_rerank / len(topics)
cost_query, cost_month = calculate_cost(latency, dev, gpu, 4)

In [43]:
lista_resultados = lista_resultados[0:1]
lista_resultados.append({'pipeline': 'bm25+rerank', 'ndcg':round(results_rerank['NDCG@10'],4), 'latency': latency,
                         'cost/query': cost_query, 'cost/month': cost_month, 'indexing': 0})
lista_resultados

[{'retriever': 'bm25',
  'ndcg': 0.5947,
  'latency': 0.015153522491455079,
  'cost/query': 0.005477998380661011,
  'cost/month': 16.433995141983033,
  'indexing': 0},
 {'retriever': 'bm25+rerank',
  'ndcg': 0.7069,
  'latency': 3.1909536838531496,
  'cost/query': 1.1548593207478524,
  'cost/month': 3464.577962243557,
  'indexing': 0}]

# Doc2Query - Preparação

## Obtenção das queries

In [44]:
#Arquivo obtido por upload manual na VM

generated_queries = []
for line in open('generated_queries.txt'):
    generated_queries.append(line)
len(generated_queries)

171332

## Geração dos documentos aumentados

In [45]:
for idx, query in enumerate(generated_queries):
    corpus[idx]['contents'] = corpus[idx]['contents'] + ' \n' + query.rstrip()

In [46]:
generated_queries[170891], corpus[170891]

('what is the evidence of transplacental transmission of sars\n',
 {'id': 'mq6qjs2s',
  'contents': 'Mechanisms and evidence of vertical transmission of infections in pregnancy including SARS‐CoV‐2. There remain unanswered questions concerning mother‐to‐child‐transmission (MTCT) of SARS‐CoV‐2. Despite reports of neonatal COVID‐19, SARS‐CoV‐2 has not been consistently isolated in perinatal samples thus, definitive proof of transplacental infection is still lacking. To address these questions, we assessed investigative tools used to confirm maternal‐fetal infection and known protective mechanisms of the placental barrier that prevent transplacental pathogen migration. Forty studies of COVID‐19 pregnancies reviewed suggest a lack of consensus on diagnostic strategy for congenital infection. While RT‐PCR of neonatal swabs was universally performed, a wide range of clinical samples was screened including vaginal secretions (22.5%), amniotic fluid (35%), breast milk (22.5%) and umbilical cor

In [47]:
!mkdir corpus

In [48]:
with open('corpus/augmented_corpus.jsonl', 'w') as fout:
    for doc_data in corpus:
        fout.write(json.dumps(doc_data, ensure_ascii=True))
        fout.write('\n')

## Indexação

In [49]:
tempo_inicial = time.time()
!python -m pyserini.index.lucene \
  --collection JsonCollection \
  --input corpus \
  --index index \
  --generator DefaultLuceneDocumentGenerator \
  --threads 1 \
  --storePositions --storeDocvectors --storeRaw
duracao_indexing_doc2query = time.time() - tempo_inicial
print(f'\n{duracao_indexing_doc2query} segundos')

2023-05-10 02:43:27,636 INFO  [main] index.IndexCollection (IndexCollection.java:380) - Setting log level to INFO
2023-05-10 02:43:27,637 INFO  [main] index.IndexCollection (IndexCollection.java:383) - Starting indexer...
2023-05-10 02:43:27,638 INFO  [main] index.IndexCollection (IndexCollection.java:385) - DocumentCollection path: corpus
2023-05-10 02:43:27,638 INFO  [main] index.IndexCollection (IndexCollection.java:386) - CollectionClass: JsonCollection
2023-05-10 02:43:27,638 INFO  [main] index.IndexCollection (IndexCollection.java:387) - Generator: DefaultLuceneDocumentGenerator
2023-05-10 02:43:27,639 INFO  [main] index.IndexCollection (IndexCollection.java:388) - Threads: 1
2023-05-10 02:43:27,639 INFO  [main] index.IndexCollection (IndexCollection.java:389) - Language: en
2023-05-10 02:43:27,639 INFO  [main] index.IndexCollection (IndexCollection.java:390) - Stemmer: porter
2023-05-10 02:43:27,639 INFO  [main] index.IndexCollection (IndexCollection.java:391) - Keep stopwords? 

# Execução - Doc2query + BM25

## Execução

In [50]:
tempo_inicial = time.time()
!python -m pyserini.search.lucene \
  --index index \
  --topics queries.tsv \
  --output run.augmented_index.bm25.10.txt \
  --output-format trec \
  --hits 10 \
  --bm25 --k1 0.82 --b 0.68
duracao_doc2query_bm25_10 = time.time() - tempo_inicial
print(f'\n{duracao_doc2query_bm25_10} segundos')

Setting BM25 parameters: k1=0.82, b=0.68
Running queries.tsv topics, saving to run.augmented_index.bm25.10.txt...
100% 50/50 [00:00<00:00, 112.96it/s]

6.430842876434326 segundos


## Avaliação

In [51]:
run = pd.read_csv('run.augmented_index.bm25.10.txt', names=["query", "q0", "docid", "rank", "score", "system"], sep=' ')
run

Unnamed: 0,query,q0,docid,rank,score,system
0,1,Q0,pl48ev5o,1,4.372200,Anserini
1,1,Q0,irkjiqll,2,4.363800,Anserini
2,1,Q0,k86pf2yf,3,4.363799,Anserini
3,1,Q0,h8ahn8fw,4,4.355500,Anserini
4,1,Q0,75773gwg,5,4.354300,Anserini
...,...,...,...,...,...,...
495,50,Q0,ptvsie6m,6,6.722100,Anserini
496,50,Q0,0fx1b7ph,7,6.512500,Anserini
497,50,Q0,7u6ofjul,8,6.345700,Anserini
498,50,Q0,akbq0ogs,9,6.281400,Anserini


In [52]:
results_bm25_d2q = trec_eval.compute(predictions=[run.to_dict(orient="list")], references=[qrels_dict])
print()
print()
print(f"nDCG@10: {results_bm25_d2q['NDCG@10']}")

To preserve the previous behavior, use

	>>> .groupby(..., group_keys=False)


	>>> .groupby(..., group_keys=True)
  topX = topX.groupby("query").apply(lambda x: x.head(n_relevant_docs.loc[x.name])).reset_index(drop=True)




nDCG@10: 0.5992369520604796


In [53]:
latency = duracao_doc2query_bm25_10 / len(topics)
cost_query, cost_month = calculate_cost(latency, 'cpu', None, .3)
index_cost, _ = calculate_cost(duracao_indexing_doc2query, 'cpu', None, .5)

In [55]:
lista_resultados = lista_resultados[0:2]
lista_resultados.append({'pipeline': 'doc2query+bm25', 'ndcg':round(results_bm25_d2q['NDCG@10'],4), 'latency': latency,
                         'cost/query': cost_query, 'cost/month': cost_month, 'indexing': index_cost})
lista_resultados

[{'retriever': 'bm25',
  'ndcg': 0.5947,
  'latency': 0.015153522491455079,
  'cost/query': 0.005477998380661011,
  'cost/month': 16.433995141983033,
  'indexing': 0},
 {'retriever': 'bm25+rerank',
  'ndcg': 0.7069,
  'latency': 3.1909536838531496,
  'cost/query': 1.1548593207478524,
  'cost/month': 3464.577962243557,
  'indexing': 0},
 {'retriever': 'doc2query+bm25',
  'ndcg': 0.5992,
  'latency': 0.1286168575286865,
  'cost/query': 0.046494993996620174,
  'cost/month': 139.48498198986053,
  'indexing': 14.259586738944053}]

# Execução Doc2query + BM25 + Rerank

## Execução BM25 K=1000

In [56]:
tempo_inicial = time.time()
!python -m pyserini.search.lucene \
  --index index \
  --topics queries.tsv \
  --output run.augmented_index.bm25.1k.txt \
  --output-format trec \
  --hits 1000 \
  --bm25 --k1 0.82 --b 0.68

Setting BM25 parameters: k1=0.82, b=0.68
Running queries.tsv topics, saving to run.augmented_index.bm25.1k.txt...
100% 50/50 [00:03<00:00, 12.55it/s]


In [57]:
run_bm25 = pd.read_csv('run.augmented_index.bm25.1k.txt', names=["query", "q0", "docid", "rank", "score", "system"], sep=' ')
run_bm25

Unnamed: 0,query,q0,docid,rank,score,system
0,1,Q0,pl48ev5o,1,4.372200,Anserini
1,1,Q0,irkjiqll,2,4.363800,Anserini
2,1,Q0,k86pf2yf,3,4.363799,Anserini
3,1,Q0,h8ahn8fw,4,4.355500,Anserini
4,1,Q0,75773gwg,5,4.354300,Anserini
...,...,...,...,...,...,...
49995,50,Q0,8apepcdk,996,3.470700,Anserini
49996,50,Q0,8vt87jon,997,3.469600,Anserini
49997,50,Q0,q8gxbwfj,998,3.469400,Anserini
49998,50,Q0,1aqt65cc,999,3.469399,Anserini


In [58]:
df_rerank = executar_rerank(run_bm25, model, tokenizer, 400, 10)
duracao_d2q_rerank = time.time() - tempo_inicial
print(f'{duracao_d2q_rerank} segundos')

encoding query+doc: 100%|██████████| 50000/50000 [01:18<00:00, 633.19it/s]
reranking: 100%|██████████| 125/125 [01:08<00:00,  1.82it/s]

157.8170280456543 segundos



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['system'] = df['system'] + '+rerank'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.rename(columns={'rerank': 'rank', 'score_rerank':'score'}, inplace=True)


In [59]:
df_rerank

Unnamed: 0,query,q0,docid,rank,score,system
18,1,Q0,4dtk1kyh,1,9.036143,Anserini+rerank
179,1,Q0,deee71uw,2,8.527672,Anserini+rerank
16,1,Q0,utsr0zv7,3,8.175335,Anserini+rerank
88,1,Q0,1mjaycee,4,8.066898,Anserini+rerank
628,1,Q0,v99vlnox,5,8.007099,Anserini+rerank
...,...,...,...,...,...,...
49018,50,Q0,aju2nr9x,6,4.930460,Anserini+rerank
49079,50,Q0,7qd8z5e7,7,4.654888,Anserini+rerank
49122,50,Q0,qq22z25y,8,4.604731,Anserini+rerank
49334,50,Q0,yrrz7oef,9,4.543234,Anserini+rerank


## Avaliação

In [60]:
results_d2q_rerank = trec_eval.compute(predictions=[df_rerank.to_dict(orient="list")], references=[qrels_dict])
print()
print()
print(f"nDCG@10: {results_d2q_rerank['NDCG@10']}")

To preserve the previous behavior, use

	>>> .groupby(..., group_keys=False)


	>>> .groupby(..., group_keys=True)
  topX = topX.groupby("query").apply(lambda x: x.head(n_relevant_docs.loc[x.name])).reset_index(drop=True)




nDCG@10: 0.7326663724773185


In [61]:
# assumindo 2 CPUs, 2.5 GB de RAM
latency = duracao_d2q_rerank / len(topics)
cost_query, cost_month = calculate_cost(latency, dev, gpu, 4)

In [62]:
lista_resultados = lista_resultados[0:3]
lista_resultados.append({'pipeline': 'doc2query+bm25+rerank', 'ndcg':round(results_rerank['NDCG@10'],4), 'latency': latency,
                         'cost/query': cost_query, 'cost/month': cost_month, 'indexing': index_cost})
lista_resultados

[{'retriever': 'bm25',
  'ndcg': 0.5947,
  'latency': 0.015153522491455079,
  'cost/query': 0.005477998380661011,
  'cost/month': 16.433995141983033,
  'indexing': 0},
 {'retriever': 'bm25+rerank',
  'ndcg': 0.7069,
  'latency': 3.1909536838531496,
  'cost/query': 1.1548593207478524,
  'cost/month': 3464.577962243557,
  'indexing': 0},
 {'retriever': 'doc2query+bm25',
  'ndcg': 0.5992,
  'latency': 0.1286168575286865,
  'cost/query': 0.046494993996620174,
  'cost/month': 139.48498198986053,
  'indexing': 14.259586738944053},
 {'retriever': 'doc2query+bm25+rerank',
  'ndcg': 0.7069,
  'latency': 3.156340560913086,
  'cost/query': 1.200724555047353,
  'cost/month': 3602.1736651420592,
  'indexing': 14.259586738944053}]