<a href="https://colab.research.google.com/github/leonardo3108/IA368dd/blob/main/exercicios/Aula_7/Aula_7_DPR_Evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
nome = 'Leonardo Augusto da Silva Pacheco'
print(f'Meu nome é {nome}')

Meu nome é Leonardo Augusto da Silva Pacheco


# Enunciado - Fazer finetuning de um buscador denso

Usar como treino o dataset "tiny" do MS MARCO
https://storage.googleapis.com/unicamp-dl/ia368dd_2023s1/msmarco/msmarco_triples.train.tiny.tsv

Avaliar o modelo no TREC-COVID, e comparar os resultados com o BM25 e doc2query

Comparar busca "exaustiva" (semelhança do vetor query com todos os vetores do corpus) com a busca aproximada (Approximate Nearest Neighbor - ANN)

Para a busca aproximada, usar os algoritmos existentes na biblioteca sentence-transformers (ex: hnswlib) OU implemente um você mesmo (Bonus!)

Dicas:

- Usar a média dos vetores da última camada (conhecido como mean pooling) do transformer para representar queries e passagens; Alternativamente, usar apenas o vetor do [CLS] da última cada.

- Tente inicialmente uma loss facil de implementar, como a entropia-cruzada

- Começar o treino a partir do microsoft/MiniLM-L12-H384-uncased

- Avaliar o pipeline usando um modelo já bem treinado: sentence-transformers/all-mpnet-base-v2

- Comparar resultados usando semelhança de coseno e produto escalar como funções de similaridade

- Para checar se seu codigo de avaliação está correto, comparar o seu desempenho com o do modelo já treinado no MS MARCO: https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2; O nDCG@10 no TREC-COVID deve ser ~0.47

- Usar a biblioteca do sentence-transformers para avaliar o modelo


# Setup

## Integração com Google Drive

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Hiperparâmetros

In [3]:
max_length = 256
batch_size = 32
tokenizer_name = "microsoft/MiniLM-L12-H384-uncased"

## Cópia local dos modelos

In [4]:
!mkdir 'model_dir'
!mkdir 'model_dir/passages'
!mkdir 'model_dir/queries'

!cp /content/drive/MyDrive/temp/passages/* model_dir/passages/
!cp /content/drive/MyDrive/temp/queries/*  model_dir/queries/

## Instalação de libs

In [5]:
!pip install transformers
!pip install datasets
!pip install pyserini
!pip install faiss-gpu
!pip install sentence-transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.28.1-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m47.0 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.13.4-py3-none-any.whl (200 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m200.1/200.1 kB[0m [31m27.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m105.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.13.4 tokenizers-0.13.3 transformers-4.28.1
Looking in indexes: https://pypi.org/simple, htt

## Importação de libs

In [6]:
import numpy as np
import json
import torch
import os
import torch.nn.functional as F
from torch.utils.data import DataLoader
from transformers import AdamW, AutoModel, AutoTokenizer, BatchEncoding
from sentence_transformers import SentenceTransformer, util
from tqdm.auto import tqdm

## Sementes

In [7]:
np.random.seed(42)

## Utilização de GPUs

In [8]:
if torch.cuda.is_available(): 
   dev = "cuda:0"
else: 
   dev = "cpu"
device = torch.device(dev)
print('Using {}'.format(device))

Using cuda:0


In [9]:
if dev != 'cpu':
    !nvidia-smi

Thu Apr 20 00:35:45 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   56C    P8    10W /  70W |      3MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# Preparação dos dados

## Obtenção do TREC-COVID

In [10]:
!wget -nc 'https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/trec-covid.zip'

--2023-04-20 00:35:45--  https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/trec-covid.zip
Resolving public.ukp.informatik.tu-darmstadt.de (public.ukp.informatik.tu-darmstadt.de)... 130.83.167.186
Connecting to public.ukp.informatik.tu-darmstadt.de (public.ukp.informatik.tu-darmstadt.de)|130.83.167.186|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 73876720 (70M) [application/zip]
Saving to: ‘trec-covid.zip’


2023-04-20 00:35:53 (11.8 MB/s) - ‘trec-covid.zip’ saved [73876720/73876720]



In [11]:
!unzip -n trec-covid.zip

Archive:  trec-covid.zip
   creating: trec-covid/
   creating: trec-covid/qrels/
  inflating: trec-covid/qrels/test.tsv  
  inflating: trec-covid/corpus.jsonl  
  inflating: trec-covid/queries.jsonl  


## Conversão do qrels para o formato esperado

In [12]:
with open('trec-covid/qrels/test.tsv', 'r') as fin:
    data = fin.read().splitlines(True)

data[:11]

['query-id\tcorpus-id\tscore\n',
 '1\t005b2j4b\t2\n',
 '1\t00fmeepz\t1\n',
 '1\tg7dhmyyo\t2\n',
 '1\t0194oljo\t1\n',
 '1\t021q9884\t1\n',
 '1\t02f0opkr\t1\n',
 '1\t047xpt2c\t0\n',
 '1\t04ftw7k9\t0\n',
 '1\tpl9ht0d0\t0\n',
 '1\t05vx82oo\t0\n']

In [13]:
with open('trec-covid/qrels/test_adjusted.tsv', 'w') as fout:
    for line in data:
        fields = line.split()
        fout.write(f'{fields[0]}\t0\t{fields[1]}\t{fields[2]}\n')

## Extração do corpus

In [14]:
corpus = []
for line in open('trec-covid/corpus.jsonl'):
    doc = json.loads(line)
    corpus.append((doc['_id'], f"{doc['title']} {doc['text']}"))
print(len(corpus), 'documents parsed. First 10:')
for doc in corpus[:10]:
    print(doc)

171332 documents parsed. First 10:
('ug7v899j', 'Clinical features of culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia OBJECTIVE: This retrospective chart review describes the epidemiology and clinical features of 40 patients with culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia. METHODS: Patients with positive M. pneumoniae cultures from respiratory specimens from January 1997 through December 1998 were identified through the Microbiology records. Charts of patients were reviewed. RESULTS: 40 patients were identified, 33 (82.5%) of whom required admission. Most infections (92.5%) were community-acquired. The infection affected all age groups but was most common in infants (32.5%) and pre-school children (22.5%). It occurred year-round but was most common in the fall (35%) and spring (30%). More than three-quarters of patients (77.5%) had comorbidities. Twenty-four isol

## Extração de queries

In [15]:
queries = []
for line in open('trec-covid/queries.jsonl'):
    query = json.loads(line)
    queries.append({'id': query['_id'], 'text': query['text']})
print(len(queries), 'queries parsed:')
for query in queries:
    print(query)

50 queries parsed:
{'id': '1', 'text': 'what is the origin of COVID-19'}
{'id': '2', 'text': 'how does the coronavirus respond to changes in the weather'}
{'id': '3', 'text': 'will SARS-CoV2 infected people develop immunity? Is cross protection possible?'}
{'id': '4', 'text': 'what causes death from Covid-19?'}
{'id': '5', 'text': 'what drugs have been active against SARS-CoV or SARS-CoV-2 in animal studies?'}
{'id': '6', 'text': 'what types of rapid testing for Covid-19 have been developed?'}
{'id': '7', 'text': 'are there serological tests that detect antibodies to coronavirus?'}
{'id': '8', 'text': 'how has lack of testing availability led to underreporting of true incidence of Covid-19?'}
{'id': '9', 'text': 'how has COVID-19 affected Canada'}
{'id': '10', 'text': 'has social distancing had an impact on slowing the spread of COVID-19?'}
{'id': '11', 'text': 'what are the guidelines for triaging patients infected with coronavirus?'}
{'id': '12', 'text': 'what are best practices in h

## Tokenizador

In [16]:
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

## Classe de dataset

In [17]:
class DatasetDPR(torch.utils.data.Dataset):
    def __init__(self, tokenizer, texts, max_seq_length = max_length):
        self.max_seq_length = max_seq_length
        self.tokenizer = tokenizer
        self.texts = texts
    def __len__(self):
        return len(self.texts)
    def __getitem__(self, idx):
        return self.tokenizer(self.texts[idx][1], padding=True, truncation=True, max_length=self.max_seq_length)

In [18]:
def collate_fn(batch):
    return BatchEncoding(tokenizer.pad(batch, return_tensors='pt'))

## Dataset e DataLoader

In [19]:
dataset_trec_covid = DatasetDPR(tokenizer, corpus)
dataloader_trec_covid = DataLoader(dataset_trec_covid, batch_size=batch_size, shuffle=False, collate_fn=collate_fn)

In [20]:
print(corpus[0])
print(dataset_trec_covid[0])
print('decode:', ' '.join(tokenizer.batch_decode(dataset_trec_covid[0]['input_ids'], skip_special_tokens=True)).replace(' ##', ''))

('ug7v899j', 'Clinical features of culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia OBJECTIVE: This retrospective chart review describes the epidemiology and clinical features of 40 patients with culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia. METHODS: Patients with positive M. pneumoniae cultures from respiratory specimens from January 1997 through December 1998 were identified through the Microbiology records. Charts of patients were reviewed. RESULTS: 40 patients were identified, 33 (82.5%) of whom required admission. Most infections (92.5%) were community-acquired. The infection affected all age groups but was most common in infants (32.5%) and pre-school children (22.5%). It occurred year-round but was most common in the fall (35%) and spring (30%). More than three-quarters of patients (77.5%) had comorbidities. Twenty-four isolates (60%) were associated with pne

In [21]:
len(dataset_trec_covid), len(dataset_trec_covid[0]['input_ids']), len(dataloader_trec_covid), batch_size

(171332, 256, 5355, 32)

# Geração dos vetores densos

## Carregamento do modelo pré-treinado de passagens

In [22]:
model_passages = AutoModel.from_pretrained('model_dir/passages').to(device)

## Vetores - Corpus

In [23]:
model_passages.eval()

matrix_trec_covid = None
with torch.no_grad():
    for batch in tqdm(dataloader_trec_covid, mininterval=0.5, desc='Extraindo vetores dos documentos trec-covid', disable=False):
        outputs_passages = model_passages(**batch.to(device))
        tcls_passages  = outputs_passages.last_hidden_state[:, 0, :]
        nt_cls_passages = F.normalize(tcls_passages, dim=1)

        if matrix_trec_covid is None:
            matrix_trec_covid = nt_cls_passages
        else:
            matrix_trec_covid = torch.cat( (matrix_trec_covid, nt_cls_passages), dim=0)
        
print(matrix_trec_covid.size())

Extraindo vetores dos documentos trec-covid:   0%|          | 0/5355 [00:00<?, ?it/s]

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


torch.Size([171332, 384])


In [24]:
matrix_trec_covid[0]

tensor([-1.8228e-02, -8.2913e-03,  6.7667e-03, -3.3719e-02,  5.8914e-02,
        -5.0007e-02, -1.3450e-02,  4.8622e-02,  9.0097e-03,  3.6606e-03,
         2.5577e-02,  4.7016e-03,  3.6636e-02, -5.1368e-03, -8.9948e-02,
        -4.4101e-03,  2.3629e-02, -3.4375e-02, -3.2875e-02,  8.9923e-03,
        -2.4711e-02,  2.5858e-02, -2.5893e-02, -2.2126e-02,  2.7185e-02,
         3.6465e-02,  1.3539e-02, -5.8188e-02, -2.7350e-02, -2.0427e-01,
         1.2730e-02,  1.2565e-02,  1.6696e-02, -2.9886e-02,  5.4365e-02,
         6.3844e-03, -2.1583e-02, -3.8462e-02,  8.2104e-03,  4.8611e-02,
        -1.3951e-02,  1.3109e-02,  1.5594e-02,  2.4341e-02, -3.3577e-02,
        -4.6783e-02, -2.5867e-02,  7.0238e-04,  4.5544e-02,  2.7213e-02,
        -5.0063e-02, -3.0408e-02, -2.5205e-02,  4.6188e-02, -5.3137e-02,
         1.3053e-02, -4.6701e-02, -5.9392e-03, -4.2127e-02,  4.1899e-02,
         8.5847e-03, -4.0065e-03, -1.1671e-01,  6.9298e-02, -3.5188e-02,
         1.3782e-02, -7.7020e-03, -2.9861e-02,  4.3

In [25]:
#torch.save(matrix_trec_covid, 'matriz_docs_trec_covid.pt')
#torch.save(matrix_trec_covid, 'drive/My Drive/temp/matriz_docs_trec_covid.pt')
#matrix_trec_covid = torch.load('matriz_docs_trec_covid.pt').to(device)

## Carregamento do modelo pré-treinado de queries

In [26]:
model_queries  = AutoModel.from_pretrained('model_dir/queries').to(device)

## Vetores - Queries

In [27]:
for query in queries:
    query_tokens = tokenizer(query['text'], padding=True, truncation=True, max_length=max_length, return_tensors='pt')
    with torch.no_grad():
        output_query = model_queries(**query_tokens.to(device))
        tcls_query    = output_query.last_hidden_state[:, 0, :]
        nt_cls_query    = F.normalize(tcls_query.squeeze(), dim=0)
    query['dpr'] = nt_cls_query

print('Example:')
print(f"id: {queries[0]['id']} /  text: {queries[0]['text']} /  dpr: {queries[0]['dpr'].size()}")

Example:
id: 1 /  text: what is the origin of COVID-19 /  dpr: torch.Size([384])


# Busca exaustiva

## Teste inicial

In [28]:
print('Examples:')
for query in queries[:5]:
    print(query['text'])
    score = torch.matmul(matrix_trec_covid, query['dpr'])
    sorted_score, indices_score = torch.sort(score, descending=True)
    for idx in indices_score[:5]:
        print('\t', corpus[idx])

Examples:
what is the origin of COVID-19
	 ('u7u75sl0', 'Strategies to trace back the origin of COVID-19 ')
	 ('tpwzwfkv', 'The many uncertainties of COVID-19. ')
	 ('hb4oua9h', 'Technology in the COVID-19 era: pushing the boundaries ')
	 ('2mw6myt8', 'Continuing Professional Development in the Era of COVID-19. ')
	 ('fpezij7o', 'Spreading of COVID-19 in a Common Place: Medical Architecture Analysis ')
how does the coronavirus respond to changes in the weather
	 ('hyyw23mz', 'Reducing risks from coronavirus transmission in the home-the role of viral load. ')
	 ('gctnx6j1', 'What next for the coronavirus response? ')
	 ('ek7114i3', 'Corona Virus International Public Health Emergencies: Implications for Radiology Management The outbreak of 2019 novel coronavirus (2019-nCoV) pneumonia was reported in Wuhan, Hubei Province, China in December 2019 and has spread internationally. This article discusses how radiology departments can most effectively respond to this public health emergency.')


## Função de busca

In [29]:
def search_dpr(query, n = 1000):
    score = torch.matmul(matrix_trec_covid, query['dpr'])
    sorted_score, indices_score = torch.sort(score, descending=True)
    ids_docs = [corpus[idx][0] for idx in indices_score[:n]]
    return zip(ids_docs, sorted_score[:n])

## Execução das buscas

In [30]:
with open('run-dpr-exaustiva.txt', 'w') as runfile:
    for q, query in enumerate(queries):
        for i, (id_doc, score) in enumerate(search_dpr(query)):
            runfile.write(f"{query['id']} Q0 {id_doc} {i+1} {float(score):.6f} Pesquisa_densa\n")

## Avaliação

In [31]:
!python -m pyserini.eval.trec_eval -c -m ndcg_cut.10 trec-covid/qrels/test_adjusted.tsv run-dpr-exaustiva.txt

Downloading https://search.maven.org/remotecontent?filepath=uk/ac/gla/dcs/terrierteam/jtreceval/0.0.5/jtreceval-0.0.5-jar-with-dependencies.jar to /root/.cache/pyserini/eval/jtreceval-0.0.5-jar-with-dependencies.jar...
jtreceval-0.0.5-jar-with-dependencies.jar: 1.79MB [00:02, 717kB/s]                 
Running command: ['java', '-jar', '/root/.cache/pyserini/eval/jtreceval-0.0.5-jar-with-dependencies.jar', '-c', '-m', 'ndcg_cut.10', 'trec-covid/qrels/test_adjusted.tsv', 'run-dpr-exaustiva.txt']
Results:
ndcg_cut_10           	all	0.2717


# sentence-transformers/all-mpnet-base-v2

In [33]:
ids_trec_covid, docs_trec_covid = zip(*corpus)
len(ids_trec_covid), len(docs_trec_covid)

(171332, 171332)

In [34]:
!pip install sentence-transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone
  Created wheel for sentence-transformers: filename=sentence_transformers-2.2.2-py3-none-any.whl size=125942 sha256=ae912098932f3ddf3033127f2df371cf16182556ea0802bc29063a19173191c1
  Stored in directory: /root/.cache/pip/wheels/71/67/06/162a3760c40d74dd40bc855d527008d26341c2b0ecf3e8e11f
Successfully built sentence-transformers
Installing collected packages: sentence-transformers
Successfully installed sentence-transformers-2.2.2


In [36]:
from sentence_transformers import SentenceTransformer, util

model_mpnet = SentenceTransformer('all-mpnet-base-v2').to(device)

docs_embeddings = model_mpnet.encode(docs_trec_covid)

torch.save(docs_embeddings, 'docs_embeddings-all-mpnet-base-v2.pt')
#docs_embeddings = torch.load('docs_embeddings-all-mpnet-base-v2.pt')

## Geração dos vetores - Queries

In [39]:
for query in queries:
    query['dpr_mnet'] = model_mpnet.encode(query['text'])

print('Example:')
print(f"id: {queries[0]['id']} /  text: {queries[0]['text']} /  dpr: {len(queries[0]['dpr_mnet'])}")

Example:
id: 1 /  text: what is the origin of COVID-19 /  dpr: 768


## Função de busca

In [48]:
def search_dpr_aprox(query, n = 1000):
    score = util.dot_score(query['dpr_mnet'], docs_embeddings).squeeze()
    sorted_score, indices_score = torch.sort(score, descending=True)
    ids_docs = [corpus[idx][0] for idx in indices_score[:n]]
    return zip(ids_docs, sorted_score[:n])

## Execução das buscas

In [49]:
with open('run-dpr-aproximada.txt', 'w') as runfile:
    for q, query in enumerate(queries):
        for i, (id_doc, score) in enumerate(search_dpr_aprox(query)):
            runfile.write(f"{query['id']} Q0 {id_doc} {i+1} {float(score):.6f} Pesquisa_densa\n")

## Avaliação

In [50]:
!python -m pyserini.eval.trec_eval -c -m ndcg_cut.10 trec-covid/qrels/test_adjusted.tsv run-dpr-aproximada.txt

Downloading https://search.maven.org/remotecontent?filepath=uk/ac/gla/dcs/terrierteam/jtreceval/0.0.5/jtreceval-0.0.5-jar-with-dependencies.jar to /root/.cache/pyserini/eval/jtreceval-0.0.5-jar-with-dependencies.jar...
/root/.cache/pyserini/eval/jtreceval-0.0.5-jar-with-dependencies.jar already exists!
Skipping download.
Running command: ['java', '-jar', '/root/.cache/pyserini/eval/jtreceval-0.0.5-jar-with-dependencies.jar', '-c', '-m', 'ndcg_cut.10', 'trec-covid/qrels/test_adjusted.tsv', 'run-dpr-aproximada.txt']
Results:
ndcg_cut_10           	all	0.5032
