<a href="https://colab.research.google.com/github/leonardo3108/IA368dd/blob/main/exercicios/Aula_11/Aula_11_Retrieve.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Enunciado

Implementar um pipeline multidoc QA: dado uma pergunta do usuário, buscamos em uma grande coleção as passagens mais relevantes e as enviamos para um sistema agregador, que irá gerar uma resposta final.

- Avaliar no dataset do IIRC
- Métrica principal: F1
- Limitar dataset de teste para 50 exemplos para economizar.
- Usar o gpt-3.5-turbo como modelo agregador. Usar vicuna-13B como alternativa open-source:

 - https://huggingface.co/helloollel/vicuna-13b
 - https://chat.lmsys.org/

Dicas:

- Se inspirar no pipeline do Visconde: https://github.com/neuralmind-ai/visconde


# Setup

## Hiperparâmetros

In [1]:
K_qa = 50
K_BM25 = 1000
K_rerank = 10
dir_sentences = 'sentences'
dir_indexes = 'indexes'
model_name = 'cross-encoder/ms-marco-MiniLM-L-6-v2'

## Carregamento das libs

In [2]:
!pip install pyserini
!pip install faiss-cpu
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyserini
  Downloading pyserini-0.21.0-py3-none-any.whl (154.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.1/154.1 MB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
Collecting pyjnius>=1.4.0 (from pyserini)
  Downloading pyjnius-1.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m68.9 MB/s[0m eta [36m0:00:00[0m
Collecting transformers>=4.6.0 (from pyserini)
  Downloading transformers-4.29.2-py3-none-any.whl (7.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m103.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentencepiece>=0.1.95 (from pyserini)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m 

## Importação das libs

In [3]:
import json
import numpy as np
import pandas as pd
import pickle
import random
import re
import spacy
import torch

from pyserini.index import IndexReader
from pyserini.search.lucene import LuceneSearcher
from torch.utils.data import Dataset, DataLoader
from tqdm import tqdm
from transformers import AutoTokenizer, AutoModelForSequenceClassification, BatchEncoding

## Geração das sementes

In [4]:
random.seed(42)

## Criação de pastas

In [5]:
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

In [6]:
!mkdir $dir_sentences
!mkdir $dir_indexes

## Utilização de GPUs

In [7]:
if torch.cuda.is_available(): 
   dev = "cuda:0"
else: 
   dev = "cpu"
device = torch.device(dev)
print('Using {}'.format(device))

Using cuda:0


In [8]:
if dev != 'cpu':
    !nvidia-smi

Wed May 17 23:50:54 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   30C    P0    49W / 400W |    717MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# Preparação dos dados

## Extração do dataset IIRC

- [Base](https://github.com/jferguson144/IIRC-baseline)
- [Instruções](https://github.com/jferguson144/IIRC-baseline/blob/main/setup.sh) 

In [9]:
!wget -nc http://jamesf-incomplete-qa.s3.amazonaws.com/iirc.tar.gz

--2023-05-17 23:50:54--  http://jamesf-incomplete-qa.s3.amazonaws.com/iirc.tar.gz
Resolving jamesf-incomplete-qa.s3.amazonaws.com (jamesf-incomplete-qa.s3.amazonaws.com)... 52.92.241.9, 52.92.194.17, 52.92.196.169, ...
Connecting to jamesf-incomplete-qa.s3.amazonaws.com (jamesf-incomplete-qa.s3.amazonaws.com)|52.92.241.9|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5713947 (5.4M) [application/x-gzip]
Saving to: ‘iirc.tar.gz’


2023-05-17 23:50:55 (10.6 MB/s) - ‘iirc.tar.gz’ saved [5713947/5713947]



In [10]:
!tar -xzf iirc.tar.gz

In [11]:
!head -48 iirc/dev.json

[
  {
    "pid": "p_4754",
    "questions": [
      {
        "answer": {
          "type": "span",
          "answer_spans": [
            {
              "start": 141,
              "end": 152,
              "text": "Switzerland",
              "passage": "university of geneva"
            }
          ]
        },
        "question": "In what country did Bain attend doctoral seminars of Wlad Godzich?",
        "question_links": [
          "University of Geneva"
        ],
        "qid": "q_10839",
        "context": [
          {
            "passage": "main",
            "text": "and later attended the doctoral seminars of Wlad Godzich in the University of Geneva.",
            "indices": [
              705,
              790
            ]
          },
          {
            "passage": "main",
            "text": "He completed M. Phil at the Geneva-based IUEE (Institute for European Studies), and later attended the doctoral seminars of Wlad Godzich in the University of Geneva.",


## Extração dos artigos

In [12]:
!wget -nc http://jamesf-incomplete-qa.s3.amazonaws.com/context_articles.tar.gz

--2023-05-17 23:50:55--  http://jamesf-incomplete-qa.s3.amazonaws.com/context_articles.tar.gz
Resolving jamesf-incomplete-qa.s3.amazonaws.com (jamesf-incomplete-qa.s3.amazonaws.com)... 52.218.194.2, 52.92.241.241, 52.92.227.81, ...
Connecting to jamesf-incomplete-qa.s3.amazonaws.com (jamesf-incomplete-qa.s3.amazonaws.com)|52.218.194.2|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 385263479 (367M) [application/x-gzip]
Saving to: ‘context_articles.tar.gz’


2023-05-17 23:51:04 (45.7 MB/s) - ‘context_articles.tar.gz’ saved [385263479/385263479]



In [13]:
!tar -xzf context_articles.tar.gz

In [14]:
!head context_articles.json

{
  "san diego padres": "The San Diego Padres are an American <a href=\"professional%20baseball\">professional baseball</a> team based in <a href=\"San%20Diego\">San Diego</a>, <a href=\"California\">California</a>. They compete in <a href=\"Major%20League%20Baseball\">Major League Baseball</a> (MLB) as a member club of the <a href=\"National%20League\">National League</a> (NL) <a href=\"National%20League%20West\">West division</a>. Founded in <a href=\"1969%20San%20Diego%20Padres%20season\">1969</a>, the Padres have won two <a href=\"List%20of%20National%20League%20pennant%20winners\">NL pennants</a> \u2014 in <a href=\"1984%20San%20Diego%20Padres%20season\">1984</a> and <a href=\"1998%20San%20Diego%20Padres%20season\">1998</a>, losing in the <a href=\"World%20Series\">World Series</a> both years. As of <a href=\"2017%20San%20Diego%20Padres%20season\">2018</a>, they have had 14 winning seasons in franchise history. The Padres are one of two Major League Baseball teams (the other being

## Tratamento das perguntas e respostas

In [15]:
dev_set = json.load(open('iirc/dev.json','r'))
len(dev_set)

430

In [16]:
qa_dev = []

for item in dev_set:
    qa = item['questions'][0]
    question = qa['question']
    answer = qa['answer']
    answer_type = answer['type']
    assert(answer_type in ('binary', 'value', 'span', 'none'))
    if answer_type == 'binary' or answer_type == 'value':
        qa_dev.append({'question': question, 'answer': answer['answer_value'].rstrip().rstrip(',')})
    elif answer_type == 'span':
        qa_dev.append({'question': question, 'answer': answer['answer_spans'][0]['text'].rstrip().rstrip(',')})

len(qa_dev)

312

In [17]:
selected_qa = random.sample(qa_dev, K_qa)
for item in selected_qa[:5]:
    print('\nQ:', item['question'])
    print('A:', item['answer'])


Q: What were the combined ages of Sheikh Abdul Rahman Al Sudais and Mohammed Bin Rashid Al Maktoum the year that Bukhatir first performed at the "Holy Qura'an" competition?
A: 95

Q: When was Tower Bridge built?
A: between 1886 and 1894

Q: What state did the Senator serve who Wingfield became friends with?
A: Nevada

Q: Which European female monarch who wore high heels during the 16th century had a longer reign?
A: Catherine de' Medici

Q: Who are the members of the band whose recording with Krauss brought her to the country music Top Ten for the first time?
A:  Marty Raybon


## Tratamento dos artigos

Limpeza de código HTML. 

Referência: https://github.com/jferguson144/IIRC-baseline/blob/main/util.py

In [18]:
context_articles = json.load(open("context_articles.json",'r'))
len(context_articles)

56550

In [19]:
def clean(html):
  return re.sub("<[^>]*>", "", html).strip()

In [20]:
adjusted_articles = [{"title": title, "content": clean(context_articles[title])} for title in context_articles.keys()]
len(adjusted_articles)

56550

In [21]:
for article in adjusted_articles[:10]:
    print(article)

{'title': 'san diego padres', 'content': 'The San Diego Padres are an American professional baseball team based in San Diego, California. They compete in Major League Baseball (MLB) as a member club of the National League (NL) West division. Founded in 1969, the Padres have won two NL pennants — in 1984 and 1998, losing in the World Series both years. As of 2018, they have had 14 winning seasons in franchise history. The Padres are one of two Major League Baseball teams (the other being the Los Angeles Angels) in California to originate from that state; the Athletics were originally from Philadelphia (and moved to the state from Kansas City), and the Dodgers and Giants are originally from two New York City boroughs – Brooklyn and Manhattan, respectively. The Padres are the only MLB team that does not share its city with another franchise in the four major American professional sports leagues. The Padres are the only major professional sports franchise to be located in San Diego, follow

## Geração dos segmentos

Quebra de cada artigo em 3 sentenças. 

Referência: https://github.com/neuralmind-ai/visconde/blob/main/iirc_create_indices.ipynb

In [22]:
%%time

nlp = spacy.blank("en")
nlp.add_pipe("sentencizer")

stride = 2
max_length = 3

def window(documents, stride=2, max_length=3):
  treated_documents = []

  for j, document in enumerate(tqdm(documents)):
    doc_text = document['content']
    doc = nlp(doc_text)
    sentences = [sent.text.strip() for sent in doc.sents]
    for i in range(0, len(sentences), stride):
      segment = ' '.join(sentences[i:i + max_length])
      treated_documents.append({
          "title": document['title'],
          "contents": document['title']+". "+segment,
          "segment": segment
      })
      if i + max_length >= len(sentences):
        break
  return treated_documents

segmented_articles = window(adjusted_articles)
len(segmented_articles)

100%|██████████| 56550/56550 [08:11<00:00, 115.00it/s]

CPU times: user 8min 8s, sys: 4.76 s, total: 8min 13s
Wall time: 8min 11s





3121678

In [23]:
segmented_articles[0:2]

[{'title': 'san diego padres',
  'contents': 'san diego padres. The San Diego Padres are an American professional baseball team based in San Diego, California. They compete in Major League Baseball (MLB) as a member club of the National League (NL) West division. Founded in 1969, the Padres have won two NL pennants — in 1984 and 1998, losing in the World Series both years.',
  'segment': 'The San Diego Padres are an American professional baseball team based in San Diego, California. They compete in Major League Baseball (MLB) as a member club of the National League (NL) West division. Founded in 1969, the Padres have won two NL pennants — in 1984 and 1998, losing in the World Series both years.'},
 {'title': 'san diego padres',
  'contents': 'san diego padres. Founded in 1969, the Padres have won two NL pennants — in 1984 and 1998, losing in the World Series both years. As of 2018, they have had 14 winning seasons in franchise history. The Padres are one of two Major League Baseball te

In [24]:
with open(dir_sentences + '/segmented_articles.jsonl', 'w') as f:
    for i, doc in enumerate(segmented_articles):
        doc['id'] = i
        if doc['segment'] != "":
            f.write(json.dumps(doc)+"\n")

# Busca BM25

## Indexação

In [25]:
%%time

!python3 -m pyserini.index \
    -collection JsonCollection \
    -generator DefaultLuceneDocumentGenerator \
    -threads 1 \
    -input {dir_sentences} \
    -index {dir_indexes} 

pyserini.index is deprecated, please use pyserini.index.lucene.
2023-05-18 00:00:03,259 INFO  [main] index.IndexCollection (IndexCollection.java:380) - Setting log level to INFO
2023-05-18 00:00:03,261 INFO  [main] index.IndexCollection (IndexCollection.java:383) - Starting indexer...
2023-05-18 00:00:03,262 INFO  [main] index.IndexCollection (IndexCollection.java:385) - DocumentCollection path: sentences
2023-05-18 00:00:03,262 INFO  [main] index.IndexCollection (IndexCollection.java:386) - CollectionClass: JsonCollection
2023-05-18 00:00:03,262 INFO  [main] index.IndexCollection (IndexCollection.java:387) - Generator: DefaultLuceneDocumentGenerator
2023-05-18 00:00:03,262 INFO  [main] index.IndexCollection (IndexCollection.java:388) - Threads: 1
2023-05-18 00:00:03,263 INFO  [main] index.IndexCollection (IndexCollection.java:389) - Language: en
2023-05-18 00:00:03,263 INFO  [main] index.IndexCollection (IndexCollection.java:390) - Stemmer: porter
2023-05-18 00:00:03,263 INFO  [main] 

In [26]:
index_reader = IndexReader(dir_indexes)
index_reader.stats()

{'total_terms': 148018179,
 'documents': 3121678,
 'non_empty_documents': 3121678,
 'unique_terms': 897887}

## Testes

In [27]:
bm25_seacher = LuceneSearcher(dir_indexes)
hits = bm25_seacher.search('When was Tower Bridge built?', k=1000)

for i in range(0, 10):
    print(f"{i+1:2} {hits[i].docid:7} {hits[i].score:.5f} {segmented_articles[int(hits[i].docid)]['segment']}")

print(len(hits))

 1 1383155 11.10100 The 4640 tonne structure cost about 2.1 million marks when it was built during World War I. Since the bridge was a major military construction project, both abutments of the bridge were flanked by stone towers with fortified foundations that could shelter up to a full battalion of men. The towers were designed with fighting loopholes for troops. From the flat roof of the towers troops had a good view of the valley.
 2 676577  10.95050 Three road bridges cross the Great Float:

A red girdered bascule bridge at Tower Road connects the Seacombe district of Wallasey with Birkenhead. Known as the Four Bridges, as originally four movable bridges existed along Tower Road: two between the Great Float and Alfred Dock, one between the Great Float and Wallasey Dock and one between the Great Float and Egerton Dock. When originally built, all four were hydraulic swing bridge types.
 3 3059009 10.80380 Exhibition. The Tower Bridge Exhibition is a display housed in the bridge's tw

In [28]:
for item in selected_qa[:5]:
    print('\nQ:', item['question'])
    print('EA:', item['answer'])
    hits = bm25_seacher.search(item['question'])
    for i in range(0, 5):
        print(f'{i+1:2} {hits[i].docid:7} {hits[i].score:.5f}')


Q: What were the combined ages of Sheikh Abdul Rahman Al Sudais and Mohammed Bin Rashid Al Maktoum the year that Bukhatir first performed at the "Holy Qura'an" competition?
EA: 95
 1 1378415 40.71450
 2 1378308 39.42290
 3 1378414 38.96920
 4 1378416 38.76260
 5 1378361 38.71810

Q: When was Tower Bridge built?
EA: between 1886 and 1894
 1 1383155 11.10100
 2 676577  10.95050
 3 3059009 10.80380
 4 3058948 10.52660
 5 3059017 10.27090

Q: What state did the Senator serve who Wingfield became friends with?
EA: Nevada
 1 2886809 11.79420
 2 774232  11.70510
 3 2817319 11.65730
 4 1190216 11.18490
 5 3063202 10.98060

Q: Which European female monarch who wore high heels during the 16th century had a longer reign?
EA: Catherine de' Medici
 1 2332751 14.62510
 2 1988263 13.77540
 3 2340532 13.63720
 4 1790837 13.40540
 5 1637091 13.16330

Q: Who are the members of the band whose recording with Krauss brought her to the country music Top Ten for the first time?
EA:  Marty Raybon
 1 2040470 

## Execução

In [29]:
%%time

run = []
for item in selected_qa:
    hits = bm25_seacher.search(item['question'], k = K_BM25)
    
    for i, hit in enumerate(hits):
        run.append({'question': item['question'], 'answer': item['answer'], 'docid': hit.docid, 'text': segmented_articles[int(hit.docid)]['segment'], 'bm25_score': hit.score, 'bm25_rank': i+1})
len(run)

CPU times: user 6.29 s, sys: 48 ms, total: 6.34 s
Wall time: 3.51 s


50000

In [30]:
df_run = pd.DataFrame(run)
df_run

Unnamed: 0,question,answer,docid,text,bm25_score,bm25_rank
0,What were the combined ages of Sheikh Abdul Ra...,95,1378415,- Hessa bint Mohammed bin Rashid Al Maktoum (b...,40.714500,1
1,What were the combined ages of Sheikh Abdul Ra...,95,1378308,Sheikh Mohammed is the third son of Sheikh Ras...,39.422901,2
2,What were the combined ages of Sheikh Abdul Ra...,95,1378414,Daughters. Six daughters married into royal fa...,38.969200,3
3,What were the combined ages of Sheikh Abdul Ra...,95,1378416,- Shaikha bint Mohammed bin Rashed Al Maktoum ...,38.762600,4
4,What were the combined ages of Sheikh Abdul Ra...,95,1378361,"In June 2017, two new initiatives were added t...",38.718102,5
...,...,...,...,...,...,...
49995,How old was Jan Piwnik the year that all railw...,30,1887496,"In the Old City are the Emir's Palace, the Gre...",7.612400,996
49996,How old was Jan Piwnik the year that all railw...,30,1862877,The Mersey Railway was the first part of the p...,7.611400,997
49997,How old was Jan Piwnik the year that all railw...,30,391743,On 19 October Murad II used his sipahi cavalry...,7.611000,998
49998,How old was Jan Piwnik the year that all railw...,30,2320831,"He is a 5th grader and has a buzz cut, and wea...",7.610500,999


# Rerank

## Classe de Dataset

In [31]:
class DatasetQueryText(Dataset):
    def __init__(self, texts: np.ndarray, tokenizer):
      self.texts = texts
      self.tokenizer = tokenizer
      self.max_seq_length = tokenizer.model_max_length

      input_ids = []
      token_type_ids = []
      attention_masks = []
      for query, text in tqdm(texts, desc='encoding query+doc'):
          encoding = tokenizer.encode_plus(
              query, 
              text,
              add_special_tokens=True,
              max_length=self.max_seq_length,
              padding='max_length',
              return_tensors = 'pt',
              truncation=True,
              return_attention_mask=True,
              return_token_type_ids=True
          )
          input_ids.append(encoding['input_ids'].long())
          token_type_ids.append(encoding['token_type_ids'].long())
          attention_masks.append(encoding['attention_mask'].long())
      self.input_ids = torch.stack(input_ids).squeeze(1)
      self.attention_masks = torch.stack(attention_masks).squeeze(1)
      self.token_type_ids = torch.stack(token_type_ids).squeeze(1)

    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        return {
            'input_ids': self.input_ids[idx],
            'attention_mask': self.attention_masks[idx],
            'token_type_ids': self.token_type_ids[idx]
        }

## Carga do modelo

In [32]:
model = AutoModelForSequenceClassification.from_pretrained(model_name).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Downloading (…)lve/main/config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

## Dataset e Dataloader

In [33]:
dataset_rerank = DatasetQueryText(texts = df_run[['question','text']].values, tokenizer=tokenizer)
dataloader_rerank = DataLoader(dataset_rerank, batch_size=500, shuffle=False)

encoding query+doc: 100%|██████████| 50000/50000 [00:44<00:00, 1119.38it/s]


## Execução

In [34]:
scores = []
model.eval()
with torch.no_grad():
    for ndx, batch in tqdm(enumerate(dataloader_rerank), total=len(dataloader_rerank), mininterval=0.5, desc='reranking', disable=False):
        logits = model(**BatchEncoding(batch).to(device)).logits
        scores.extend(logits.squeeze().cpu().numpy())

reranking: 100%|██████████| 1563/1563 [01:18<00:00, 19.99it/s]


## Tratamento do resultado

In [36]:
df_run['score_rerank'] = scores
df_run = df_run.groupby('question', group_keys=False).apply(lambda x: x.sort_values(['score_rerank'], ascending=[False]))
df_run['rerank'] = df_run.groupby('question').cumcount() + 1
df_run = df_run.query('rerank <= ' + str(K_rerank))
df_run

Unnamed: 0,question,answer,docid,text,bm25_score,bm25_rank,score_rerank,rerank
31007,Did the same team win the cup finals Watford p...,no,1601617,The cup has been won by the same team in two o...,13.814700,8,8.217612,1
31017,Did the same team win the cup finals Watford p...,no,26877,The victory meant Southampton reached the semi...,13.214600,18,6.593122,2
31058,Did the same team win the cup finals Watford p...,no,516756,In the quarter finals they played fellow Premi...,11.695900,59,6.506072,3
31019,Did the same team win the cup finals Watford p...,no,516746,The match was the fourth time that the two tea...,13.063100,20,6.458253,4
31025,Did the same team win the cup finals Watford p...,no,1601618,"The cup is currently held by Manchester City, ...",12.873300,26,6.447309,5
...,...,...,...,...,...,...,...,...
7004,Who was the founder of Bell Aircraft?,Larry Bell,2416112,"On February 23, 1909, Bell was present as the ...",8.528400,5,3.704788,6
7014,Who was the founder of Bell Aircraft?,Larry Bell,1722969,Burford has twice had a bell foundry: one run ...,7.946300,15,3.567063,7
7013,Who was the founder of Bell Aircraft?,Larry Bell,1782967,"Before long, Bell became general manager and b...",7.955300,14,3.506255,8
7055,Who was the founder of Bell Aircraft?,Larry Bell,1782963,"The company was purchased in 1960 by Textron, ...",7.150600,56,3.359585,9


In [37]:
df_run[['question', 'text', 'rerank', 'docid', 'answer']].to_csv('retriever.csv', index=False)