## Imports

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
from pathlib import Path
import pandas as pd
from tqdm import tqdm
import json
import os
from subprocess import Popen, PIPE, STDOUT

from haystack.preprocessor.cleaning import clean_wiki_text
from haystack.preprocessor.utils import convert_files_to_dicts, fetch_archive_from_http
from haystack.reader.farm import FARMReader
from haystack.reader.transformers import TransformersReader
from haystack.utils import print_answers
from haystack.document_store.elasticsearch import ElasticsearchDocumentStore
from haystack.retriever.sparse import ElasticsearchRetriever
from haystack.pipeline import ExtractiveQAPipeline

03/06/2021 10:33:27 - INFO - faiss.loader -   Loading faiss with AVX2 support.
03/06/2021 10:33:27 - INFO - faiss.loader -   Loading faiss.
03/06/2021 10:33:28 - INFO - farm.modeling.prediction_head -   Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .


## Load data

There are four types of dataset associated with AmazonQA:

* Products: combination of Amazon reviews and questions
* QA pairs: heuristics applied to the products dataset to generate QA pairs and query-relevant review snippets (the main contribution from the paper)
* SQuAD style: conversion of QA pairs to extractive QA format
* MS MARCO: conversion of QA pairs to abstractive QA format'

Task:

> Given a set of product reviews and a question concerning a specific product, generate an informative
natural language answer.

So could build a system where you search for a product and then ask questions about that product. Will need a way to lookup Amazon Standard Identification Number (ASIN) per product to be human readable. 

In [None]:
data = Path('./data/amazon-qa')
!ls {data}

test-qar_squad_all.jsonl  train-qar_msmarco.jsonl      train-qar_squad.json
train-qar.jsonl		  train-qar_products.jsonl     train-qar_squad.jsonl
train-qar.jsonl.bak	  train-qar_squad-music.json   val-qar_squad-music.json
train-qar_meta.jsonl	  train-qar_squad-music.jsonl  val-qar_squad.jsonl


### Products

In [None]:
products_df = pd.read_json(data/'train-qar_products.jsonl', lines=True, nrows=10)
products_df.head()

Unnamed: 0,asin,questions,reviews,category
0,B007F357HQ,"[{'questionText': 'I had shoulder surgery 6 months ago and have a 4"" wide sc...","[{'helpful': [1, 1], 'reviewText': 'Love this - wasn't sure I would as I tho...",Beauty
1,B00CRAJZFW,"[{'questionText': 'is it for iphones', 'questionType': 'yesno', 'answers': [...","[{'helpful': [2, 4], 'reviewText': 'This product arrived exactly as pictured...",Cell_Phones_and_Accessories
2,B002ZAZ7H4,"[{'questionText': 'what is the width and ht of the cells?', 'questionType': ...","[{'helpful': [0, 0], 'reviewText': 'Very well constructed and designed.I lik...",Home_and_Kitchen
3,B008SCP8UE,"[{'questionText': 'is it big for a bunny', 'questionType': 'yesno', 'answers...","[{'helpful': [0, 0], 'reviewText': 'Absolutely love!!! It was cute and easy ...",Pet_Supplies
4,B001O4F8Y4,[{'questionText': 'We are having a problem with the range. We have an old ho...,"[{'helpful': [0, 0], 'reviewText': 'We have a two story home and notice that...",Tools_and_Home_Improvement


### QA pairs

In [None]:
qar_df = pd.read_json(data/'train-qar.jsonl', lines=True, nrows=10)
qar_df.head()

Unnamed: 0,asin,category,questionText,questionType,review_snippets,answers,is_answerable,qid
0,B000MP20BU,Toys_and_Games,"Many have stated similar to the following: ""Paint Chips Off Easily; Pieces a...",descriptive,[A lot of reviewers have said things about this puzzle not being that durabl...,[{'answerText': 'The paint has held up through two toddlers and still going ...,1,0
1,B00BOXZZU2,Health_and_Personal_Care,Will these work with the Phillips sonicare handles?,descriptive,[I didn't even realize such a small electric tooth-brush existed till I acci...,"[{'answerText': 'The answer unfortunately, is no. The Slim Sonic is a compac...",0,1
2,B00CSYD4M2,Cell_Phones_and_Accessories,What kind of sim card it use?,descriptive,[I bought this phone a few weeks ago.I am using it in Costa Rica with a Kolb...,"[{'answerText': 'This phone is an unlocked GSM device, it requires a MINI SI...",1,2
3,B00C5TNSRG,Home_and_Kitchen,does anyone know if this dinnerware set does not contain lead or traces of l...,descriptive,[I love my new dishes! They are so versatile. I can set a casual table and y...,[{'answerText': 'According to the internet search: three-layer glass lamin...,0,3
4,B0099XQBD4,Musical_Instruments,I'm thinking of getting in to modular synthesizers. Would this work for that?,descriptive,[Will order another in the near future and arrived very quickly. Easy to ins...,"[{'answerText': 'Yes it will.', 'answerType': 'NA', 'helpful': [1, 1]}, {'an...",0,4


### SQuAD

In [None]:
squad_df = pd.read_json(data/'train-qar_squad.jsonl', lines=True, nrows=10)
squad_df.head()

Unnamed: 0,context,qas
0,This is the perfect kit to get started. Everything is miniaturized and comes...,"[{'id': 331392, 'is_impossible': False, 'question': 'What exactly comes in t..."
1,"... it doesn't last quite as long as advertised so therefore, I had to conti...","[{'id': 684949, 'is_impossible': False, 'question': 'How do you apply this p..."
2,This is a pretty cool filter. If you spin it around it will totally change t...,"[{'id': 604553, 'is_impossible': False, 'question': 'Does this come with a c..."
3,This product was the exact match of the original manufactured equipment.. On...,"[{'id': 341653, 'is_impossible': False, 'question': 'How to remove midgate r..."
4,Nice kit. Works well. Adjustable for proper sighting. Good quality. Instruct...,"[{'id': 192046, 'is_impossible': False, 'question': 'does this include the m..."


### MARCO

In [None]:
marco_df = pd.read_json(data/'train-qar_msmarco.jsonl', lines=True, nrows=10)
marco_df.head()

Unnamed: 0,answers,passages,query,query_id,query_type,wellFormedAnswers
0,[If your VHS tapes are getting old you may have problems like that. Also He...,"[{'is_selected': 1, 'url': '', 'passage_text': 'This product arrived in a &#...",I am having issues with dropped video. Audio is fine. I'm getting video abou...,384160,DESCRIPTION,[]
1,"[No they are not made in USA not the ones I got sorry, I am not sure. I bel...","[{'is_selected': 1, 'url': '', 'passage_text': 'My wife and I looked for mon...",Are all mattress materials made in the USA?,282084,DESCRIPTION,[]
2,[It's not really suited for handling large debris. It should only be used fo...,"[{'is_selected': 1, 'url': '', 'passage_text': 'We purchased our home a few ...",How well does it handle leaves? especially large leaves? I had a KK severa...,454427,DESCRIPTION,[]
3,"[There is no adapter, just the micro SD card., It comes with a standard SD c...","[{'is_selected': 1, 'url': '', 'passage_text': 'I purchased this card for my...",Does this card come with the adapter for the larger slots as in a RaspberryPI?,193420,DESCRIPTION,[]
4,[I have done that to another grease gun before. You would probably have to ...,"[{'is_selected': 1, 'url': '', 'passage_text': 'The first one I got didn't w...",would this work with oil instead of grease? I have a old milling machine wi...,661529,YESNO,[]


### Metadata
Let's pick out the mapping from ASIN to QID from the training set:

In [None]:
rows = []
with open(data/'train-qar.jsonl', 'r') as f:
    for _, line in tqdm(enumerate(f)):
        row = json.loads(line)
        rows.append((row['asin'], row['category'], row['qid'], row['is_answerable']))

738776it [00:44, 16552.33it/s]


In [None]:
meta_df = pd.DataFrame(rows, columns=['asin', 'category', 'qid', 'is_answerable'])
meta_df.head()

Unnamed: 0,asin,category,qid,is_answerable
0,B000MP20BU,Toys_and_Games,0,1
1,B00BOXZZU2,Health_and_Personal_Care,1,0
2,B00CSYD4M2,Cell_Phones_and_Accessories,2,1
3,B00C5TNSRG,Home_and_Kitchen,3,0
4,B0099XQBD4,Musical_Instruments,4,0


In [None]:
meta_df.shape

(738776, 4)

In [None]:
assert meta_df['qid'].nunique() == len(meta_df)

In [None]:
meta_df['category'].value_counts()

Electronics                    169764
Home_and_Kitchen               107423
Sports_and_Outdoors             70824
Tools_and_Home_Improvement      62995
Health_and_Personal_Care        47589
Automotive                      45892
Cell_Phones_and_Accessories     42211
Patio_Lawn_and_Garden           36693
Toys_and_Games                  30838
Office_Products                 26086
Beauty                          24956
Pet_Supplies                    21668
Baby                            14427
Musical_Instruments             14285
Grocery_and_Gourmet_Food        11553
Video_Games                      5901
Clothing_Shoes_and_Jewelry       5671
Name: category, dtype: int64

In [None]:
meta_df.groupby('category')['asin'].nunique().sort_values(ascending=False)

category
Electronics                    28696
Home_and_Kitchen               17640
Sports_and_Outdoors            11881
Tools_and_Home_Improvement     10283
Automotive                      8172
Health_and_Personal_Care        8051
Cell_Phones_and_Accessories     7133
Patio_Lawn_and_Garden           6108
Toys_and_Games                  5725
Beauty                          4450
Office_Products                 4339
Pet_Supplies                    3349
Baby                            2466
Musical_Instruments             2366
Grocery_and_Gourmet_Food        2272
Clothing_Shoes_and_Jewelry      1021
Video_Games                      886
Name: asin, dtype: int64

In [None]:
meta_df.groupby('category')['is_answerable'].value_counts()

category                     is_answerable
Automotive                   0                 23656
                             1                 22236
Baby                         1                 10255
                             0                  4172
Beauty                       1                 15243
                             0                  9713
Cell_Phones_and_Accessories  1                 27446
                             0                 14765
Clothing_Shoes_and_Jewelry   1                  3315
                             0                  2356
Electronics                  1                108614
                             0                 61150
Grocery_and_Gourmet_Food     1                  6774
                             0                  4779
Health_and_Personal_Care     1                 29539
                             0                 18050
Home_and_Kitchen             1                 66384
                             0                 41039
Mus

In [None]:
meta_df.query("category == 'Musical_Instruments'")

Unnamed: 0,asin,category,qid,is_answerable
4,B0099XQBD4,Musical_Instruments,4,0
55,B00F9ECDRU,Musical_Instruments,55,0
59,B0083FTVB8,Musical_Instruments,59,0
231,B005ETZ7NW,Musical_Instruments,231,0
269,B000EJTXZU,Musical_Instruments,269,0
...,...,...,...,...
738436,B001KPWU7A,Musical_Instruments,738436,0
738527,B00AMPDYDS,Musical_Instruments,738527,0
738581,B005IQGKX2,Musical_Instruments,738581,1
738652,B001KPWU7A,Musical_Instruments,738652,1


## Warmup: no fine-tuning

Let's pick a single category like `Musical_Instruments` and build a `DataFrame` that has `asin`, `context` columns that we can use to create a simple QA system with an existing model fine-tuned on SQuAD:

In [None]:
qid2category = pd.Series(meta_df["category"].values, index=meta_df["qid"]).to_dict()
qid2category[0]

'Toys_and_Games'

In [None]:
qid2asin = pd.Series(meta_df["asin"].values, index=meta_df["qid"]).to_dict()
qid2asin[0]

'B000MP20BU'

It seems that all SQuAD entries are answerable (does this make sense?). What about SQuAD v2 with impossible questions?

In [None]:
qid2isanswer = pd.Series(meta_df["is_answerable"].values, index=meta_df["qid"]).to_dict()
qid2isanswer[4]

0

In [None]:
qid2asin[331392]

'B0057JCYYE'

In [None]:
rows = []

with open(data/'train-qar_squad.jsonl', 'r', encoding='utf-8') as f:
    for _, line in tqdm(enumerate(f)):
        row = json.loads(line)
        qid = row["qas"][0]["id"]
        if qid2category[qid] == "Musical_Instruments":
            rows.append((qid2asin[qid], row["context"], row["qas"], qid2isanswer[qid]))

455931it [00:30, 15122.32it/s]


In [None]:
qa_df = pd.DataFrame(rows, columns=['asin', 'text', "qas", 'is_answerable'])
qa_df.head()

Unnamed: 0,asin,text,qas,is_answerable
0,B005OZE9SA,Works perfectly and easy to use. Software download also great.The only surpr...,"[{'id': 943, 'is_impossible': False, 'question': 'ipad', 'answers': [{'answe...",1
1,B001RR9BZA,I'm not totally happy with it because it squeals a lot and doesn't really he...,"[{'id': 6381, 'is_impossible': False, 'question': 'Does it amplify your voic...",1
2,B00B9060X6,I've tried computer studios but prefer twiddling knobs. With my Portastudio ...,"[{'id': 440217, 'is_impossible': False, 'question': 'Can you record a drum m...",1
3,B009VDW4OW,I just received this drum in the mail. I had no idea how much assembly was r...,"[{'id': 693739, 'is_impossible': False, 'question': 'What are thoughts on di...",1
4,B004STXY3E,This DMX controller is a great start for getting into DMX lighting control. ...,"[{'id': 27683, 'is_impossible': False, 'question': 'I have RGB par cans and ...",1


In [None]:
qa_df['is_answerable'].value_counts()

1    8694
Name: is_answerable, dtype: int64

In [None]:
qa_df.shape

(8694, 4)

In [None]:
qa_df['asin'].nunique()

2100

### Boot ES

In [None]:
! wget -nc https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
! chown -R daemon:daemon elasticsearch-7.9.2

es_server = Popen(['elasticsearch-7.9.2/bin/elasticsearch'],
                   stdout=PIPE, stderr=STDOUT,
                   preexec_fn=lambda: os.setuid(1)  # as daemon
                  )
# wait until ES has started
! sleep 30

In [None]:
document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document")

03/05/2021 15:11:00 - INFO - elasticsearch -   HEAD http://localhost:9200/ [status:200 request:0.094s]
03/05/2021 15:11:00 - INFO - elasticsearch -   HEAD http://localhost:9200/document [status:200 request:0.011s]
03/05/2021 15:11:00 - INFO - elasticsearch -   GET http://localhost:9200/document [status:200 request:0.004s]
03/05/2021 15:11:00 - INFO - elasticsearch -   PUT http://localhost:9200/document/_mapping [status:200 request:0.024s]
03/05/2021 15:11:00 - INFO - elasticsearch -   HEAD http://localhost:9200/label [status:200 request:0.002s]


In [None]:
document_store.delete_all_documents()

03/04/2021 21:20:03 - INFO - elasticsearch -   POST http://localhost:9200/document/_delete_by_query [status:200 request:0.970s]


### Index docs

In [None]:
docs = [{"text": row["text"], "meta":{"asin": row["asin"], "is_answerable": row["is_answerable"]}} for _, row in qa_df.iterrows()]
docs[0]

{'text': "Works perfectly and easy to use. Software download also great.The only surprise was that the one I ordered, (USB) doesn't work with an iPad.I was thinking it would work with both PC and iPad. My mistake. I use this with Logic Pro X on an iMac running Mavericks (it's replacing an Mbox) and with a Sony Vaio running Windows 7 and get excellent results (don't forget to install the Windows drivers or you'll run into latency issues). I also use it with the Auria App on my iPad Air. I did appreciate the direct line in switch...I could hear exactly what was being played into the unit without having to route through the computer. That was a nice feature. More recently, I was very happy to get this working with my ipad mini. I did purchase a recommended usb powered hub Belkin model &#34; F4U020&#34; and with that - I'm good to play music into and out of my ipad. Focusrite. An industry standard.I bought this specifically for use with an iPad to do mobile recording. The app I use is Auri

In [None]:
document_store.write_documents(docs)

03/05/2021 15:11:02 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:1.124s]
03/05/2021 15:11:04 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:1.382s]
03/05/2021 15:11:05 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:1.381s]
03/05/2021 15:11:06 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:1.281s]
03/05/2021 15:11:08 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:1.237s]
03/05/2021 15:11:09 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:1.254s]
03/05/2021 15:11:10 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:1.260s]
03/05/2021 15:11:12 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:1.275s]


### Retriever

In [None]:
retriever = ElasticsearchRetriever(document_store=document_store)

### Reader

In [None]:
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True, context_window_size=500)

03/06/2021 10:59:24 - INFO - farm.utils -   Using device: CUDA 
03/06/2021 10:59:24 - INFO - farm.utils -   Number of GPUs: 1
03/06/2021 10:59:24 - INFO - farm.utils -   Distributed Training: False
03/06/2021 10:59:24 - INFO - farm.utils -   Automatic Mixed Precision: None
Some weights of RobertaModel were not initialized from the model checkpoint at deepset/roberta-base-squad2 and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
03/06/2021 10:59:42 - INFO - farm.utils -   Using device: CUDA 
03/06/2021 10:59:42 - INFO - farm.utils -   Number of GPUs: 1
03/06/2021 10:59:42 - INFO - farm.utils -   Distributed Training: False
03/06/2021 10:59:42 - INFO - farm.utils -   Automatic Mixed Precision: None
03/06/2021 10:59:45 - INFO - farm.infer -   Got ya 7 parallel workers to do inference ...
03/06/2021 10:59:45 - INFO - farm.infer -    0    0    0

In [None]:
# check evaluation on SQuAD v2
reader_eval_results = reader.eval_on_file("data/squad", "dev-v2.0.json", device='cuda')

Preprocessing Dataset data/squad/dev-v2.0.json: 100%|██████████| 1204/1204 [00:07<00:00, 162.32 Dicts/s]
Evaluating: 100%|██████████| 274/274 [02:36<00:00,  1.75it/s]


In [None]:
print("Reader Top-N-Accuracy:", reader_eval_results["top_n_accuracy"])
## Reader Exact Match is the proportion of questions where the predicted answer is exactly the same as the correct answer
print("Reader Exact Match:", reader_eval_results["EM"])
## Reader F1-Score is the average overlap between the predicted answers and the correct answers
print("Reader F1-Score:", reader_eval_results["f1"])

Reader Top-N-Accuracy: 0.9746483618293608
Reader Exact Match: 0.7843005137707403
Reader F1-Score: 0.8260896852846605


In [None]:
# check evaluation on AmazonQA
reader_eval_results = reader.eval_on_file("data/amazon-qa", "val-qar_squad-music.json", device='cuda')

Preprocessing Dataset data/amazon-qa/val-qar_squad-music.json: 100%|██████████| 1150/1150 [00:03<00:00, 371.15 Dicts/s]
Evaluating: 100%|██████████| 133/133 [01:17<00:00,  1.72it/s]


In [None]:
print("Reader Top-N-Accuracy:", reader_eval_results["top_n_accuracy"])
## Reader Exact Match is the proportion of questions where the predicted answer is exactly the same as the correct answer
print("Reader Exact Match:", reader_eval_results["EM"])
## Reader F1-Score is the average overlap between the predicted answers and the correct answers
print("Reader F1-Score:", reader_eval_results["f1"])

Reader Top-N-Accuracy: 0.542608695652174
Reader Exact Match: 0.0008695652173913044
Reader F1-Score: 0.0752376647890378


In [None]:
pipe = ExtractiveQAPipeline(reader, retriever)

In [None]:
query = "Is a snare included?"
# DIY drumkit
asin = "B009VDW4OW"
number_of_answers_to_fetch = 3

prediction = pipe.run(query=query, filters={"asin": [asin]}, top_k_retriever=10, top_k_reader=number_of_answers_to_fetch)
print(f"Question: {prediction['query']}")
print("\n")
for i in range(number_of_answers_to_fetch):
    print(f"#{i+1}")
    print(f"Answer: {prediction['answers'][i]['answer']}")
    print(f"ASIN: {prediction['answers'][i]['meta']['asin']}")
    print(f"Is answerable?: {prediction['answers'][i]['meta']['is_answerable']}")
    print(f"Context: {prediction['answers'][i]['context']}")
    print('\n\n')

03/05/2021 14:39:19 - INFO - elasticsearch -   POST http://localhost:9200/document/_search [status:200 request:0.088s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  7.25 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 12.66 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 12.14 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.83 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.97 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.93 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 18.37 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 18.49 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.88 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.86 Batches/s]

Question: Is a snare included?


#1
Answer: this one only came with one
ASIN: B009VDW4OW
Is answerable?: 1
Context: the correct sounds out of it. When I slapped the "bass", it would play a "snare" sound combined with the bass. When I slapped the "snare", I would just get a wood sound.I've also seen images that most cajons come with multiple snares... this one only came with one.I'm really not sure what else to say. I wanted a Cajon to play with.but didn't want to pay 100.00 plus. This was a great option, Easy to put together with the limited tools I had on hand. And cheap enough that I wasn't worried to have 



#2
Answer: this one only came with one
ASIN: B009VDW4OW
Is answerable?: 1
Context: the correct sounds out of it. When I slapped the "bass", it would play a "snare" sound combined with the bass. When I slapped the "snare", I would just get a wood sound.I've also seen images that most cajons come with multiple snares... this one only came with one.I'm really not sure what else to




## Fine-tuning

### Converting to the true SQuAD format

One problem with our SQuAD dataset is that it is composed of _line-separated_ JSON instead of the single JSON object that SQuAD traditionally uses. So instead of having examples like 

```json
{
    "context": "blah blah",
    "qas": [
        {
            "id": 331392,
            "is_impossible": false,
            "question": "blah blah?",
            "answers": [
                {
                    "answer_start": 2881,
                    "text": "blah blah"
                },
                ...
            ],
            "human_answers": [
                "blah blah",
                ...
            ]
        }
    ]
}
```

what we really need is a JSON of the form

```json
{
    "data": [
        {
            "title": "Beyoncé",
            "paragraphs": [
                {
                    "qas": [
                        {
                            "question": "When did Beyonce start becoming popular?",
                            "id": "56be85543aeaaa14008c9063",
                            "answers": [
                                {
                                    "text": "in the late 1990s",
                                    "answer_start": 269
                                }
                            ],
                            "is_impossible": false
                        }
                        ...
                    ],
                    "context": "Beyoncé ..."
                },
                ...
            ]
        }
    ]
}
```

Let's write a function that does the conversion for us. To warm-up let's load a single example from the training set:

In [None]:
examples = []

with open(data/"train-qar_squad.jsonl", 'r', encoding='utf-8') as f:
    for line in f:
        ex = json.loads(line)
        qid = ex["qas"][0]["id"]
        asin = qid2asin[qid]
        if asin == "B0057JCYYE" or asin == "B00F9ECDRU":
            examples.append(ex)
        if len(examples) > 4:
            break
examples

We don't need the human answers, but we do need the mapping from `qid` to `asin` so that we can collect all questions together that belong to the same product.

In [None]:
asin2qas = {}
seen_asin = set()

for ex in examples:
    qid = ex["qas"][0]["id"]
    asin = qid2asin[qid]
    qas = [{k:v for k,v in ex["qas"][0].items() if k != "human_answers"}]
    par = [{"qas": qas, "context": ex["context"]}]

    if asin in seen_asin:
        asin2qas[asin].extend(par)
    else:
        asin2qas[asin] = par
        seen_asin.add(asin)


# asin2qas

In [None]:
squad_data = []

for k,v in asin2qas.items():
    squad_ex = {}
    squad_ex["title"] = k
    squad_ex["paragraphs"] = v
    squad_data.append(squad_ex)
    
squad_data

In [None]:
squad_dict = {"data": squad_data}

In [None]:
with open(data/"train-qar_squad.json", 'w', encoding='utf-8') as f:
    json.dump(squad_dict, f)

In [None]:
# pick out answer fields
with open(data/"val-qar_squad.jsonl", 'r', encoding='utf-8') as f:
    for line in f:
        ex = json.loads(line)
        break

In [None]:
[k for k in ex["qas"][0].keys() if k.startswith("answers")]

['answers_snippet_spans_bleu2',
 'answers_snippet_spans_bleu4',
 'answers_snippet_spans_rouge',
 'answers_sentence_ir',
 'answers_sentence_bleu2',
 'answers_sentence_bleu4']

In [None]:
def convert_to_squad_format(input_file: Path, output_file: Path, category: str = "Musical_Instruments"):
    squad_data = []
    asin2qas = {}
    seen_asin = set()
    answer_fields = [k for k in ex["qas"][0].keys() if k.startswith("answers")]
    
    with open(input_file, 'r', encoding='utf-8') as f:
        for _, line in tqdm(enumerate(f)):
            row = json.loads(line)
            qid = row["qas"][0]["id"]
            if qid2category[qid] == category:
                asin = qid2asin[qid]
                qas = [{"answers" if k in answer_fields else k:v for k,v in row["qas"][0].items()}]
                par = [{"qas": qas, "context": row["context"]}]
                
                if asin in seen_asin:
                    asin2qas[asin].extend(par)
                else:
                    asin2qas[asin] = par
                    seen_asin.add(asin)
                    
    for k,v in asin2qas.items():
        squad_ex = {}
        squad_ex["title"] = k
        squad_ex["paragraphs"] = v
        squad_data.append(squad_ex)

    squad_dict = {"data": squad_data}
        
    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(squad_dict, f)    

In [None]:
convert_to_squad_format(data/'train-qar_squad.jsonl', data/'train-qar_squad-music.json')

455931it [00:11, 39335.58it/s]


In [None]:
convert_to_squad_format(data/'val-qar_squad.jsonl', data/'val-qar_squad-music.json')

58969it [00:03, 18374.60it/s]


### Load single example

In [None]:
val_df = pd.read_json(data/'val-qar_squad-music.json')

In [None]:
val_df

Unnamed: 0,data
0,"{'title': 'B007A98S8U', 'paragraphs': [{'qas': [{'id': 63917, 'is_impossible..."
1,"{'title': 'B00D6RMFG6', 'paragraphs': [{'qas': [{'id': 63342, 'is_impossible..."
2,"{'title': 'B007566BLE', 'paragraphs': [{'qas': [{'id': 86214, 'is_impossible..."
3,"{'title': 'B001QCXSDW', 'paragraphs': [{'qas': [{'id': 21580, 'is_impossible..."
4,"{'title': 'B006Z9D9UI', 'paragraphs': [{'qas': [{'id': 8108, 'is_impossible'..."
...,...
909,"{'title': 'B005M1U7GO', 'paragraphs': [{'qas': [{'id': 34674, 'is_impossible..."
910,"{'title': 'B005OEM43S', 'paragraphs': [{'qas': [{'id': 77380, 'is_impossible..."
911,"{'title': 'B00CZ6VB6Y', 'paragraphs': [{'qas': [{'id': 83011, 'is_impossible..."
912,"{'title': 'B001EC5ECW', 'paragraphs': [{'qas': [{'id': 21693, 'is_impossible..."


### Fine-tune model

Either something is wrong with my data preparation or getting the model to generalise is _hard_!

In [None]:
train_data = "data/amazon-qa/"

In [None]:
reader.train(data_dir=train_data, 
             train_filename="train-qar_squad-music.json", 
             dev_filename="val-qar_squad-music.json", 
             use_gpu=True, n_epochs=1, save_dir="models/haystack")

03/06/2021 11:37:42 - INFO - farm.utils -   Using device: CUDA 
03/06/2021 11:37:42 - INFO - farm.utils -   Number of GPUs: 1
03/06/2021 11:37:42 - INFO - farm.utils -   Distributed Training: False
03/06/2021 11:37:42 - INFO - farm.utils -   Automatic Mixed Precision: None
Preprocessing Dataset data/amazon-qa/train-qar_squad-music.json: 100%|██████████| 8694/8694 [00:11<00:00, 742.55 Dicts/s] 
Preprocessing Dataset data/amazon-qa/val-qar_squad-music.json: 100%|██████████| 1150/1150 [00:02<00:00, 388.12 Dicts/s]
03/06/2021 11:38:06 - INFO - farm.modeling.optimization -   Loading optimizer `TransformersAdamW`: '{'correct_bias': False, 'weight_decay': 0.01, 'lr': 1e-05}'
03/06/2021 11:38:06 - INFO - farm.modeling.optimization -   Using scheduler 'get_linear_schedule_with_warmup'
03/06/2021 11:38:06 - INFO - farm.modeling.optimization -   Loading schedule `get_linear_schedule_with_warmup`: '{'num_training_steps': 4210, 'num_warmup_steps': 842}'
Train epoch 0/0 (Cur. train loss: 2.0989):   

In [None]:
new_reader = FARMReader(model_name_or_path="models/haystack")

03/06/2021 12:40:51 - INFO - farm.utils -   Using device: CUDA 
03/06/2021 12:40:51 - INFO - farm.utils -   Number of GPUs: 1
03/06/2021 12:40:51 - INFO - farm.utils -   Distributed Training: False
03/06/2021 12:40:51 - INFO - farm.utils -   Automatic Mixed Precision: None
03/06/2021 12:40:54 - INFO - farm.utils -   Using device: CUDA 
03/06/2021 12:40:54 - INFO - farm.utils -   Number of GPUs: 1
03/06/2021 12:40:54 - INFO - farm.utils -   Distributed Training: False
03/06/2021 12:40:54 - INFO - farm.utils -   Automatic Mixed Precision: None
03/06/2021 12:40:55 - INFO - farm.infer -   Got ya 7 parallel workers to do inference ...
03/06/2021 12:40:55 - INFO - farm.infer -    0    0    0    0    0    0    0 
03/06/2021 12:40:55 - INFO - farm.infer -   /w\  /w\  /w\  /w\  /w\  /w\  /w\
03/06/2021 12:40:55 - INFO - farm.infer -   /'\  / \  /'\  /'\  / \  / \  /'\
03/06/2021 12:40:55 - INFO - farm.infer -               


In [None]:
# eval
reader_eval_results = new_reader.eval_on_file("data/amazon-qa", "val-qar_squad-music.json", device='cuda')

Preprocessing Dataset data/amazon-qa/val-qar_squad-music.json: 100%|██████████| 1150/1150 [00:02<00:00, 390.39 Dicts/s]
Evaluating: 100%|██████████| 133/133 [01:17<00:00,  1.71it/s]


In [None]:
print("Reader Top-N-Accuracy:", reader_eval_results["top_n_accuracy"])
## Reader Exact Match is the proportion of questions where the predicted answer is exactly the same as the correct answer
print("Reader Exact Match:", reader_eval_results["EM"])
## Reader F1-Score is the average overlap between the predicted answers and the correct answers
print("Reader F1-Score:", reader_eval_results["f1"])

Reader Top-N-Accuracy: 0.7417391304347826
Reader Exact Match: 0.0
Reader F1-Score: 0.0


In [None]:
pipe = ExtractiveQAPipeline(new_reader, retriever)

In [None]:
query = "Is a snare included?"
# DIY drumkit
asin = "B009VDW4OW"
number_of_answers_to_fetch = 3

prediction = pipe.run(query=query, filters={"asin": [asin]}, top_k_retriever=10, top_k_reader=number_of_answers_to_fetch)
print(f"Question: {prediction['query']}")
print("\n")
for i in range(number_of_answers_to_fetch):
    print(f"#{i+1}")
    print(f"Answer: {prediction['answers'][i]['answer']}")
    print(f"ASIN: {prediction['answers'][i]['meta']['asin']}")
    print(f"Is answerable?: {prediction['answers'][i]['meta']['is_answerable']}")
    print(f"Context: {prediction['answers'][i]['context']}")
    print('\n\n')

Traceback (most recent call last):
  File "/root/miniconda3/envs/transformerlab/lib/python3.8/site-packages/urllib3/connection.py", line 156, in _new_conn
    conn = connection.create_connection(
  File "/root/miniconda3/envs/transformerlab/lib/python3.8/site-packages/urllib3/util/connection.py", line 84, in create_connection
    raise err
  File "/root/miniconda3/envs/transformerlab/lib/python3.8/site-packages/urllib3/util/connection.py", line 74, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/miniconda3/envs/transformerlab/lib/python3.8/site-packages/elasticsearch/connection/http_urllib3.py", line 245, in perform_request
    response = self.pool.urlopen(
  File "/root/miniconda3/envs/transformerlab/lib/python3.8/site-packages/urllib3/connectionpool.py", line 719, in urlopen
    retries = retries.increment(
  File "/ro

ConnectionError: ConnectionError(<urllib3.connection.HTTPConnection object at 0x7ff914bc2d00>: Failed to establish a new connection: [Errno 111] Connection refused) caused by: NewConnectionError(<urllib3.connection.HTTPConnection object at 0x7ff914bc2d00>: Failed to establish a new connection: [Errno 111] Connection refused)

### Evaluation

#### New reader

In [None]:
reader_eval_results = new_reader.eval_on_file(train_data, "val-qar_squad-music.json", device='cuda')

## Reader Top-N-Accuracy is the proportion of predicted answers that match with their corresponding correct answer
print("Reader Top-N-Accuracy:", reader_eval_results["top_n_accuracy"])
## Reader Exact Match is the proportion of questions where the predicted answer is exactly the same as the correct answer
print("Reader Exact Match:", reader_eval_results["EM"])
## Reader F1-Score is the average overlap between the predicted answers and the correct answers
print("Reader F1-Score:", reader_eval_results["f1"])

Preprocessing Dataset data/amazon-qa/val-qar_squad-music.json: 100%|██████████| 1828/1828 [00:03<00:00, 507.82 Dicts/s]
Evaluating: 100%|██████████| 238/238 [02:16<00:00,  1.74it/s]


Reader Top-N-Accuracy: 0.5
Reader Exact Match: 0.0
Reader F1-Score: 0.0


#### SQuAD reader

In [None]:
reader_eval_results = reader.eval_on_file(train_data, "train-qar_squad-music.json", device='cuda')

## Reader Top-N-Accuracy is the proportion of predicted answers that match with their corresponding correct answer
print("Reader Top-N-Accuracy:", reader_eval_results["top_n_accuracy"])
## Reader Exact Match is the proportion of questions where the predicted answer is exactly the same as the correct answer
print("Reader Exact Match:", reader_eval_results["EM"])
## Reader F1-Score is the average overlap between the predicted answers and the correct answers
print("Reader F1-Score:", reader_eval_results["f1"])

Preprocessing Dataset data/amazon-qa/train-qar_squad-music.json: 100%|██████████| 2100/2100 [00:03<00:00, 664.52 Dicts/s]
Evaluating: 100%|██████████| 210/210 [02:00<00:00,  1.75it/s]


Reader Top-N-Accuracy: 0.0
Reader Exact Match: 0.0
Reader F1-Score: 0.0
