### BERT for Question Answering

In [1]:
from transformers import BertTokenizer, BertForQuestionAnswering
import torch

  from .autonotebook import tqdm as notebook_tqdm


### A bert base model fined tuned with SQuAD 2 dataset

In [2]:
class PeanutQASystem():
    def __init__(self): 
        self.tokenizer = BertTokenizer.from_pretrained("deepset/bert-base-cased-squad2")
        self.model = BertForQuestionAnswering.from_pretrained("deepset/bert-base-cased-squad2")

    def predict_answer_for_qi(self, qid, data):
        return self.predict_answer(data["question"][qid], data["text"][qid])

    def predict_answer(self, question, context):
        #encoding question + context to input a single vector to bert
        input_ids = self.tokenizer.encode(question, context)
        tokens = self.tokenizer.convert_ids_to_tokens(input_ids) # getting tokens back

        sep_idx = input_ids.index(self.tokenizer.sep_token_id)

        #number of tokens in segment Q (question)
        num_seg_q = sep_idx + 1
        #number of tokens in segment T (text)
        num_seg_t = len(input_ids) - num_seg_q

        #creating the segment ids
        segment_ids = [0]*num_seg_q + [1]*num_seg_t #to differentiate our segments - question and text

        #making sure that every input token has a segment id
        assert len(segment_ids) == len(input_ids)

        output = self.model(torch.tensor([input_ids]),  token_type_ids=torch.tensor([segment_ids]))

        #tokens with highest start and end scores
        answer_start = torch.argmax(output.start_logits)
        answer_end = torch.argmax(output.end_logits)
        if answer_end >= answer_start:
            answer = tokens[answer_start]
            for i in range(answer_start+1, answer_end+1): #removing Bert spetials symbols like ##
                if tokens[i][0:2] == "##":
                    answer += tokens[i][2:]
                else:
                    answer += " " + tokens[i]
        else:
            answer = ""

        if answer.startswith("[CLS]"):
            answer = ""

        return answer

In [3]:
import pandas as pd

### SQuAD

In [4]:
cols = ["text","question","answer"]
squad_raw_data = pd.read_json('./dev-v2.0.json')
comp_list = []
for index, row in squad_raw_data.iterrows():
    for i in range(len(row["data"]["paragraphs"])): # each article
        for j in range(len(row["data"]["paragraphs"][i])): # each paragraph
            for k in range(len(row["data"]["paragraphs"][i]['qas'])):# each question about the selected paragraph
                temp_list = []
                temp_list.append(row["data"]["paragraphs"][i]["context"])
                temp_list.append(row["data"]["paragraphs"][i]['qas'][k]["question"])
                try:
                    temp_list.append(row["data"]["paragraphs"][i]['qas'][k]["answers"][0]["text"])
                except IndexError:
                    temp_list.append("")
                comp_list.append(temp_list)
new_df = pd.DataFrame(comp_list, columns=cols) 
new_df.to_csv('./dev-v2.0.csv', index=False)
data = pd.read_csv('./dev-v2.0.csv')


In [5]:
data.head()


Unnamed: 0,text,question,answer
0,The Normans (Norman: Nourmands; French: Norman...,In what country is Normandy located?,France
1,The Normans (Norman: Nourmands; French: Norman...,When were the Normans in Normandy?,10th and 11th centuries
2,The Normans (Norman: Nourmands; French: Norman...,From which countries did the Norse originate?,"Denmark, Iceland and Norway"
3,The Normans (Norman: Nourmands; French: Norman...,Who was the Norse leader?,Rollo
4,The Normans (Norman: Nourmands; French: Norman...,What century did the Normans first gain their ...,10th century


In [6]:
import numpy as np

In [7]:
peanut_qa = PeanutQASystem()
random_num = np.random.randint(0,len(data))
prediction = peanut_qa.predict_answer_for_qi(random_num, data)

print("Context: ")
print(data["text"][random_num])
print("Question: ")
print(data["question"][random_num])
print("Prediction: ")
print(prediction)

Context: 
In the final years of the apartheid era, parents at white government schools were given the option to convert to a "semi-private" form called Model C, and many of these schools changed their admissions policies to accept children of other races. Following the transition to democracy, the legal form of "Model C" was abolished, however, the term continues to be used to describe government schools formerly reserved for white children.. These schools tend to produce better academic results than government schools formerly reserved for other race groups . Former "Model C" schools are not private schools, as they are state-controlled. All schools in South Africa (including both independent schools and public schools) have the right to set compulsory school fees, and formerly model C schools tend to set much higher school fees than other public schools.
Question: 
How do academic results in former Model C schools compare to other schools?
Prediction: 
better academic results


### CoQA

In [8]:
coqa_raw_data = pd.read_json('coqa-dev-v1.0.json')
comp_list = []
for index, row in coqa_raw_data.iterrows():
    for i in range(len(row["data"]["questions"])):
        temp_list = []
        temp_list.append(row["data"]["story"])
        temp_list.append(row["data"]["questions"][i]["input_text"])
        temp_list.append(row["data"]["answers"][i]["input_text"])
        comp_list.append(temp_list)
new_df = pd.DataFrame(comp_list, columns=cols) 
new_df.to_csv('coqa-dev-v1.0.csv', index=False)
coqa_data = pd.read_csv('coqa-dev-v1.0.csv')

In [9]:
coqa_data.head()

Unnamed: 0,text,question,answer
0,"Once upon a time, in a barn near a farm house,...",What color was Cotton?,white
1,"Once upon a time, in a barn near a farm house,...",Where did she live?,in a barn
2,"Once upon a time, in a barn near a farm house,...",Did she live alone?,no
3,"Once upon a time, in a barn near a farm house,...",Who did she live with?,with her mommy and 5 sisters
4,"Once upon a time, in a barn near a farm house,...",What color were her sisters?,orange and white


In [18]:
random_num = np.random.randint(0,len(coqa_data))
prediction = peanut_qa.predict_answer_for_qi(random_num, coqa_data)

print("Context: ")
print(data["text"][random_num])
print("Question: ")
print(data["question"][random_num])
print("Prediction: ")
print(prediction)

Context: 
The first commercially successful true engine, in that it could generate power and transmit it to a machine, was the atmospheric engine, invented by Thomas Newcomen around 1712. It was an improvement over Savery's steam pump, using a piston as proposed by Papin. Newcomen's engine was relatively inefficient, and in most cases was used for pumping water. It worked by creating a partial vacuum by condensing steam under a piston within a cylinder. It was employed for draining mine workings at depths hitherto impossible, and also for providing a reusable water supply for driving waterwheels at factories sited away from a suitable "head". Water that had passed over the wheel was pumped back up into a storage reservoir above the wheel.
Question: 
Who conceptualized the vacuum?
Prediction: 



### Functions to compute basic metrics

---
|    X   |exact match |f1 score|
|---|---|---
| SQUAD | 60.0 | 73.33|
|---|---|---|
| COQA  |14.8|23.30|
---
