# Question Answering with PyTorch Transformers: Part 2
## A simple vector index with scikit-learn

Article for this notebook: https://medium.com/@patonw/question-answering-with-pytorch-transformers-part-2-a31900294673

> In Part 1 we briefly examined the problem of question answering in machine learning and how recent breakthroughs have greatly improved the quality of answers produced by computer systems.
>
> Using the pipeline API Transformers library we were able to run a pre-trained model in a few lines of code. In this article we’ll prototype an information retrieval system around it. In later articles we’ll turn that into web services that can be queried by browsers and mobile apps.

In [None]:
# Prepare to run in paperspace. You should manage these with pipenv or conda on your own machine.
#import os
#os.environ["PYTORCH_PRETRAINED_BERT_CACHE"] = "/storage/torch"

!pip install torch transformers sklearn spacy[cuda100] pyarrow
!python -m spacy download en_core_web_sm
!/bin/bash -c "[[ ! -f train-v2.0.json ]] && wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json"

In [2]:
import os
import random
import pandas as pd
import json
import sklearn
import spacy

import numpy as np
import torch
import torch.nn.functional as F
from itertools import islice
from tqdm import tqdm

from sklearn.feature_extraction.text import TfidfVectorizer
from transformers import *

In [3]:
LEMMA_CACHE = "squad_context.feather"

In [4]:
spacy.prefer_gpu()
sp = spacy.load("en_core_web_sm")

### Extract Questions from SQUAD2.0

In [5]:
with open("train-v2.0.json") as f:
    doc = json.load(f)
doc.keys(), type(doc["data"]), len(doc["data"])

(dict_keys(['version', 'data']), list, 442)

In [6]:
paragraphs = []
questions = []
for topic in doc["data"]:
    for pgraph in topic["paragraphs"]:
        paragraphs.append(pgraph["context"])
        for qa in pgraph["qas"]:
            if not qa["is_impossible"]:
                questions.append(qa["question"])
        
len(paragraphs), len(questions), random.sample(paragraphs, 2), random.sample(questions, 5)

(19035,
 86821,
 ['The main campus in Provo, Utah, United States sits on approximately 560 acres (2.3 km2) nestled at the base of the Wasatch Mountains and includes 295 buildings. The buildings feature a wide variety of architectural styles, each building being built in the style of its time. The grass, trees, and flower beds on BYU\'s campus are impeccably maintained. Furthermore, views of the Wasatch Mountains, (including Mount Timpanogos) can be seen from the campus. BYU\'s Harold B. Lee Library (also known as "HBLL"), which The Princeton Review ranked as the No. 1 "Great College Library" in 2004, has approximately 8½ million items in its collections, contains 98 miles (158 km) of shelving, and can seat 4,600 people. The Spencer W. Kimball Tower, shortened to SWKT and pronounced Swicket by many students, is home to several of the university\'s departments and programs and is the tallest building in Provo, Utah. Furthermore, BYU\'s Marriott Center, used as a basketball arena, can sea

### Map words to lemmas

In [7]:
def lemmatize(phrase):
    return " ".join([word.lemma_ for word in sp(phrase)])

In [8]:
%%time

if not os.path.isfile(LEMMA_CACHE):
    lemmas = [lemmatize(par) for par in tqdm(paragraphs)]
    df = pd.DataFrame(data={'context': paragraphs, 'lemmas': lemmas})
    df.to_feather(LEMMA_CACHE)
    
df = pd.read_feather(LEMMA_CACHE)
paragraphs = df.context
lemmas = df.lemmas

CPU times: user 238 ms, sys: 48.3 ms, total: 286 ms
Wall time: 285 ms


In [9]:
rand_idx = [random.randint(0, len(lemmas)-1) for i in range(10)]

# TODO display in left/right columns
[(paragraphs[i][:80], lemmas[i][:80]) for i in rand_idx]

[('The loss of eight battleships and 2,403 Americans at Pearl Harbor forced the U.S',
  'the loss of eight battleship and 2,403 Americans at Pearl Harbor force the U.S. '),
 ('The most expensive part of a CD is the jewel case. In 1995, material costs were ',
  'the most expensive part of a CD be the jewel case . in 1995 , material cost be 3'),
 ('The third-generation iPod had a weak bass response, as shown in audio tests. The',
  'the third - generation iPod have a weak bass response , as show in audio test . '),
 ('On 6 September 2007, Belgian-based International Polar Foundation unveiled the P',
  'on 6 September 2007 , Belgian - base International Polar Foundation unveil the P'),
 ('Despite the small land mass, place names are repeated; there are, for example, t',
  'despite the small land mass , place name be repeat ; there be , for example , tw'),
 ("Copper's greater conductivity versus other metals enhances the electrical energy",
  "copper 's great conductivity versus other meta

### Vectorize corpus by TF-IDF

In [10]:
%%time
vectorizer = TfidfVectorizer(
    stop_words='english', min_df=5, max_df=.5, ngram_range=(1,3))
tfidf = vectorizer.fit_transform(lemmas)
len(vectorizer.vocabulary_)

CPU times: user 14.4 s, sys: 508 ms, total: 14.9 s
Wall time: 14.8 s


37931

### Fetch contexts related to question

In [11]:
question = "When did the last country to adopt the Gregorian calendar start using it?"
query = vectorizer.transform([lemmatize(question)])
(query > 0).sum(), vectorizer.inverse_transform(query)

(9, [array(['adopt', 'calendar', 'country', 'gregorian', 'gregorian calendar',
         'start', 'start use', 'use', 'use pron'], dtype='<U42')])

In [12]:
%%time
scores = (tfidf * query.T).toarray()
results = (np.flip(np.argsort(scores, axis=0)))
[paragraphs[i] for i in results[:3, 0]]

CPU times: user 6.94 ms, sys: 273 µs, total: 7.22 ms
Wall time: 6.03 ms


['"Old Style" (OS) and "New Style" (NS) are sometimes added to dates to identify which system is used in the British Empire and other countries that did not immediately change. Because the Calendar Act of 1750 altered the start of the year, and also aligned the British calendar with the Gregorian calendar, there is some confusion as to what these terms mean. They can indicate that the start of the Julian year has been adjusted to start on 1 January (NS) even though contemporary documents use a different start of year (OS); or to indicate that a date conforms to the Julian calendar (OS), formerly in use in many countries, rather than the Gregorian calendar (NS).',
 'During the period between 1582, when the first countries adopted the Gregorian calendar, and 1923, when the last European country adopted it, it was often necessary to indicate the date of some event in both the Julian calendar and in the Gregorian calendar, for example, "10/21 February 1750/51", where the dual year accounts

### Extract answers from contexts

In [13]:
qapipe = pipeline('question-answering',
                  model='distilbert-base-uncased-distilled-squad',
                  tokenizer='bert-base-uncased')

In [14]:
%%time
THRESH = 0.01
candidate_idxs = [ (i, scores[i]) for i in results[0:10, 0] ]
contexts = [ (paragraphs[i],s)
    for (i,s) in candidate_idxs if s > THRESH ]

question_df = pd.DataFrame.from_records([ {
    'question': question,
    'context':  ctx
} for (ctx,s) in contexts ])

question_df.to_feather("question_context.feather")

CPU times: user 3.19 ms, sys: 417 µs, total: 3.6 ms
Wall time: 2.75 ms


In [18]:
%%time
preds = qapipe(question_df.to_dict(orient="records"))
answer_df = pd.DataFrame.from_records(preds)
answer_df["context"] = question_df["context"]
answer_df = answer_df.sort_values(by="score", ascending=False)
answer_df.head()

Converting examples to features: 100%|██████████| 10/10 [00:00<00:00, 111.43it/s]


CPU times: user 266 ms, sys: 103 ms, total: 369 ms
Wall time: 369 ms


Unnamed: 0,score,start,end,answer,context
1,0.973023,93,98,1923,"During the period between 1582, when the first..."
8,0.920178,473,489,15 October 1582.,Philip II of Spain decreed the change from the...
3,0.809257,324,334,"1 January,",Extending the Gregorian calendar backwards to ...
0,0.641907,441,450,1 January,"""Old Style"" (OS) and ""New Style"" (NS) are some..."
2,0.547818,566,590,"Friday, 15 October 1582,",In conjunction with the system of months there...


In [19]:
answer_df.head().to_dict(orient="records")

[{'score': 0.9730233193969333,
  'start': 93,
  'end': 98,
  'answer': '1923,',
  'context': 'During the period between 1582, when the first countries adopted the Gregorian calendar, and 1923, when the last European country adopted it, it was often necessary to indicate the date of some event in both the Julian calendar and in the Gregorian calendar, for example, "10/21 February 1750/51", where the dual year accounts for some countries already beginning their numbered year on 1 January while others were still using some other date. Even before 1582, the year sometimes had to be double dated because of the different beginnings of the year in various countries. Woolley, writing in his biography of John Dee (1527–1608/9), notes that immediately after 1582 English letter writers "customarily" used "two dates" on their letters, one OS and one NS.'},
 {'score': 0.9201782766888371,
  'start': 473,
  'end': 489,
  'answer': '15 October 1582.',
  'context': 'Philip II of Spain decreed the chang