# Question Answering with PyTorch Transformers: Part 2
## A simple vector index with scikit-learn

Read the full article: https://medium.com/@patonw/question-answering-with-pytorch-transformers-part-2-a31900294673

> In Part 1 we briefly examined the problem of question answering in machine learning and how recent breakthroughs have greatly improved the quality of answers produced by computer systems.
>
> Using the pipeline API Transformers library we were able to run a pre-trained model in a few lines of code. In this article we’ll prototype an information retrieval system around it. In later articles we’ll turn that into web services that can be queried by browsers and mobile apps.

In [1]:
# Prepare to run in paperspace. You should manage these with pipenv or conda on your own machine.
# Run init_container from a Terminal window for debugging
# I'd rather not have the output filling up the screen here.
%run init_container.py

In [2]:
from constants import *

In [3]:
import os
import requests
import random
import pickle

import pandas as pd
import json
import sklearn
import spacy

import numpy as np
import torch
import torch.nn.functional as F
from itertools import islice
from tqdm import tqdm

from sklearn.feature_extraction.text import TfidfVectorizer
from transformers import *

In [4]:
spacy.prefer_gpu()
sp = spacy.load("en_core_web_sm")

### Extract Questions from SQUAD2.0

Download dataset if not present.

In [5]:
with open(SQUAD_TRAIN) as f:
    doc = json.load(f)
doc.keys(), type(doc["data"]), len(doc["data"])

(dict_keys(['version', 'data']), list, 442)

In [6]:
paragraphs = []
questions = []
for topic in doc["data"]:
    for pgraph in topic["paragraphs"]:
        paragraphs.append(pgraph["context"])
        for qa in pgraph["qas"]:
            if not qa["is_impossible"]:
                questions.append(qa["question"])
        
len(paragraphs), len(questions), random.sample(paragraphs, 2), random.sample(questions, 5)

(19035,
 86821,
 ["Since the 1979 Revolution, to overcome foreign embargoes, Iran has developed its own military industry, produced its own tanks, armored personnel carriers, guided missiles, submarines, military vessels, guided missile destroyer, radar systems, helicopters and fighter planes. In recent years, official announcements have highlighted the development of weapons such as the Hoot, Kowsar, Zelzal, Fateh-110, Shahab-3 and Sejjil missiles, and a variety of unmanned aerial vehicles (UAVs). The Fajr-3 (MIRV) is currently Iran's most advanced ballistic missile, it is a liquid fuel missile with an undisclosed range which was developed and produced domestically.",
  'Israel operates under a parliamentary system as a democratic republic with universal suffrage. A member of parliament supported by a parliamentary majority becomes the prime minister—usually this is the chair of the largest party. The prime minister is the head of government and head of the cabinet. Israel is governed

### Map words to lemmas

In [7]:
def lemmatize(phrase):
    return " ".join([word.lemma_ for word in sp(phrase)])

In [8]:
%%time

if not os.path.isfile(LEMMA_CACHE):
    lemmas = [lemmatize(par) for par in tqdm(paragraphs)]
    df = pd.DataFrame(data={'context': paragraphs, 'lemmas': lemmas})
    df.to_feather(LEMMA_CACHE)
    
df = pd.read_feather(LEMMA_CACHE)
paragraphs = df.context
lemmas = df.lemmas

CPU times: user 33.3 ms, sys: 20.6 ms, total: 53.9 ms
Wall time: 88.4 ms


In [9]:
rand_idx = [random.randint(0, len(lemmas)-1) for i in range(10)]

# TODO display in left/right columns
[(paragraphs[i][:80], lemmas[i][:80]) for i in rand_idx]

[('Most prime ministers in parliamentary systems are not appointed for a specific t',
  'Most prime minister in parliamentary system be not appoint for a specific term i'),
 ('The Roman Catholic Diocese of Charleston Office of Education also operates out o',
  'the roman Catholic Diocese of Charleston Office of Education also operate out of'),
 ('The Super Nintendo Entertainment System (officially abbreviated the Super NES[b]',
  'the Super Nintendo Entertainment System ( officially abbreviate the Super NES[b '),
 ('Paris is a major international air transport hub with the 4th busiest airport sy',
  'Paris be a major international air transport hub with the 4th busy airport syste'),
 ('For temperature studies, subjects must remain awake but calm and semi-reclined i',
  'for temperature study , subject must remain awake but calm and semi - recline in'),
 ('The Government of Estonia (Estonian: Vabariigi Valitsus) or the executive branch',
  'the Government of Estonia ( Estonian : Vabarii

### Vectorize corpus by TF-IDF

In [10]:
%%time
if not os.path.isfile(VECTOR_CACHE):
    vectorizer = TfidfVectorizer(
        stop_words='english', min_df=5, max_df=.5, ngram_range=(1,3))
    tfidf = vectorizer.fit_transform(lemmas)
    with open(VECTOR_CACHE, "wb") as f:
        pickle.dump(dict(vectorizer=vectorizer, tfidf=tfidf), f)
else:
    with open(VECTOR_CACHE, "rb") as f:
        cache = pickle.load(f)
        tfidf = cache["tfidf"]
        vectorizer = cache["vectorizer"]
        
len(vectorizer.vocabulary_)

CPU times: user 546 ms, sys: 101 ms, total: 647 ms
Wall time: 645 ms


37931

### Fetch contexts related to question

In [11]:
question = "When did the last country to adopt the Gregorian calendar start using it?"
query = vectorizer.transform([lemmatize(question)])
(query > 0).sum(), vectorizer.inverse_transform(query)

(9, [array(['adopt', 'calendar', 'country', 'gregorian', 'gregorian calendar',
         'start', 'start use', 'use', 'use pron'], dtype='<U42')])

In [12]:
%%time
scores = (tfidf * query.T).toarray()
results = (np.flip(np.argsort(scores, axis=0)))
[paragraphs[i] for i in results[:3, 0]]

CPU times: user 4.49 ms, sys: 562 µs, total: 5.05 ms
Wall time: 4.25 ms


['"Old Style" (OS) and "New Style" (NS) are sometimes added to dates to identify which system is used in the British Empire and other countries that did not immediately change. Because the Calendar Act of 1750 altered the start of the year, and also aligned the British calendar with the Gregorian calendar, there is some confusion as to what these terms mean. They can indicate that the start of the Julian year has been adjusted to start on 1 January (NS) even though contemporary documents use a different start of year (OS); or to indicate that a date conforms to the Julian calendar (OS), formerly in use in many countries, rather than the Gregorian calendar (NS).',
 'During the period between 1582, when the first countries adopted the Gregorian calendar, and 1923, when the last European country adopted it, it was often necessary to indicate the date of some event in both the Julian calendar and in the Gregorian calendar, for example, "10/21 February 1750/51", where the dual year accounts

### Extract answers from contexts

In [13]:
qapipe = pipeline('question-answering',
                  model='distilbert-base-uncased-distilled-squad',
                  tokenizer='bert-base-uncased')

In [14]:
%%time
THRESH = 0.01
candidate_idxs = [ (i, scores[i]) for i in results[0:10, 0] ]
contexts = [ (paragraphs[i],s)
    for (i,s) in candidate_idxs if s > THRESH ]

question_df = pd.DataFrame.from_records([ {
    'question': question,
    'context':  ctx
} for (ctx,s) in contexts ])

question_df.to_feather("cache/question_context.feather")

CPU times: user 0 ns, sys: 3.2 ms, total: 3.2 ms
Wall time: 2.44 ms


In [15]:
%%time
preds = qapipe(question_df.to_dict(orient="records"))
answer_df = pd.DataFrame.from_records(preds)
answer_df["context"] = question_df["context"]
answer_df = answer_df.sort_values(by="score", ascending=False)
answer_df.head()

Converting examples to features: 100%|██████████| 10/10 [00:00<00:00, 150.06it/s]


CPU times: user 174 ms, sys: 100 ms, total: 274 ms
Wall time: 308 ms


Unnamed: 0,score,start,end,answer,context
1,0.973023,93,98,1923,"During the period between 1582, when the first..."
8,0.920178,473,489,15 October 1582.,Philip II of Spain decreed the change from the...
3,0.809254,324,334,"1 January,",Extending the Gregorian calendar backwards to ...
0,0.641908,441,450,1 January,"""Old Style"" (OS) and ""New Style"" (NS) are some..."
2,0.547817,566,590,"Friday, 15 October 1582,",In conjunction with the system of months there...


In [16]:
answer_df.head().to_dict(orient="records")

[{'score': 0.973023084214109,
  'start': 93,
  'end': 98,
  'answer': '1923,',
  'context': 'During the period between 1582, when the first countries adopted the Gregorian calendar, and 1923, when the last European country adopted it, it was often necessary to indicate the date of some event in both the Julian calendar and in the Gregorian calendar, for example, "10/21 February 1750/51", where the dual year accounts for some countries already beginning their numbered year on 1 January while others were still using some other date. Even before 1582, the year sometimes had to be double dated because of the different beginnings of the year in various countries. Woolley, writing in his biography of John Dee (1527–1608/9), notes that immediately after 1582 English letter writers "customarily" used "two dates" on their letters, one OS and one NS.'},
 {'score': 0.9201781041192767,
  'start': 473,
  'end': 489,
  'answer': '15 October 1582.',
  'context': 'Philip II of Spain decreed the change