# Question Answering with PyTorch Transformers: Part 2
## A simple vector index with scikit-learn

Article for this notebook: https://medium.com/@patonw/question-answering-with-pytorch-transformers-part-2-a31900294673

> In Part 1 we briefly examined the problem of question answering in machine learning and how recent breakthroughs have greatly improved the quality of answers produced by computer systems.
>
> Using the pipeline API Transformers library we were able to run a pre-trained model in a few lines of code. In this article we’ll prototype an information retrieval system around it. In later articles we’ll turn that into web services that can be queried by browsers and mobile apps.

In [None]:
# Prepare to run in paperspace. You should manage these with pipenv or conda on your own machine.

!pip install torch transformers sklearn spacy[cuda100] pyarrow
!python -m spacy download en_core_web_sm

In [2]:
import os
import requests
import random
import pickle

import pandas as pd
import json
import sklearn
import spacy

import numpy as np
import torch
import torch.nn.functional as F
from itertools import islice
from tqdm import tqdm

from sklearn.feature_extraction.text import TfidfVectorizer
from transformers import *

In [3]:
SQUAD_URL = "https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json"
SQUAD_TRAIN = "data/train-v2.0.json"
LEMMA_CACHE = "cache/lemmas.feather"
VECTOR_CACHE = "cache/vectors.pickle"

In [4]:
spacy.prefer_gpu()
sp = spacy.load("en_core_web_sm")

### Extract Questions from SQUAD2.0

Download dataset if not present.

In [5]:
if not os.path.isdir("data"):
    os.mkdir("data")
if not os.path.isdir("cache"):
    os.mkdir("cache")
    
if not os.path.isfile(SQUAD_TRAIN):
    response = requests.get(SQUAD_URL, stream=True)

    with open(SQUAD_TRAIN, "wb") as handle:
        for data in tqdm(response.iter_content()):
            handle.write(data)

In [6]:
with open(SQUAD_TRAIN) as f:
    doc = json.load(f)
doc.keys(), type(doc["data"]), len(doc["data"])

(dict_keys(['version', 'data']), list, 442)

In [7]:
paragraphs = []
questions = []
for topic in doc["data"]:
    for pgraph in topic["paragraphs"]:
        paragraphs.append(pgraph["context"])
        for qa in pgraph["qas"]:
            if not qa["is_impossible"]:
                questions.append(qa["question"])
        
len(paragraphs), len(questions), random.sample(paragraphs, 2), random.sample(questions, 5)

(19035,
 86821,
 ['Mexico City has numerous museums dedicated to art, including Mexican colonial, modern and contemporary art, and international art. The Museo Tamayo was opened in the mid-1980s to house the collection of international contemporary art donated by famed Mexican (born in the state of Oaxaca) painter Rufino Tamayo. The collection includes pieces by Picasso, Klee, Kandinsky, Warhol and many others, though most of the collection is stored while visiting exhibits are shown. The Museo de Arte Moderno (Museum of Modern Art) is a repository of Mexican artists from the 20th century, including Rivera, Orozco, Siqueiros, Kahlo, Gerzso, Carrington, Tamayo, among others, and also regularly hosts temporary exhibits of international modern art. In southern Mexico City, the Museo Carrillo Gil (Carrillo Gil Museum) showcases avant-garde artists, as does the University Museum/Contemporary Art (Museo Universitario Arte Contemporáneo – or MUAC), designed by famed Mexican architect Teodoro 

### Map words to lemmas

In [8]:
def lemmatize(phrase):
    return " ".join([word.lemma_ for word in sp(phrase)])

In [9]:
%%time

if not os.path.isfile(LEMMA_CACHE):
    lemmas = [lemmatize(par) for par in tqdm(paragraphs)]
    df = pd.DataFrame(data={'context': paragraphs, 'lemmas': lemmas})
    df.to_feather(LEMMA_CACHE)
    
df = pd.read_feather(LEMMA_CACHE)
paragraphs = df.context
lemmas = df.lemmas

CPU times: user 24.1 ms, sys: 30.7 ms, total: 54.8 ms
Wall time: 76.9 ms


In [10]:
rand_idx = [random.randint(0, len(lemmas)-1) for i in range(10)]

# TODO display in left/right columns
[(paragraphs[i][:80], lemmas[i][:80]) for i in rand_idx]

[('For vertebrates, the early stages of neural development are similar across all s',
  'for vertebrate , the early stage of neural development be similar across all spe'),
 ('Compression efficiency of encoders is typically defined by the bit rate, because',
  'compression efficiency of encoder be typically define by the bit rate , because '),
 ('The Byzantine Empire ruled the northern shores of the Sahara from the 5th to the',
  'the Byzantine Empire rule the northern shore of the Sahara from the 5th to the 7'),
 ("The mosaics of St. Peter's often show lively Baroque compositions based on desig",
  "the mosaic of St. Peter 's often show lively Baroque composition base on design "),
 ('After 1870, the new railroads across the Plains brought hunters who killed off a',
  'after 1870 , the new railroad across the Plains bring hunter who kill off almost'),
 ('Numerous live performance events dedicated to house music were founded during th',
  'numerous live performance event dedicate to ho

### Vectorize corpus by TF-IDF

In [11]:
%%time
if not os.path.isfile(VECTOR_CACHE):
    vectorizer = TfidfVectorizer(
        stop_words='english', min_df=5, max_df=.5, ngram_range=(1,3))
    tfidf = vectorizer.fit_transform(lemmas)
    with open(VECTOR_CACHE, "wb") as f:
        pickle.dump(dict(vectorizer=vectorizer, tfidf=tfidf), f)
else:
    with open(VECTOR_CACHE, "rb") as f:
        cache = pickle.load(f)
        tfidf = cache["tfidf"]
        vectorizer = cache["vectorizer"]
len(vectorizer.vocabulary_)

CPU times: user 615 ms, sys: 40.5 ms, total: 655 ms
Wall time: 650 ms


37931

### Fetch contexts related to question

In [12]:
question = "When did the last country to adopt the Gregorian calendar start using it?"
query = vectorizer.transform([lemmatize(question)])
(query > 0).sum(), vectorizer.inverse_transform(query)

(9, [array(['adopt', 'calendar', 'country', 'gregorian', 'gregorian calendar',
         'start', 'start use', 'use', 'use pron'], dtype='<U42')])

In [13]:
%%time
scores = (tfidf * query.T).toarray()
results = (np.flip(np.argsort(scores, axis=0)))
[paragraphs[i] for i in results[:3, 0]]

CPU times: user 3.5 ms, sys: 492 µs, total: 3.99 ms
Wall time: 3.43 ms


['"Old Style" (OS) and "New Style" (NS) are sometimes added to dates to identify which system is used in the British Empire and other countries that did not immediately change. Because the Calendar Act of 1750 altered the start of the year, and also aligned the British calendar with the Gregorian calendar, there is some confusion as to what these terms mean. They can indicate that the start of the Julian year has been adjusted to start on 1 January (NS) even though contemporary documents use a different start of year (OS); or to indicate that a date conforms to the Julian calendar (OS), formerly in use in many countries, rather than the Gregorian calendar (NS).',
 'During the period between 1582, when the first countries adopted the Gregorian calendar, and 1923, when the last European country adopted it, it was often necessary to indicate the date of some event in both the Julian calendar and in the Gregorian calendar, for example, "10/21 February 1750/51", where the dual year accounts

### Extract answers from contexts

In [14]:
qapipe = pipeline('question-answering',
                  model='distilbert-base-uncased-distilled-squad',
                  tokenizer='bert-base-uncased')

In [15]:
%%time
THRESH = 0.01
candidate_idxs = [ (i, scores[i]) for i in results[0:10, 0] ]
contexts = [ (paragraphs[i],s)
    for (i,s) in candidate_idxs if s > THRESH ]

question_df = pd.DataFrame.from_records([ {
    'question': question,
    'context':  ctx
} for (ctx,s) in contexts ])

question_df.to_feather("cache/question_context.feather")

CPU times: user 0 ns, sys: 2.79 ms, total: 2.79 ms
Wall time: 2.14 ms


In [16]:
%%time
preds = qapipe(question_df.to_dict(orient="records"))
answer_df = pd.DataFrame.from_records(preds)
answer_df["context"] = question_df["context"]
answer_df = answer_df.sort_values(by="score", ascending=False)
answer_df.head()

Converting examples to features: 100%|██████████| 10/10 [00:00<00:00, 154.16it/s]


CPU times: user 190 ms, sys: 80.6 ms, total: 270 ms
Wall time: 296 ms


Unnamed: 0,score,start,end,answer,context
1,0.973023,93,98,1923,"During the period between 1582, when the first..."
8,0.920178,473,489,15 October 1582.,Philip II of Spain decreed the change from the...
3,0.809254,324,334,"1 January,",Extending the Gregorian calendar backwards to ...
0,0.641908,441,450,1 January,"""Old Style"" (OS) and ""New Style"" (NS) are some..."
2,0.547817,566,590,"Friday, 15 October 1582,",In conjunction with the system of months there...


In [17]:
answer_df.head().to_dict(orient="records")

[{'score': 0.973023084214109,
  'start': 93,
  'end': 98,
  'answer': '1923,',
  'context': 'During the period between 1582, when the first countries adopted the Gregorian calendar, and 1923, when the last European country adopted it, it was often necessary to indicate the date of some event in both the Julian calendar and in the Gregorian calendar, for example, "10/21 February 1750/51", where the dual year accounts for some countries already beginning their numbered year on 1 January while others were still using some other date. Even before 1582, the year sometimes had to be double dated because of the different beginnings of the year in various countries. Woolley, writing in his biography of John Dee (1527–1608/9), notes that immediately after 1582 English letter writers "customarily" used "two dates" on their letters, one OS and one NS.'},
 {'score': 0.9201781041192767,
  'start': 473,
  'end': 489,
  'answer': '15 October 1582.',
  'context': 'Philip II of Spain decreed the change