# Question Answering with PyTorch Transformers: Part 1

Article for this notebook: https://medium.com/@patonw/question-answering-with-pytorch-transformers-part-1-8736196bf20e

> In the first part of this series we’ll look at the problem of question answering and the SQUAD datasets. Then we’ll see how the Transformers’ pipeline API allows us to easily use pre-trained models to answer questions.
>
> In later parts we’ll explore how to build systems around it that are generally useful to the average person. There have been many academic articles written on the topics we’ll explore. However, I want to focus on the engineering aspects and demonstrate how simple it is to build useful systems by leveraging a handful of high quality open-source libraries.

Skip ahead to Part 2 & 3 for the meat.

This is just an exploration of the SQUAD2.0 dataset.

In [1]:
# Prepare to run in paperspace. You should manage these with pipenv or conda on your own machine.
# Run init_container from a Terminal window for debugging
# I'd rather not have the output filling up the screen here.
%run init_container.py

In [2]:
from constants import *

In [3]:
import os
import random
import pandas as pd
import json

In [4]:
with open(SQUAD_TRAIN) as f:
    doc = json.load(f)
doc.keys(), type(doc["data"]), len(doc["data"])

(dict_keys(['version', 'data']), list, 442)

In [5]:
doc["data"][0].keys(), doc["data"][0]["title"]

(dict_keys(['title', 'paragraphs']), 'Beyoncé')

In [6]:
len(doc["data"][0]["paragraphs"]), doc["data"][0]["paragraphs"][0].keys(), len(doc["data"][0]["paragraphs"][0]["qas"])

(66, dict_keys(['qas', 'context']), 15)

In [7]:
doc["data"][0]["paragraphs"][0]["context"]

'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".'

In [8]:
doc["data"][0]["paragraphs"][0]["qas"][0]

{'question': 'When did Beyonce start becoming popular?',
 'id': '56be85543aeaaa14008c9063',
 'answers': [{'text': 'in the late 1990s', 'answer_start': 269}],
 'is_impossible': False}

In [9]:
paragraphs = []
questions = []
impossible = []
for topic in doc["data"]:
    for pgraph in topic["paragraphs"]:
        paragraphs.append(pgraph["context"])
        for qa in pgraph["qas"]:
            if not qa["is_impossible"]:
                questions.append((qa["question"], pgraph["context"]))
            else:
                impossible.append((qa["question"], pgraph["context"]))
        
len(paragraphs), len(questions), len(impossible)

(19035, 86821, 43498)

In [10]:
random.sample(paragraphs, 2)

["The battle in the Marshall Islands caused irreparable damage, especially on Japanese bases. During the American bombing, the islands' population suffered from lack of food and various injuries. U.S. attacks started in mid-1943, and caused half the Japanese garrison of 5,100 people in the atoll Mili to die from hunger by August 1945.",
 'The amount of crossover between the AC chart and the Hot 100 has varied based on how much the passing pop music trends of the times appealed to adult listeners. Not many disco or new wave songs were particularly successful on the AC chart during the late 1970s and early 1980s, and much of the hip-hop and harder rock music featured on CHR formats later in the decade would have been unacceptable on AC radio.']

In [11]:
random.sample(questions, 5)

[('What are the intercellular junctions unique to animal cells?',
  'All animals have eukaryotic cells, surrounded by a characteristic extracellular matrix composed of collagen and elastic glycoproteins. This may be calcified to form structures like shells, bones, and spicules. During development, it forms a relatively flexible framework upon which cells can move about and be reorganized, making complex structures possible. In contrast, other multicellular organisms, like plants and fungi, have cells held in place by cell walls, and so develop by progressive growth. Also, unique to animal cells are the following intercellular junctions: tight junctions, gap junctions, and desmosomes.'),
 ('What did the biggest star on the flag of China in 1949 represent?',
  'After the Communist Party of China took power in 1949, the flag of China became a red flag with a large star symbolizing the Communist Party, and smaller stars symbolizing workers, peasants, the urban middle class and rural middle

In [12]:
random.sample(impossible, 5)

[(' Who rejected the HDI?',
  'The Human Development Index (HDI) is a composite statistic of life expectancy, education, and income per capita indicators, which are used to rank countries into four tiers of human development. A country scores higher HDI when the life expectancy at birth is longer, the education period is longer, and the income per capita is higher. The HDI was developed by the Pakistani economist Mahbub ul Haq, often framed in terms of whether people are able to "be" and "do" desirable things in their life, and was published by the United Nations Development Programme.'),
 ("What quality hasn't LEDs been used as?",
  'LEDs have also been used as a medium-quality voltage reference in electronic circuits. The forward voltage drop (e.g. about 1.7 V for a normal red LED) can be used instead of a Zener diode in low-voltage regulators. Red LEDs have the flattest I/V curve above the knee. Nitride-based LEDs have a fairly steep I/V curve and are useless for this purpose. Altho