# SQuAD 2.0 Dataset

The SQuAD (Stanford Question and Answering Dataset) is a hugely popular dataset containing question and answer pairs scraped from Wikipedia, covering topics ranging from Beyonce, to Physics. As one of the most comprehensive Q&A datasets available, it's only natural that we will be making use of it. So let's explore it.

First, we'll need to download the data. There are two JSON files that we are interested in - train and dev, which we can downloaded from `http`. Here we will be storing the SQuAD data in the *../../data/squad* directory, so we must check if this already exists and if not create the directory.

In [1]:
import os

squad_dir = '../../data/squad'

if not os.path.exists(squad_dir):
    os.mkdir(squad_dir)

Now let's define our SQuAD URL and files.

In [2]:
url = 'https://rajpurkar.github.io/SQuAD-explorer/dataset/'
files = ['train-v2.0.json', 'dev-v2.0.json']

And now we can download and write both files to our *../../squad* directory.

In [3]:
import requests

for file in files:
    res = requests.get(url+file)
    # write to file in chunks
    with open(os.path.join(squad_dir, file), 'wb') as f:
        for chunk in res.iter_content(chunk_size=40):
            f.write(chunk)

Now that we have both files stored locally, lets open up the training data and see what we have.

In [4]:
import json

with open(os.path.join(squad_dir, 'train-v2.0.json'), 'rb') as f:
    squad = json.load(f)

The JSON structure contains a top-level `'data'` key which contains a list of *groups*, where each group is a topic, such as *Beyonce*, *Chopin*, or *Matter*. We can take a look at the first and last groups respectively.

In [None]:
squad['data'][0]['paragraphs'][0]

{'qas': [{'question': 'When did Beyonce start becoming popular?',
   'id': '56be85543aeaaa14008c9063',
   'answers': [{'text': 'in the late 1990s', 'answer_start': 269}],
   'is_impossible': False},
  {'question': 'What areas did Beyonce compete in when she was growing up?',
   'id': '56be85543aeaaa14008c9065',
   'answers': [{'text': 'singing and dancing', 'answer_start': 207}],
   'is_impossible': False},
  {'question': "When did Beyonce leave Destiny's Child and become a solo singer?",
   'id': '56be85543aeaaa14008c9066',
   'answers': [{'text': '2003', 'answer_start': 526}],
   'is_impossible': False},
  {'question': 'In what city and state did Beyonce  grow up? ',
   'id': '56bf6b0f3aeaaa14008c9601',
   'answers': [{'text': 'Houston, Texas', 'answer_start': 166}],
   'is_impossible': False},
  {'question': 'In which decade did Beyonce become famous?',
   'id': '56bf6b0f3aeaaa14008c9602',
   'answers': [{'text': 'late 1990s', 'answer_start': 276}],
   'is_impossible': False},
  {'q

In [9]:
squad['data'][-1]['paragraphs'][0]

{'qas': [{'plausible_answers': [{'text': 'ordinary matter composed of atoms',
     'answer_start': 50}],
   'question': 'What did the term matter include after the 20th century?',
   'id': '5a7db48670df9f001a87505f',
   'answers': [],
   'is_impossible': True},
  {'plausible_answers': [{'text': 'matter', 'answer_start': 59}],
   'question': 'What are atoms composed of?',
   'id': '5a7db48670df9f001a875060',
   'answers': [],
   'is_impossible': True},
  {'plausible_answers': [{'text': 'light or sound', 'answer_start': 128}],
   'question': 'What are two examples of matter?',
   'id': '5a7db48670df9f001a875061',
   'answers': [],
   'is_impossible': True},
  {'plausible_answers': [{'text': "its (possibly massless) constituents' motion and interaction energies",
     'answer_start': 315}],
   'question': "What can an object's mass not come from?",
   'id': '5a7db48670df9f001a875062',
   'answers': [],
   'is_impossible': True},
  {'plausible_answers': [{'text': 'fundamental', 'answer_sta

### Processing SQuAD Training Data

If we compare the first entry on *Beyonce* and the second on *Matter*, we can see that we sometimes return our answers in the `'answers'` key, and sometimes in the `'plausible_answers'` key. So when processing this data we will need to consider some additional logic to deal with this.

Secondly, for all samples, we need to iterate through multiple levels. On the highest level we have *groups*, which is where our topics like *'Beyonce'* and *'Matter'* belong. At the next layer we have paragraphs, and in the next we have our question-answer pairs, this structure looks like this:

![train-v2.0.json structure](../../assets/images/squad_train_json_structure.png)

*(You can check out this structure by opening the **train-v2.0.json** in Jupyter)*

We'll work through parsing this data into a cleaner format that we will be using in later notebooks. We need to create a format that consists of a list of dictionaries where each dictionary contains a single `question`, `answer`, and `context`.

In [10]:
# initialize list where we will place all of our data
new_squad = []

# we need to loop through groups -> paragraphs -> qa_pairs
for group in squad['data']:
    for paragraph in group['paragraphs']:
        # we pull out the context from here
        context = paragraph['context']
        for qa_pair in paragraph['qas']:
            # we pull out the question
            question = qa_pair['question']
            # now the logic to check if we have 'answers' or 'plausible_answers'
            if 'answers' in qa_pair.keys() and len(qa_pair['answers']) > 0:
                answer = qa_pair['answers'][0]['text']
            elif 'plausible_answers' in qa_pair.keys() and len(qa_pair['plausible_answers']) > 0:
                answer = qa_pair['plausible_answers'][0]['text']
            else:
                # this shouldn't happen, but just in case we just set answer = None
                answer = None
            # append dictionary sample to parsed squad
            new_squad.append({
                'question': question,
                'answer': answer,
                'context': context
            })

In [11]:
new_squad[:2], new_squad[-2:]

([{'question': 'When did Beyonce start becoming popular?',
   'answer': 'in the late 1990s',
   'context': 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".'},
  {'question': 'What areas did Beyonce compete in when she was growing up?',
   'answer': 'singing and dancing',
   'context': 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born

We can save our parsed data to file as a JSON like so:

In [12]:
with open(os.path.join(squad_dir, 'train.json'), 'w') as f:
    json.dump(new_squad, f)