# Validation Parse

We previously downloaded the SQuAD validation dataset to *../../data/squad/dev-v2.0.json*. Here we apply the same parsing logic that we applied to our training data, to our validation data.

In [1]:
squad_dir = '../../data/squad'

Lets open up the training data and confirm that is shares the same format as the training data.

In [2]:
import os
import json

with open(os.path.join(squad_dir, 'dev-v2.0.json'), 'rb') as f:
    squad = json.load(f)

As before, the JSON structure contains a top-level `'data'` key which contains a list of *groups*, where each group is a topic. We can take a look at a few examples from the start and end of the dataset.

In [3]:
squad['data'][0]['paragraphs'][0]

{'qas': [{'question': 'In what country is Normandy located?',
   'id': '56ddde6b9a695914005b9628',
   'answers': [{'text': 'France', 'answer_start': 159},
    {'text': 'France', 'answer_start': 159},
    {'text': 'France', 'answer_start': 159},
    {'text': 'France', 'answer_start': 159}],
   'is_impossible': False},
  {'question': 'When were the Normans in Normandy?',
   'id': '56ddde6b9a695914005b9629',
   'answers': [{'text': '10th and 11th centuries', 'answer_start': 94},
    {'text': 'in the 10th and 11th centuries', 'answer_start': 87},
    {'text': '10th and 11th centuries', 'answer_start': 94},
    {'text': '10th and 11th centuries', 'answer_start': 94}],
   'is_impossible': False},
  {'question': 'From which countries did the Norse originate?',
   'id': '56ddde6b9a695914005b962a',
   'answers': [{'text': 'Denmark, Iceland and Norway', 'answer_start': 256},
    {'text': 'Denmark, Iceland and Norway', 'answer_start': 256},
    {'text': 'Denmark, Iceland and Norway', 'answer_star

In [4]:
squad['data'][-1]['paragraphs'][0]

{'qas': [{'question': 'What concept did philosophers in antiquity use to study simple machines?',
   'id': '573735e8c3c5551400e51e71',
   'answers': [{'text': 'force', 'answer_start': 46},
    {'text': 'force', 'answer_start': 46},
    {'text': 'the concept of force', 'answer_start': 31},
    {'text': 'the concept of force', 'answer_start': 31},
    {'text': 'force', 'answer_start': 46},
    {'text': 'force', 'answer_start': 46}],
   'is_impossible': False},
  {'question': 'What was the belief that maintaining motion required force?',
   'id': '573735e8c3c5551400e51e72',
   'answers': [{'text': 'fundamental error', 'answer_start': 387},
    {'text': 'A fundamental error', 'answer_start': 385},
    {'text': 'A fundamental error', 'answer_start': 385},
    {'text': 'A fundamental error', 'answer_start': 385},
    {'text': 'A fundamental error', 'answer_start': 385},
    {'text': 'A fundamental error', 'answer_start': 385}],
   'is_impossible': False},
  {'question': 'Who had mathmatical 

We can see that both `'answers'` and `'plausible_answers'` fields appear in this dataset too. However this time, we can see multiple answers that seem to be duplicates - so we'll need to adjust our logic to deal with that.

We'll try and produce the same format as we previously built where we have a list of dictionaries where each dictionary contains a single `question`, `answer`, and `context`.

If we find duplicates in the *answers* lists, we should remove them.

In [5]:
# initialize list where we will place all of our data
new_squad = []

# we need to loop through groups -> paragraphs -> qa_pairs
for group in squad['data']:
    for paragraph in group['paragraphs']:
        # we pull out the context from here
        context = paragraph['context']
        for qa_pair in paragraph['qas']:
            # we pull out the question
            question = qa_pair['question']
            # now the logic to check if we have 'answers' or 'plausible_answers'
            if 'answers' in qa_pair.keys() and len(qa_pair['answers']) > 0:
                answer_list = qa_pair['answers']
            elif 'plausible_answers' in qa_pair.keys() and len(qa_pair['plausible_answers']) > 0:
                answer_list = qa_pair['plausible_answers']
            else:
                # this shouldn't happen, but just in case we just set answer = []
                answer_list = []
            # we want to pull our the 'text' of each answer in our list of answers
            answer_list = [item['text'] for item in answer_list]
            # we can remove duplicate answers by converting our list to a set, and then back to a list
            answer_list = list(set(answer_list))
            # we iterate through each unique answer in the answer_list
            for answer in answer_list:
                # append dictionary sample to parsed squad
                new_squad.append({
                    'question': question,
                    'answer': answer,
                    'context': context
                })

In [6]:
new_squad[:3], new_squad[-2:]

([{'question': 'In what country is Normandy located?',
   'answer': 'France',
   'context': 'The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse ("Norman" comes from "Norseman") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.'},
  {'question': 'When were the Normans in Normandy?',
   'answer': 'in the 10th and 11th centuries',
   'context': 'The Normans (Norman: Nourmands; French: Norman

At indices *1* and *2* we can see an example of where our new logic which loops through each answer in the `'answers'`/`'plausible_answers'` list has been used for the question *'When were the Normans in Normandy'*. We can save our parsed data to file as a JSON like so:

In [7]:
with open(os.path.join(squad_dir, 'dev.json'), 'w') as f:
    json.dump(new_squad, f)