# SQuAD 2.0 Dataset (Python 3.10)

Python 3.10 has introduced something called **Structural Pattern Matching**, which are similiar to switch-case statements (*but would be better named **match-case** in Python*) and allow us to parse our SQuAD data in a cleaner fashion. If you have access to Python 3.10, try this method out - otherwise stick with the previous approach.

We will load in our data just as before.

In [1]:
import os
import json

squad_dir = '../../data/squad'

with open(os.path.join(squad_dir, 'train-v2.0.json'), 'rb') as f:
    squad = json.load(f)

The match-case statement logic looks like this:

![train-v2.0.json structure](../../assets/images/match_case_logic.png)

Let's try applying is to a simple example first so that we can full grasp the logic and syntax.

In [2]:
# we will be testing the value of our http_code
http_code = '418'

# we begin the match-case statement with the match keyword and the 'subject' of our statement
match http_code:
    # now we write multiple cases where if http_code matches the given pattern, we will execute the code
    case '200':
        print('OK')
    case '404':
        print('Not found')
    case '418':
        print("I'm a teapot")
    case _:
        print('HTTP code not recognized')

I'm a teapot


Because our `case '418'` pattern matches the subject `http_code`, the `print("I'm a teapot")` block is executed. In this scenerio the code behaves very much like an if-elif-else statement. We even have our *else* equivalent with the `case _` condition at the end, which acts as a *catch-all*:

In [3]:
http_code = "I'm not an HTTP code"

match http_code:
    case '200':
        print('OK')
    case '404':
        print('Not found')
    case '418':
        print("I'm a teapot")
    case _:
        print('HTTP code not recognized')

HTTP code not recognized


Great, so now we have a grasp of these new match-case statements. However, we don't need to check for *exact* matches with our match-case, and for our use-case we don't want to. We will be checking if the returned dictionary structure contains the values we need (eg does is contain a list under *'plausible_answers'*?

It's also worth noting that values like empty lists, strings, *0*, and *None* will return as **falsy**. So in the case where we find that *'answers'* exists but just contains an empty list, we will be returning **falsy** and therefore not execute the respective code block.

So, let's write it out.

In [4]:
# initialize list where we will place all of our data
new_squad = []

# we need to loop through groups -> paragraphs -> qa_pairs
for group in squad['data']:
    for paragraph in group['paragraphs']:
        # we pull out the context from here
        context = paragraph['context']
        for qa_pair in paragraph['qas']:
            # we pull out the question
            question = qa_pair['question']
            # now the NEW match-case logic to check if we have 'answers' or 'plausible_answers'
            match qa_pair:
                case {'answers': [{'text': answer}]}:
                    # this will be truthy IF the qa_pair dictionary contains a 'answers' key
                    # which in turn contains a list containing a dictionary with a 'text' key
                    # and any value mapping to this 'text' key is assigned to the answer variable
                    pass  # because the case pattern assigns 'answer' for us, we pass
                case {'plausible_answers': [{'text': answer}]}:
                    # we perform same check but for 'plausible_answers'
                    pass
                case _:
                    # this is our catchall, we will set answer to None
                    answer = None
            # append dictionary sample to parsed squad
            new_squad.append({
                'question': question,
                'answer': answer,
                'context': context
            })

In [5]:
new_squad[:2], new_squad[-2:]

([{'question': 'When did Beyonce start becoming popular?',
   'answer': 'in the late 1990s',
   'context': 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".'},
  {'question': 'What areas did Beyonce compete in when she was growing up?',
   'answer': 'singing and dancing',
   'context': 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born

Perfect, we have the exact same output as produced with our **if-else** version in the previous notebook. We can save our parsed data to file as a JSON like before:

In [14]:
with open(os.path.join(squad_dir, 'train.json'), 'w') as f:
    json.dump(new_squad, f)

### Process the dev dataset

In [15]:
with open(os.path.join(squad_dir, 'dev-v2.0.json'), 'rb') as f:
    squad = json.load(f)

In [16]:
squad["data"][0]["paragraphs"][0]

{'qas': [{'question': 'In what country is Normandy located?',
   'id': '56ddde6b9a695914005b9628',
   'answers': [{'text': 'France', 'answer_start': 159},
    {'text': 'France', 'answer_start': 159},
    {'text': 'France', 'answer_start': 159},
    {'text': 'France', 'answer_start': 159}],
   'is_impossible': False},
  {'question': 'When were the Normans in Normandy?',
   'id': '56ddde6b9a695914005b9629',
   'answers': [{'text': '10th and 11th centuries', 'answer_start': 94},
    {'text': 'in the 10th and 11th centuries', 'answer_start': 87},
    {'text': '10th and 11th centuries', 'answer_start': 94},
    {'text': '10th and 11th centuries', 'answer_start': 94}],
   'is_impossible': False},
  {'question': 'From which countries did the Norse originate?',
   'id': '56ddde6b9a695914005b962a',
   'answers': [{'text': 'Denmark, Iceland and Norway', 'answer_start': 256},
    {'text': 'Denmark, Iceland and Norway', 'answer_start': 256},
    {'text': 'Denmark, Iceland and Norway', 'answer_star

In [36]:
new_squad_dev = []

for group in squad["data"]:
    for paragraph in group['paragraphs']:
        # we pull out the context from here
        context = paragraph['context']
        for qa_pair in paragraph['qas']:
            # we pull out the question
            question = qa_pair['question']
            match qa_pair:
                case {'answers': answer_list}:
                    pass
                case {'plausible_answers': answer_list}:
                    pass
                case _:
                    answer = None
            # if "answers" in qa_pair.keys() and len(qa_pair["answers"]) > 0:
            #     answer_list = qa_pair["answers"]
            # elif "plausible_answers" in qa_pair.keys() and len(qa_pair["plausible_answers"]) > 0:
            #     answer_list = qa_pair["plausible_answers"]
            # else:
            #     answer_list = []
            answer_list = [item['text'] for item in answer_list]
            answer_list = list(set(answer_list))
            for answer in answer_list:
                # append dictionary sample to parsed squad
                new_squad_dev.append({
                    'question': question,
                    'answer': answer,
                    'context': context
                })

In [34]:
with open(os.path.join(squad_dir, 'dev.json'), 'w') as f:
    json.dump(new_squad_dev, f)