> **EXPLORATORY ANALYSIS**
>
> ---
>
> This notebook is aimed at investigating the structure and the issues regarding the dataset itself. All the code that is exposed here will be then included in the form of utility functions inside [this repository](https://github.com/giuluck/Gangster-SQuAD), so that in the following notebooks, models could be trained directly on the correct data.
> 
> This is to be considered as the initial notebook in which the methodology of our work is explained. The following notebooks, instead, will be devoted to implementing, training, and evaluating different neural models (one model per notebook) to tackle the *Question-Answering Task*, thus they will all have similar structure.

# **0. Retrieve Data**

The raw dataset is retrieved from our public github repository.

In [None]:
!wget https://raw.githubusercontent.com/giuluck/Gangster-SQuAD/main/data/training_set.json

--2021-01-12 09:59:28--  https://raw.githubusercontent.com/giuluck/Gangster-SQuAD/main/data/training_set.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 30288272 (29M) [text/plain]
Saving to: ‘training_set.json’


2021-01-12 09:59:29 (53.4 MB/s) - ‘training_set.json’ saved [30288272/30288272]



# **1. Dataset Inspection**

The dataset has a nested structure of lists of dictionaries:

1. The first level has two fields: `data` and `version`. We discard the latter and iterate over the former, which is a list of **442 objects**.

2. The `data` level contains a `title` and a list of `paragraphs`. The `title` will not be useful for the question-answering task but it will be used to split between train and test set (indeed, we want to keep paragraphs regarding the same contexts in the same set), thus we keep it.

3. The `paragraphs` level contains a `context` (the paragraph itself) and a list `qas` of questions and their related answer. The `context` is the string that must be analysed in order to retrieve the answer, which is identified by a span in the text, so it must be kept.

4. The `qas` level contains a `question`, which must be kept in order to be paired with the `context` as the input of the model, an `id` of the question itself, which must be kept because the system is required to pair each answer to the specific id, and a list of `answers`.

5. The `answers` level actually contains one element only. This element is made up of the answer's `text` and an `answer_start` field which identifies at which character the answer starts. This value, then, must be change accordingly in order to identify the correct token and not the character, and must be paired as well with an `end_token` index, which can be retrieved from the length of the answer.

In [None]:
import json

def explore_level(level, key, tab=0):
  # general information
  print(f'{"  " * tab}{tab}. {key}', end=' ')

  # type related information
  if type(level) == list:
    print(f"<class 'list', {len(level)} objects>:")
    level = level[0]
  elif type(level) == dict:
    print("<class 'dict'>:")
  else:
    print(f'--> {type(level)}')
    return
  
  # if not returned explores next level
  for key, value in level.items():
    explore_level(value, key, tab + 1)

with open('training_set.json', 'r') as f:
  dataset = json.load(f)

explore_level(dataset, 'dataset')

0. dataset <class 'dict'>:
  1. data <class 'list', 442 objects>:
    2. title --> <class 'str'>
    2. paragraphs <class 'list', 55 objects>:
      3. context --> <class 'str'>
      3. qas <class 'list', 5 objects>:
        4. answers <class 'list', 1 objects>:
          5. answer_start --> <class 'int'>
          5. text --> <class 'str'>
        4. question --> <class 'str'>
        4. id --> <class 'str'>
  1. version --> <class 'str'>


In [None]:
import pandas as pd

samples = []
for data in dataset['data']:
  title = data['title']
  for i, paragraph in enumerate(data['paragraphs']):
    context = paragraph['context']
    for j, qas in enumerate(paragraph['qas']):
      id = qas['id']
      question = qas['question']
      assert len(qas['answers']) == 1, f'Paragraph {i}, question {j} has {len(qas["answers"])} answers'
      answer = qas['answers'][0]['text']
      start = qas['answers'][0]['answer_start']
      samples.append([id, title, context, question, answer, start])

df = pd.DataFrame(samples, columns=['id', 'title', 'context', 'question', 'answer', 'start'])
df

Unnamed: 0,id,title,context,question,answer,start
0,5733be284776f41900661182,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",To whom did the Virgin Mary allegedly appear i...,Saint Bernadette Soubirous,515
1,5733be284776f4190066117f,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",What is in front of the Notre Dame Main Building?,a copper statue of Christ,188
2,5733be284776f41900661180,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",The Basilica of the Sacred heart at Notre Dame...,the Main Building,279
3,5733be284776f41900661181,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",What is the Grotto at Notre Dame?,a Marian place of prayer and reflection,381
4,5733be284776f4190066117e,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",What sits on top of the Main Building at Notre...,a golden statue of the Virgin Mary,92
...,...,...,...,...,...,...
87594,5735d259012e2f140011a09d,Kathmandu,"Kathmandu Metropolitan City (KMC), in order to...",In what US state did Kathmandu first establish...,Oregon,229
87595,5735d259012e2f140011a09e,Kathmandu,"Kathmandu Metropolitan City (KMC), in order to...",What was Yangon previously known as?,Rangoon,414
87596,5735d259012e2f140011a09f,Kathmandu,"Kathmandu Metropolitan City (KMC), in order to...",With what Belorussian city does Kathmandu have...,Minsk,476
87597,5735d259012e2f140011a0a0,Kathmandu,"Kathmandu Metropolitan City (KMC), in order to...",In what year did Kathmandu create its initial ...,1975,199


### ***1.1. Records Cleaning***

- Some questions are **not properly defined**, e.g., they do not have a question mark. Still, this should not be a problem so, given that there are a bunch of them, we keep them.

- Some questions, however, do not have any meaning and they represent fake records, thus they must be excluded. This does not occurr for the contexts.

In [None]:
poorly_defined_questions = [q for q in df['question'] if '?' not in q]
len(poorly_defined_questions)

949

In [None]:
excluded_questions = [q for q in df['question'] if len(q) < 10]
print('Questions:', excluded_questions)

excluded_contexts = [c for c in df['context'] if len(c) < 10]
print('Contexts: ', excluded_contexts)

Questions: ['k', 'j', 'n', 'b', 'v', 'dd', 'dd', 'dd', 'dd', 'd']
Contexts:  []


In [None]:
excluded_questions += ["I couldn't could up with another question. But i need to fill this space because I can't submit the hit. "]
excluded_questions = set(excluded_questions)

df[df['question'].isin(excluded_questions)][['question', 'answer']]

Unnamed: 0,question,answer
16818,k,ks
16819,j,Ch
16820,n,n
16821,b,b
16822,v,v
38422,dd,yptian Se
38423,dd,Buddh
38424,dd,m and E
38425,dd,Buddhism
38426,d,the Gre


### ***1.2. Final Dataset***

We only keep the records with correct questions.

In [None]:
df = df[~df['question'].isin(excluded_questions)]
df.index = range(len(df))
print(len(df), 'records remained')

87583 records remained


# **2. Train-Val-Test Splits**

We will now slice the dataset into a ***train***, a ***validation***, and a ***test*** split. The ***train*** and the ***validation*** splits will be used in the following notebooks to train the models and to evaluate their performance, respectively. Once each model is trained, just the one that gave the best results will be evaluated on the ***test*** set, and that will be considered our final performance on the task.

To split the dataset into *train*, *val* and *test* splits, we rely on the `title` attribute of the original data. This is done to avoid that records in two different splits share the same subject, so that a total independence from the data in the three splits is guaranteed. In order to do that, we retrieve the `split_title` which is in a certain index, then we split the dataset so that each record with that or a subsequent title is part of one split, while records with the other titles are part of the other split.

### ***3.1. Train-Val-Test Split***



In [None]:
def title_based_split(data, split_val=0.75):
  # retrieve split title and get the minimum id with that title
  split_title = data['title'].iloc[int(split_val * len(data))]
  split_index = data[data['title'] == split_title].index.min()
  return data.iloc[:split_index], data.iloc[split_index:]

train_df, test_df = title_based_split(df)
print('Train-Test Split:', len(train_df) / len(df))

train_df, val_df = title_based_split(train_df)
print('Train-Val Split:', len(train_df) / (len(train_df) + len(val_df)))

Train-Test Split: 0.7499058036376922
Train-Val Split: 0.7473317194232555


In [None]:
train_df[['title', 'context']]

Unnamed: 0,title,context
0,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha..."
1,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha..."
2,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha..."
3,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha..."
4,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha..."
...,...,...
49079,Yale_University,"Yale University, one of the oldest universitie..."
49080,Yale_University,"Yale University, one of the oldest universitie..."
49081,Yale_University,"Yale University, one of the oldest universitie..."
49082,Yale_University,"Yale University, one of the oldest universitie..."


In [None]:
val_df[['title', 'context']]

Unnamed: 0,title,context
49084,Late_Middle_Ages,"Around 1300, centuries of prosperity and growt..."
49085,Late_Middle_Ages,"Around 1300, centuries of prosperity and growt..."
49086,Late_Middle_Ages,"Around 1300, centuries of prosperity and growt..."
49087,Late_Middle_Ages,"Around 1300, centuries of prosperity and growt..."
49088,Late_Middle_Ages,"Around 1300, centuries of prosperity and growt..."
...,...,...
65674,Paris,Paris and its close suburbs is home to numerou...
65675,Paris,"The most-viewed network in France, TF1, is in ..."
65676,Paris,"The most-viewed network in France, TF1, is in ..."
65677,Paris,"The most-viewed network in France, TF1, is in ..."


In [None]:
test_df[['title', 'context']]

Unnamed: 0,title,context
65679,Apollo,"Apollo (Attic, Ionic, and Homeric Greek: Ἀπόλλ..."
65680,Apollo,"Apollo (Attic, Ionic, and Homeric Greek: Ἀπόλλ..."
65681,Apollo,"Apollo (Attic, Ionic, and Homeric Greek: Ἀπόλλ..."
65682,Apollo,"Apollo (Attic, Ionic, and Homeric Greek: Ἀπόλλ..."
65683,Apollo,"Apollo (Attic, Ionic, and Homeric Greek: Ἀπόλλ..."
...,...,...
87578,Kathmandu,"Kathmandu Metropolitan City (KMC), in order to..."
87579,Kathmandu,"Kathmandu Metropolitan City (KMC), in order to..."
87580,Kathmandu,"Kathmandu Metropolitan City (KMC), in order to..."
87581,Kathmandu,"Kathmandu Metropolitan City (KMC), in order to..."
