# Data Extraction

This script organizes the data into two columns: questions, answers

In [1]:
# Files to clean
file_paths = [{'input_file': 'train_raw.json', 'output_file': 'train.json'},
              {'input_file': 'val_raw.json', 'output_file': 'val.json'},
              {'input_file': 'test_raw.json', 'output_file': 'test.json'},
              {'input_file': 'devtest_raw.json', 'output_file': 'devtest.json'}]

output_dir = './output'

In [2]:
!gdown 1ACqeigUcVh1SH8fem7VqpD2j-wQP9lRj
!gdown 1SU8BCy3hknF7tybwXnDWOaeHGT-cA7DQ
!gdown 1_Cxk0YK0W4Y7Dzz3pkrfzzt7qftBzKvS
!gdown 138dYd8GvE0TcMaEQlHgUEE7Be6DWNQix

Downloading...
From: https://drive.google.com/uc?id=1ACqeigUcVh1SH8fem7VqpD2j-wQP9lRj
To: /content/train_raw.json
100% 377k/377k [00:00<00:00, 86.5MB/s]
Downloading...
From: https://drive.google.com/uc?id=1SU8BCy3hknF7tybwXnDWOaeHGT-cA7DQ
To: /content/val_raw.json
100% 102k/102k [00:00<00:00, 44.2MB/s]
Downloading...
From: https://drive.google.com/uc?id=1_Cxk0YK0W4Y7Dzz3pkrfzzt7qftBzKvS
To: /content/test_raw.json
100% 274k/274k [00:00<00:00, 20.3MB/s]
Downloading...
From: https://drive.google.com/uc?id=138dYd8GvE0TcMaEQlHgUEE7Be6DWNQix
To: /content/devtest_raw.json
100% 25.1k/25.1k [00:00<00:00, 36.6MB/s]


Berant et al.

PDF: https://www.aclweb.org/anthology/D13-1160.pdf

Dataset: https://github.com/brmson/dataset-factoid-webquestions

Year of Publication: 2013

Size: 5,810

Data Collection: Berant et al. use the Google Suggest API as basis for generating questions. They start with a single question ("Where was Barack Obama born") and feed the Google Suggest API with three query fragments generated from the source question: the question without the phrase before the entity, the question without the entity and the question without the phrase after the entity. Each of these queries generates 5 candidate questions. The candidate questions are added to a queue of questions and the process is repeated for each of the questions in the queue. They stop after 1M questions are generated. Due to the nature of the approach, the generated questions start with a wh-word and revolve around a single entity. 100K of these questions are given to crowdsourcing workers who select a Freebase entity, value or list of entities as answer. Questions that are answered identically by at least two workers are included in the dataset.

In [3]:
import json

def data_extraction(file_path: str) -> list:
    """
    Extracts relevant data from the JSON file and
    organized them into two columns: questions, answers.

    Parameters:
        file_path (str): file path name to the JSON file

    Returns:
        list_questions_answers (list): list of dictionaries
        with questions and answers
    """
    with open(file_path, 'r') as file:
        data = json.load(file)

    list_questions_answers = []

    for item in data:
        question = item['qText']
        answers = item['answers']
        for answer in answers:
            if answer:
                question_answer = {
                    'question': question,
                    'answer': answer
                }
                list_questions_answers.append(question_answer)

    return list_questions_answers

In [4]:
import os

for file_path in file_paths:
    input_file = file_path['input_file']
    output_file = file_path['output_file']
    print(f"Processing {input_file}...\n")
    list_questions_answers = data_extraction(input_file)
    print(f"Number of questions and answers in {input_file}: {len(list_questions_answers)}\n")
    print(f"First 5 entries:\n")
    for i in range(5):
        print(f"Question {i+1}: {list_questions_answers[i]['question']}")
        print(f"Answer {i+1}: {list_questions_answers[i]['answer']}\n")

    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    output_file_path = os.path.join(output_dir, output_file)

    with open(output_file_path, 'w') as file:
        json.dump(list_questions_answers, file)
    print(f"Data saved in {output_file_path}\n")
    print("-" * 20)

Processing train_raw.json...

Number of questions and answers in train_raw.json: 6668

First 5 entries:

Question 1: what character did natalie portman play in star wars?
Answer 1: Padmé Amidala

Question 2: what state does selena gomez?
Answer 2: New York City

Question 3: what country is the grand bahama island in?
Answer 3: Bahamas

Question 4: what character did john noble play in lord of the rings?
Answer 4: Denethor II

Question 5: who does joakim noah play for?
Answer 5: Chicago Bulls

Data saved in ./output/train.json

--------------------
Processing val_raw.json...

Number of questions and answers in val_raw.json: 1821

First 5 entries:

Question 1: what kind of money to take to bahamas?
Answer 1: Bahamian dollar

Question 2: how old is sacha baron cohen?
Answer 2: http://justjared.buzznet.com/2009/01/12/sacha-baron-cohen-insults-madonna-at-golden-globes/

Question 3: where is rome italy located on a map?
Answer 3: Rome

Question 4: who did the philippines gain independence fr

In [5]:
# zip the files and download
import shutil
from google.colab import files
from datetime import datetime

file_name = 'dataset' + '_' + datetime.now().strftime("%Y%m%d-%H%M%S")
shutil.make_archive(file_name, 'zip', 'output')
files.download(file_name + '.zip')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>