# Data Extraction

This script organizes the data into two columns: questions, answers

In [9]:
# Files to clean
file_paths = [{'input_file': 'dialogs.txt', 'output_file': 'dialogs.json'}]

output_dir = './output'

In [10]:
!gdown 1Xg7UhjpdCT_3-i_X1DVQnOSkwz_PPVsx

Downloading...
From: https://drive.google.com/uc?id=1Xg7UhjpdCT_3-i_X1DVQnOSkwz_PPVsx
To: /content/dialogs.txt
  0% 0.00/244k [00:00<?, ?B/s]100% 244k/244k [00:00<00:00, 83.5MB/s]


Berant et al.

PDF: https://www.aclweb.org/anthology/D13-1160.pdf

Dataset: https://github.com/brmson/dataset-factoid-webquestions

Year of Publication: 2013

Size: 5,810

Data Collection: Berant et al. use the Google Suggest API as basis for generating questions. They start with a single question ("Where was Barack Obama born") and feed the Google Suggest API with three query fragments generated from the source question: the question without the phrase before the entity, the question without the entity and the question without the phrase after the entity. Each of these queries generates 5 candidate questions. The candidate questions are added to a queue of questions and the process is repeated for each of the questions in the queue. They stop after 1M questions are generated. Due to the nature of the approach, the generated questions start with a wh-word and revolve around a single entity. 100K of these questions are given to crowdsourcing workers who select a Freebase entity, value or list of entities as answer. Questions that are answered identically by at least two workers are included in the dataset.

In [11]:
import json

def data_extraction(file_path: str) -> list:
    """
    Extracts relevant data from the JSON file and
    organized them into two columns: questions, answers.

    Parameters:
        file_path (str): file path name to the JSON file

    Returns:
        list_questions_answers (list): list of dictionaries
        with questions and answers
    """
    # with open(file_path, 'r') as file:
    #     data = json.load(file)

    # list_questions_answers = []

    # for item in data:

    #     # question = item['qText']
    #     # answers = item['answers']
    #     # for answer in answers:
    #     #     if answer:
    #     #         question_answer = {
    #     #             'question': question,
    #     #             'answer': answer
    #     #         }
    #     #         list_questions_answers.append(question_answer)

    # return list_questions_answers

    qa_list = []
    with open(file_path, 'r') as file:
        for line in file:
            question, answer = line.split('\t')
            qa_dict = {'question': question.strip(), 'answer': answer.strip()}
            qa_list.append(qa_dict)
    return qa_list

In [12]:
import os

for file_path in file_paths:
    input_file = file_path['input_file']
    output_file = file_path['output_file']
    print(f"Processing {input_file}...\n")
    list_questions_answers = data_extraction(input_file)
    print(f"Number of questions and answers in {input_file}: {len(list_questions_answers)}\n")
    print(f"First 5 entries:\n")
    for i in range(5):
        print(f"Question {i+1}: {list_questions_answers[i]['question']}")
        print(f"Answer {i+1}: {list_questions_answers[i]['answer']}\n")

    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    output_file_path = os.path.join(output_dir, output_file)

    with open(output_file_path, 'w') as file:
        json.dump(list_questions_answers, file)
    print(f"Data saved in {output_file_path}\n")
    print("-" * 20)

Processing dialogs.txt...

Number of questions and answers in dialogs.txt: 3725

First 5 entries:

Question 1: hi, how are you doing?
Answer 1: i'm fine. how about yourself?

Question 2: i'm fine. how about yourself?
Answer 2: i'm pretty good. thanks for asking.

Question 3: i'm pretty good. thanks for asking.
Answer 3: no problem. so how have you been?

Question 4: no problem. so how have you been?
Answer 4: i've been great. what about you?

Question 5: i've been great. what about you?
Answer 5: i've been good. i'm in school right now.

Data saved in ./output/train.json

--------------------


In [13]:
# zip the files and download
import shutil
from google.colab import files
from datetime import datetime

file_name = 'dataset' + '_' + datetime.now().strftime("%Y%m%d-%H%M%S")
shutil.make_archive(file_name, 'zip', 'output')
files.download(file_name + '.zip')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>