# Data Extraction

This script organizes the data into two columns: questions, answers

In [1]:
# Files to clean
file_paths = [{'input_file': 'movie_conversations.txt', 'output_file': 'movie_dialog.json'}]
lines_file_path = 'movie_lines.txt'

output_dir = './output'

In [2]:
!gdown 1638WA6FxxfjOFtTyfg6u_eNUEg09yi5R
!gdown 1nTtZgWzb46nuOi9LLn922HzbbrNZyZUG

Downloading...
From: https://drive.google.com/uc?id=1638WA6FxxfjOFtTyfg6u_eNUEg09yi5R
To: /content/movie_lines.txt
100% 34.6M/34.6M [00:00<00:00, 106MB/s] 
Downloading...
From: https://drive.google.com/uc?id=1nTtZgWzb46nuOi9LLn922HzbbrNZyZUG
To: /content/movie_conversations.txt
100% 6.76M/6.76M [00:00<00:00, 67.8MB/s]


**Cornell Movie--Dialogs Corpus**

Distributed together with:  Chameleons in Imagined Conversations.

Data and Code available in ConvoKit: a toolkit for analyzing conversations

Related corpus: Cornell Movie-Quotes Corpus

**DESCRIPTION:**
                                    
This corpus contains a large metadata-rich collection of fictional conversations extracted from raw movie scripts:

- 220,579 conversational exchanges between 10,292 pairs of movie characters

- involves 9,035 characters from 617 movies

- in total 304,713 utterances

- movie metadata included:

    - genres

    - release year

    - IMDB rating

    - number of IMDB votes

    - IMDB rating

- character metadata included:

    - gender (for 3,774 characters)

    - position on movie credits (3,321 characters)

- see the [documentation](https://convokit.cornell.edu/documentation/movie.html) for details

In [3]:
def data_extraction(conversation_file_path: str, lines_file_path: str) -> list:
    """
    Extract relevant data from the CSV file and
    organized them into two columns: questions, answers.

    Parameters:
        conversation_file_path (str): path to the conversation file
        lines_file_path (str): path to the lines file

    Returns:
        list_questions_answers (list): list of dictionaries
        with questions and answers
    """
    # Read the movie_conversation.txt and organized the ids into questions and answers
    with open(conversation_file_path, 'r') as file:
        conversations = file.readlines()

    list_questions_answers_id = []
    for conversation in conversations:
        conversation = conversation.strip().split(' +++$+++ ')
        conversation_sequence = conversation[-1].replace("'", "").replace("[", "").replace("]", "").split(',')
        conversation_sequence = [line_id.strip() for line_id in conversation_sequence]

        # Include all combinations of question/answer in each conversation
        # for index, line_id in enumerate(conversation_sequence):
            # if index < len(conversation_sequence) - 1:
            #     question_id = line_id
            #     answer_id = conversation_sequence[index + 1]
            #     question_answer = {
            #         'question': question_id,
            #         'answer': answer_id
            #     }
            #     list_questions_answers_id.append(question_answer)

        # Include only the first question/answer in each conversation
        if len(conversation_sequence) > 1:
            question_id = conversation_sequence[0]
            answer_id = conversation_sequence[1]
            question_answer = {
                'question': question_id,
                'answer': answer_id
            }
            list_questions_answers_id.append(question_answer)

    # Read the movie_lines.txt and replace the ids with actual dialogue
    with open(lines_file_path, 'r', errors='ignore') as file:
        lines = file.readlines()

    # Create dictionary of id and lines
    id_to_line = {}
    for line in lines:
        line = line.strip().split(' +++$+++ ')
        id_to_line[line[0]] = line[-1]

    # Replace the ids with actual dialogue
    list_questions_answers = []
    for question_answer in list_questions_answers_id:
        question = id_to_line[question_answer['question']]
        answer = id_to_line[question_answer['answer']]
        question_answer = {
            'question': question,
            'answer': answer
        }
        list_questions_answers.append(question_answer)

    return list_questions_answers

In [4]:
import os
import json

for file_path in file_paths:
    input_file = file_path['input_file']
    output_file = file_path['output_file']
    print(f"Processing {input_file}...\n")
    list_questions_answers = data_extraction(input_file, lines_file_path)
    print(f"Number of questions and answers in {input_file}: {len(list_questions_answers)}\n")
    print(f"First 5 entries:\n")
    for i in range(5):
        print(f"Question {i+1}: {list_questions_answers[i]['question']}")
        print(f"Answer {i+1}: {list_questions_answers[i]['answer']}\n")

    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    output_file_path = os.path.join(output_dir, output_file)

    with open(output_file_path, 'w') as file:
        json.dump(list_questions_answers, file)
    print(f"Data saved in {output_file_path}\n")
    print("-" * 20)

Processing movie_conversations.txt...

Number of questions and answers in movie_conversations.txt: 83097

First 5 entries:

Question 1: Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.
Answer 1: Well, I thought we'd start with pronunciation, if that's okay with you.

Question 2: You're asking me out.  That's so cute. What's your name again?
Answer 2: Forget it.

Question 3: No, no, it's my fault -- we didn't have a proper introduction ---
Answer 3: Cameron.

Question 4: Why?
Answer 4: Unsolved mystery.  She used to be really popular when she started high school, then it was just like she got sick of it or something.

Question 5: Gosh, if only we could find Kat a boyfriend...
Answer 5: Let me see what I can do.

Data saved in ./output/movie_dialog.json

--------------------


In [5]:
# zip the files and download
import shutil
from google.colab import files
from datetime import datetime

file_name = 'dataset' + '_' + datetime.now().strftime("%Y%m%d-%H%M%S")
shutil.make_archive(file_name, 'zip', 'output')
files.download(file_name + '.zip')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>