# Cornell Movies Dialogues Dataset translation

See https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html

This notebook contains code for the automation of the Cornell Movies Dialogues Dataset utterances to __Spanish__.
It can be easily modified to translate to any other language, using the Azure Translation API service.

Procesing is made incrementally over on the _movie_lines.txt_ data. Therefore, this code can be safely run on the same data, producing an incrementa lset of translations:
- Translations are optimized: already done translations are checked and reused.
- Fail tolerant: if error occurs and translation is stoped, it can be rerun with a small additional processing time cost.

Translations are incrementally written in the _tranlsation_log_ file. Therefore, __do not delete it!!__. It will serve as a basis for any further translation (to avoid translate more than once the same text). This file is automatically written by the _AzureTranslator_ class.

### Preprocess

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
DATA_DIR = "../data/cornell_movie_dialogs_corpus"
PREPROCESSED_DATA_DIR = "../preprocessed_data"
print(os.listdir(DATA_DIR))

# Any results you write to the current directory are saved as output.

['movie_conversations.txt', 'raw_script_urls.txt', 'movie_lines.txt', 'README.txt', 'chameleons.pdf', 'movie_titles_metadata.txt', 'movie_characters_metadata.txt', '.DS_Store']


In [2]:
from tqdm import tqdm_notebook as tqdm
import pandas as pd
import sys
import time
from typing import Dict, Text

In [3]:
# Add src folder to PYTHONPATH
sys.path.append("../../src")

In [4]:
from datasets_preprocess import build_text_translation_dict
from datasets_preprocess.cornell_movies import movie_lines_to_dataframe

__Some exploration on original data__

In [5]:
!cat ../data/cornell_movie_dialogs_corpus/README.txt

Cornell Movie-Dialogs Corpus

Distributed together with:

"Chameleons in imagined conversations: A new approach to understanding coordination of linguistic style in dialogs"
Cristian Danescu-Niculescu-Mizil and Lillian Lee
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, ACL 2011.

(this paper is included in this zip file)

NOTE: If you have results to report on these corpora, please send email to cristian@cs.cornell.edu or llee@cs.cornell.edu so we can add you to our list of people using this data.  Thanks!


Contents of this README:

	A) Brief description
	B) Files description
	C) Details on the collection procedure
	D) Contact


A) Brief description:

This corpus contains a metadata-rich collection of fictional conversations extracted from raw movie scripts:

- 220,579 conversational exchanges between 10,292 pairs of movie characters
- involves 9,035 characters from 617 movies
- in total 304,713 utterances
- movie metadata

In [6]:
!head -n 1000 ../data/cornell_movie_dialogs_corpus/movie_lines.txt

L1045 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ They do not!
L1044 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ They do to!
L985 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ I hope so.
L984 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ She okay?
L925 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Let's go.
L924 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ Wow
L872 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Okay -- you're gonna need to learn how to lie.
L871 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ No
L870 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ I'm kidding.  You know how sometimes you just become this "persona"?  And you don't know how to quit?
L869 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Like my fear of wearing pastels?
L868 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ The "real you".
L867 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ What good stuff?
L866 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ I figured you'd get to the good stuff eventually.
L

__Parse dialogue lines into pandas data frame__

In [7]:
CONVERT_TO_DATAFRAME = False
if CONVERT_TO_DATAFRAME:
    df_movie_lines = movie_lines_to_dataframe(DATA_DIR + "/movie_lines.txt", tqdm_call=tqdm)
    # Save to disk
    df_movie_lines.to_csv("movie_lines.csv")
    df_movie_lines.to_pickle("movie_lines.pick")
else:
    df_movie_lines = pd.read_pickle("movie_lines.pick")

In [8]:
df_movie_lines.dtypes

LINE_ID           object
CHARACTER_ID      object
MOVIE_ID          object
CHARACTER_NAME    object
UTTERANCE         object
dtype: object

In [9]:
#!ls -alh

## Translate to Spanish
Translate utterances to Spanish using Azure 

In [10]:
# -*- coding: utf-8 -*-
import os, requests, uuid, json
from datasets_preprocess.azure_translator import AzureTranslator

In [11]:
# Checks to see if the Translator Text subscription key is available
# as an environment variable. If you are setting your subscription key as a
# string, then comment these lines out.
if 'TRANSLATOR_TEXT_KEY' in os.environ:
    subscription_key = os.environ['TRANSLATOR_TEXT_KEY']
else:
    print('Environment variable for TRANSLATOR_TEXT_KEY is not set.')
    exit()
# If you want to set your subscription key as a string, uncomment the line
# below and add your subscription key.
#subscriptionKey = 'put_your_key_here'

In [12]:
df_movie_lines.iloc[12]

LINE_ID                                                        L866
CHARACTER_ID                                                     u2
MOVIE_ID                                                         m0
CHARACTER_NAME                                              CAMERON
UTTERANCE         I figured you'd get to the good stuff eventually.
Name: 12, dtype: object

In [13]:
# Build translator
translator = AzureTranslator(subscription_key=subscription_key, origin_language="en",
                             destination_language="es")

In [14]:
# Build translation dictionary from log file
text_translation_dict = build_text_translation_dict("translation_log")
len(text_translation_dict)

100%|██████████| 28346/28346 [00:00<00:00, 932564.70it/s]


25476

In [15]:
# Run translation
translator.translate_with_dict(df_movie_lines.UTTERANCE.values, text_translation_dict, num_texts_per_request=20,
                            tqdm_call=tqdm, log_file="./translation_log")

HBox(children=(IntProgress(value=0, description='Number of texts to translate', max=304713, style=ProgressStyl…




TranslationExample: (400, 'Bad Request', b'{"error":{"code":400077,"message":"The maximum request size has been exceeded."}}')

In [16]:
# Rebuild translation dictionary from log file
text_translation_dict = build_text_translation_dict("translation_log")
len(text_translation_dict)

100%|██████████| 28346/28346 [00:00<00:00, 968552.38it/s]


25476

In [17]:
# Write translations into the dataframe
df_movie_lines["UTTERANCE_SPANISH"] = df_movie_lines["UTTERANCE"].apply(lambda s: text_translation_dict.get(s.strip(), None))

In [19]:
# Save dataframe to disk
df_movie_lines.to_csv(DATA_DIR + "/movie_lines.txt")
df_movie_lines.to_pickle(DATA_DIR + "/movie_lines.txt")

__Explore translation results__

In [28]:
# Number of translated texts
_counts = df_movie_lines["UTTERANCE_SPANISH"].isnull().value_counts()
print("Number of translated texts: {} from {}".format(_counts[False], len(df_movie_lines)))
print("{0:.2f}% translated".format(_counts[False]/len(df_movie_lines) * 100))


Number of translated texts: 52783 from 304713
17.32% translated
