### Movie diaglog data processing

This is to process the [Cornell movie dataset](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html) that to construct the real diaglog.

In [1]:
import os

files = [x for x in os.listdir() if x.endswith(".txt") and not x.lower().startswith('reamde')]
print(files)

['movie_characters_metadata.txt', 'movie_conversations.txt', 'movie_lines.txt', 'movie_titles_metadata.txt', 'raw_script_urls.txt', 'README.txt']


In [12]:
def open_file(file_name):
    with open(file_name, 'r') as f:
        data = f.readlines()
    return data


def samples(data):
    print("There are {} lines.".format(len(data)))
    print("Get some sample:")
    print("\n".join(data[:5]))
    
split_value = " +++$+++ "

- movie_conversations.txt
	- the structure of the conversations
	- fields
		- characterID of the first character involved in the conversation
		- characterID of the second character involved in the conversation
		- movieID of the movie in which the conversation occurred
		- list of the utterances that make the conversation, in chronological 
			order: ['lineID1','lineID2',É,'lineIDN']
			has to be matched with movie_lines.txt to reconstruct the actual content

In [8]:
# read movie_conversations.txt that contain the diaglog
conversation = open_file("movie_conversations.txt")

samples(conversation)

There are 83097 lines.
Get some sample:
u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L194', 'L195', 'L196', 'L197']

u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L198', 'L199']

u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L200', 'L201', 'L202', 'L203']

u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L204', 'L205', 'L206']

u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L207', 'L208']



Notes: `u` means `user`, `m` means `movie`, list of string is real conversation. To get real conversation should open `movie_lines.txt`.


- movie_lines.txt
	- contains the actual text of each utterance
	- fields:
		- lineID
		- characterID (who uttered this phrase)
		- movieID
		- character name
		- text of the utterance


In [10]:
# open `movie_lines.txt`
movie_lines = open_file("movie_lines.txt")

samples(movie_lines)

There are 304713 lines.
Get some sample:
L1045 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ They do not!

L1044 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ They do to!

L985 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ I hope so.

L984 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ She okay?

L925 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Let's go.



In [14]:
# get a dict to get line and real text
movie_line_dict = {}

for i in range(len(movie_lines)):
    s = movie_lines[i]
    s_split = s.split(split_value)
    movie_line_dict[s_split[0]] = s_split[-1].replace("\n", '')
    
print(len(movie_line_dict))

304713


In [28]:
# function to read get a list from real list string
get_list_from_str  = lambda c: [x.replace("[", '').replace(']', '').strip().replace("'", '') for x in c.split(split_value)[-1].replace("\n", "").split(",")]

In [30]:
# get real convert conversation
real_conversation_list = []

for i in range(len(conversation)):
    c = conversation[i]
    c_list = get_list_from_str(c)
    
    out_con = [movie_line_dict.get(x, '') for x in c_list]
    real_conversation_list.append(out_con)

In [34]:
print("There are {} conversations.".format(len(real_conversation_list)))

There are 83097 conversations.


In [33]:
real_conversation_list[:5]

[['Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.',
  "Well, I thought we'd start with pronunciation, if that's okay with you.",
  'Not the hacking and gagging and spitting part.  Please.',
  "Okay... then how 'bout we try out some French cuisine.  Saturday?  Night?"],
 ["You're asking me out.  That's so cute. What's your name again?",
  'Forget it.'],
 ["No, no, it's my fault -- we didn't have a proper introduction ---",
  'Cameron.',
  "The thing is, Cameron -- I'm at the mercy of a particularly hideous breed of loser.  My sister.  I can't date until she does.",
  'Seems like she could get a date easy enough...'],
 ['Why?',
  'Unsolved mystery.  She used to be really popular when she started high school, then it was just like she got sick of it or something.',
  "That's a shame."],
 ['Gosh, if only we could find Kat a boyfriend...',
  'Let me see what I can do.']]

In [36]:
# Let's try to dump this list into server for future use case
with open("processed_movie_diaglog.txt", 'w', encoding='utf-8') as f:
    for con in real_conversation_list:
        for sen in con:
            f.write(sen + "\n")