# Predicting character dialogue from The Matrix

The goal of this project is to use supervised and unsupervised machine learning to predict a character's dialogue. The source material is the movie The Matrix. 

For the unsupervised learning section, the goal is to create clusters corresponding to the major characters of the film and try to identify which character corresponds to each cluster based on the dialogue centered around it. Different models will be used to vectorize the lines for each character and model them in order to make predictions. From the terms grouped around each cluster, I'll try to figure out which character is more likely to have said those words based on my knowledge of the movie.

In [104]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import spacy
import matplotlib.pyplot as plt
import seaborn as sns
import re
import nltk
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     /home/diegofvargas/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/diegofvargas/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [10]:
# Utility function for standard text cleaning.
def text_cleaner(text):
    # Remove the location of each scene, they're in between parenthesis
    text = re.sub(r'[\([a-zA-Z]*.[a-zA-Z]*\)','',raw)
    #text = ' '.join(text.split())
    return text
    
# Load and clean the data.
The_Matrix = open('The Matrix Script.txt','r') 
raw=The_Matrix.read()
The_Matrix.close()
script = text_cleaner(raw)
tokens = nltk.word_tokenize(script)
text = nltk.Text(tokens)

In [11]:
characters = []
sentences = []
for line in script.splitlines():
    if ':' in line:
        characters.append(line.split(':')[0])
        sentences.append(line.split(':')[1])

In [12]:
script_df = pd.DataFrame(np.column_stack([characters, sentences]), columns = ['character','sentences']) 

In [13]:
script_df.head()

Unnamed: 0,character,sentences
0,Cypher,Yeah.
1,Trinity,Is everything in place?
2,Cypher,You weren't supposed to relieve me.
3,Trinity,"I know, but I want to take your shift."
4,Cypher,"You like watching him, don't you?"


In [14]:
print(script_df['character'].unique())
print(len(script_df['character'].unique()))

['Cypher' 'Trinity' 'Cop' 'Agent Smith' 'Lieutenant' 'Morpheus'
 'Agent Brown' 'Agent Jones' 'Neo' 'Choi' 'DuJour' 'Mr. Rhineheart'
 'FedEx man' 'Switch' 'Apoc' 'Dozer' 'Tank' 'Mouse' 'Priestess'
 'Spoon boy' 'Oracle' 'Police' 'Guard 1' 'Guard 2' 'Soldier' 'Pilot' 'Man'
 'The One']
28


In [15]:
#Will focus on the main 4 characters since they have the most lines, making their data richer and better for analysis
main_chars_df = script_df[script_df.character.isin(['Neo', 'Trinity','Morpheus','Agent Smith'])]

In [19]:
stop_words = stopwords.words('english')

# tokenization
tokenized_script = main_chars_df['sentences'].apply(lambda x: x.split())

# remove stop-words
tokenized_script = tokenized_script.apply(lambda x: [item for item in x if item not in stop_words])

# de-tokenization
detokenized_script = []
for i in range(len(main_chars_df)):
    t = ' '.join(tokenized_script.reset_index().iloc[i]['sentences'])
    detokenized_script.append(t)

main_chars_df['clean_sentences'] = detokenized_script

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


In [341]:
main_chars_df.head()

Unnamed: 0,character,sentences,clean_sentences
1,Trinity,Is everything in place?,Is everything place?
3,Trinity,"I know, but I want to take your shift.","I know, I want take shift."
6,Trinity,Don't be ridiculous.,Don't ridiculous.
8,Trinity,Morpheus believes he is the one.,Morpheus believes one.
10,Trinity,It doesn't matter what I believe.,It matter I believe.


In [363]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english', 
max_features= 5000,
max_df = 0.5, 
smooth_idf=True)

X = vectorizer.fit_transform(main_chars_df['clean_sentences'])

X.shape # check shape of the document-term matrix

(416, 776)

## Unsupervised learning

In [343]:
from sklearn.decomposition import TruncatedSVD

# SVD represent terms in vectors 
svd_model = TruncatedSVD(n_components=4, algorithm='randomized', n_iter=1000, random_state=101)

svd_model.fit(X)

len(svd_model.components_)

4

In [346]:
terms = vectorizer.get_feature_names()

for i, comp in enumerate(svd_model.components_):
    terms_comp = zip(terms, comp)
    sorted_terms = sorted(terms_comp, key= lambda x:x[1], reverse=True)[:20]
    print("Character "+str(i)+": ")
    for t in sorted_terms:
        print(t[0])

Character 0: 
know
neo
morpheus
want
trinity
going
hello
yes
trying
come
tell
looking
matrix
ready
like
ve
believe
line
make
world
Character 1: 
morpheus
neo
going
don
ready
oracle
believed
told
alive
gave
believes
convinced
place
happened
make
come
sacrificed
kill
way
gotcha
Character 2: 
yes
neo
hell
beginning
old
elevator
slowly
yeah
mr
clear
perfectly
rhineheart
run
come
coming
looking
time
hello
unfortunately
watching
Character 3: 
neo
hello
come
easy
hurry
watching
like
true
matters
trust
looking
tell
run
protection
truth
necessary
say
answer
breathe
said


Based on my knowledge of the movie, I can sort of tell what character the clusters correspond to. 

Neo seems to be character 2, given that he is the only one to refer to Mr. rhineheart who was his boss in the matrix and the prominence of hell, which he used frequently to express excitement.

Character 3 seems to be Morpheus since the top word is Neo, who he mostly interacts with. He also says hello a lot since he is introducing Neo and the audience to the real world.

Character 1 is Trinity, because her main two interactions are with Morpheus and Neo, which are her top two words. There are also words like alive and believes that relate to the scene when Trinity revives Neo.

Character 0 must be Agent Smith. The top terms are Neo, Morpheus and Trinity, who are the main culprits in his eyes. Wouldn't make sense for any character besides Neo to have their own name as a top term, because he says his new identity a lot as he's discovering the new world.

## Non-negative matrix factorization

In [356]:
from sklearn.decomposition import NMF
nmf_model = NMF(n_components=4, init='random', random_state=101)
W = nmf_model.fit_transform(X)
H = nmf_model.components_

In [360]:
for i, comp in enumerate(H):
    terms_comp = zip(terms, comp)
    sorted_terms = sorted(terms_comp, key= lambda x:x[1], reverse=True)[:20]
    print("Character "+str(i)+": ")
    for t in sorted_terms:
        print(t[0])

Character 0: 
morpheus
going
don
believed
ready
alive
oracle
gave
told
believes
convinced
make
place
happened
kill
gotcha
great
meet
traced
sacrificed
Character 1: 
trinity
believe
help
oracle
focus
going
hit
make
base
cracked
irs
real
ve
worry
matrix
world
welcome
tank
little
beginning
Character 2: 
neo
yes
come
hello
going
like
tell
want
looking
easy
watching
true
hurry
matters
trust
run
ready
ve
trying
coming
Character 3: 
know
want
trying
hope
line
lot
dead
shift
suggest
looking
matrix
world
went
ve
exactly
hello
feel
coincidence
does
fu


NMF wasn't as effective as SVD at clustering each character from the movie. The terms in each cluster aren't as helpful in determining which character is represented, so for this data set SVD is a more appropiate technique.