# Movie Scripts - An Analysis
### Act 3
_"Maybe I didn't want the antidote."_

In the final act, we try to generate some dialogue using some machine learning techniques. Let's see if we can get it to sound like Hollywood quality.

### 1) Imports and data loading!


In [7]:
%matplotlib inline 

import numpy as np
import pandas as pd
import os
import re
import pickle

import matplotlib.pyplot as plt
import seaborn as sns; sns.set(style="ticks", color_codes=True)

In [9]:
movies_df = pickle.load(open("movies_df.pkl", "rb"))

In [10]:
all_thrillers = ""
for t in movies_df[movies_df.genre == "Thriller"].condensed_text:
    for line in [l.lstrip() for l in t]:
        all_thrillers += line
data = all_thrillers

Let's take all the thriller movies and see if we can write some of our own! We'll user a character based RNN/LSTM, where we take sequences of length 120 and try to predict the next character. Hopefully we get some good stuff out of it.

In [192]:
"""
Minimal character-level  RNN model. Adapted from by Andrej Karpathy (@karpathy)
"""
chars = list(set(data))
data_size, vocab_size = len(data), len(chars)
print 'data has %d characters, %d unique.' % (data_size, vocab_size)
char_to_ix = { ch:i for i,ch in enumerate(chars) }
ix_to_char = { i:ch for i,ch in enumerate(chars) }

# hyperparameters
hidden_size = 100 # size of hidden layer of neurons
seq_length = 25 # number of steps to unroll the RNN for
learning_rate = 1e-1

# credit michaelrzhang Char-RNN
maxlen = 120
step = 13
sentences = []
targets = []
for i in range(0,len(data)-maxlen-1, step):
    sentences.append(data[i:i+maxlen])
    targets.append(data[i+maxlen])
print('number of sequences:', len(sentences))

X = np.zeros((1000000, maxlen, len(chars)), dtype=np.bool)
y = np.zeros((1000000, len(chars)), dtype=np.bool)
for i in range(1000000):
    sentence = sentences[i]
    target = targets[i]
    y[i][char_to_ix[target]] = 1
    for j in range(maxlen):
        X[i][j][char_to_ix[sentence[j]]] = 1
        
        
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.layers import LSTM

model = Sequential()
model.add(LSTM(128, input_shape=(maxlen,len(chars))))
model.add(Dropout(0.3))
model.add(Dense(len(chars), activation = "softmax"))
model.compile(optimizer='adam',loss='categorical_crossentropy')

# train and print out test text
start = np.random.randint(0, 999998)
for i in range(50):
    print ("-" * 10) + "Epoch " + str(i) + ("-" * 10)
    model.fit(X, y, epochs=1, verbose=0)
    total_string = sentences[start]
    input_vec = X[start]
    for i in range(10):
        x = input_vec.reshape((1, input_vec.shape[0], input_vec.shape[1]))
        pred = model.predict(x)
        c = ix_to_char[pred.argmax()]
        total_string += c
        input_vec = np.vstack((input_vec[1:], [True if i == pred.argmax() else False for i in range(len(chars))]))
    


data has 51456649 characters, 98 unique.


KeyboardInterrupt: 

Unfortunately I don't have the processing power required to make this come to life - I'm running this using the CPU on my Macbook, rather than a GPU, so its taking about 4 hours per epoch and after 5 epochs the results are kinda crappy. I've included the code above so I can run it another time.

**Instead, I'm going to try a simpler, more comprehensible model - A 2-character Markov Model.**

This means that I'll be predicting what a character will be given the two preceeding it. For example, given the characters _th_, there could be a 30% chance it's followed by an _a_ and 70% chance it's followed by an _e_, so we select the next character from this probability distribution.

I'm sure there are packages to do this, but I'll write one from scratch.

In [340]:
# start with breaking the string down into ((a,b), c) pairs
from collections import defaultdict
transition_dict = defaultdict(lambda : defaultdict(int)) # create a defaultdict of defaultdict
for i in range(len(all_thrillers)-2): # for each triplet of characters
    transition_dict[(data[i], data[i+1])][data[i+2]] += 1 # create a key for (a,b) and dict for following letters
    

This is all we really need to generate the model. From here, we'll start with a random seed and then stochastically choose the subsequent characters.

In [360]:
seed_text = "IN"
total_string = seed_text
for i in range(1000):
    key = total_string[-2:] # get last two letters
    val_dict = transition_dict[(key[0],key[1])] # retrieve dictionary for values
    counts_sq = [x**2 for x in val_dict.values()]
    prob_dist = [1.*val/np.sum(counts_sq) for val in counts_sq] # get prob dist
    next_char = np.random.choice(val_dict.keys(), p=prob_dist) # choose randomly
    total_string += next_char
print total_string
    

INT.                                                                                                                 CONT'D)
Nothe pare sten ready.
MER
and the on and me pit the call be the her the as 
the ing whe liket ther ing there dow dow mand the and the pas hickes pay so got and to the ge a dow ithe a sing the there sidend loom on to his and of ther he les the of the bacre anto ing to ling seets and ing the a dow his and nothe got loses all the the side.
I dooke ther there the of the dook and se wor the withe st wholeat the ou lif the se cone st hand he frout ing withe the you and you'readin st the the firs the ple the like younto the sure cand shat ound the staread whoully and to the ing a me the cars som ne a for the withe mallower all the door and she ing of the strand yout then to sone come re re she seento ther to met knothe the a pack the and ou to to his hout ther ones the of the wall mand st the puld the but ther of the the and to a be ing the of the and he st the be one

Pretty nonsensical. Let's take up the n in n-grams.

In [297]:
n = 3
# start with breaking the string down into ((a,b), c) pairs
from collections import defaultdict
transition_dict = defaultdict(lambda : defaultdict(int)) # create a defaultdict of defaultdict
for i in range(len(all_thrillers)-n-1): # for each triplet of characters
    tup = tuple(data[i+j] for j in range(n)) # create arbitrary length tuple
    transition_dict[tup][data[i+n+1]] += 1 # create a key for (a,b,...) and dict for following letters
    

In [361]:
seed_text = "The"
total_string = seed_text
for i in range(100):
    key = total_string[-n:] # get last two letters
    tup = tuple(key[j] for j in range(n))
    val_dict = transition_dict[(tup)] # retrieve dictionary for values
    counts_sq = [x**2 for x in val_dict.values()]
    prob_dist = [1.*val/np.sum(counts_sq) for val in counts_sq] # get prob dist
    next_char = np.random.choice(val_dict.keys(), p=prob_dist) # choose randomly
    total_string += next_char
print total_string
    

ValueError: a must be non-empty

The problem is that there are so many tri-grams without anything after that the sequence just fizzle outs. Let's try out word-level ngrams instead. 

In [389]:
word_tokens = data.split(' ')

In [407]:
# start with breaking the string down into ((a,b), c) pairs
from collections import defaultdict
transition_dict_words = defaultdict(list) # create a defaultdict of defaultdict
for i in range(len(word_tokens)-2): # for each triplet of characters
    transition_dict_words[(word_tokens[i], word_tokens[i+1])].append(word_tokens[i+2])

In [434]:
import random
seed_text = list(random.choice(transition_dict_words.keys()))
total_string = seed_text
for i in range(100):
    key = total_string[-2:] # get last two letters
    val_dict = transition_dict_words[(key[0],key[1])] # retrieve dictionary for values
    next_char = np.random.choice(val_dict) # choose randomly
    total_string.append(next_char)
print ' '.join(total_string)

leaving
a small cut.
Del starts twitching, rocking back and forth.
He slams the door shut.
Then Yoko moves through the air, a moment with images
from those "VISIONS" that have nothing further to
discuss.
Bruno goes to the floor, do it if you didn't insist on calling all the 
pleasure they are not harmless misfits.
What you see me?
SOFIA
Maybe I didn't want the antidote.
VERONA
Oh, the antidote, huh?
VERONA makes eye contact going on here.
28 INTERIOR ADM- THE NEXT DAY            93
The Doctor can't face it.  I know something. (his 
voice goes up too.
Allison looks at him a picture)
Kathy's. I


I like this better than using NLTK's word tokenizer, because this way we can preserve structure like new lines and tabs, rather than just joining all word tokens with a space. For example, "here.\nCan" is a single token.

It's interesting how the thriller genre comes out pretty strong. Death, pressure, frightening themes are present. What's clear though is that we are likely to get stuck in a sequence of direct matches from scripts, if only a two-gram leads to a single term. 

------------
#### And there you have it!
This concludes the third act in a movie script analysis series. This act was particularly resource-constrained, and as such the auto-generated script didn't have much of a polished feel to it. The next iteration would explore the deep learning route - tuning the perfect LSTM to generate realistic sounding scripts. We could also try different script cuts, like other genres, or even try to get generate dialogue from specific characters (provided there's enough material). Looking forward to trying out ideas other may have using this data! 