# The Simpsons - Recurrent Neural Network

In this project we'll generate Simpsons TV Scripts using RNN's.

You can find the dataset at [kaggle](https://www.kaggle.com/wcukierski/the-simpsons-by-the-data).

The dataset consists of 27 seasons of The Simpsons show

In [1]:
# Import dependencies
import os
import pickle
import pandas as pd
import numpy as np
import tensorflow as tf

# Loading Data

Download the [dataset](https://www.kaggle.com/wcukierski/the-simpsons-by-the-data) and extract the csv files in `data` folder.

Let's import our data using `pandas.read_csv`.

In [2]:
# Loading the csv file
simpsons_lines = pd.read_csv('data/simpsons_script_lines.csv',
                             usecols=['episode_id', 'number', 'raw_text'],
                             error_bad_lines=False,
                             low_memory=False)

# Sorting the data
simpsons_lines.sort_values(['episode_id', 'number'], axis=0, ascending=True, inplace=True)

simpsons_lines.head()

Unnamed: 0,episode_id,number,raw_text
148761,1,0,(Street: ext. street - establishing - night)
148762,1,1,(Car: int. car - night)
148763,1,2,"Marge Simpson: Ooo, careful, Homer."
148764,1,3,Homer Simpson: There's no time to be careful.
148765,1,4,Homer Simpson: We're late.


In [3]:
# Gets the raw text
simpsons_lines_raw = simpsons_lines['raw_text']
simpsons_lines_raw.head()

148761     (Street: ext. street - establishing - night)
148762                          (Car: int. car - night)
148763              Marge Simpson: Ooo, careful, Homer.
148764    Homer Simpson: There's no time to be careful.
148765                       Homer Simpson: We're late.
Name: raw_text, dtype: object

## Some data info

Let's print some info about your data.

In [4]:
print('Number of episodes: {0}'.format(simpsons_lines.tail(1)['episode_id'].item()))
print('Number of lines: {0}'.format(simpsons_lines.tail(1).index.item()))

Number of episodes: 568
Number of lines: 147786


# Preprocessing data

We'll preprocess our data to avoid unnecessary computations at training time.

Our lookup table will contain two dicts, with the following structure:
```python
vocab_to_int = {
  'hello': 1
}
int_to_vocab = {
  1: 'hello'
}
```

In [5]:
def create_lookup_tables(text):
    vocab = set(text)
    vocab_to_int = {word: ii for ii, word in enumerate(vocab)}
    int_to_vocab = {ii: word for ii, word in enumerate(vocab)}
    
    return vocab_to_int, int_to_vocab


We need to tokenize our punctuation. If we don't do this, the RNN will see `'word' != 'word!'`

In [6]:
def token_lookup():
    tokens = {
        '.'  : 'period',
        ','  : 'comma',
        '"'  : 'quote',
        ';'  : 'semicolon',
        '!'  : 'exclamation_mark',
        '?'  : 'question_mark',
        '('  : 'parentheses_left',
        ')'  : 'parentheses_right',
        '--' : 'dash',
        '\n' : 'return'
    }
    return {token: '||{0}||'.format(value) for token, value in tokens.items()}

Our save function

In [7]:
def preprocess_and_save(dataset, token_lookup, create_lookup_tables):
    text = ''
    for idx, line in dataset.items():
        text += line + '\n'
    
    token_dict = token_lookup()
    text = text.split()
    vocab_to_int, int_to_vocab = create_lookup_tables(text)
    int_text = [vocab_to_int[word] for word in text]
    pickle.dump((int_text, vocab_to_int, int_to_vocab, token_dict), open('preprocess.p', 'wb'))

Function to load preprocessed data

In [8]:
def load_preprocess():
    return pickle.load(open('preprocess.p', mode='rb'))

In [9]:
# Preprocessing and saving
preprocess_and_save(simpsons_lines_raw, token_lookup, create_lookup_tables)

# Build the Neural Network

Let's build our RNN.

We'll use LSTM (Long short-term Memory) cells to build this.

You can find more about LSTM [here](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)

In [10]:
# Loading preprocessed data
int_text, vocab_to_int, int_to_vocab, token_dict = load_preprocess()

In [11]:
# Creating input tensors
def get_inputs():
    inputs = tf.placeholder(tf.int32, shape=(None, None), name='input')
    targets = tf.placeholder(tf.int32, shape=(None, None), name='targets')
    learning_rate = tf.placeholder(tf.float32, shape=(None), name='learning_rate')

    return inputs, targets, learning_rate

In [12]:
# Get the initial cell with zero state
def get_init_cell(lstm_cell_number, batch_size, rnn_size):
    lstm_cells = [tf.contrib.rnn.BasicLSTMCell(rnn_size) for i in range(0, lstm_cell_number)]
    
    drop = [tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=0.5) for lstm in lstm_cells]
    
    cell = tf.contrib.rnn.MultiRNNCell(lstm_cells)
    
    initial_state = tf.identity(cell.zero_state(batch_size, tf.float32), name='initial_state')

    return cell, initial_state

In [13]:
# Embed the data with random_uniform distribution
def get_embed(input_data, vocab_size, embed_dim):
    embedding = tf.Variable(tf.random_uniform((vocab_size, embed_dim), -1, 1))
    embed = tf.nn.embedding_lookup(embedding, input_data)
    
    return embed

In [14]:
# Build the rnn
def build_rnn(cell, inputs):
    outputs, final_state = tf.nn.dynamic_rnn(cell, inputs, dtype=tf.float32)
    final_state = tf.identity(final_state, name='final_state')
    return outputs, final_state

In [15]:
# Build the entire NN
def build_nn(cell, rnn_size, input_data, vocab_size, embed_dim):
    
    embed = get_embed(input_data, vocab_size, embed_dim)
    
    outputs, state = build_rnn(cell, embed)
    
    logits = tf.contrib.layers.fully_connected(
        outputs, num_outputs = vocab_size, activation_fn = None
    )
    
    return logits, state

In [16]:
# Generate batches to avoid OOM (out of memory)
def get_batches(int_text, batch_size, seq_length):
    n_batches = int(len(int_text) / (batch_size * seq_length))

    # Drop the last few characters to make only full batches
    xdata = np.array(int_text[: n_batches * batch_size * seq_length])
    ydata = np.array(int_text[1: n_batches * batch_size * seq_length + 1])

    #ydata[:-1] = xdata[:1]

    x_batches = np.split(xdata.reshape(batch_size, -1), n_batches, 1)
    y_batches = np.split(ydata.reshape(batch_size, -1), n_batches, 1)

    y_batches[-1][-1][-1] = x_batches[0][0][0]

    return np.array(list(zip(x_batches, y_batches)))

# Training

Let's train our model

In [26]:
### Defines hyperparameters

# Number of Epochs
num_epochs = 200
# Batch Size
batch_size = 256
# RNN Size
rnn_size = 128
# Embedding Dimension Size
embed_dim = 300
# Sequence Length
seq_length = 16
# Learning Rate
learning_rate = 0.001
# Show stats for every n number of batches
show_every_n_batches = 150
# Number of lstm cells
lstm_cell_number = 2

save_dir = './save'

In [27]:
# Build the seq2seq model

from tensorflow.contrib import seq2seq

train_graph = tf.Graph()
with train_graph.as_default():
    vocab_size = len(int_to_vocab)
    input_text, targets, lr = get_inputs()
    input_data_shape = tf.shape(input_text)
    cell, initial_state = get_init_cell(lstm_cell_number, input_data_shape[0], rnn_size)
    logits, final_state = build_nn(cell, rnn_size, input_text, vocab_size, embed_dim)

    # Probabilities for generating words
    probs = tf.nn.softmax(logits, name='probs')

    # Loss function
    cost = seq2seq.sequence_loss(
        logits,
        targets,
        tf.ones([input_data_shape[0], input_data_shape[1]]))

    # Optimizer
    optimizer = tf.train.AdamOptimizer(lr)

    # Gradient Clipping
    gradients = optimizer.compute_gradients(cost)
    capped_gradients = [(tf.clip_by_value(grad, -1., 1.), var) for grad, var in gradients if grad is not None]
    train_op = optimizer.apply_gradients(capped_gradients)

In [28]:
# Training

batches = get_batches(int_text, batch_size, seq_length)

with tf.Session(graph=train_graph) as sess:
    sess.run(tf.global_variables_initializer())

    for epoch_i in range(num_epochs):
        state = sess.run(initial_state, {input_text: batches[0][0]})

        for batch_i, (x, y) in enumerate(batches):
            feed = {
                input_text: x,
                targets: y,
                initial_state: state,
                lr: learning_rate}
            train_loss, state, _ = sess.run([cost, final_state, train_op], feed)

            # Show every <show_every_n_batches> batches
            if (epoch_i * len(batches) + batch_i) % show_every_n_batches == 0:
                print('Epoch {:>3} Batch {:>4}/{}   train_loss = {:.3f}'.format(
                    epoch_i,
                    batch_i,
                    len(batches),
                    train_loss))

    # Save Model
    saver = tf.train.Saver()
    saver.save(sess, save_dir)
    print('Model Trained and Saved')

Epoch   0 Batch    0/450   train_loss = 11.825
Epoch   0 Batch  150/450   train_loss = 8.348
Epoch   0 Batch  300/450   train_loss = 8.311
Epoch   1 Batch    0/450   train_loss = 8.388
Epoch   1 Batch  150/450   train_loss = 8.192
Epoch   1 Batch  300/450   train_loss = 8.203
Epoch   2 Batch    0/450   train_loss = 8.319
Epoch   2 Batch  150/450   train_loss = 8.186
Epoch   2 Batch  300/450   train_loss = 8.171
Epoch   3 Batch    0/450   train_loss = 8.264
Epoch   3 Batch  150/450   train_loss = 8.145
Epoch   3 Batch  300/450   train_loss = 8.100
Epoch   4 Batch    0/450   train_loss = 7.940
Epoch   4 Batch  150/450   train_loss = 7.613
Epoch   4 Batch  300/450   train_loss = 7.353
Epoch   5 Batch    0/450   train_loss = 7.097
Epoch   5 Batch  150/450   train_loss = 6.802
Epoch   5 Batch  300/450   train_loss = 6.545
Epoch   6 Batch    0/450   train_loss = 6.449
Epoch   6 Batch  150/450   train_loss = 6.335
Epoch   6 Batch  300/450   train_loss = 6.144
Epoch   7 Batch    0/450   train_

Epoch  59 Batch  300/450   train_loss = 2.950
Epoch  60 Batch    0/450   train_loss = 2.832
Epoch  60 Batch  150/450   train_loss = 2.929
Epoch  60 Batch  300/450   train_loss = 2.924
Epoch  61 Batch    0/450   train_loss = 2.823
Epoch  61 Batch  150/450   train_loss = 2.916
Epoch  61 Batch  300/450   train_loss = 2.909
Epoch  62 Batch    0/450   train_loss = 2.803
Epoch  62 Batch  150/450   train_loss = 2.901
Epoch  62 Batch  300/450   train_loss = 2.910
Epoch  63 Batch    0/450   train_loss = 2.788
Epoch  63 Batch  150/450   train_loss = 2.898
Epoch  63 Batch  300/450   train_loss = 2.902
Epoch  64 Batch    0/450   train_loss = 2.778
Epoch  64 Batch  150/450   train_loss = 2.877
Epoch  64 Batch  300/450   train_loss = 2.894
Epoch  65 Batch    0/450   train_loss = 2.766
Epoch  65 Batch  150/450   train_loss = 2.868
Epoch  65 Batch  300/450   train_loss = 2.874
Epoch  66 Batch    0/450   train_loss = 2.741
Epoch  66 Batch  150/450   train_loss = 2.852
Epoch  66 Batch  300/450   train_l

Epoch 119 Batch  150/450   train_loss = 2.292
Epoch 119 Batch  300/450   train_loss = 2.287
Epoch 120 Batch    0/450   train_loss = 2.202
Epoch 120 Batch  150/450   train_loss = 2.288
Epoch 120 Batch  300/450   train_loss = 2.286
Epoch 121 Batch    0/450   train_loss = 2.201
Epoch 121 Batch  150/450   train_loss = 2.286
Epoch 121 Batch  300/450   train_loss = 2.263
Epoch 122 Batch    0/450   train_loss = 2.188
Epoch 122 Batch  150/450   train_loss = 2.282
Epoch 122 Batch  300/450   train_loss = 2.254
Epoch 123 Batch    0/450   train_loss = 2.175
Epoch 123 Batch  150/450   train_loss = 2.281
Epoch 123 Batch  300/450   train_loss = 2.253
Epoch 124 Batch    0/450   train_loss = 2.172
Epoch 124 Batch  150/450   train_loss = 2.273
Epoch 124 Batch  300/450   train_loss = 2.249
Epoch 125 Batch    0/450   train_loss = 2.184
Epoch 125 Batch  150/450   train_loss = 2.283
Epoch 125 Batch  300/450   train_loss = 2.245
Epoch 126 Batch    0/450   train_loss = 2.160
Epoch 126 Batch  150/450   train_l

Epoch 179 Batch    0/450   train_loss = 1.908
Epoch 179 Batch  150/450   train_loss = 1.997
Epoch 179 Batch  300/450   train_loss = 1.997
Epoch 180 Batch    0/450   train_loss = 1.893
Epoch 180 Batch  150/450   train_loss = 1.979
Epoch 180 Batch  300/450   train_loss = 2.001
Epoch 181 Batch    0/450   train_loss = 1.895
Epoch 181 Batch  150/450   train_loss = 1.978
Epoch 181 Batch  300/450   train_loss = 1.996
Epoch 182 Batch    0/450   train_loss = 1.883
Epoch 182 Batch  150/450   train_loss = 1.983
Epoch 182 Batch  300/450   train_loss = 1.993
Epoch 183 Batch    0/450   train_loss = 1.892
Epoch 183 Batch  150/450   train_loss = 1.978
Epoch 183 Batch  300/450   train_loss = 2.000
Epoch 184 Batch    0/450   train_loss = 1.889
Epoch 184 Batch  150/450   train_loss = 1.958
Epoch 184 Batch  300/450   train_loss = 1.990
Epoch 185 Batch    0/450   train_loss = 1.878
Epoch 185 Batch  150/450   train_loss = 1.962
Epoch 185 Batch  300/450   train_loss = 1.986
Epoch 186 Batch    0/450   train_l

# Generate Functions

Let's make some helper functions to generate the script

In [136]:
import tensorflow as tf
import numpy as np
import re

_, vocab_to_int, int_to_vocab, token_dict = load_preprocess()

In [137]:
# Get tensors state from saved model

def get_tensors(loaded_graph):

    inputs = loaded_graph.get_tensor_by_name('input:0')
    initial_state = loaded_graph.get_tensor_by_name('initial_state:0')
    final_state = loaded_graph.get_tensor_by_name('final_state:0')
    probs = loaded_graph.get_tensor_by_name('probs:0')
    
    return inputs, initial_state, final_state, probs

In [138]:
# Random probability for choose words

def pick_word(probabilities, int_to_vocab):

    choices = np.random.choice(len(int_to_vocab), size=1, p=probabilities)
    choice  = choices[0]
    return int_to_vocab[choice]

# Generate the script

Let's generate our script

In [142]:
gen_length = 300
# homer_simpson, moe_szyslak, or Barney_Gumble
prime_word = 'Homer'

loaded_graph = tf.Graph()
with tf.Session(graph=loaded_graph) as sess:
    # Load saved model
    loader = tf.train.import_meta_graph(save_dir + '.meta')
    loader.restore(sess, save_dir)

    # Get Tensors from loaded model
    input_text, initial_state, final_state, probs = get_tensors(loaded_graph)

    # Sentences generation setup
    gen_sentences = [prime_word ]#+ ':']
    prev_state = sess.run(initial_state, {input_text: np.array([[1]])})

    # Generate sentences
    for n in range(gen_length):
        # Dynamic Input
        dyn_input = [[vocab_to_int[word] for word in gen_sentences[-seq_length:]]]
        dyn_seq_length = len(dyn_input[0])
        # Get Prediction
        probabilities, prev_state = sess.run(
            [probs, final_state],
            {input_text: dyn_input, initial_state: prev_state})

        pred_word = pick_word(probabilities[0][dyn_seq_length-1], int_to_vocab)

        gen_sentences.append(pred_word)
    
    # Remove tokens
    tv_script = ' '.join(gen_sentences)
    for key, token in token_dict.items():
        ending = ' ' if key in ['\n', '(', '"'] else ''
        tv_script = tv_script.replace(' ' + token.lower(), key)
    tv_script = tv_script.replace('\n ', '\n')
    tv_script = tv_script.replace('( ', '(')
    #tv_script = re.findall(r'^([^:-][^:]*):', '\n', tv_script)
        
    print(tv_script)

INFO:tensorflow:Restoring parameters from ./save
Homer Simpson: (HUMS, THEN:) Oh. (Hallway: INT. BUS Patty Bouvier: Yes, if books be careful now, Homer. I thought we were following out. Lenny Leonard: Can I tell anyone! Homer Simpson: (SWEETLY) Yeah. I'm gonna play guitar. Lisa Simpson: That's it! What are you so happy about? Mary: Lisa, wanna any stock, left on a sea issue. We hurt running against him down. Homer Simpson: (SIGHS) Great. I got your job to sting the snail than (SNEERING) (Suburban Street: EXT. country goes - LATER) Apu Nahasapeemapetilon: There for me twenty-four team. (COUGHS OFF Normally I received our community and a line. (Highway: Ext. highway - continuous) (Simpson Home: int. Simpson house - living room - continuous) Edna Krabappel-Flanders: (INTO cab on there?! I see the whole deal. All anyone who is offering for it, Mom. Marge Simpson: You got it, Marge, what's it. Marge Simpson: Well, that brought that thought. Homer Simpson: Gunderson! I can't waste a good hou

## Formatting generated script

Homer Simpson: (HUMS, THEN:) Oh. 

(Hallway: INT. BUS)

Patty Bouvier: Yes, if books be careful now, Homer. I thought we were following out.<br>
Lenny Leonard: Can I tell anyone!<br>
Homer Simpson: (SWEETLY) Yeah. I'm gonna play guitar. <br>
Lisa Simpson: That's it! What are you so happy about? <br>
Mary: Lisa, wanna any stock, left on a sea issue. We hurt running against him down. <br>
Homer Simpson: (SIGHS) Great. I got your job to sting the snail than (SNEERING)

(Suburban Street: EXT. country goes - LATER) 

Apu Nahasapeemapetilon: There for me twenty-four team. (COUGHS OFF) Normally I received our community and a line. 

(Highway: Ext. highway - continuous)
(Simpson Home: int. Simpson house - living room - continuous)

Edna Krabappel-Flanders: (INTO cab on there?! I see the whole deal. All anyone who is offering for it, Mom. <br>
Marge Simpson: You got it, Marge, what's it. <br>
Marge Simpson: Well, that brought that thought. <br>
Homer Simpson: Gunderson! I can't waste a good house so I can be happy? <br>
Connie: It's piss it out. <br>
Ned Flanders: Oh, okay. <br>
Waylon Smithers: Well... I guess he should have been in the city. <br>
Homer Simpson: I'll save that baby I've read it. <br>

(Springfield Wax Museum: ext. capital seas it onstage of Congratulations, pray we got away my tear with Springfield tree!

Mary: (DISGUSTED SOUND) My finger student, Mr. Smithers. Back in the San pie! (SINGS) ON THE-- Int. house & (Sideshow <br> Player: Hey look! I still haven't sleeping this locked down. <br>
Marge Simpson: No more important trouble. <br>
Darcy: ... the cameraman! 

(Springfield Elementary School: INT. springfield elementary - skinner's office - later that night)

Committee #2: We've heard your life for someone who... <br>
Edna Krabappel-Flanders: (TO QUIMBY) Do it when not. As as the law from bachelor tonight, by Troy McClure.

## Conclusion

The script does not make sense, but we could generate some lines with almost perfect sense! This is an amazing result