## Using RNNs to generate Simpsons script

Implenting RNNs in tensorflow to generate script for a scene from Simpsons.

We are going to be using a subset of Simpsons data from the following dataset:
https://www.kaggle.com/wcukierski/the-simpsons-by-the-data

We'll be using a subset of the original dataset. It consists of only the scenes in Moe's Tavern. This doesn't include other versions of the tavern.

In [None]:
### Importing Libraries

In [75]:

import os
import tensorflow as tf
import numpy as np
import pickle
from tensorflow.contrib import seq2seq

In [33]:
### Download and Load the data

In [43]:
file_path = "./data/simpsons/moes_tavern_lines.txt"

def load_data(file_path):
    input_file = os.path.join(file_path)
    with open(input_file, "r") as f:
        data = f.read()
    return data

# Testing the load function
data = load_data(file_path)

In [44]:
### Basic Exploration of the data

In [45]:

view_sentence_range = (50, 60)

print('Dataset Stats')
print('Roughly the number of unique words: {}'.format(len({word: None for word in data.split()})))
scenes = data.split('\n\n')
print('Number of scenes: {}'.format(len(scenes)))
sentence_count_scene = [scene.count('\n') for scene in scenes]
print('Average number of sentences in each scene: {}'.format(np.average(sentence_count_scene)))

sentences = [sentence for scene in scenes for sentence in scene.split('\n')]
print('Number of lines: {}'.format(len(sentences)))
word_count_sentence = [len(sentence.split()) for sentence in sentences]
print('Average number of words in each line: {}'.format(np.average(word_count_sentence)))

print('\n')
print('The sentences {} to {}:'.format(*view_sentence_range))
print('\n'.join(data.split('\n')[view_sentence_range[0]:view_sentence_range[1]]))

Dataset Stats
Roughly the number of unique words: 11501
Number of scenes: 263
Average number of sentences in each scene: 15.1901140684
Number of lines: 4258
Average number of words in each line: 11.5044621888


The sentences 50 to 60:
Moe_Szyslak: Sorry, Homer.
Homer_Simpson: You know, if you tip the glass, there won't be so much foam on top.
Moe_Szyslak: Sorry, Homer.
Homer_Simpson: (LOOKING AT WATCH) Ah. Finished with fifteen seconds to spare.
Little_Man: (CONCERNED) What's the matter, buddy?
Homer_Simpson: The moron next door closed early!
Little_Man: (STIFFENING) I happen to be that moron.
Homer_Simpson: Oh, me and my trenchant mouth.
Homer_Simpson: Please, you've got to open that store.
Little_Man: Let me think about it... Eh... No.


### Preprocess the data

There are a lot of preprocessing steps we can perform on text data. Some of them are:
- Use Lookup tables (converting text to int) Helps in improving efficiency
- Tokenize punctuation
- Remove Stopwords
- Remove white spaces and other charaters

#### Creating Lookup tables
- vocab_to_int 
- int_to_vocab

In [46]:
def create_lookup_tables(data):
    """
    Create lookup tables for vocabulary
    :param text: The text of tv scripts split into words
    :return: A tuple of dicts (vocab_to_int, int_to_vocab)
    """
    # Build the vocabulary of unique words in the dataset
    vocab = set(data)
    
    vocab_to_int = {word:index for index, word in enumerate(vocab)}
    int_to_vocab = {index:word for index, word in enumerate(vocab)}
    return (vocab_to_int, int_to_vocab)


#### Creating lookup for punctuation
Often it is helpful to convert punctuations to word forms 

Egs - . : ||Period||



In [47]:
def token_lookup():
    """
    Generate a dict to turn punctuation into a token.
    :return: Tokenize dictionary where the key is the punctuation and the value is the token
    """
    token_dict = {".":"||Period||",",":"||Comma||",'"':"||Quotation_Mark||",";":"||Semicolon||", "!":"||Exclamation_Mark||","?":"||Question_mark||","(":"||Left_Parentheses||",")":"||Right_Parentheses||","--":"||Dash||", "\n":"||Return||"}
    
    return token_dict

### Preprocess and store data
Run the preprocessing code and store the data 

In [48]:
file_path = "./data/simpsons/moes_tavern_lines.txt"

# Load the data
data = load_data(file_path)

# Ignore notice, since we don't use it for analysing the data
data = data[81:]

# Create the lookup tables
token_dict = token_lookup()
for key, token in token_dict.items():
    data = data.replace(key, ' {} '.format(token))

# Convert to lower case and split the data
data = data.lower().split()

vocab_to_int, int_to_vocab = create_lookup_tables(data)
int_text = [vocab_to_int[word] for word in data]
pickle.dump((int_text, vocab_to_int, int_to_vocab, token_dict), open('preprocess.p', 'wb'))

In [49]:
#### Test if you can load the stored data

data = pickle.load(open('preprocess.p', mode='rb'))

### Build the Network

- Define placeholders
- Create word embeddings
- Create the RNN cells with dropout and initialize the state
- Create the RNN network
- Create the fully connected layer

In [62]:
#### Create the input, targets and learning rate placeholders
def get_inputs():
    inputs = tf.placeholder(dtype=tf.int32,shape = [None, None], name="inputs")
    targets  = tf.placeholder(tf.int32, [None, None], name ="targets")
    learning_rate = tf.placeholder(tf.float32,name ="learning_rate")
    
    return (inputs, targets, learning_rate)

#### Build and stack the RNNs cells and initialize them
Stack single or multiple RNN cells with dropout and initialize the state

In [52]:
def get_rnn_cells(rnn_size, batch_size, num_layers = 1, keep_prob=0.7):
    
    ## Define cell, dropout and multicell
    cell = tf.contrib.rnn.BasicRNNCell(rnn_size)
    dropout = tf.contrib.rnn.DropoutWrapper(cell, output_keep_prob=keep_prob)
    cell = tf.contrib.rnn.MultiRNNCell([dropout]*num_layers)
    
    ## Initialize the RNN cell
    initial_state = cell.zero_state(batch_size=batch_size, dtype=tf.float32)
    ## Using tf.identity to set the name of the initial state                    
    initial_state = tf.identity(initial_state,name = "initial_state")
    
    return(cell, initial_state)

#### Build the embedding layer
Since there are a huge number of words, instead of passing in a one hot encoded vector as input, 
it would be more efficient to use word embeddings. 
Since this is a small network, we can train the embeddings as a part of the same network
Usually in larger networks and when inputs are larger it is way more efficient to generate the embeddings separately 
and then feed it to this network. This would avoid the embeddings being trained every time we train this network

In [53]:
def get_embedding_layer(inputs, vocab_size, embedding_size):
    embedding_weights = tf.Variable(tf.random_uniform((vocab_size, embedding_size),-1,1))
    embedding = tf.nn.embedding_lookup(embedding_weights,inputs)
    
    return embedding

In [54]:
### Building the RNN

def build_rnn(cell, inputs):
    outputs, final_state = tf.nn.dynamic_rnn(cell,inputs, dtype=tf.float32)
    final_state = tf.identity(final_state,name = "final_state")
    return (outputs,final_state)

In [71]:
#### Building the completing network

def build_network(cell, input_data, vocab_size, rnn_size):
    # Get the embeddings, Here we are using the rnn_size as the embedding size, but we don't have to
    embed = get_embedding_layer(input_data, vocab_size, rnn_size)
    # Build the RNN passing the embeddings
    rnn_outputs, final_state = build_rnn(cell, embed)
    # Build a fully connected layer  with hthe weights and biases initialized
    logits = tf.contrib.layers.fully_connected(rnn_outputs, num_outputs = vocab_size, weights_initializer=tf.truncated_normal_initializer(stddev=0.1),biases_initializer=tf.zeros_initializer())
    return (logits,final_state)


### Generate Batches
We are going to generate the batches which are going to be fed into the network

The batches should be a Numpy array with the shape (number of batches, 2, batch size, sequence length). Each batch contains two elements:

The first element is a single batch of input with the shape [batch size, sequence length]
The second element is a single batch of targets with the shape [batch size, sequence length]

If you can't fill the last batch with enough data, drop the last batch.
For exmple, get_batches([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], 2, 3) would return a Numpy array of the following:
[
  #### First Batch
  [
    ##### Batch of Input
    [[ 1  2  3], [ 7  8  9]],
    ##### Batch of targets
    [[ 2  3  4], [ 8  9 10]]
  ],

  #### Second Batch
  [
    ##### Batch of Input
    [[ 4  5  6], [10 11 12]],
    ##### Batch of targets
    [[ 5  6  7], [11 12 13]]
  ]
]

In [72]:
def get_batches(int_text, batch_size, seq_length):
    """
    Return batches of input and target
    :param int_text: Text with the words replaced by their ids
    :param batch_size: The size of batch
    :param seq_length: The length of sequence
    :return: Batches as a Numpy array
    """
    int_text = np.array(int_text)
      
    split_size = batch_size * seq_length
    n_batches = int(len(int_text) / split_size)


    x = int_text[: n_batches*split_size]
    y = int_text[1: n_batches*split_size + 1]
    
    # Split the data into batch_size slices, then stack them into a 2D matrix 
    x = np.stack(np.split(x, batch_size))
    y = np.stack(np.split(y, batch_size))
    
    xx = [x[:, i*seq_length:i*seq_length+seq_length].tolist() for i in range(n_batches)]
    yy = [y[:, i*seq_length:i*seq_length+seq_length].tolist() for i in range(n_batches)]
 
    batches = np.array([a for a in zip(xx,yy)])
    return batches



### Training the RNN

#### Setting the Hyperparams

In [79]:
# Number of Epochs
num_epochs = 10
# Batch Size
batch_size = 128
# RNN Size
rnn_size = 128
# Sequence Length
seq_length = 20
# Learning Rate
learning_rate_value = 0.01
# Show stats for every n number of batches
show_every_n_batches = 10

## Save directory
save_dir = './save'

#### Build the Graph

In [80]:
train_graph = tf.Graph()

with train_graph.as_default():
    vocab_size = len(int_to_vocab)
    input_text, targets, learning_rate = get_inputs()
    input_data_shape = tf.shape(input_text)
    cell, initial_state = get_rnn_cells(rnn_size, input_data_shape[0], num_layers = 1, keep_prob=0.7)
    logits, final_state = build_network(cell, input_text, vocab_size, rnn_size)

    # Probabilities for generating words
    probs = tf.nn.softmax(logits, name='probs')

    # Loss function
    cost = seq2seq.sequence_loss(
        logits,
        targets,
        tf.ones([input_data_shape[0], input_data_shape[1]]))

    # Optimizer
    optimizer = tf.train.AdamOptimizer(learning_rate)

    # Gradient Clipping
    gradients = optimizer.compute_gradients(cost)
    capped_gradients = [(tf.clip_by_value(grad, -1., 1.), var) for grad, var in gradients]
    train_op = optimizer.apply_gradients(capped_gradients)

#### Training step

In [84]:
# Get the batches
batches = get_batches(int_text, batch_size, seq_length)

# Initialize a session with the graph
with tf.Session(graph=train_graph) as sess:
    sess.run(tf.global_variables_initializer())

    for epoch_i in range(num_epochs):
        state = sess.run(initial_state, {input_text: batches[0][0]})

        for batch_i, (x, y) in enumerate(batches):
            feed = {
                input_text: x,
                targets: y,
                initial_state: state,
                learning_rate: learning_rate_value}
            train_loss, state, _ = sess.run([cost, final_state, train_op], feed)

            # Show every <show_every_n_batches> batches
            if (epoch_i * len(batches) + batch_i) % show_every_n_batches == 0:
                print('Epoch {:>3} Batch {:>4}/{}   train_loss = {:.3f}'.format(
                    epoch_i,
                    batch_i,
                    len(batches),
                    train_loss))

    # Save Model
    saver = tf.train.Saver()
    saver.save(sess, save_dir)
    print('Model Trained and Saved')

Epoch   0 Batch    0/26   train_loss = 8.873
Epoch   0 Batch   10/26   train_loss = 6.069
Epoch   0 Batch   20/26   train_loss = 5.781
Epoch   1 Batch    4/26   train_loss = 5.447
Epoch   1 Batch   14/26   train_loss = 5.275
Epoch   1 Batch   24/26   train_loss = 5.335
Epoch   2 Batch    8/26   train_loss = 5.097
Epoch   2 Batch   18/26   train_loss = 5.052
Epoch   3 Batch    2/26   train_loss = 4.815
Epoch   3 Batch   12/26   train_loss = 5.017
Epoch   3 Batch   22/26   train_loss = 4.788
Epoch   4 Batch    6/26   train_loss = 4.809
Epoch   4 Batch   16/26   train_loss = 4.717
Epoch   5 Batch    0/26   train_loss = 4.529
Epoch   5 Batch   10/26   train_loss = 4.594
Epoch   5 Batch   20/26   train_loss = 4.637
Epoch   6 Batch    4/26   train_loss = 4.549
Epoch   6 Batch   14/26   train_loss = 4.473
Epoch   6 Batch   24/26   train_loss = 4.531
Epoch   7 Batch    8/26   train_loss = 4.385
Epoch   7 Batch   18/26   train_loss = 4.484
Epoch   8 Batch    2/26   train_loss = 4.167
Epoch   8 

In [87]:
#### Save params
params = (seq_length, save_dir)
pickle.dump(params, open('params.p', 'wb'))

### Generating the new script

In [88]:
### Load the preprocessed data and model params

_, vocab_to_int, int_to_vocab, token_dict = pickle.load(open('preprocess.p', mode='rb'))
seq_length, load_dir = pickle.load(open('params.p', mode='rb'))

#### Get Tensors helper function

In [95]:
def get_tensors(loaded_graph):
    """
    Get input, initial state, final state, and probabilities tensor from <loaded_graph>
    :param loaded_graph: TensorFlow graph loaded from file
    :return: Tuple (InputTensor, InitialStateTensor, FinalStateTensor, ProbsTensor)
    """
    InputTensor = loaded_graph.get_tensor_by_name("inputs:0")
    InitialStateTensor = loaded_graph.get_tensor_by_name("initial_state:0")
    FinalStateTensor = loaded_graph.get_tensor_by_name("final_state:0")
    ProbsTensor = loaded_graph.get_tensor_by_name("probs:0")
    
    return (InputTensor, InitialStateTensor, FinalStateTensor, ProbsTensor)


#### Implement pick word function

In [92]:
#### Pick most probable word
def pick_word(probabilities, int_to_vocab):
    """
    Pick the next word in the generated text
    :param probabilities: Probabilites of the next word
    :param int_to_vocab: Dictionary of word ids as the keys and words as the values
    :return: String of the predicted word
    """
    probabilities = probabilities.tolist()
    predicted_word = int_to_vocab[probabilities.index(max(probabilities))]
    return predicted_word

In [93]:
#### Pick Random Word
def pick_random_word(probabilities, int_to_vocab):

    t = np.cumsum(probabilities)
    rand_s = np.sum(probabilities) * np.random.rand(1)
    pred_word = int_to_vocab[int(np.searchsorted(t, rand_s))]

    return pred_word

### Generating the New Script

In [96]:
gen_length = 250
# homer_simpson, moe_szyslak, or Barney_Gumble
prime_word = 'moe_szyslak'

"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
loaded_graph = tf.Graph()
with tf.Session(graph=loaded_graph) as sess:
    # Load saved model
    loader = tf.train.import_meta_graph(load_dir + '.meta')
    loader.restore(sess, load_dir)

    # Get Tensors from loaded model
    input_text, initial_state, final_state, probs = get_tensors(loaded_graph)

    # Sentences generation setup
    gen_sentences = [prime_word + ':']
    prev_state = sess.run(initial_state, {input_text: np.array([[1]])})

    # Generate sentences
    for n in range(gen_length):
        # Dynamic Input
        dyn_input = [[vocab_to_int[word] for word in gen_sentences[-seq_length:]]]
        dyn_seq_length = len(dyn_input[0])

        # Get Prediction
        probabilities, prev_state = sess.run(
            [probs, final_state],
            {input_text: dyn_input, initial_state: prev_state})
        
        pred_word = pick_word(probabilities[dyn_seq_length-1], int_to_vocab)

        gen_sentences.append(pred_word)
    
    # Remove tokens
    tv_script = ' '.join(gen_sentences)
    for key, token in token_dict.items():
        ending = ' ' if key in ['\n', '(', '"'] else ''
        tv_script = tv_script.replace(' ' + token.lower(), key)
    tv_script = tv_script.replace('\n ', '\n')
    tv_script = tv_script.replace('( ', '(')
        
    print(tv_script)

moe_szyslak:(sings) yeah, but i got a woman.
moe_szyslak:(to moe) oh, it's probably a man enough left a man in.
homer_simpson:(singing) i am.
moe_szyslak:(sobs)
homer_simpson:(to homer) i want to that.
homer_simpson:(sobs)
homer_simpson:(sobs)
homer_simpson:(sobs)
homer_simpson:(terrified noise)
homer_simpson:(") moe, i got a beer.
homer_simpson:(sobs)
homer_simpson:(sobs)
homer_simpson:(warily) oh, i can't believe it has a big company.
homer_simpson:(loud) hey, what's the springfield, please!
moe_szyslak:(laughs)
homer_simpson:(to homer, homer) i got a" love...
homer_simpson:(singing)"("")"(""" and".
homer_simpson:(sobs)
homer_simpson:(sobs)
homer_simpson:(sobs)
homer_simpson:(to camera) you know, i see...
homer_simpson:(warily) oh, that's it.
homer_simpson:(reading)" and, and you sure, and you all the end of my friend.
moe_szyslak:(sobs)
homer_simpson:(sobs)
homer_simpson:(to moe) hey, what's the world of the bar.



### Tips on Improving the Model

Tweaking the hyperparameters will lead to better output
Training on more data is a good way to improve the model
Adding more preprocessing to the inputs might help in improving the model