# TV Script Generation

In this project a [Seinfeld](https://en.wikipedia.org/wiki/Seinfeld) TV script is generated using RNNs. [Seinfeld dataset](https://www.kaggle.com/thec03u5/seinfeld-chronicles#scripts.csv) of scripts from 9 seasons, available from Kaggle, is used.  The Neural Network generates a new ,"fake" TV script, based on patterns it recognizes in this training data.

## Get the Data

In [1]:
# load in data
import helper
data_dir = './data/Seinfeld_Scripts.txt'
text = helper.load_data(data_dir)

## Explore the Data
The variable `view_line_range` displays different parts of the data. It can be seen that it is all lowercase text, and each new line of dialogue is separated by a newline character `\n`.

In [2]:
view_line_range = (0, 10)

import numpy as np

print('Dataset Stats')
print('Roughly the number of unique words: {}'.format(len({word: None for word in text.split()})))

lines = text.split('\n')
print('Number of lines: {}'.format(len(lines)))
word_count_line = [len(line.split()) for line in lines]
print('Average number of words in each line: {}'.format(np.average(word_count_line)))

print()
print('The lines {} to {}:'.format(*view_line_range))
print('\n'.join(text.split('\n')[view_line_range[0]:view_line_range[1]]))

Dataset Stats
Roughly the number of unique words: 46367
Number of lines: 109233
Average number of words in each line: 5.544240293684143

The lines 0 to 10:
jerry: do you know what this is all about? do you know, why were here? to be out, this is out...and out is one of the single most enjoyable experiences of life. people...did you ever hear people talking about we should go out? this is what theyre talking about...this whole thing, were all out now, no one is home. not one person here is home, were all out! there are people trying to find us, they dont know where we are. (on an imaginary phone) did you ring?, i cant find him. where did he go? he didnt tell me where he was going. he must have gone out. you wanna go out you get ready, you pick out the clothes, right? you take the shower, you get all ready, get the cash, get your friends, the car, the spot, the reservation...then youre standing around, what do you do? you go we gotta be getting back. once youre out, you wanna get back! y

---
## Implement Pre-processing Functions
The first thing to do to any dataset is pre-processing.  The following pre-processing functions below are implemented:
- Lookup Table
- Tokenize Punctuation

### Lookup Table
To create a word embedding, the words are first transformed to ids.  In this function, two dictionaries are created:
- Dictionary to go from the words to an id, we'll call `vocab_to_int`
- Dictionary to go from the id to word, we'll call `int_to_vocab`

These dictionaries are returned in the following **tuple** `(vocab_to_int, int_to_vocab)`

In [3]:
import problem_unittests as tests
from collections import Counter

def create_lookup_tables(text):
    """
    Create lookup tables for vocabulary
    :param text: The text of tv scripts split into words
    :return: A tuple of dicts (vocab_to_int, int_to_vocab)
    """
    word_counts = Counter(text)
    # sorting the words from most to least frequent in text occurrence
    sorted_vocab = sorted(word_counts, key=word_counts.get, reverse=True)
    # create int_to_vocab dictionaries
    int_to_vocab = {ii: word for ii, word in enumerate(sorted_vocab)}
    vocab_to_int = {word: ii for ii, word in int_to_vocab.items()}

    return vocab_to_int, int_to_vocab

"""
Test the function
"""
tests.test_create_lookup_tables(create_lookup_tables)

Tests Passed


### Tokenize Punctuation
The script is split into a word array using spaces as delimiters.  However, punctuations like periods and exclamation marks can create multiple ids for the same word. For example, "bye" and "bye!" would generate two different word ids.

The function `token_lookup` is implemented to return a dict that is used to tokenize symbols like "!" into "||Exclamation_Mark||".  A dictionary is created for the following symbols where the symbol is the key and value is the token:
- Period ( **.** )
- Comma ( **,** )
- Quotation Mark ( **"** )
- Semicolon ( **;** )
- Exclamation mark ( **!** )
- Question mark ( **?** )
- Left Parentheses ( **(** )
- Right Parentheses ( **)** )
- Dash ( **-** )
- Return ( **\n** )

This dictionary is used to tokenize the symbols and add the delimiter (space) around it.  This separates each symbols as its own word, making it easier for the neural network to predict the next word. It is ensured that values which could be confused as a word are not used; for example, instead of using the value "dash", "&lt;DASH&gt;" is used.

In [4]:
def token_lookup():
    """
    Generate a dict to turn punctuation into a token.
    :return: Tokenized dictionary where the key is the punctuation and the value is the token
    """
    dictionary = {
        '.': '<PERIOD>',
        ',': '<COMMA>',
        '"': '<QUOTATION_MARK>',
        ';': '<SEMICOLON>',
        '!': '<EXCLAMATION_MARK>',
        '?': '<QUESTION_MARK>',
        '(': '<LEFT_PAREN>',
        ')': '<RIGHT_PAREN>',
        #'--': '<HYPHENS>',
        '-': '<DASH>',
        '?': '<QUESTION_MARK>',
        #':': '<COLON>',
        '\n': '<NEW_LINE>'}
    return dictionary

"""
Test the function
"""
tests.test_tokenize(token_lookup)

Tests Passed


## Pre-process all the data and save it

Running the code cell below pre-processes all the data and save it to file. 

In [5]:
# pre-process training data
helper.preprocess_and_save_data(data_dir, token_lookup, create_lookup_tables)

# Check Point
This is the first checkpoint. If needed to come back to this notebook or have to restart it, it can be started from here. The preprocessed data has been saved to disk.

In [6]:
import helper
import problem_unittests as tests

int_text, vocab_to_int, int_to_vocab, token_dict = helper.load_preprocess()

The variable `int_text` is the tokenized text. Below it is tested by making sure that the first 10 words from `text` variable are the same as the combination of `int_text` and `int_to_vocab` dictionary.

In [7]:
text[0:42]

'jerry: do you know what this is all about?'

In [8]:
[int_to_vocab[i] for i in int_text[0:10]]

['jerry:',
 'do',
 'you',
 'know',
 'what',
 'this',
 'is',
 'all',
 'about',
 '<question_mark>']

## Build the Neural Network
In this section, the components necessary to build an RNN by implementing the RNN Module and forward and backpropagation functions are defined.

### Check Access to GPU

In [9]:
import torch

# Check for a GPU
train_on_gpu = torch.cuda.is_available()
if not train_on_gpu:
    print('No GPU found. Please use a GPU to train your neural network.')

## Input
Let's start with the preprocessed input data. [TensorDataset](http://pytorch.org/docs/master/data.html#torch.utils.data.TensorDataset) is used to provide a known format to the dataset; in combination with [DataLoader](http://pytorch.org/docs/master/data.html#torch.utils.data.DataLoader), it handles batching, shuffling, and other dataset iteration functions.

### Batching
The function `batch_data` is implemented to batch `words` data into chunks of size `batch_size` using the `TensorDataset` and `DataLoader` classes. 

In [10]:
from torch.utils.data import TensorDataset, DataLoader
import torch

def batch_data(words, sequence_length, batch_size):
    """
    Batch the neural network data using DataLoader
    :param words: The word ids of the TV scripts
    :param sequence_length: The sequence length of each batch
    :param batch_size: The size of each batch; the number of sequences in a batch
    :return: DataLoader with batched data
    """
    
    feature_tensors = []
    target_tensors= []
    for i in range(sequence_length,len(words)):
        feature_tensors.append(words[i-sequence_length:i])
        target_tensors.append(words[i])

    data = TensorDataset(torch.tensor(feature_tensors), torch.tensor(target_tensors))
    data_loader = torch.utils.data.DataLoader(data, batch_size=batch_size)
    
    return data_loader

### Test the dataloader 

In [11]:
# test the code
test_text = [i for i in range(20)]
print('Sample input: ', test_text)
print()

# obtain one batch of training data
dataloader = batch_data(test_text, sequence_length = 5, batch_size = 10)
dataiter = iter(dataloader)
feature_tensors, target_tensors = dataiter.next()

print('Sample feature tensor size: ', feature_tensors.size()) # batch_size, seq_length
print('Sample feature tensor: \n', feature_tensors)
print()
print('Sample target tensor size: ', target_tensors.size()) # batch_size
print('Sample target tensor: \n', target_tensors)

Sample input:  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

Sample feature tensor size:  torch.Size([10, 5])
Sample feature tensor: 
 tensor([[  0,   1,   2,   3,   4],
        [  1,   2,   3,   4,   5],
        [  2,   3,   4,   5,   6],
        [  3,   4,   5,   6,   7],
        [  4,   5,   6,   7,   8],
        [  5,   6,   7,   8,   9],
        [  6,   7,   8,   9,  10],
        [  7,   8,   9,  10,  11],
        [  8,   9,  10,  11,  12],
        [  9,  10,  11,  12,  13]])

Sample target tensor size:  torch.Size([10])
Sample target tensor: 
 tensor([  5,   6,   7,   8,   9,  10,  11,  12,  13,  14])


---
## Build the Neural Network
Implement an RNN using PyTorch's [Module class](http://pytorch.org/docs/master/nn.html#torch.nn.Module) with LSTM.
 
The initialize function creates the layers of the neural network and saves them to the class. The forward propagation function uses these layers to run forward propagation and generates an output and a hidden state.

**The output of this model is the *last* batch of word scores** after a complete sequence has been processed. That is, for each input sequence of words, the output is word scores for a single, most likely, next word.


In [13]:
import torch.nn as nn

class RNN(nn.Module):
    
    def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, dropout=0.5):
        """
        Initialize the PyTorch RNN Module
        :param vocab_size: The number of input dimensions of the neural network (the size of the vocabulary)
        :param output_size: The number of output dimensions of the neural network
        :param embedding_dim: The size of embeddings, should you choose to use them        
        :param hidden_dim: The size of the hidden layer outputs
        :param dropout: dropout to add in between LSTM/GRU layers
        """
        super(RNN, self).__init__()

        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim
        
        # embedding and LSTM layers
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, 
                            dropout=dropout, batch_first=True)
        
        # dropout layer
        #self.dropout = nn.Dropout(dropout)
        
        # linear and sigmoid layers
        self.fc = nn.Linear(hidden_dim, output_size)
    
    def forward(self, nn_input, hidden):
        """
        Forward propagation of the neural network
        :param nn_input: The input to the neural network
        :param hidden: The hidden state        
        :return: Two Tensors, the output of the neural network and the latest hidden state
        """
        x = nn_input
        batch_size = x.size(0)

        # embeddings and lstm_out
        x = x.long()
        embeds = self.embedding(x)
        lstm_out, hidden = self.lstm(embeds, hidden)
    
        # stack up lstm outputs
        lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim)
        
        # dropout and fully-connected layer
        #output = self.dropout(lstm_out)
        output = self.fc(lstm_out)
        
        # reshape into (batch_size, seq_length, output_size)
        output = output.view(batch_size, -1, self.output_size)
        # get last batch
        output = output[:, -1]
        
        # return last sigmoid output and hidden state
        return output, hidden
    
    
    def init_hidden(self, batch_size):
        '''
        Initialize the hidden state of an LSTM/GRU
        :param batch_size: The batch_size of the hidden state
        :return: hidden state of dims (n_layers, batch_size, hidden_dim)
        '''
        weight = next(self.parameters()).data
        
        if (train_on_gpu):
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(),
                  weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda())
        else:
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_(),
                      weight.new(self.n_layers, batch_size, self.hidden_dim).zero_())
        
        return hidden
        

"""
Test the function
"""
tests.test_rnn(RNN, train_on_gpu)

Tests Passed


### Define forward and backpropagation

The RNN class is implemented to apply forward and back propagation. It returns the average loss over a batch and the hidden state.

In [14]:
def forward_back_prop(rnn, optimizer, criterion, inputs, target, hidden):
    """
    Forward and backward propagation on the neural network
    :param decoder: The PyTorch Module that holds the neural network
    :param decoder_optimizer: The PyTorch optimizer for the neural network
    :param criterion: The PyTorch loss function
    :param inputs: A batch of input to the neural network
    :param target: The target output for the batch of input
    :return: The loss and the latest hidden state Tensor
    """
    # move data to GPU, if available
    if(train_on_gpu):
        inputs, target = inputs.cuda(), target.cuda()
    
    # Creating new variables for the hidden state
    hidden = tuple([each.data for each in hidden])
    
    # zero accumulated gradients
    rnn.zero_grad()

    # get the output from the model
    output, hidden = rnn(inputs, hidden)

    # calculate the loss and perform backprop
    loss = criterion(output, target)
    loss.backward()
    
    clip = 5 # gradient clipping
    # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
    nn.utils.clip_grad_norm_(rnn.parameters(), clip)
    optimizer.step()
    
    return loss.item(), hidden

"""
Test the function
"""
tests.test_forward_back_prop(RNN, forward_back_prop, train_on_gpu)

Tests Passed


## Neural Network Training

With the structure of the network complete and data ready to be fed in the neural network, it's time to train it.

### Train Loop

The training loop is implemented in the `train_decoder` function. This function trains the network over all the batches for the number of epochs given. 

In [15]:
def train_rnn(rnn, batch_size, optimizer, criterion, n_epochs, show_every_n_batches=100):
    batch_losses = []
    
    rnn.train()

    print("Training for %d epoch(s)..." % n_epochs)
    for epoch_i in range(1, n_epochs + 1):
        
        # initialize hidden state
        hidden = rnn.init_hidden(batch_size)
        
        for batch_i, (inputs, labels) in enumerate(train_loader, 1):
            
            # make sure you iterate over completely full batches, only
            n_batches = len(train_loader.dataset)//batch_size
            if(batch_i > n_batches):
                break
            
            # forward, back prop
            loss, hidden = forward_back_prop(rnn, optimizer, criterion, inputs, labels, hidden)          
            # record loss
            batch_losses.append(loss)

            # printing loss stats
            if batch_i % show_every_n_batches == 0:
                print('Epoch: {:>4}/{:<4}  Loss: {}'.format(
                    epoch_i, n_epochs, np.average(batch_losses)))
                batch_losses = []

    # returns a trained rnn
    return rnn

### Hyperparameters

The following parameters are defined:
* `sequence_length` - the length of a sequence
* `batch_size` - batch size
* `num_epochs` - number of epochs to train for
* `learning_rate` - learning rate for an Adam optimizer
* `vocab_size` - number of uniqe tokens in our vocabulary
* `output_size` - desired size of the output
* `embedding_dim` - embedding dimension; smaller than the vocab_size
* `hidden_dim` - hidden dimension of your RNN
* `n_layers` - number of layers/cells in your RNN
* `show_every_n_batches` - number of batches at which the neural network should print progress

In [16]:
# Data params
# Sequence Length
sequence_length = 10  # of words in a sequence
# Batch Size
batch_size = 128

# data loader - do not change
train_loader = batch_data(int_text, sequence_length, batch_size)

In [17]:
# Training parameters
# Number of Epochs
num_epochs = 10
# Learning Rate
learning_rate = 0.001

# Model parameters
# Vocab size
vocab_size = len(vocab_to_int) 
# Output size
output_size = vocab_size
# Embedding Dimension
embedding_dim = 300
# Hidden Dimension
hidden_dim = 256
# Number of RNN Layers
n_layers = 2

# Show stats for every n number of batches
show_every_n_batches = 500

### Train

The neural network is now trained on the pre-processed data.

In [18]:
# create model and move to gpu if available
rnn = RNN(vocab_size, output_size, embedding_dim, hidden_dim, n_layers, dropout=0.5)
if train_on_gpu:
    rnn.cuda()

# defining loss and optimization functions for training
optimizer = torch.optim.Adam(rnn.parameters(), lr=learning_rate)
criterion = nn.CrossEntropyLoss()

# training the model
trained_rnn = train_rnn(rnn, batch_size, optimizer, criterion, num_epochs, show_every_n_batches)

# saving the trained model
helper.save_model('./save/trained_rnn', trained_rnn)
print('Model Trained and Saved')

Training for 10 epoch(s)...
Epoch:    1/10    Loss: 5.49014184474945
Epoch:    1/10    Loss: 4.8817426195144655
Epoch:    1/10    Loss: 4.658660507202148
Epoch:    1/10    Loss: 4.543095160484314
Epoch:    1/10    Loss: 4.53983976650238
Epoch:    1/10    Loss: 4.55903724527359
Epoch:    1/10    Loss: 4.46418013381958
Epoch:    1/10    Loss: 4.339940574645996
Epoch:    1/10    Loss: 4.3120862193107605
Epoch:    1/10    Loss: 4.257246546268463
Epoch:    1/10    Loss: 4.373244073867798
Epoch:    1/10    Loss: 4.408431221485138
Epoch:    1/10    Loss: 4.396023030757904
Epoch:    2/10    Loss: 4.1921956972195025
Epoch:    2/10    Loss: 4.010610648155213
Epoch:    2/10    Loss: 3.9120227069854736
Epoch:    2/10    Loss: 3.8620828886032106
Epoch:    2/10    Loss: 3.919403598308563
Epoch:    2/10    Loss: 4.000794438362122
Epoch:    2/10    Loss: 3.946725730895996
Epoch:    2/10    Loss: 3.8232504744529723
Epoch:    2/10    Loss: 3.8503002891540525
Epoch:    2/10    Loss: 3.795907429218292
Epo

  "type " + obj.__name__ + ". It won't be checked "


Model Trained and Saved


### How were the model hyperparameters selected? 

The parameters picked initially were similar to the ones in the final iteration. The loss kept decreasing consistently up until epoch 5, when the loss started oscillating between 3.8-4.0. The model probably reached a local minimum and its final loss was 3.9. This behaviour did not change despite modifying the parameters. Removing dropout before the final fully-connected layer resolved that issue and the loss started decreasing further.

The sequence length was chosen to be 10 words, which seemed sufficient for this task. 

The batch size is generally taken as a power of 2 to handle better by torch module when using cuda. Too small bath size slows down the training, but too large one consumes more memory affecting the accuracy. The size of 128 was finally chosen.

Two hidden layers were selected as they converged faster than three, whilst keeping similar performance. The hidden dimension of 256 and embedding size of 300 were selected.

Learning rate of 0.001 was chosen for this task. However, during the analysis the loss was oscillating and adaptive learning rate could be potentially incorporated to adjust the training.

---
# Checkpoint

The model is saved as `trained_rnn`. It can be resumed by running the next cell, which will load in word:id dictionaries and load in saved model.

In [23]:
import torch
import helper
import problem_unittests as tests

_, vocab_to_int, int_to_vocab, token_dict = helper.load_preprocess()
trained_rnn = helper.load_model('./save/trained_rnn')

## Generate TV Script
With the network trained and saved, "fake" Seinfeld TV script is generated in this section.

### Generate Text
To generate the text, the network needs to start with a single word and repeat its predictions until it reaches a set length, with `generate` function. This function takes a word id to start with, `prime_id`, and generates a set length of text, `predict_len`. It uses topk sampling to introduce some randomness in choosing the most likely next word, given an output set of word scores.

In [20]:
import torch.nn.functional as F

def generate(rnn, prime_id, int_to_vocab, token_dict, pad_value, predict_len=100):
    """
    Generate text using the neural network
    :param decoder: The PyTorch Module that holds the trained neural network
    :param prime_id: The word id to start the first prediction
    :param int_to_vocab: Dict of word id keys to word values
    :param token_dict: Dict of puncuation tokens keys to puncuation values
    :param pad_value: The value used to pad a sequence
    :param predict_len: The length of text to generate
    :return: The generated text
    """
    rnn.eval()
    
    # create a sequence (batch_size=1) with the prime_id
    current_seq = np.full((1, sequence_length), pad_value)
    current_seq[-1][-1] = prime_id
    predicted = [int_to_vocab[prime_id]]
    
    for _ in range(predict_len):
        if train_on_gpu:
            current_seq = torch.LongTensor(current_seq).cuda()
        else:
            current_seq = torch.LongTensor(current_seq)
        
        # initialize the hidden state
        hidden = rnn.init_hidden(current_seq.size(0))
        
        # get the output of the rnn
        output, _ = rnn(current_seq, hidden)
        
        # get the next word probabilities
        p = F.softmax(output, dim=1).data
        if(train_on_gpu):
            p = p.cpu() # move to cpu
         
        # use top_k sampling to get the index of the next word
        top_k = 5
        p, top_i = p.topk(top_k)
        top_i = top_i.numpy().squeeze()
        
        # select the likely next word index with some element of randomness
        p = p.numpy().squeeze()
        word_i = np.random.choice(top_i, p=p/p.sum())
        
        # retrieve that word from the dictionary
        word = int_to_vocab[word_i]
        predicted.append(word)     
        
        # the generated word becomes the next "current sequence" and the cycle can continue
        current_seq = np.roll(current_seq, -1, 1)
        current_seq[-1][-1] = word_i
    
    gen_sentences = ' '.join(predicted)
    
    # Replace punctuation tokens
    for key, token in token_dict.items():
        ending = ' ' if key in ['\n', '(', '"'] else ''
        gen_sentences = gen_sentences.replace(' ' + token.lower(), key)
    gen_sentences = gen_sentences.replace('\n ', '\n')
    gen_sentences = gen_sentences.replace('( ', '(')
    
    # return all the sentences
    return gen_sentences

### Generate a New Script
It's time to generate the text. `gen_length` is set to the length of TV script and `prime_word` is can be set to one of the following to start the prediction:
- "jerry"
- "elaine"
- "george"
- "kramer"

This can be run several times until an interesting script has been generated.

In [25]:
# run the cell multiple times to get different results!
gen_length = 400 # modify the length to your preference
prime_word = 'jerry' # name for starting the script

pad_word = helper.SPECIAL_WORDS['PADDING']
generated_script = generate(trained_rnn, vocab_to_int[prime_word + ':'], int_to_vocab, token_dict, vocab_to_int[pad_word], gen_length)
print(generated_script)



jerry: hopeless.

hoyt: i can't find it out.

jerry: what do i think?

elaine: well, i don't know.

jerry: oh no, no.

hoyt: well, i don't know what this is going to do that. i was thinking about it.

kramer: oh, no, no.

elaine: what happened?

jerry: well..

elaine: so?

kramer: no! i'm sorry.

hoyt: so, what's the matter with you?

elaine: well, i'm not going to do that.

jerry: what?

elaine: well, i was thinking about a homeless guard.

jerry: what?

george: yeah, i guess i'll see the bathrooms.

jerry: i don't know.

helen: you got a good time, you want to be able to go.

jerry: so, you know what the big pee are gonna do.

jerry: oh!

jerry: hey!

george: hey, hey, hey!

kramer: oh, come on, gentlemen.

george: what happened? i mean, this is not really good.

hoyt: so, what's going on?

kramer: well, you know what the interest is. it's a misprint, and you know what?

jerry: i don't know what they do.

george: i know how it would do about the mood.

hoyt: well, i think it's a real

#### Save the best scripts

Once an interesting script is generated, it's saved to a text file.

In [41]:
# save script to a text file
f =  open("generated_script_1.txt","w")
f.write(generated_script)
f.close()

# The TV Script is Not Perfect
The TV script clearly doesn't make perfect sense, but it looks like alternating lines of dialogue.

It takes quite a while to get good results. The Seinfeld dataset is about 3.4 MB, which is big enough for this purpose; for script generation more than 1 MB of textis generally desired. 