# TV Script Generation

In this project, I have tried to generate my own [Seinfeld](https://en.wikipedia.org/wiki/Seinfeld) TV scripts using RNNs. I have used part of the [Seinfeld dataset](https://www.kaggle.com/thec03u5/seinfeld-chronicles#scripts.csv) of scripts from 9 seasons.  The Neural Network will generate a new ,"fake" TV script, based on patterns it recognizes in this training data.

## Get the Data


In [1]:
# load in data
import helper
data_dir = './data/Seinfeld_Scripts.txt'
text = helper.load_data(data_dir)

## Explore the Data

In [2]:
view_line_range = (0, 10)

import numpy as np

print('Dataset Stats')
print('Roughly the number of unique words: {}'.format(len({word: None for word in text.split()})))

lines = text.split('\n')
print('Number of lines: {}'.format(len(lines)))
word_count_line = [len(line.split()) for line in lines]
print('Average number of words in each line: {}'.format(np.average(word_count_line)))

print()
print('The lines {} to {}:'.format(*view_line_range))
print('\n'.join(text.split('\n')[view_line_range[0]:view_line_range[1]]))

Dataset Stats
Roughly the number of unique words: 46367
Number of lines: 109233
Average number of words in each line: 5.544240293684143

The lines 0 to 10:
jerry: do you know what this is all about? do you know, why were here? to be out, this is out...and out is one of the single most enjoyable experiences of life. people...did you ever hear people talking about we should go out? this is what theyre talking about...this whole thing, were all out now, no one is home. not one person here is home, were all out! there are people trying to find us, they dont know where we are. (on an imaginary phone) did you ring?, i cant find him. where did he go? he didnt tell me where he was going. he must have gone out. you wanna go out you get ready, you pick out the clothes, right? you take the shower, you get all ready, get the cash, get your friends, the car, the spot, the reservation...then youre standing around, what do you do? you go we gotta be getting back. once youre out, you wanna get back! y

---
## Implement Pre-processing Functions
The first thing to do to any dataset is pre-processing. I will implement the following pre-processing functions below:
- Lookup Table
- Tokenize Punctuation

### Lookup Table
To create a word embedding, you first need to transform the words to ids.  In this function, create two dictionaries:
- Dictionary to go from the words to an id, we'll call `vocab_to_int`
- Dictionary to go from the id to word, we'll call `int_to_vocab`

In [3]:
import problem_unittests as tests
from collections import Counter

def create_lookup_tables(text):

    words = Counter(text)
    vocab = sorted(words, key = words.get, reverse = True)
    vocab_to_int = {word : ii for ii, word in enumerate(vocab)} 
    int_to_vocab = {ii : word for ii, word in enumerate(vocab)}
    # return tuple
    return (vocab_to_int, int_to_vocab)

tests.test_create_lookup_tables(create_lookup_tables)

Tests Passed


### Tokenize Punctuation

In [4]:
def token_lookup():
    
    #getting the key-value pair in a dict
    punctuation = {'.': "||Period||", ',': "||Comma||", '"': "||Quotation_Mark||", ';': "||Semicolon||",
                   '!': "||Exclamation_Mark||", '?': "||Question_Mark||", '(': "||Left_Parentheses||",
                   ')': "||Right_Parentheses||", '-': "||Dash||", '\n': "||Return||"
                  }
    
    return punctuation

tests.test_tokenize(token_lookup)

Tests Passed


## Pre-process all the data and save it

Running the code cell below will pre-process all the data and save it to file.

In [5]:
# pre-process training data
helper.preprocess_and_save_data(data_dir, token_lookup, create_lookup_tables)

In [6]:
import helper
import problem_unittests as tests

int_text, vocab_to_int, int_to_vocab, token_dict = helper.load_preprocess()

## Build the Neural Network


In [7]:
import torch

# Check for a GPU
train_on_gpu = torch.cuda.is_available()
if not train_on_gpu:
    print('No GPU found. Please use a GPU to train your neural network.')

In [8]:
from torch.utils.data import TensorDataset, DataLoader


def batch_data(words, sequence_length, batch_size):

    num_featues = len(words) - sequence_length  #this is how many total features we can have at given sequence length
    X_train = np.zeros((num_featues, sequence_length), dtype = int) #num_features gives us shape of complete X_train
    y_train = np.zeros(num_featues)  #num of labels is equal to num of rows in X_train
    
    #now, we will ommit the zeros with our words with this logic
    for i in range(0, num_featues):
        X_train[i] = words[i:i+sequence_length]
        y_train[i] = words[i+sequence_length]
    
    #changing dtype
    feature_array = np.asarray(X_train, np.int64)
    target_array = np.asarray(y_train, np.int64)
    data = TensorDataset(torch.from_numpy(feature_array), torch.from_numpy(target_array))
    dataloader = DataLoader(data, batch_size = batch_size, shuffle = True)

    return dataloader

words = np.array([99,88,77,66,55,44,33,22,11,0])
loader = batch_data(words, 5, 3)
dataiter = iter(loader)
dataiter.next()

[tensor([[ 88,  77,  66,  55,  44],
         [ 66,  55,  44,  33,  22],
         [ 55,  44,  33,  22,  11]]), tensor([ 33,  11,   0])]

In [9]:
# test dataloader

test_text = range(50)
t_loader = batch_data(test_text, sequence_length=5, batch_size=10)

data_iter = iter(t_loader)
sample_x, sample_y = data_iter.next()

print(sample_x.shape)
print(sample_x)
print()
print(sample_y.shape)
print(sample_y)
print(len(t_loader))

torch.Size([10, 5])
tensor([[ 18,  19,  20,  21,  22],
        [ 44,  45,  46,  47,  48],
        [ 14,  15,  16,  17,  18],
        [ 32,  33,  34,  35,  36],
        [  0,   1,   2,   3,   4],
        [ 35,  36,  37,  38,  39],
        [ 16,  17,  18,  19,  20],
        [  1,   2,   3,   4,   5],
        [ 24,  25,  26,  27,  28],
        [ 33,  34,  35,  36,  37]])

torch.Size([10])
tensor([ 23,  49,  19,  37,   5,  40,  21,   6,  29,  38])
5


---
## Build the Neural Network

In [10]:
import torch.nn as nn

class RNN(nn.Module):
    
    def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, dropout=0.5):
        
        super(RNN, self).__init__()
        # set class variables
        self.hidden_dim = hidden_dim
        self.output_size = output_size
        self.n_layers = n_layers
        
        #embedding layer
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        #lstm layer
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, dropout=dropout, batch_first=True)
        
        # dropout layer
        self.dropout = nn.Dropout(p=0.2)
        #fully connected layer
        self.fc = nn.Linear(hidden_dim, output_size)
    
    
    def forward(self, nn_input, hidden):
        
        batch_size = nn_input.size(0)
        
        embeds = self.embedding(nn_input)
        lstm_output, hidden = self.lstm(embeds, hidden)
        
        #stack the outputs of the lstm to pass to your fully-connected layer - kinda flattening step
        lstm_output = lstm_output.contiguous().view(-1, self.hidden_dim)
        
        output = self.dropout(lstm_output)
        output = self.fc(lstm_output)
        output = output.view(batch_size, -1, self.output_size)
        #getting the last batch of outputs
        output = output[:,-1]
        
        # return one batch of output word scores and the hidden state
        return output, hidden
    
    
    def init_hidden(self, batch_size):
        #hidden state of dims (n_layers, batch_size, hidden_dim)
        
        weight = next(self.parameters()).data
        
        # initialize hidden state with zero weights, and move to GPU if available
        if train_on_gpu:
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(),
                      weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda())
        else:
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_(),
                      weight.new(self.n_layers, batch_size, self.hidden_dim).zero_())
        return hidden



Tests Passed


### Define forward and backpropagation

In [11]:
def forward_back_prop(rnn, optimizer, criterion, inp, target, hidden):
   
    # move data to GPU, if available
    if train_on_gpu:
        inp, target = inp.cuda(), target.cuda()
    
    # perform backpropagation and optimization
    
    hidden = tuple([i.data for i in hidden])
    
    rnn.zero_grad()
    output, hidden = rnn(inp, hidden)
    loss = criterion(output, target)
    loss.backward()
    
    #useing gradient clipping to prevent exploding gradients problem
    nn.utils.clip_grad_norm_(rnn.parameters(), 5)
    optimizer.step()
    

    # return the loss over a batch and the hidden state produced by our model
    return loss.item(), hidden


Tests Passed


## Neural Network Training

With the structure of the network complete and data ready to be fed in the neural network, it's time to train it.

In [12]:
def train_rnn(rnn, batch_size, optimizer, criterion, n_epochs, show_every_n_batches=100):
    batch_losses = []
    
    rnn.train()

    print("Training for %d epoch(s)..." % n_epochs)
    for epoch_i in range(1, n_epochs + 1):
        
        # initialize hidden state
        hidden = rnn.init_hidden(batch_size)
        
        for batch_i, (inputs, labels) in enumerate(train_loader, 1):
            
            # making sure to iterate over completely full batches, only
            n_batches = len(train_loader.dataset)//batch_size
            if(batch_i > n_batches):
                break
            
            # forward, back prop
            loss, hidden = forward_back_prop(rnn, optimizer, criterion, inputs, labels, hidden)          
            # record loss
            batch_losses.append(loss)

            # printing loss stats
            if batch_i % show_every_n_batches == 0:
                print('Epoch: {:>4}/{:<4}  Loss: {}\n'.format(
                    epoch_i, n_epochs, np.average(batch_losses)))
                batch_losses = []

    # returns a trained rnn
    return rnn

### Hyperparameters

In [13]:
# Data params
# Sequence Length - [5, 7, 10, 15, 20, 25]
sequence_length = 15  # of words in a sequence
# Batch Size - [64, 128, 256]
batch_size = 128

# data loader - do not change
train_loader = batch_data(int_text, sequence_length, batch_size)

In [14]:
# Training parameters
# Number of Epochs
num_epochs = 25
# Learning Rate - [0.001 - 0.005, 0.01, 0.1]
learning_rate = 0.001

# Model parameters
# Vocab size
vocab_size = len(int_to_vocab)
# Output size
output_size = vocab_size
# Embedding Dimension - [200, 400, 600]
embedding_dim = 128
# Hidden Dimension - [300, 500, 1000]
hidden_dim = 500
# Number of RNN Layers - [2,3]
n_layers = 2

# Show stats for every n number of batches
show_every_n_batches = 500

### Train

In [15]:
import time
t0 = time.time()


from workspace_utils import active_session
with active_session():


    # create model and move to gpu if available
    rnn = RNN(vocab_size, output_size, embedding_dim, hidden_dim, n_layers, dropout=0.5)
    if train_on_gpu:
        rnn.cuda()

    # defining loss and optimization functions for training
    optimizer = torch.optim.Adam(rnn.parameters(), lr=learning_rate)
    criterion = nn.CrossEntropyLoss()

    # training the model
    trained_rnn = train_rnn(rnn, batch_size, optimizer, criterion, num_epochs, show_every_n_batches)

    # saving the trained model
    helper.save_model('./save/trained_rnn', trained_rnn)
    print('Model Trained and Saved')
    
    
t1 = time.time()
print('Time taken:', (t1-t0)/3600, 'hours')

Training for 25 epoch(s)...
Epoch:    1/25    Loss: 5.423726204872131

Epoch:    1/25    Loss: 4.816921296596527

Epoch:    1/25    Loss: 4.5948448047637935

Epoch:    1/25    Loss: 4.465704329490662

Epoch:    1/25    Loss: 4.394996451377868

Epoch:    1/25    Loss: 4.312475248336792

Epoch:    1/25    Loss: 4.285950607776642

Epoch:    1/25    Loss: 4.2121966876983645

Epoch:    1/25    Loss: 4.203113260269165

Epoch:    1/25    Loss: 4.1777085647583005

Epoch:    1/25    Loss: 4.095357491016388

Epoch:    1/25    Loss: 4.106398698806763

Epoch:    1/25    Loss: 4.0812316989898685

Epoch:    2/25    Loss: 3.9684305444467425

Epoch:    2/25    Loss: 3.881323283672333

Epoch:    2/25    Loss: 3.8777885551452638

Epoch:    2/25    Loss: 3.867604364871979

Epoch:    2/25    Loss: 3.85935231590271

Epoch:    2/25    Loss: 3.829647635936737

Epoch:    2/25    Loss: 3.855135422706604

Epoch:    2/25    Loss: 3.860082482814789

Epoch:    2/25    Loss: 3.8412785549163817

Epoch:    2/25    Lo

  "type " + obj.__name__ + ". It won't be checked "


Model Trained and Saved
Time taken: 5.182297595010864 hours


---
# Checkpoint

In [16]:
import torch
import helper
import problem_unittests as tests

_, vocab_to_int, int_to_vocab, token_dict = helper.load_preprocess()
trained_rnn = helper.load_model('./save/trained_rnn')

## Generate TV Script
With the network trained and saved, I can use it to generate a new, "fake" Seinfeld TV script.

### Generate Text
To generate the text, the network needs to start with a single word and repeat its predictions until it reaches a set length. It takes a word id to start with, `prime_id`, and generates a set length of text, `predict_len`. Also note that it uses topk sampling to introduce some randomness in choosing the most likely next word, given an output set of word scores!

In [17]:
import torch.nn.functional as F

def generate(rnn, prime_id, int_to_vocab, token_dict, pad_value, predict_len=100):
    
    rnn.eval()
    
    # create a sequence (batch_size=1) with the prime_id
    current_seq = np.full((1, sequence_length), pad_value)
    current_seq[-1][-1] = prime_id
    predicted = [int_to_vocab[prime_id]]
    
    for _ in range(predict_len):
        if train_on_gpu:
            current_seq = torch.LongTensor(current_seq).cuda()
        else:
            current_seq = torch.LongTensor(current_seq)
        
        # initialize the hidden state
        hidden = rnn.init_hidden(current_seq.size(0))
        
        # get the output of the rnn
        output, _ = rnn(current_seq, hidden)
        
        # get the next word probabilities
        p = F.softmax(output, dim=1).data
        if(train_on_gpu):
            p = p.cpu() # move to cpu
         
        # use top_k sampling to get the index of the next word
        top_k = 5
        p, top_i = p.topk(top_k)
        top_i = top_i.numpy().squeeze()
        
        # select the likely next word index with some element of randomness
        p = p.numpy().squeeze()
        word_i = np.random.choice(top_i, p=p/p.sum())
        
        # retrieve that word from the dictionary
        word = int_to_vocab[word_i]
        predicted.append(word)     
        
        # the generated word becomes the next "current sequence" and the cycle can continue
        current_seq = np.roll(current_seq, -1, 1)
        current_seq[-1][-1] = word_i
    
    gen_sentences = ' '.join(predicted)
    
    # Replace punctuation tokens
    for key, token in token_dict.items():
        ending = ' ' if key in ['\n', '(', '"'] else ''
        gen_sentences = gen_sentences.replace(' ' + token.lower(), key)
    gen_sentences = gen_sentences.replace('\n ', '\n')
    gen_sentences = gen_sentences.replace('( ', '(')
    
    # return all the sentences
    return gen_sentences

### Generate a New Script
It's time to generate the text.

In [18]:
gen_length = 400 
prime_word = 'jerry' # name for starting the script

pad_word = helper.SPECIAL_WORDS['PADDING']
generated_script = generate(trained_rnn, vocab_to_int[prime_word + ':'], int_to_vocab, token_dict, vocab_to_int[pad_word], gen_length)
print(generated_script)



jerry: shrubs.)

jerry: hey, george!

george: shhh!(he runs back and forth between a hug)

elaine:(to jerry) what is this? a little test?

jerry: oh, no.

elaine: i mean, why not?

jerry: well i was just admiring your skin.

kramer: hey.

jerry: hey, how's your father?

george:(muttering) i don't know..(sees the flies) it's not safe. i don't want to see the milk anymore. i'm not going to commit suicide.

jerry: i don't understand. maybe i can figure out why would i be friends to do anything...

jerry: i don't think you want to do this.

elaine: well, i was just leaving.

jerry: well if you think about this...

george: well, you know...

kramer:(quietly) i said cubans, cosmo kramer.

jerry: oh, no, not really.

george: what do you mean?

jerry: because i'm not going to punch him........ and i screamed at it.

elaine: what?!

jerry: well i just got it, too many misunderstandings. i can't take it. it's hard to be used to the cabin.

jerry:(handing over the phone) puddy, you're gonna take 

#### Save your favorite scripts

Once you have a script that you like (or find interesting), save it to a text file!

In [19]:
# save script to a text file
f =  open("generated_script_1.txt","w")
f.write(generated_script)
f.close()