# Story Plot Generator

## Problem Statement

**Given a text corpus of movie plots data, create a model that generates new (artificial) plots and deploy it.**

The source dataset for this project was taken from [here](<http://www.cs.cmu.edu/~ark/personas/data/MovieSummaries.tar.gz>). We will look into scraping Wikipedia to create a new dataset in a later post.

In [1]:
# importing libraries
import os
import re
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import Adam
from torch.utils.data import DataLoader, TensorDataset
from collections import Counter

from IPython.core.debugger import set_trace

## Data Preparation

### Reading Data

First, we read in the data and combine all plots into a single corpus.

In [2]:
# preparing the corpus
# data = pd.read_csv(os.path.join('data', 'input', 'MovieSummaries', 'plot_summaries.txt'), 
#                    sep='\t', names=['id', 'plot'])
# data = data[:500]
# data = list(data['plot'])
# corpus = ' '.join(data)
with open(os.path.join('data', 'input', 'Seinfeld_Scripts.txt'), 'r') as file:
    corpus = file.read()
print(corpus[:500])

jerry: do you know what this is all about? do you know, why were here? to be out, this is out...and out is one of the single most enjoyable experiences of life. people...did you ever hear people talking about we should go out? this is what theyre talking about...this whole thing, were all out now, no one is home. not one person here is home, were all out! there are people trying to find us, they dont know where we are. (on an imaginary phone) did you ring?, i cant find him. where did he go? he d


In [3]:
class Vocabulary():
    def __init__(self, corpus):
        self.word2count = {}
        self.word2index = {}
        self.index2word = {0: '<PADDING>'}
        self.num_words = 1 # <PADDING>
        self.min_count = 3
        self.corpus = corpus
        
        # call create_vocab on object creation
        self.create_vocab(corpus)
        
    # replace punctuations
    def remove_punctuations(self, text):
        for punct in '!"#$%&().*+,-./:;<=>?@[\\]^_`{|}~\t\n\r':
            text = text.replace(punct, ' ')
        return text

    # preprocess text and return list of words
    def preprocess_text(self, text):
        text = text.strip()
        text = text.lower()
        text = self.remove_punctuations(text)
        text = re.sub(r'\s+', ' ', text).strip()
        text = text.split(' ')
        return text
    
    # add each word into the dictionaries
    def add_word(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.num_words
            self.index2word[self.num_words] = word
            self.word2count[word] = 1
            self.num_words += 1
        else:
            self.word2count[word] += 1        
    
    # preprocess data and create a list of words
    # then add each word to the dictionaries
    def create_vocab(self, text):
        # preprocess corpus
        self.corpus = self.preprocess_text(text)
        
        # add words to dictionaries
        for word in self.corpus:
            self.add_word(word)

In [4]:
# create Vocabulary object
vocab = Vocabulary(corpus)

print(f'Size of corpus = {len(corpus.split())}')
print(f'Number of Unique words in corpus = {len(vocab.word2index)}\n')

print(f'Sample Corpus: \n{vocab.corpus[:50]}')

Size of corpus = 605614
Number of Unique words in corpus = 19863

Sample Corpus: 
['jerry', 'do', 'you', 'know', 'what', 'this', 'is', 'all', 'about', 'do', 'you', 'know', 'why', 'were', 'here', 'to', 'be', 'out', 'this', 'is', 'out', 'and', 'out', 'is', 'one', 'of', 'the', 'single', 'most', 'enjoyable', 'experiences', 'of', 'life', 'people', 'did', 'you', 'ever', 'hear', 'people', 'talking', 'about', 'we', 'should', 'go', 'out', 'this', 'is', 'what', 'theyre', 'talking']


In [5]:
# convert each word to numerical token using word2index
def encode_corpus(vocab):
    encoded_corpus = []
    for word in vocab.corpus:
        encoded_corpus.append(vocab.word2index[word])
    return encoded_corpus

encoded_tokens = encode_corpus(vocab)
print(f'Encoded Tokens: \n{encoded_tokens[:50]}')

Encoded Tokens: 
[1, 2, 3, 4, 5, 6, 7, 8, 9, 2, 3, 4, 10, 11, 12, 13, 14, 15, 6, 7, 15, 16, 15, 7, 17, 18, 19, 20, 21, 22, 23, 18, 24, 25, 26, 3, 27, 28, 25, 29, 9, 30, 31, 32, 15, 6, 7, 5, 33, 29]


In [6]:
# convert the list of tokens into sequence data
# extract features and labels from the sequence data as numpy arrays
def create_training_data(encoded_tokens, sequence_length=10, batch_size=50):
    # a sequence of length 20 has 19 features and one target
    # we need to get full batches containing full sequences
    n_tokens = len(encoded_tokens)
    n_possible_sequences = n_tokens - sequence_length + 1
    n_batches = n_possible_sequences // batch_size
    n_sequences = n_batches * batch_size
    
    features = []
    labels = []
    for ii in range(n_sequences):
        feature = encoded_tokens[ii : ii+sequence_length-1]
        label = encoded_tokens[ii+sequence_length]
        features.append(feature)
        labels.append(label)
    
    # convert to numpy array
    features, labels = np.array(features), np.array(labels)
    
    # convert features and labels data to Long Tensor
    features, labels = torch.from_numpy(features), torch.from_numpy(labels)
    features, labels = features.long(), labels.long()
    
    # convert features and labels data into DataLoader for easy batching
    training_data = TensorDataset(features, labels)
    training_data = DataLoader(training_data, batch_size=batch_size)
    
    return training_data

In [7]:
# get training data
input_dataloader = create_training_data(encoded_tokens, sequence_length=10, batch_size=15)

batch_x, batch_y = next(iter(input_dataloader))
print(f'Feature Tensor: \n{batch_x}\n')
print(f'Label Tensor: \n{batch_y}')

Feature Tensor: 
tensor([[ 1,  2,  3,  4,  5,  6,  7,  8,  9],
        [ 2,  3,  4,  5,  6,  7,  8,  9,  2],
        [ 3,  4,  5,  6,  7,  8,  9,  2,  3],
        [ 4,  5,  6,  7,  8,  9,  2,  3,  4],
        [ 5,  6,  7,  8,  9,  2,  3,  4, 10],
        [ 6,  7,  8,  9,  2,  3,  4, 10, 11],
        [ 7,  8,  9,  2,  3,  4, 10, 11, 12],
        [ 8,  9,  2,  3,  4, 10, 11, 12, 13],
        [ 9,  2,  3,  4, 10, 11, 12, 13, 14],
        [ 2,  3,  4, 10, 11, 12, 13, 14, 15],
        [ 3,  4, 10, 11, 12, 13, 14, 15,  6],
        [ 4, 10, 11, 12, 13, 14, 15,  6,  7],
        [10, 11, 12, 13, 14, 15,  6,  7, 15],
        [11, 12, 13, 14, 15,  6,  7, 15, 16],
        [12, 13, 14, 15,  6,  7, 15, 16, 15]])

Label Tensor: 
tensor([ 3,  4, 10, 11, 12, 13, 14, 15,  6,  7, 15, 16, 15,  7, 17])


We have completed the data preparation phase. Lets now move on to model creation and training.

## Model Creation

In [8]:
class Net(nn.Module):
    
    def __init__(self, vocab_size, embedding_dim=100, hidden_dim=1024, n_layers=2, dropout=0.5):
        
        super(Net, self).__init__()
        
        # set class variables
        self.vocab_size = vocab_size
        self.output_size = vocab_size # output_size is the same as vocab_size
        self.embedding_dim = embedding_dim
        self.hidden_dim = hidden_dim
        self.n_layers = n_layers
        
        # define layers
        self.embed = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, batch_first=True, dropout=dropout)
        self.fc = nn.Linear(hidden_dim, self.output_size)
        
    def forward(self, batch_input):

        batch_size = batch_input.size(0)
        
        embed_out = self.embed(batch_input)
        lstm_out, _ = self.lstm(embed_out)
        fc_out = self.fc(lstm_out)
        
        # reshape into (batch_size, seq_length, output_size)
        fc_out = fc_out.view(batch_size, -1, self.output_size)
        
        # get last values of each sequence. 
        # size = (batch_size, output_size) 
        # the output of each sequence is of size output_size = vocab_size
        output = fc_out[:, -1]
        
        # return one batch of output word scores
        return output

In [9]:
# model summary
model = Net(vocab_size=len(vocab.word2index))
print(model)

Net(
  (embed): Embedding(19863, 100)
  (lstm): LSTM(100, 1024, num_layers=2, batch_first=True, dropout=0.5)
  (fc): Linear(in_features=1024, out_features=19863, bias=True)
)


## Model Training

In [10]:
# forward and backward pass
def train(model, epochs, optimizer, loss_funtion, device):
    for epoch in range(1, epochs+1):
        model.train()
        total_loss = 0
        
        for idx, batch in enumerate(input_dataloader):
            batch_x, batch_y = batch
            batch_x, batch_y = batch_x.to(device), batch_y.to(device)
            
            model.zero_grad()
            output = model(batch_x)
            loss = loss_funtion(output, batch_y)
            loss.backward()
            optimizer.step()
            
            total_loss += loss.data.item()
            
            if (idx+1) % 100 == 0:
                print(f'-- Epoch: {epoch}/{epochs}, Loss: {total_loss/(idx+1)}')
            
        print(f'Epoch: {epoch}, Avg. Batch Loss:{total_loss/(idx+1)}')
    
    return model

In [11]:
# set parameteres for batching
sequence_length = 10
batch_size = 128

# set parameters for training
epochs = 20
vocab_size = len(vocab.word2index)
embedding_dim = 300
hidden_dim = 1024
n_layers = 2
dropout = 0.5
learning_rate = 0.0001

# set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [12]:
# get encoded tokens
encoded_tokens = encode_corpus(vocab)

# get training data
input_dataloader = create_training_data(encoded_tokens, sequence_length, batch_size)

# instantiate model and send to model
model = Net(vocab_size, embedding_dim, hidden_dim, n_layers, dropout=dropout).to(device)

In [13]:
# set optimizer and loss function
optimizer = Adam(model.parameters(), lr=learning_rate)
loss_function = nn.CrossEntropyLoss()

In [14]:
# train model
model = train(model, epochs, optimizer, loss_function, device)

-- Epoch: 1/20, Loss: 7.779703888893128
-- Epoch: 1/20, Loss: 7.123715350627899
-- Epoch: 1/20, Loss: 6.910825576782226
-- Epoch: 1/20, Loss: 6.873457415103912
-- Epoch: 1/20, Loss: 6.81625286769867
-- Epoch: 1/20, Loss: 6.793596826394399
-- Epoch: 1/20, Loss: 6.749405933107648
-- Epoch: 1/20, Loss: 6.724829170107841
-- Epoch: 1/20, Loss: 6.696787780655755
-- Epoch: 1/20, Loss: 6.679587941169739
-- Epoch: 1/20, Loss: 6.664028927629644
-- Epoch: 1/20, Loss: 6.644272829294205
-- Epoch: 1/20, Loss: 6.639522865735567
-- Epoch: 1/20, Loss: 6.6302850931031365
-- Epoch: 1/20, Loss: 6.624685676892598
-- Epoch: 1/20, Loss: 6.624357369244098
-- Epoch: 1/20, Loss: 6.620013927571914
-- Epoch: 1/20, Loss: 6.614863878356086
-- Epoch: 1/20, Loss: 6.6145558841604934
-- Epoch: 1/20, Loss: 6.617357450723648
-- Epoch: 1/20, Loss: 6.614528046562558
-- Epoch: 1/20, Loss: 6.613967206261375
-- Epoch: 1/20, Loss: 6.614150090632231
-- Epoch: 1/20, Loss: 6.612934514085452
-- Epoch: 1/20, Loss: 6.611910195922851

-- Epoch: 5/20, Loss: 5.504362803527287
-- Epoch: 5/20, Loss: 5.5075583788553875
-- Epoch: 5/20, Loss: 5.511598843634129
-- Epoch: 5/20, Loss: 5.514033405079561
-- Epoch: 5/20, Loss: 5.515080654621125
-- Epoch: 5/20, Loss: 5.519378108727304
-- Epoch: 5/20, Loss: 5.52396782875061
-- Epoch: 5/20, Loss: 5.524882966450282
-- Epoch: 5/20, Loss: 5.526963618235155
-- Epoch: 5/20, Loss: 5.529897459693577
-- Epoch: 5/20, Loss: 5.531869138479233
-- Epoch: 5/20, Loss: 5.532787777900696
-- Epoch: 5/20, Loss: 5.534234920831827
-- Epoch: 5/20, Loss: 5.530354097507618
-- Epoch: 5/20, Loss: 5.530872511863708
-- Epoch: 5/20, Loss: 5.531938068784516
-- Epoch: 5/20, Loss: 5.52896101029714
-- Epoch: 5/20, Loss: 5.5297551664229365
-- Epoch: 5/20, Loss: 5.529952114373446
-- Epoch: 5/20, Loss: 5.53269480344021
-- Epoch: 5/20, Loss: 5.536108456499436
-- Epoch: 5/20, Loss: 5.537040718759809
-- Epoch: 5/20, Loss: 5.538418718443976
-- Epoch: 5/20, Loss: 5.5395874547958375
-- Epoch: 5/20, Loss: 5.544767344876339


KeyboardInterrupt: 

In [15]:
# save model to disk
torch.save({'state_dict': model.state_dict()}, \
           os.path.join('data', 'output', f'trained_model_bs{batch_size}_sq{sequence_length}.pt'))

In [16]:
def generate_text(model, prime_word, generate_len=100):
    
    model.eval()
    
    # start with prime word
    generated_words = [prime_word]
    
    # create new sequence with shape (1, sequence_length)
    # fill with 0 = <PADDING>
    gen_sequence = np.full(shape=(1, sequence_length), fill_value=0)
    
    # replace last value of gen_sequence with encoded prime_word
    gen_sequence[-1][-1] = vocab.word2index[prime_word]
    
    # generate new words generate_len times
    for _ in range(generate_len):
    
        # run the sequence through the model to get next word
        gen_sequence = torch.from_numpy(gen_sequence).long()
        output = model(gen_sequence)
        
        # convert raw output values into probabilities
        prob = F.softmax(output, dim=1).data
        prob = prob.cpu()

        # get top_k probable output words and index
        top_k = 5
        top_k_prob, top_k_idx = prob.topk(top_k)
        top_k_prob, top_k_idx = top_k_prob.numpy().squeeze(), top_k_idx.numpy().squeeze()

        # get the next words choosing according to probability from top_k words
        word_idx = np.random.choice(top_k_idx, p=top_k_prob/np.sum(top_k_prob))

        # roll the sequence by 1 in negative direction
        # and append the new word to end of sequence
        gen_sequence = np.roll(gen_sequence, shift=-1, axis=1)
        gen_sequence[-1][-1] = word_idx

        # append new word to generated words
        generated_words.append(vocab.index2word[word_idx])
        
    
    # convert generated_words list to text
    generated_text = ' '.join(generated_words)
    
    return generated_text

In [18]:
generate_len = 100
prime_word = 'george'
generated_text = generate_text(model.cpu(), prime_word, generate_len)
print(generated_text)

george you what i you to a jerry i i you you i you a you a george a jerry a george jerry a of george yeah a of a jerry a of george jerry what that you me you me the jerry george i i i i i a in i i you you what is about george i know you to her on phone what jerry elaine george you me i you i you what is george i i know i i i i to you i a and jerry know i a elaine i know know is to


The text generated here is complete gibberish. To get better results we need to do one or more of the following:

- **Larger input data** - We have only taken a small subset of the movie plots dataset. Training on a larger corpus of text will produce better results. But more input data will lead to more unique words which will increase the size of our model. Depending on GPU memory available, very large models may not fit in the GPU.
- **More number of epochs** - We trained our model for very few epochs. To get better results, the model need to be trained for more epochs. The downside is that as we increase the number of epochs, our model training time will also increase.
- **More complex model** - Our model was very simple with only a few layers. We can increase model complexity to capture more complex relationship between data. But increasing the model complexity will also increase the model size. This will again cause memory issues in GPU.
- **Trim rare words** - Our corpus has many words that are only used once or twice in the entire corpus. But each extra word adds mre compleity to the model, making the model bigger and causing memory issues. We can implement a process that remove the rarerely used words along with the sequences that use these rare words. That will improve our model performance significantly.
- **Initialize hidden weights** - We have used default weight initialization for our model. We can explicitly define a `init_hidden()` function in the model definition. Better weight initialization can make help the model converge faster.

There is one way we can get much better results without facing most of the above issues - **Transfer Learning**. We can use a pre-trained model to hasten the training process and get better results. We will look into Transfer Learning in a future post.