# CS4120 Final Project -- Generating *How It's Made* Scripts
*By Nathaniel Gordon*



## Contents

- [Generating the Dataset](#generating_the_dataset)
- [Character-Level Text Generation With a RNN](#rnn)
- [Text Generation with gpt-2-simple](#gpt2)

**Important Note:** The GPT-2 model was originally run in a Google Colab instance. To get it to run in this notebook, make sure you are using Python 3.6 as it is the latest version to support the dependencies of `gpt-2-simple` (namely, Tensorflow 1.15.2). These dependencies are not compatable with the requirements for the RNN section.

A better alternative to running the GPT-2 code in this notebook is to make a copy of [this Google Colab notebook](https://colab.research.google.com/drive/1VLG8e7YSEwypxU-noRNhsv5dW4NfTGce), upload the [compiled dataset](#generating_the_dataset) to your Google Drive, and copy [my parameters](#gpt2) into the appropriate cells.

<a id='generating_the_dataset'></a>
## Generating the Dataset

The dataset I crafted for this project has been included as an .zip archive of individual script files. Each script contains the text of one segment of a How It's Made episode, with the filename being the topic of focus. This script concatenates the files -- each topic is delineated with a `<BOF>` and `<EOF>` tokens.

In [None]:
# Imports
import re
import os
import random
from zipfile import ZipFile

In [None]:
# Location of the dataset
dataset_path = "../data/HIM_scripts_dataset.zip"

# Open the zip file in READ mode
with ZipFile(dataset_path, 'r') as zip:
    script_files = zip.namelist()
    random.shuffle(script_files)
    data = ''

    # Iterate through each file in the archive
    for path in script_files:
        text = zip.read(path).decode("utf-8").strip()
        text = re.sub(r"  +", " ", text)
        
        # Append "beggining of script" and "end of script" key tokens
        bos_token = '<BOS>'
        eos_token = '<EOS>'
        
        data += bos_token + ' ' + text + ' ' + eos_token + '\n\n'
            
    # Write to destination file
    with open('../data/compiled_scripts.txt', 'w') as file: 
        file.write(data)

<a id='rnn'></a>
## Character-Level Text Generation With a RNN


In [None]:
# Imports
from keras.layers import Embedding, LSTM, Dense, Dropout, Activation
from keras.callbacks import EarlyStopping, LambdaCallback
from keras.models import Sequential
import keras.models

import numpy as np
import string, os 
import random
import sys

import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter(action='ignore', category=FutureWarning)

In [None]:
# Obtain the compiled text
with open('../data/compiled_scripts.txt', 'r') as file:
            input_text = file.read()
        
print(input_text[:500])

In [None]:
# Store all tokens present in the sample text (we will examine novel tokens later)
from nltk.tokenize import word_tokenize
all_tokens = list(set(word_tokenize(input_text)))

print(all_tokens[:50])

To perform character-level text generation, an encoding is derived from the charset present in the text. Then, input sequences are derived from the training data at a given length and offset (I found a 40-3 scheme worked fairly well). The inputs are the encoded characters, with the label being the next character in the sequence.

In [None]:
# Set up character encodings
chars = sorted(list(set(input_text)))
print('total chars: ', len(input_text))

char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

In [None]:
# Create input sequences

# Parameters
maxlen = 40   # length of inputs
step = 3      # offset of input vectors (overlap of maxlen - step)

sentences = []
next_chars = []
for i in range(0, len(input_text) - maxlen, step):
    sentences.append(input_text[i: i + maxlen])
    next_chars.append(input_text[i + maxlen])
print('nb sequences:', len(sentences))
print(sentences[:10])

In [None]:
# Encode inputs
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

The network architecture I settled on uses 3 LSTM layers, each paired with a dropout layer to reduce overfitting. The output layer uses a softmax activation unit, and the ADAM optimization scheme was used for reducing loss.

In [None]:
# Set up RNN with 3 layers of LSTM nodes
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars)), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(256, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(128))
model.add(Dropout(0.2))
model.add(Dense(len(chars)))
model.add(Activation('softmax'))

In [None]:
# Compile model
model.compile(loss='categorical_crossentropy', optimizer="adam")

In [None]:
# Text generation preview helper
#
# Arguments:
#   preds -- an array of probabilities
#   temperature -- sample predictability
def sample(preds, temperature=1.0):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [None]:
# Callback function invoked at end of each epoch. Prints generated sample text at a variety of model temperatures.
def on_epoch_end(epoch, logs):

    print()
    print('----- Generating text after Epoch: %d' % epoch)

    start_index = random.randint(0, len(input_text) - maxlen - 1)
    for diversity in [0.2, 0.5, 1.0, 1.2]:
        print('----- diversity:', diversity)
        generated = ''
        sentence = input_text[start_index: start_index + maxlen]
        generated += sentence
        
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x_pred = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x_pred[0, t, char_indices[char]] = 1.

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()
        
        print('----- Novel words')
        novel_words = []
        generated_tokens = list(set(word_tokenize(generated)))
        
        for tok in generated_tokens:
            if tok not in all_tokens:
                novel_words.append(tok)
                
        print(novel_words)
        
print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

In [None]:
# Callback function for saving model checkpoint
from keras.callbacks import ModelCheckpoint
saved_weights_path = "../../Project/models/RNN/weights.hdf5"
checkpoint = ModelCheckpoint(saved_weights_path, monitor='loss',
                             verbose=1, save_best_only=True,
                             mode='min')

In [None]:
# Callback function for loss calculations
from keras.callbacks import ReduceLROnPlateau
reduce_lr = ReduceLROnPlateau(monitor='loss', factor=0.2,
                              patience=1, min_lr=0.001)

In [None]:
# Fit the model
callbacks = [print_callback, checkpoint, reduce_lr]

# If running from a saved checkpoint:
#model = keras.models.load_model(saved_weights_path)

model.fit(x, y, batch_size=64, epochs=20, callbacks=callbacks)

In [None]:
# Use model to generate text from a random seed
#
# Arguments:
#   length -- the number of characters to generate
#   diversity -- how likely the model will make a sub-optimal decision
def generate_text(length, diversity):
    generated = ''
    start_index = random.randint(0, len(input_text) - maxlen - 1)
    sentence = input_text[start_index: start_index + maxlen]
    generated += sentence
    
    for i in range(length):
            x_pred = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x_pred[0, t, char_indices[char]] = 1.

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char
            
    return generated

In [None]:
# Generate some sample sentences
for i in range(10):
    
    print('----- Sample text')
    sample_text = generate_text(750, 0.7)
    print(sample_text)
    
    print('----- Novel words')
    novel_words = []
    generated_tokens = list(set(word_tokenize(sample_text)))
    
    for tok in generated_tokens:
        if tok not in all_tokens:
            novel_words.append(tok)

    print(novel_words)
    print()

<a id='gpt2'></a>
## Text Generation with gpt-2-simple

The gpt-2-simple library includes several pre-trained models and scripts for fine-tuning. Below are the configurations I used to fine-tune the model and generate text. For more information on how the library works, check out https://minimaxir.com/2019/09/howto-gpt2/.

**Note:** During my experiments, I ran this section on Google Colab instance with a T4 GPU, which ran 1000 steps in approximately 50 minutes. Performance on my local hardware was significantly worse.

In [None]:
%%capture
# Imports
!pip install tensorflow==1.15.2
!pip install -q gpt-2-simple
import gpt_2_simple as gpt2

In [None]:
# Download the 355M-parameter model
gpt2.download_gpt2(model_name="355M")

In [None]:
# Begin a training session with the custom dataset
sess = gpt2.start_tf_sess()
gpt2.finetune(sess,
              dataset='../data/compiled_scripts.txt',
              model_name='355M',
              steps=2500,
              restore_from='fresh',
              run_name='run1',
              print_every=100,
              sample_every=200,
              save_every=500,
              learning_rate=1e-5
              )

In [None]:
# Load a training session from a checkpoint
sess = gpt2.start_tf_sess()
gpt2.load_gpt2(sess, run_name='run1')

In [None]:
# Generate text from a saved model checkpoint
gpt2.generate(sess,
              run_name='run1',
              length=400,
              temperature=0.7,
              nsamples=10,
              batch_size=10,
              prefix="<BOS> SEED TEXT HERE",
              truncate="<EOS>",
              include_prefix=True
              )