# Edgar Allen Poe Text Generation Notebook
**Jeremy Chow**

7/15/2019

Goal: Generate text in the style of Edgar Allen Poe, specifically emulating his writing style in the short story dataset from Kaggle "Spooky Author Identification" competition: https://www.kaggle.com/c/spooky-author-identification

## Imports
Here we need general dataframe manipulation libraries, then some TensorFlow and Keras libraries for the deep learning models.

In [1]:
# Standard Data Science Libraries
import pickle
import math
import pandas as pd
import numpy as np
from numpy import array

# Neural Net Preprocessing
from sklearn.feature_extraction.text import CountVectorizer
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing.sequence import pad_sequences
# Neural Net Layers
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Embedding

# Neural Net Training
from tensorflow.keras.models import load_model
from tensorflow.keras.callbacks import ModelCheckpoint
from keras.callbacks import EarlyStopping

from pickle import load


Using TensorFlow backend.


In [7]:
# Import the data
train_df = pd.read_csv('../data/train.csv')
# Selecting Edgar Allen Poe as author style to emulate
author = train_df[train_df['author'] == 'EAP']["text"]
print('Number of training sentences: ',author.shape[0])


Number of training sentences:  7900


# Preprocessing
Generally for NLP projects, to optimize the model's ability to gather meaning from the text, there would be removal of:
- stop words such as _"the","a","an"_ 
- punctuation

then tokenization (turning unique words into unique integers) of the text. However, because the goal here is to generate fluid and human-like speech, we want to preserve stop words. Instead we just use the Tokenizer method in the Keras library to perform the rest of the preprocessing steps.

### Tokenize words in corpus using Keras Tokenizer.
This function does the following:
1. Removes punctation
2. Sets all text to lower case
3. Splits the words up, then assigns a unique integer to each word
4. Replaces all instances of that word with the integer.

Tokenization is necessary for preparing data for embedding layer (see model architecture section below)

In [8]:
max_words = 50000 # Max size of the dictionary
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(author.values)
sequences = tokenizer.texts_to_sequences(author.values)
print(sequences[:5])

[[19, 2397, 80, 1001, 29, 31, 177, 2, 4073, 1, 1960, 2, 11, 3024, 15, 7, 110, 157, 41, 2146, 3, 481, 4, 1, 149, 2147, 7, 393, 74, 114, 101, 439, 2, 1, 162, 32, 913, 6453, 136, 1, 380], [6, 21, 142, 150, 10, 5, 551, 2148, 319, 28, 16, 15, 20, 8999, 128, 1, 3025, 2398, 30, 171, 2, 1797, 697, 20, 180, 2148, 6454, 12, 33, 188, 2, 1, 869, 243, 522, 1264], [1, 6455, 203, 14, 19, 149, 180, 6456, 6, 1, 1357, 2, 1358, 9000, 3, 83, 2149, 10, 355, 140, 794], [1, 4074, 491, 6, 9001, 28, 11, 158], [7, 287, 9, 36, 48, 22, 73, 4, 644, 9002, 114, 101, 346, 4, 271, 2, 9003, 3, 81, 2, 1, 3026, 2, 6457, 3, 282, 53, 34, 6458, 19, 339, 22, 43, 97, 608, 7, 450, 4, 36, 133, 1191, 88, 12, 133, 71, 914, 1, 759, 3027, 2, 9, 1445, 1359, 18, 760, 12, 4973, 6, 1, 421, 9004, 9005, 7, 214, 9, 36, 48, 22, 3449, 3028, 98, 124, 1192, 4, 1, 92, 9006, 6, 3450, 3, 7, 761, 870, 9, 36, 55, 111, 32]]


In [9]:
# Flatten the list of lists resulting from the tokenization. This will reduce the list
# to one dimension, allowing us to apply the sliding window technique to predict the next word
text = [item for sublist in sequences for item in sublist]
vocab_size = len(tokenizer.word_index)

In [10]:
print('Vocabulary size in this corpus: ', vocab_size)

Vocabulary size in this corpus:  15713


In [11]:
# Training on 19 words to predict the 20th
sentence_len = 20
pred_len = 1
train_len = sentence_len - pred_len
seq = []
# Sliding window to generate train data
for i in range(len(text)-sentence_len):
    seq.append(text[i:i+sentence_len])
# Reverse dictionary to decode tokenized sequences back to words
reverse_word_map = dict(map(reversed, tokenizer.word_index.items()))

# Save tokenizer
# dump(tok, open('tokenizer.pkl', 'wb'))

In [12]:
# Each row in seq is a 20 word long window. We append he first 19 words as the input to predict the 20th word
trainX = []
trainy = []
for i in seq:
    trainX.append(i[:train_len])
    trainy.append(i[-1])

# Model Architecture:
1. Embedding layer
    - Helps model understand 'meaning' of words by mapping them to representative vector space instead of semantic integers
2. Stacked LSTM layers
    - Stacked LSTMs add more depth than additional cells in a single LSTM layer (see paper: https://arxiv.org/abs/1303.5778)
    - The first LSTM layer must have `return sequences` flag set to True in order to pass sequence information to the second LSTM layer instead of just its end states
3. Dense (regression) layer with ReLU activation
4. Dense layer with Softmax activation 
    - Outputs word probability across entire vocab

In [13]:
# define model
model = Sequential([
    Embedding(vocab_size+1, 50, input_length=train_len),
    LSTM(100, return_sequences=True),
    LSTM(100),
    Dense(100, activation='relu'),
    Dense(vocab_size, activation='softmax')
])

Instructions for updating:
Colocations handled automatically by placer.


In [52]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 19, 50)            785700    
_________________________________________________________________
lstm_4 (LSTM)                (None, 19, 100)           60400     
_________________________________________________________________
lstm_5 (LSTM)                (None, 100)               80400     
_________________________________________________________________
dense_4 (Dense)              (None, 100)               10100     
_________________________________________________________________
dense_5 (Dense)              (None, 15713)             1587013   
Total params: 2,523,613
Trainable params: 2,523,613
Non-trainable params: 0
_________________________________________________________________


In [None]:
model.fit(np.asarray(trainX),
         pd.get_dummies(np.asarray(trainy)),
         epochs = 500,
         batch_size = 10240,
         callbacks = callbacks_list,
         verbose = 2)

In [56]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
model.fit(np.asarray(trainX), pd.get_dummies(np.asarray(trainy)), batch_size=128, epochs=100)


Instructions for updating:
Use tf.cast instead.
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100

<tensorflow.python.keras.callbacks.History at 0x1a43927f98>

Model 1 was trained for 100 epochs and ended with an accuracy of .3877

In [58]:
pickle.dump(tokenizer, open('tokenizer.pkl', 'wb'))
model.save('model_weights.hdf5')

# Model Version 2
This model is similar to model 1, but we add a dropout layer to prevent overfitting. The dropout layer randomly turns off a proportion of neurons fed into it from the previous layer, forcing the model to come up with more robust features

In [19]:
# define model
model_2 = Sequential([
    Embedding(vocab_size+1, 50, input_length=train_len),
    LSTM(100, return_sequences=True),
    LSTM(100),
    Dense(100, activation='relu'),
    Dropout(0.1),
    Dense(vocab_size, activation='softmax')
])

Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


In [20]:
model_2.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 19, 50)            785700    
_________________________________________________________________
lstm_4 (LSTM)                (None, 19, 100)           60400     
_________________________________________________________________
lstm_5 (LSTM)                (None, 100)               80400     
_________________________________________________________________
dense_4 (Dense)              (None, 100)               10100     
_________________________________________________________________
dropout (Dropout)            (None, 100)               0         
_________________________________________________________________
dense_5 (Dense)              (None, 15713)             1587013   
Total params: 2,523,613
Trainable params: 2,523,613
Non-trainable params: 0
_________________________________________________________________


In [18]:
# Early stopping allows model to stop training if improvement stops.
# es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=50)
# model_2.compile(optimizer='adam',
#               loss='categorical_crossentropy',
#               metrics=['accuracy'])
# filepath = "./model_2_weights.hdf5"
# # Model checkpointing allows us to preserve progress during training if training is interrupted
# checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
# callbacks_list = [checkpoint]
# history = model_2.fit(np.asarray(trainX),
#          pd.get_dummies(np.asarray(trainy)),
#          epochs = 300,
#          batch_size = 128,
#          callbacks = callbacks_list,
#          verbose = 1)

Model 2 was trained for 300 epochs but only had .3138 accuracy

In [21]:
model_2.load_weights('../models/model_2_weights_colab.hdf5')

At this point I moved model training to Google Colabs for the GPU boost to speed.

# Model 3

Model 2 had an additional dropout layer, but the accuracy took a 30% hit.

For model 3, we'll try removing the dropout layer and up the number of neurons across all layers by 50%. 

As expected, this resulted in a higher accuracy on the training set of about 40%.

In [14]:
# define model
model_3 = Sequential([
    Embedding(vocab_size+1, 50, input_length=train_len),
    LSTM(150, return_sequences=True),
    LSTM(150),
    Dense(150, activation='relu'),
    Dense(vocab_size, activation='softmax')
])

Model 3 was trained for 300 epochs and reached .63 accuracy

In [16]:
# Loaded after being trained remotely using Google Colab
model_3.load_weights('../models/model_3_weights_colab.hdf5')

# Testing the generation models
If this were any other type of project then a good metric to quantify the model's success would be to do a **train-test split to identify the testing accuracy score** using the models to predict data it was not trained on and had never seen before. 

However, the goal of text generation isn't quite to maximize accuracy, because that would amount to the model regurgitating quotes and would be overfitting. Instead we'll compare the model outputs to the same input strings.

In [26]:

def gen(model,seq,max_len = 20):
    ''' Generates a sequence given a string seq using specified model until the total sequence length
    reaches max_len'''
    # Tokenize the input string
    tokenized_sent = tokenizer.texts_to_sequences([seq])
    max_len = max_len+len(tokenized_sent[0])
    # If sentence is not as long as the desired sentence length, we need to 'pad sequence' so that
    # the array input shape is correct going into our LSTM. the `pad_sequences` function adds 
    # zeroes to the left side of our sequence until it becomes 19 long, the number of input features.
    while len(tokenized_sent[0]) < max_len:
        padded_sentence = pad_sequences(tokenized_sent[-19:],maxlen=19)
        op = model.predict(np.asarray(padded_sentence).reshape(1,-1))
        tokenized_sent[0].append(op.argmax()+1)
        
    return " ".join(map(lambda x : reverse_word_map[x],tokenized_sent[0]))

In [27]:
def test_models(test_string,sequence_length= 50, model_list = model_list):
    '''Generates output given input test_string up to sequence_length'''
    print('Input String: ', test_string)
    for counter,model in enumerate(model_list):
        print("Model ", counter+1, ":")
        print(gen(model,test_string,sequence_length))
    pass

In [28]:
model_list = [model,model_2,model_3]

In [29]:
test_models('This process however afforded me', 10)

Input String:  This process however afforded me
Model  1 :
this process however afforded me bearing pernicious pernicious pernicious mattock mattock thousands thousands thousands disdain
Model  2 :
this process however afforded me most well echoed one who had a very idea which
Model  3 :
this process however afforded me good difficulty whatever long vaulting are no more subject to


In [30]:
test_models(author.iloc[3709])

Input String:  what avails the vigilance against the Destiny of man?
Model  1 :
what avails the vigilance against the destiny of man d'indaginé hippocratian hippocratian hippocratian miseries disclosed disclosed caligula worthy worthy engulfed shaved shaved shaved deviate deviate deviate hoggishly hoggishly hoggishly odd odd odd odd odd odd odd recitative recitative recitative succumbed evinced evinced evinced evinced definition definition definition definition definition definition caligula caligula caligula caligula fillagree fillagree fillagree fillagree fillagree
Model  2 :
what avails the vigilance against the destiny of man varnished with the most officer diameter to indebted at the same hour that looked up all ah we can say that the most officer which i had observed that so around and the whole hand of the most absurd lilies within the same time the most oil of shadows of
Model  3 :
what avails the vigilance against the destiny of man to their nationality that von jung them in u

In [31]:
test_models(author.iloc[58],50)

Input String:  By these means for they were ignorant men I found little difficulty in gaining them over to my purpose.
Model  1 :
by these means for they were ignorant men i found little difficulty in gaining them over to my purpose worthy worthy worthy deviate perpetual perpetual perpetual velocities broadway hoggishly hoggishly pernicious persuaded glanced amazing generalities generalities pocket carrying carrying carrying pernicious jonas sign tavern tavern hideous hideous hideous fxwl type type torches torches indeterminate indeterminate 'und 'und 'und resume resume calf 'und calf calf 'und thus thus buries buries
Model  2 :
by these means for they were ignorant men i found little difficulty in gaining them over to my purpose in fact a very very appearance i have poh alluded that g he carries her idea and then forests in fortune however i am not a very angel book the same agitation recess i was made was more and more properly had been interested and as unreal what has a
Model  3 :


In [32]:
test_models(author.iloc[70])

Input String:  In the meantime it was folly to grieve, or to think.
Model  1 :
in the meantime it was folly to grieve or to think deviate deviate deviate hoggishly hoggishly odd odd odd odd odd odd recitative recitative recitative succumbed succumbed evinced evinced aorta glassy exaggeration exaggeration exaggeration year greatest greatest greatest hers hers phosphoric phosphoric phosphoric phosphoric dogmaticians dogmaticians landlady duty recorded recorded discontinuance stir recorded officer officer relieves flayed gleaming gleaming gleaming gleaming
Model  2 :
in the meantime it was folly to grieve or to think that the whole hand was intolerably folded in bob handkerchiefs which expired with a fiend i made in the most officer proprietor as so was at the whole of the oil of bob so a very capital height that caused upon the most oil of water had been obtained to
Model  3 :
in the meantime it was folly to grieve or to think the passions of the son of the automaton was in the first ins

In [33]:
test_models(author.iloc[7800])

Input String:  First of all I dismembered the corpse.
Model  1 :
first of all i dismembered the corpse eder eder hippocratian hippocratian ladies ladies ladies amazing amazing amazing amazing pernicious pernicious pernicious pernicious fillagree fillagree fillagree fillagree or or mattock mattock strived strived pernicious pernicious pernicious canoe canoe canoe diverted discourage discourage discourage discourage humoredly humoredly humoredly humoredly humoredly humoredly proceedings infirmity infirmity infirmity characters characters dodona study
Model  2 :
first of all i dismembered the corpse telescope delighted and accordingly the whole idea was intended of course and the most oil of mankind nor the whole flight of us at all events had been mistaken the most idea of a very very old lady had been more so much a thousand channels which not at west
Model  3 :
first of all i dismembered the corpse which is profound since in the first place he appeared at first so suddenly as any matte

In [34]:
test_models(author.iloc[7120])

Input String:  Here, however, are Moissart, Voissart, Croissart, and Froissart, all in the direct line of descent.
Model  1 :
here however are moissart voissart croissart and froissart all in the direct line of descent 'oppodeldoc d'indaginé d'indaginé emissaries emissaries 'oppodeldoc 'oppodeldoc emissaries fell suspecting suspecting suspecting completed completed completed entreaties entreaties fossillus fossillus fossillus suspended suspended officer' smells concert payment payment ellison's choctaws concert thrusting juniper pure exaggeration snatches snatches concert security bearing happened relieves emerging fiftieth recorded supplying discontinuance butcheries gleaming unhurried unhurried
Model  2 :
here however are moissart voissart croissart and froissart all in the direct line of descent the most officer which the most oil of water had no the very natural man had been more more than the most officer which at the same time a little fellow in fact was well as that the idea was

In [35]:
test_models(author.iloc[5121])

Input String:  "There are two windows in the chamber.
Model  1 :
there are two windows in the chamber 'found 'found 'found relieves relieves recitative disclosed disclosed disclosed engulfed engulfed engulfed hers hers hers shakespeare shakespeare 'prince thousands thousands thousands thousands involving sworn ellipsoid ellipsoid disdain imbued imbued pocket pocket museums museums museums memory memory senty senty senty senty senty confused confused confused confused holies holies holies riveted riveted
Model  2 :
there are two windows in the chamber had been more properly in fact that the idea was most much and the most most oil of water had been a very very old lady have been more than more more than the most oil of water the most oil of water the very surface of sighs wherein the
Model  3 :
there are two windows in the chamber of caravaggio of the amenity of albano to the renewed cares of the variable particulars of the monk i had enough although i am well the whole head spirits wit

# Possible Improvements
1. The model vocabulary is only based on the corpus vocabulary, so it omits any words in the input string that it does not recognize.
2. To make a more robust model with a wider vocabulary, a pretrained embedding model like Glove for Fastext could be implemented