# 02: LSTM Model Build

In this notebook we'll look in detail at the steps used to build one of our LSTM models for evaluation.

Due to the size and resource needs of the models, the actual models will be run on an Amazon Web Services Elastic Computing instance allowing us to build and train several versions of the model while decreasing the time to complete one epoch from ~1 hour to xxxxx (my personal computer does not have a GPU). Please refer to the AWS script folder (and modify as needed).

Also, for sake of clarity, the functions referenced in each step are included in this notebook to make the process easier to follow. The functions are separated into in the AWS script to streamline the creation process.

* [Section A: Load File Containing the Generated Sequences From Our Corpus](#load)
* [Section B: Use `Tokenizer` from the Keras API to Encode our Sequences](#tokenize)
* [Section C: Preparing Data for Training (Defining X and y)](#xandy)
* [Section D: Define, Fit, and Save LSTM Model With Embedding Layer](#model)

In [2]:
import numpy as np
import os
from contextlib import redirect_stdout
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras import utils
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, LSTM, Embedding
from tensorflow.keras.callbacks import ModelCheckpoint

# setting a random seed for reproducibility
np.random.seed(2021)

### <a name="load"></a>Section A: Load in File Containing the Generated Sequences From Our Corpus

In [25]:
def load_sequences(path_and_filename):
    sequence_data = open(path_and_filename).read()
    sequences = sequence_data.split('\n')
    
    words_in_seq = len(sequences[0].split()) - 1
    
    print(f'{len(sequences)} sequences have been loaded.')
    print(f'Each sequence has {words_in_seq} word token(s) plus an output token.')
    return sequences, words_in_seq

In [26]:
sequence_list, seq_length = load_sequences('../data/Poe_NLG/03_Text_files_for_models/cleaned_poe_tot_seq_len_26.txt')

480044 sequences have been loaded.
Each sequence has 25 word token(s) plus an output token.


In [27]:
sequence_list[:5]

['Upon my return to the United States a few months ago , after the extraordinary series of adventure in the South Seas and elsewhere , of',
 'my return to the United States a few months ago , after the extraordinary series of adventure in the South Seas and elsewhere , of which',
 'return to the United States a few months ago , after the extraordinary series of adventure in the South Seas and elsewhere , of which an',
 'to the United States a few months ago , after the extraordinary series of adventure in the South Seas and elsewhere , of which an account',
 'the United States a few months ago , after the extraordinary series of adventure in the South Seas and elsewhere , of which an account is']

### <a name="tokenize"></a>Section B: Use `Tokenizer` from the Keras API to Encode our Sequences

In [28]:
# map words to integers for each sequence
def tokenize_words(sequence_list, filter_string='', lower_case=True):
     
    tokenizer = Tokenizer(filters=filter_string, lower=lower_case)
    
    tokenizer.fit_on_texts(sequence_list)
    
    sequences = tokenizer.texts_to_sequences(sequence_list)
    
    vocabulary_size = len(tokenizer.word_index) 
    
    print(f'Sequences have been tokenized using Keras API Tokenizer.')
    print(f'Vocabulary size is {vocabulary_size}')
    
    return tokenizer, sequences, vocabulary_size

In [29]:
tokenizer, sequences, vocab_size = tokenize_words(sequence_list, filter_string='', lower_case=False)

Sequences have been tokenized using Keras API Tokenizer.
Vocabulary size is 22466


### <a name="xandy"></a>Section C: Preparing Data for Training (Defining X and y)

In [30]:
def input_and_output_sequences(sequences, vocab_size):
    sequences = np.array(sequences)
    X, y = sequences[:,:-1], sequences[:, -1]
    y = utils.to_categorical(y, num_classes = vocab_size+1) # plus one required due to 0-offset of array
    return X, y

In [31]:
X, y = input_and_output_sequences(sequences, vocab_size)

In [32]:
X.shape

(480044, 25)

In [33]:
y.shape

(480044, 22467)

### <a name='model'></a>Section D: Define, Fit, and Save LSTM Model With Embedding Layer

In [8]:
def build_LSTM_model(vocab_size, seq_length, layer_size=256, embedding=True, embedding_vector_space=128, dropout=True, dropout_rate=0.2):

    model = Sequential()
    
    if embedding:
        model.add(Embedding(input_dim=vocab_size+1, output_dim=embedding_vector_space, input_length=seq_length))
        model.add(LSTM(layer_size, return_sequences=True))
    else:
        model.add(LSTM(layer_size, input_shape = (seq_length, vocab_size+1), return_sequences=True))
    
    if dropout:
        model.add(Dropout(dropout_rate))
    
    model.add(LSTM(layer_size))
    
    if dropout:
        model.add(Dropout(dropout_rate))
    
    model.add(Dense(layer_size, activation='relu'))

    model.add(Dense(vocab_size+1, activation='softmax'))
    
    print(f"Model has been created.\n\nHere's a summary:")
    print(f'----------------------')
    print(model.summary())
    
    model_name = f'{seq_length}_seqlen_LSTM_model_'
    
    
    return model, model_name

In [36]:
model, model_name = build_LSTM_model(vocab_size, seq_length, layer_size=512)

Model has been created.

Here's a summary:


Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 128)           2875776   
_________________________________________________________________
lstm_2 (LSTM)                (None, 25, 512)           1312768   
_________________________________________________________________
dropout_2 (Dropout)          (None, 25, 512)           0         
_________________________________________________________________
lstm_3 (LSTM)                (None, 512)               2099200   
_________________________________________________________________
dropout_3 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 512)               262656    
_________________________________________________________________
dense_3 (

In [None]:
# create checkpoints to save model weights (if an improvement) at each epoch
if not os.path.isdir(f'./Model_weights/{model_name}'):
    os.mkdir(f'./Model_weights/{model_name}')
    
checkpoint_path = f'./Model_weights/{model_name}/{model_name}_weights' + '-improvement-{epoch:02d}-{loss:.4f}-acc{accuracy:.4f}.hdf5'
checkpoint = ModelCheckpoint(checkpoint_path, monitor='loss', verbose=1, save_best_only=True, mode='min')
callback_list = [checkpoint]

Next we'll compile the model. 

The accuracy metric is purely as a point of reference (and personal curiosity). We don't want our accuracy to be to high (we don't want to exactly reproduce the training text), but we do want the model to learn how our defined words relate to each other.

In [37]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'], callbacks=callback_list)

In [None]:
model.fit(X,y, batch_size=64, epochs=100)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10

In [None]:
# save model and model summary
if not os.path.isdir(f'./Models/{model_name}'):
    os.mkdir(f'./Models/{model_name}')

with open(f'./Models/{model_name}/{model_name}_summary.txt', 'w') as f:
    with redirect_stdout(f):
        model.summary()

model.save(f'./Models/{model_name}/{model_name}_word_model.h5')