### New Notes

In the first three models, I was a bit unclear about what I wanted to do and in reality, I should have been combining lessons from all of them - in the first, I was trying to emulate Liam Ge's preprocessing (not realizing that he completely ignores Keras in his explanation). In the second, I was emulating Jason Brownlee's LSTM - but all that managed to do was create something to predict sequences. Now - I might be able to use his model if I can figure out if the history thing is the same as my TensorFlow classification notes and how evaluation works, cause I feel like I'm dealing with three different types of models right now: an LSTM (not what I want), a classifier (also not what I want), and an MLP (what I want), but I only know how to evaluate the first two. In the third, I was just trying to experiment with how to build an MLP without any regard for what was actually going into it. It might be helpful to build an MLP from scratch like my LSTM and just replace the parameters that I understand from the Bengio paper and use the simple `.tf.exp(history[loss])` function I found on Stack Exchange to calculate perplexity. Brownlee says that `model.evaluate` can calculate the loss values for input data, so maybe I should just build a basic MLP and then run it on a separate validation dataset and test dataset and call it a day. 

My main issue is that I don't really know how to do the preprocessing for an MLP. The first hypothesis is that I can copy the sequences idea from Brownlee's LSTM, but only run it on half the training corpus if Collab can't handle it? I also need to figure out how to optimize them to leverage a TPU or GPU, because running on half the training corpus might not yield the results that I want - but this can be a secondary task. 

Current plan: 
0. Search out MLPs from scratch [MERP]
1. Use Brownlee's LSTM to guide preprocessing dataset and building model 
2. Run `model.evaluate()` on validation and testing. 
3. Use Tutorial's visualization and loss grabbing to get values I need. 
4. Add in extra features if you're not bored 

In [1]:
import os
import tensorflow as tf

import numpy
from numpy import array

from random import randint

from pickle import load
from pickle import dump 

from tensorflow import keras
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding
from keras.models import load_model
from keras.preprocessing.sequence import pad_sequences 

Using TensorFlow backend.


## Helper Functions

In [2]:
# loads the doc into memory
def load_doc(filename):
    # opens the file as read only
    file = open(filename, 'r')
    # reads all the text
    text = file.read()
    # closes the file
    file.close()
    return text

import string

# turns a document into clean tokens
def clean_doc(doc):
    # replaces "--" with a space ' '
    doc = doc.replace('--', ' ')
    # splits into tokens by white space 
    tokens = doc.split()
    # removes punctuation from each token
    table = str.maketrans('', '', string.punctuation)
    tokens = [w.translate(table) for w in tokens]
    # removes remaining tokens that aren't alphabetic
    tokens = [word for word in tokens if word.isalpha()]
    # makes everything lower case
    tokens = [word.lower() for word in tokens]
    return tokens

# saves tokens to file, one dialog per line:
def save_doc(lines, filename):
    data = '\n'.join(lines)
    file = open((filename + '_sequences.txt'), 'w')
    file.write(data)
    file.close()
    
def preprocessing_train(document):
    # loads the document
    in_filename = document
    doc = load_doc(in_filename)
    # sanity check, uncomment to see first 200 characters
    # print(doc[:200])
    
    # FYI, not 100% sure about how variables get returned. Keep in mind for DB
    
    # cleans the document
    tokens = clean_doc(doc)
    # sanity checks, uncomment to see first 200 tokens
    print(tokens[:200])
    print('Total Tokens: %d' % len(tokens))
    print('Unique Tokens: %d' % len(set(tokens)))
    
    # organize into sequences of tokens
    length = 50 + 1
    sequences = list()
    for i in range(length, len(tokens)):
        # select sequence of tokens
        seq = tokens[i-length:i]
        # converts into a line
        line = ' '.join(seq)
        # store
        sequences.append(line)
    print('Total Sequences: %d' % len(sequences))
    
    # save sequences to file
    out_filename = 'example'
    save_doc(sequences, out_filename)
    
def preprocessing_val(document):
    # loads the document
    in_filename = document
    doc = load_doc(in_filename)
    # sanity check, uncomment to see first 200 characters
    # print(doc[:200])
    
    # FYI, not 100% sure about how variables get returned. Keep in mind for DB
    
    # cleans the document
    tokens = clean_doc(doc)
    # sanity checks, uncomment to see first 200 tokens
    print(tokens[:200])
    print('Total Tokens: %d' % len(tokens))
    print('Unique Tokens: %d' % len(set(tokens)))
    
    # organize into sequences of tokens
    length = 50 + 1
    sequences = list()
    for i in range(length, len(tokens)):
        # select sequence of tokens
        seq = tokens[i-length:i]
        # converts into a line
        line = ' '.join(seq)
        # store
        sequences.append(line)
    print('Total Sequences: %d' % len(sequences))
    
    # save sequences to file
    out_filename = 'example2'
    save_doc(sequences, out_filename)
    
def preprocessing_test(document):
    # loads the document
    in_filename = document
    doc = load_doc(in_filename)
    # sanity check, uncomment to see first 200 characters
    # print(doc[:200])
    
    # FYI, not 100% sure about how variables get returned. Keep in mind for DB
    
    # cleans the document
    tokens = clean_doc(doc)
    # sanity checks, uncomment to see first 200 tokens
    print(tokens[:200])
    print('Total Tokens: %d' % len(tokens))
    print('Unique Tokens: %d' % len(set(tokens)))
    
    # organize into sequences of tokens
    length = 50 + 1
    sequences = list()
    for i in range(length, len(tokens)):
        # select sequence of tokens
        seq = tokens[i-length:i]
        # converts into a line
        line = ' '.join(seq)
        # store
        sequences.append(line)
    print('Total Sequences: %d' % len(sequences))
    
    # save sequences to file
    out_filename = 'example3'
    save_doc(sequences, out_filename)

## Training the Language Model

Unlike in my second series of notes, I will use the embedding layer to learn the representations of words, and then two Dense layers with the requisite activation functions and see what happens. 

### Loading the Sequences

In [3]:
document = 'data/brown-train.txt'
# preprocesses the document and saves to file
preprocessing_train(document)

# loads doc into memory
in_filename = "example_sequences.txt"
doc = load_doc(in_filename)
lines = doc.split('\n')

['the', 'fulton', 'county', 'grand', 'jury', 'said', 'friday', 'an', 'investigation', 'of', 'atlantas', 'recent', 'primary', 'election', 'produced', 'no', 'evidence', 'that', 'any', 'irregularities', 'took', 'place', 'the', 'jury', 'further', 'said', 'in', 'termend', 'presentments', 'that', 'the', 'city', 'executive', 'committee', 'which', 'had', 'overall', 'charge', 'of', 'the', 'election', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'city', 'of', 'atlanta', 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', 'the', 'septemberoctober', 'term', 'jury', 'had', 'been', 'charged', 'by', 'fulton', 'superior', 'court', 'judge', 'durwood', 'pye', 'to', 'investigate', 'reports', 'of', 'possible', 'irregularities', 'in', 'the', 'hardfought', 'primary', 'which', 'was', 'won', 'by', 'mayornominate', 'ivan', 'allen', 'jr', 'only', 'a', 'relative', 'handful', 'of', 'such', 'reports', 'was', 'received', 'the', 'jury', 'said', 'considering', 'the', 'widesprea

In [4]:
# integer encode sequences of words
tokenizer = Tokenizer()
tokenizer.fit_on_texts(lines)
sequences = tokenizer.texts_to_sequences(lines)

# vocabulary size
vocab_size = len(tokenizer.word_index) + 1

In [5]:
# separate into input and output
sequences = array(sequences)
X, y = sequences[:,:-1], sequences[:,-1]
y = to_categorical(y, num_classes=vocab_size)
seq_length = X.shape[1]

### What I Changed

Here, I replaced the Embedding vector space with the dimmensions that I thought Bengio noted for his MLPs (as the second argument for Embedding()), I removed the LSTMs, and changed the activation for the first dense layer to tanh. I also changed the first argument of Dense to 50 - because I think that's the hidden units.

I'm winging it on the epochs - I can change it to something higher if the accuracy I get makes it seem like I low balled it. 

If it takes a while to train, I'll cut the batch size in half. Likewise, I might throw more in there if I can. 

I don't really know how to implement the mix that he mentions, or the order, or the direct - so this is as good as it's going to get for a bit.I can always ask David about that later. 

In [None]:
# define the model
model = Sequential()
model.add(Embedding(vocab_size, 60, input_length=seq_length))
model.add(LSTM(100, return_sequences=True))
model.add(LSTM(100))
model.add(Dense(50, activation='tanh'))
model.add(Dense(vocab_size, activation='softmax'))

# sanity check 
print(model.summary())
    
# compiles the model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    
# fits the model
history = model.fit(X, y, batch_size=128, epochs=25)

Instructions for updating:
Colocations handled automatically by placer.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 50, 60)            1347780   
_________________________________________________________________
lstm_1 (LSTM)                (None, 50, 100)           64400     
_________________________________________________________________
lstm_2 (LSTM)                (None, 100)               80400     
_________________________________________________________________
dense_1 (Dense)              (None, 50)                5050      
_________________________________________________________________
dense_2 (Dense)              (None, 22463)             1145613   
Total params: 2,643,243
Trainable params: 2,643,243
Non-trainable params: 0
_________________________________________________________________
None
Instructions for updating:
Use tf.cast instead.
Epoch 1

In [None]:
# saves the model
model.save('model.h5')

# save the tokenizer
dump(tokenizer, open('tokenizer.pkl', 'wb'))

### Measurements Taken from TF Tutorial 

I'm just now noticing that the model I did in the tutorials explicitly separates the training and validation data and reports the accuracy and loss for both of them while training. I might see if I can play around with that after I finish the first hypothesis. 

In order to get the history, I grabbed it from a variable I created to house the model. TBD if this interferes with anything else. 

It also looks like I might have to format the validation data and the test data in the same way as I formatted the training data in order to get it to work properly. Keep that in mind below.

#### Validation Dataset

In [None]:
document = 'data/brown-val.txt'
# preprocesses the document and saves to file
preprocessing_val(document)

# loads doc into memory
in_filename = "example2_sequences.txt"
doc = load_doc(in_filename)
lines = doc.split('\n')

In [None]:
# integer encode sequences of words
tokenizer = Tokenizer()
tokenizer.fit_on_texts(lines)
sequences = tokenzier.texts_to_sequences(lines)

# vocabulary size
vocab_size = len(tokenizer.word_index) + 1

In [None]:
# separate into input and output
sequences = array(sequences)
vX, vy = sequences[:,:-1], sequences[:,-1]
vy = to_categorical(vy, num_classes=vocab_size)
seq_length = vX.shape[1]

In [None]:
val_results = model.evaluate(vX, vy)

#### Test Dataset

In [None]:
document = 'data/brown-test.txt'
# preprocesses the document and saves to file
preprocessing_test(document)

# loads doc into memory
in_filename = "example3_sequences.txt"
doc = load_doc(in_filename)
lines = doc.split('\n')

In [None]:
# integer encode sequences of words
tokenizer = Tokenizer()
tokenizer.fit_on_texts(lines)
sequences = tokenzier.texts_to_sequences(lines)

# vocabulary size
vocab_size = len(tokenizer.word_index) + 1

In [None]:
# separate into input and output
sequences = array(sequences)
tX, ty = sequences[:,:-1], sequences[:,-1]
ty = to_categorical(ty, num_classes=vocab_size)
seq_length = tX.shape[1]

In [None]:
test_results = model.evaluate(tX, ty)

### Graphing and Calculating Perplexity 

After I run `model.fit()`, it returns a `history` object that contains a dictionary with everything that happened during training. There are currently only two entires for each monitored metric. The following code makes sure that these are present:

In [None]:
history_dict = history.history
history_dict.keys()

In [None]:
val_dict = val_results.history
val_dict.keys()

In [None]:
test_dict = test_results.history
test_dict.keys()

In [None]:
import matplotlib.pyplot as plt

acc = history_dict['acc']
loss = history_dict['loss']
val_acc = val_dict['acc']
val_loss = val_dict['loss']
test_acc = test_dict['acc']
test_loss = test_dict['loss']

epochs = range(1, len(acc) + 1)

# "bo" is for "blue dot"
plt.plot(epochs, loss, 'bo', label='Training Loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation Loss')
# r+ is for "solid red pluses"
plt.plot(epochs, test_loss, 'r+', label='Test Loss')
plt.title('Training, Validation, and Test Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

In [None]:
tppl = tf.exp(history_dict[loss])

vppl = tf.exp(val_dict[loss])

sppl = tf.exp(test_dict[loss])

print(tppl, vppl, sppl)