# Bonus Lab - Neural Language Model
A language model predicts the next word in the sequence based on the specific words that have come before it in the sequence.

It is also possible to develop language models at the character level using neural networks. The benefit of character-based language models is their small vocabulary and flexibility in handling any words, punctuation, and other document structure. This comes at the cost of requiring larger models that are slower to train.

Nevertheless, in the field of neural language models, character-based models offer a lot of promise for a general, flexible and powerful approach to language modeling.

As a prerequisite for the lab, make sure to pip install:
- keras
- tensorflow
- h5py

# Source Text Creation

To start out with, we'll be using a simple nursery rhyme. It's quite short so we can actually train something on your CPU and see relatively interesting results. Please copy and paste the following text in a text file and save it as "rhyme.txt". Place this in the same directory as this jupyter notebook:

In [0]:
!pip install tensorflow
!pip install keras
!pip install h5py

In [0]:
s='Sing a song of sixpence,\
A pocket full of rye.\
Four and twenty blackbirds,\
Baked in a pie.\
When the pie was opened\
The birds began to sing;\
Wasn’t that a dainty dish,\
To set before the king.\
The king was in his counting house,\
Counting out his money;\
The queen was in the parlour,\
Eating bread and honey.\
The maid was in the garden,\
Hanging out the clothes,\
When down came a blackbird\
And pecked off her nose.'

with open('rhymes.txt','w') as f:
  f.write(s)

    Sing a song of sixpence,
    A pocket full of rye.
    Four and twenty blackbirds,
    Baked in a pie.

    When the pie was opened
    The birds began to sing;
    Wasn’t that a dainty dish,
    To set before the king.

    The king was in his counting house,
    Counting out his money;
    The queen was in the parlour,
    Eating bread and honey.

    The maid was in the garden,
    Hanging out the clothes,
    When down came a blackbird
    And pecked off her nose.

# Importing Modules

In [0]:
import numpy
from numpy import array
from pickle import dump
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from pickle import load
from keras.models import load_model
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
from keras.layers import Embedding

# Sequence Generation

A language model must be trained on the text, and in the case of a character-based language model, the input and output sequences must be characters.

The number of characters used as input will also define the number of characters that will need to be provided to the model in order to elicit the first predicted character.

After the first character has been generated, it can be appended to the input sequence and used as input for the model to generate the next character.

Longer sequences offer more context for the model to learn what character to output next but take longer to train and impose more burden on seeding the model when generating text.

We will use an arbitrary length of 10 characters for this model.

There is not a lot of text, and 10 characters is a few words.

We can now transform the raw text into a form that our model can learn; specifically, input and output sequences of characters.

In [0]:
#load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text
# save tokens to file, one dialog per line
def save_doc(lines, filename):
    data = '\n'.join(lines)
    file = open(filename, 'w')
    file.write(data)
    file.close()

In [0]:
#load text
raw_text = load_doc('rhymes.txt')
print(raw_text)

# clean
tokens = raw_text.split()
raw_text = ' '.join(tokens)

print(raw_text)

# organize into sequences of characters
length = 10
sequences = list()
for i in range(length, len(raw_text)):
    # select sequence of tokens
    seq = raw_text[i-length:i+1]
    # store
    sequences.append(seq)
print('Total Sequences: %d' % len(sequences))

Sing a song of sixpence,A pocket full of rye.Four and twenty blackbirds,Baked in a pie.When the pie was openedThe birds began to sing;Wasn’t that a dainty dish,To set before the king.The king was in his counting house,Counting out his money;The queen was in the parlour,Eating bread and honey.The maid was in the garden,Hanging out the clothes,When down came a blackbirdAnd pecked off her nose.
Sing a song of sixpence,A pocket full of rye.Four and twenty blackbirds,Baked in a pie.When the pie was openedThe birds began to sing;Wasn’t that a dainty dish,To set before the king.The king was in his counting house,Counting out his money;The queen was in the parlour,Eating bread and honey.The maid was in the garden,Hanging out the clothes,When down came a blackbirdAnd pecked off her nose.
Total Sequences: 384


In [0]:
# save sequences to file
out_filename = 'char_sequences.txt'
save_doc(sequences, out_filename)

# Train a Model
In this section, we will develop a neural language model for the prepared sequence data.

The model will read encoded characters and predict the next character in the sequence. A Long Short-Term Memory recurrent neural network hidden layer will be used to learn the context from the input sequence in order to make the predictions.

In [0]:
# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

In [0]:
# load
in_filename = 'char_sequences.txt'
raw_text = load_doc(in_filename)
lines = raw_text.split('\n')

The sequences of characters must be encoded as integers.This means that each unique character will be assigned a specific integer value and each sequence of characters will be encoded as a sequence of integers. We can create the mapping given a sorted set of unique characters in the raw input data. The mapping is a dictionary of character values to integer values.

Next, we can process each sequence of characters one at a time and use the dictionary mapping to look up the integer value for each character. The result is a list of integer lists.

We need to know the size of the vocabulary later. We can retrieve this as the size of the dictionary mapping.

In [0]:
# integer encode sequences of characters
chars = sorted(list(set(raw_text)))
mapping = dict((c, i) for i, c in enumerate(chars))
sequences = list()
for line in lines:
    # integer encode line
    encoded_seq = [mapping[char] for char in line]
    # store
    sequences.append(encoded_seq)

# vocabulary size
vocab_size = len(mapping)
print('Vocabulary Size: %d' % vocab_size)

# separate into input and output
sequences = array(sequences)
X, y = sequences[:,:-1], sequences[:,-1]
sequences = [to_categorical(x, num_classes=vocab_size) for x in X]
X = array(sequences)
y = to_categorical(y, num_classes=vocab_size)

Vocabulary Size: 38


The model is defined with an input layer that takes sequences that have 10 time steps and 38 features for the one hot encoded input sequences. Rather than specify these numbers, we use the second and third dimensions on the X input data. This is so that if we change the length of the sequences or size of the vocabulary, we do not need to change the model definition.

The model has a single LSTM hidden layer with 75 memory cells. The model has a fully connected output layer that outputs one vector with a probability distribution across all characters in the vocabulary. A softmax activation function is used on the output layer to ensure the output has the properties of a probability distribution.

The model is learning a multi-class classification problem, therefore we use the categorical log loss intended for this type of problem. The efficient Adam implementation of gradient descent is used to optimize the model and accuracy is reported at the end of each batch update. The model is fit for 50 training epochs.

In [0]:
# define model
model = Sequential()
model.add(LSTM(75, input_shape=(X.shape[1], X.shape[2])))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())
# compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
history=model.fit(X, y, epochs=50)

# save the model to file
model.save('model_75units.h5')
# save the mapping
dump(mapping, open('mapping_75units.pkl', 'wb'))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_25 (LSTM)               (None, 75)                34200     
_________________________________________________________________
dense_24 (Dense)             (None, 38)                2888      
Total params: 37,088
Trainable params: 37,088
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/5

# Generating Text

We must provide sequences of 10 characters as input to the model in order to start the generation process. We will pick these manually. A given input sequence will need to be prepared in the same way as preparing the training data for the model. 

In [0]:
# generate a sequence of characters with a language model
def generate_seq(model, mapping, seq_length, seed_text, n_chars):
    in_text = seed_text
    # generate a fixed number of characters
    for _ in range(n_chars):
        # encode the characters as integers
        encoded = [mapping[char] for char in in_text]
        # truncate sequences to a fixed length
        encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')
        # one hot encode
        encoded = to_categorical(encoded, num_classes=len(mapping))
        #encoded = encoded.reshape(1, encoded.shape[0], encoded.shape[1])
        # predict character
        yhat = model.predict_classes(encoded, verbose=0)
        # reverse map integer to character
        out_char = ''
        for char, index in mapping.items():
            if index == yhat:
                out_char = char
                break
        # append to input
        in_text += char
    return in_text

# load the model
model_75 = load_model('model_75units.h5')
# load the mapping
mapping_75 = load(open('mapping_75units.pkl', 'rb'))

Running the example generates three sequences of text.

The first is a test to see how the model does at starting from the beginning of the rhyme. The second is a test to see how well it does at beginning in the middle of a line. The final example is a test to see how well it does with a sequence of characters never seen before.

If the results aren't satisfactory, try out the suggestions above or these below:
- Padding. Update the example to provides sequences line by line only and use padding to fill out each sequence to the maximum line length.
- Sequence Length. Experiment with different sequence lengths and see how they impact the behavior of the model.
- Tune Model. Experiment with different model configurations, such as the number of memory cells and epochs, and try to develop a better model for fewer resources.



# To Do:
- Try different numbers of memory cells
- Try different types and amounts of recurrent and fully connected layers
- Try different lengths of training epochs
- Try different sequence lengths and pre-processing of data
- Try regularization techniques such as Dropout

# Deliverables to receive credit
The following deliverables will receive increasing amount of bonus credit:

1. Optimize the following cells above to get the model to work reasonably well on the above generated sentences. Again, this is a toy problem as language models require a lot of computation... so this toy problem is great for rapid experimentation to explore different aspects of deep learning language models.
2. Write a function to split the text corpus file into training and validation and pipe the validation data into the model.fit() function to be able to track validation error per epoch. Lookup Keras documentation to see how this is handled.
3. Write a summary (methods and results) in the below cells of the different things you applied. You must include your intuitions behind what did work and what did not work well
4. Do something even more interesting. Try a different source text. Train a word-level model. We'll leave it up to your creativity to explore and write a summary of your methods and results.


In [0]:
# 1. Optimizing the Current Model

### Model can be optimized by increasing then number of memory units: Let's try training using 500 and 1000 units and comapring results at the end:

In [0]:
# define model
model = Sequential()
model.add(LSTM(500, input_shape=(X.shape[1], X.shape[2])))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())
# compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
history=model.fit(X, y, epochs=50)

# save the model to file
model.save('model_500units.h5')
# save the mapping
dump(mapping, open('mapping_500units.pkl', 'wb'))

# load the model
model_500 = load_model('model_500units.h5')
# load the mapping
mapping_500 = load(open('mapping_500units.pkl', 'rb'))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_26 (LSTM)               (None, 500)               1078000   
_________________________________________________________________
dense_25 (Dense)             (None, 38)                19038     
Total params: 1,097,038
Trainable params: 1,097,038
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoc

### We can observe that the model actually gets better If we increase the memorcy units to 1000 then:

In [0]:
# define model
model = Sequential()
model.add(LSTM(1000, input_shape=(X.shape[1], X.shape[2])))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())
# compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
history=model.fit(X, y, epochs=50)

# save the model to file
model.save('model_1000units.h5')
# save the mapping
dump(mapping, open('mapping_1000units.pkl', 'wb'))

# load the model
model_1000 = load_model('model_1000units.h5')
# load the mapping
mapping_1000 = load(open('mapping_1000units.pkl', 'rb'))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_12 (LSTM)               (None, 1000)              4156000   
_________________________________________________________________
dense_12 (Dense)             (None, 38)                38038     
Total params: 4,194,038
Trainable params: 4,194,038
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoc

### And the results are:

In [0]:
# test start of rhyme 75 units
print(generate_seq(model_75, mapping_75, 10, 'Sing a son', 100))
# test mid-line
print(generate_seq(model_75, mapping_75, 10, 'king was i', 100))
# test mid-line
print(generate_seq(model_75, mapping_75, 10, 'Hanging out', 100))
# test not in original
print(generate_seq(model_75, mapping_75, 10, 'hello world', 40))

Sing a song hen oue oue ooue oun ooue ouue onn ooue hhe onne.The biid wss nntee borrd aend wnn cae bais in  on
king was int he in.The iin wws in  ae in.he pie was onn yhe larrrdn,danngtoe the  arse.anng ing ous theenothe 
Hanging out the clotre,,ouen  poee thee ooe .The  ine.wTe  i  ie .Whe onng;The unte wan tia id isn peee.he oun 
hello worlde.eaednng win  hee aas onng oue out he i


In [0]:
# test start of rhyme 500 units
print(generate_seq(model_500, mapping_500, 10, 'Sing a son', 100))
# test mid-line
print(generate_seq(model_500, mapping_500, 10, 'king was i', 100))
# test mid-line
print(generate_seq(model_500, mapping_500, 10, 'Hanging out', 100))
# test not in original
print(generate_seq(model_500, mapping_500, 10, 'hello world', 40))

Sing a song of sixpence,A pocket full of rye.Four and twenty blackbirds,Baked in a pie.When the pie was opened
king was in his counting house,Counting out his money;The queen was in the garden,Hanging out the clothes,When
Hanging out the clothes,When down came a blackbirds,Baked in a pie.When the pie was openedThe birds began to si
hello world.Furaattttttidddaaad n tthinyWWha ,’nttt


In [0]:
# test start of rhyme 1000 units
print(generate_seq(model_1000, mapping_1000, 10, 'Sing a son', 100))
print(' ')
# test mid-line
print(generate_seq(model_1000, mapping_1000, 10, 'king was i', 100))
# test mid-line
print(generate_seq(model_1000, mapping_1000, 10, 'Hanging out', 100))
# test not in original
print(generate_seq(model_1000, mapping_1000, 10, 'hello world', 40))

Sing a song of sixpence,A pocket full of rye.Four and twenty blackbirdAnd pecked off her nose.,eenbin  wgkked 
 
king was in his counting house,Counting out his money;The queen was in the garden,Hanging out the clothes,When
Hanging out the clothes,When down came a blackbirdAnd pecked off her nose.,eenbin  wgkked Wae iir hasper anntho
hello world.Foe  nng twr  oe ga  r  riygyTaa mnedt 


### We can comfirm that increasing the number of units does not neccesairly mean the model will perform better overall! If we observe closely the 500 unit model performs better at the beggining and although it does not accurateley outputs the correct words at the end. The 1000 unit model can predict accuratelly at the end but not at the beggining. 
### ------------------------------------------------------------------------------------------------------------------------------------------------------
### Let us stick with the 500 model to see if we can improve the output to have similar results to that of 1000 units by chainging the number of epochs: ###

In [0]:
# define model
model = Sequential()
model.add(LSTM(500, input_shape=(X.shape[1], X.shape[2])))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())
# compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
history=model.fit(X, y, epochs=500)

# save the model to file
model.save('model_500units_500epochs.h5')
# save the mapping
dump(mapping, open('mapping_500units_500epochs.pkl', 'wb'))

# load the model
model_500_500 = load_model('model_500units_500epochs.h5')
# load the mapping
mapping_500_500 = load(open('mapping_500units_500epochs.pkl', 'rb'))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_14 (LSTM)               (None, 500)               1078000   
_________________________________________________________________
dense_14 (Dense)             (None, 38)                19038     
Total params: 1,097,038
Trainable params: 1,097,038
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/500
Epoch 2/500
Epoch 3/500
Epoch 4/500
Epoch 5/500
Epoch 6/500
Epoch 7/500
Epoch 8/500
Epoch 9/500
Epoch 10/500
Epoch 11/500
Epoch 12/500
Epoch 13/500
Epoch 14/500
Epoch 15/500
Epoch 16/500
Epoch 17/500
Epoch 18/500
Epoch 19/500
Epoch 20/500
Epoch 21/500
Epoch 22/500
Epoch 23/500
Epoch 24/500
Epoch 25/500
Epoch 26/500
Epoch 27/500
Epoch 28/500
Epoch 29/500
Epoch 30/500
Epoch 31/500
Epoch 32/500
Epoch 33/500
Epoch 34/500
Epoch 35/500
Epoch 36/500
Epoch 37/500
Epoch 38/500
Epoch 39/500
Epoch 40/500
Epoch 41/500

Epoch 80/500
Epoch 81/500
Epoch 82/500
Epoch 83/500
Epoch 84/500
Epoch 85/500
Epoch 86/500
Epoch 87/500
Epoch 88/500
Epoch 89/500
Epoch 90/500
Epoch 91/500
Epoch 92/500
Epoch 93/500
Epoch 94/500
Epoch 95/500
Epoch 96/500
Epoch 97/500
Epoch 98/500
Epoch 99/500
Epoch 100/500
Epoch 101/500
Epoch 102/500
Epoch 103/500
Epoch 104/500
Epoch 105/500
Epoch 106/500
Epoch 107/500
Epoch 108/500
Epoch 109/500
Epoch 110/500
Epoch 111/500
Epoch 112/500
Epoch 113/500
Epoch 114/500
Epoch 115/500
Epoch 116/500
Epoch 117/500
Epoch 118/500
Epoch 119/500
Epoch 120/500
Epoch 121/500
Epoch 122/500
Epoch 123/500
Epoch 124/500
Epoch 125/500
Epoch 126/500
Epoch 127/500
Epoch 128/500
Epoch 129/500
Epoch 130/500
Epoch 131/500
Epoch 132/500
Epoch 133/500
Epoch 134/500
Epoch 135/500
Epoch 136/500
Epoch 137/500
Epoch 138/500
Epoch 139/500
Epoch 140/500
Epoch 141/500
Epoch 142/500
Epoch 143/500
Epoch 144/500
Epoch 145/500
Epoch 146/500
Epoch 147/500
Epoch 148/500
Epoch 149/500
Epoch 150/500
Epoch 151/500
Epoch 152/50

Epoch 163/500
Epoch 164/500
Epoch 165/500
Epoch 166/500
Epoch 167/500
Epoch 168/500
Epoch 169/500
Epoch 170/500
Epoch 171/500
Epoch 172/500
Epoch 173/500
Epoch 174/500
Epoch 175/500
Epoch 176/500
Epoch 177/500
Epoch 178/500
Epoch 179/500
Epoch 180/500
Epoch 181/500
Epoch 182/500
Epoch 183/500
Epoch 184/500
Epoch 185/500
Epoch 186/500
Epoch 187/500
Epoch 188/500
Epoch 189/500
Epoch 190/500
Epoch 191/500
Epoch 192/500
Epoch 193/500
Epoch 194/500
Epoch 195/500
Epoch 196/500
Epoch 197/500
Epoch 198/500
Epoch 199/500
Epoch 200/500
Epoch 201/500
Epoch 202/500
Epoch 203/500
Epoch 204/500
Epoch 205/500
Epoch 206/500
Epoch 207/500
Epoch 208/500
Epoch 209/500
Epoch 210/500
Epoch 211/500
Epoch 212/500
Epoch 213/500
Epoch 214/500
Epoch 215/500
Epoch 216/500
Epoch 217/500
Epoch 218/500
Epoch 219/500
Epoch 220/500
Epoch 221/500
Epoch 222/500
Epoch 223/500
Epoch 224/500
Epoch 225/500
Epoch 226/500
Epoch 227/500
Epoch 228/500
Epoch 229/500
Epoch 230/500
Epoch 231/500
Epoch 232/500
Epoch 233/500
Epoch 

Epoch 246/500
Epoch 247/500
Epoch 248/500
Epoch 249/500
Epoch 250/500
Epoch 251/500
Epoch 252/500
Epoch 253/500
Epoch 254/500
Epoch 255/500
Epoch 256/500
Epoch 257/500
Epoch 258/500
Epoch 259/500
Epoch 260/500
Epoch 261/500
Epoch 262/500
Epoch 263/500
Epoch 264/500
Epoch 265/500
Epoch 266/500
Epoch 267/500
Epoch 268/500
Epoch 269/500
Epoch 270/500
Epoch 271/500
Epoch 272/500
Epoch 273/500
Epoch 274/500
Epoch 275/500
Epoch 276/500
Epoch 277/500
Epoch 278/500
Epoch 279/500
Epoch 280/500
Epoch 281/500
Epoch 282/500
Epoch 283/500
Epoch 284/500
Epoch 285/500
Epoch 286/500
Epoch 287/500
Epoch 288/500
Epoch 289/500
Epoch 290/500
Epoch 291/500
Epoch 292/500
Epoch 293/500
Epoch 294/500
Epoch 295/500
Epoch 296/500
Epoch 297/500
Epoch 298/500
Epoch 299/500
Epoch 300/500
Epoch 301/500
Epoch 302/500
Epoch 303/500
Epoch 304/500
Epoch 305/500
Epoch 306/500
Epoch 307/500
Epoch 308/500
Epoch 309/500
Epoch 310/500
Epoch 311/500
Epoch 312/500
Epoch 313/500
Epoch 314/500
Epoch 315/500
Epoch 316/500
Epoch 

Epoch 329/500
Epoch 330/500
Epoch 331/500
Epoch 332/500
Epoch 333/500
Epoch 334/500
Epoch 335/500
Epoch 336/500
Epoch 337/500
Epoch 338/500
Epoch 339/500
Epoch 340/500
Epoch 341/500
Epoch 342/500
Epoch 343/500
Epoch 344/500
Epoch 345/500
Epoch 346/500
Epoch 347/500
Epoch 348/500
Epoch 349/500
Epoch 350/500
Epoch 351/500
Epoch 352/500
Epoch 353/500
Epoch 354/500
Epoch 355/500
Epoch 356/500
Epoch 357/500
Epoch 358/500
Epoch 359/500
Epoch 360/500
Epoch 361/500
Epoch 362/500
Epoch 363/500
Epoch 364/500
Epoch 365/500
Epoch 366/500
Epoch 367/500
Epoch 368/500
Epoch 369/500
Epoch 370/500
Epoch 371/500
Epoch 372/500
Epoch 373/500
Epoch 374/500
Epoch 375/500
Epoch 376/500
Epoch 377/500
Epoch 378/500
Epoch 379/500
Epoch 380/500
Epoch 381/500
Epoch 382/500
Epoch 383/500
Epoch 384/500
Epoch 385/500
Epoch 386/500
Epoch 387/500
Epoch 388/500
Epoch 389/500
Epoch 390/500
Epoch 391/500
Epoch 392/500
Epoch 393/500
Epoch 394/500
Epoch 395/500
Epoch 396/500
Epoch 397/500
Epoch 398/500
Epoch 399/500
Epoch 

Epoch 412/500
Epoch 413/500
Epoch 414/500
Epoch 415/500
Epoch 416/500
Epoch 417/500
Epoch 418/500
Epoch 419/500
Epoch 420/500
Epoch 421/500
Epoch 422/500
Epoch 423/500
Epoch 424/500
Epoch 425/500
Epoch 426/500
Epoch 427/500
Epoch 428/500
Epoch 429/500
Epoch 430/500
Epoch 431/500
Epoch 432/500
Epoch 433/500
Epoch 434/500
Epoch 435/500
Epoch 436/500
Epoch 437/500
Epoch 438/500
Epoch 439/500
Epoch 440/500
Epoch 441/500
Epoch 442/500
Epoch 443/500
Epoch 444/500
Epoch 445/500
Epoch 446/500
Epoch 447/500
Epoch 448/500
Epoch 449/500
Epoch 450/500
Epoch 451/500
Epoch 452/500
Epoch 453/500
Epoch 454/500
Epoch 455/500
Epoch 456/500
Epoch 457/500
Epoch 458/500
Epoch 459/500
Epoch 460/500
Epoch 461/500
Epoch 462/500
Epoch 463/500
Epoch 464/500
Epoch 465/500
Epoch 466/500
Epoch 467/500
Epoch 468/500
Epoch 469/500
Epoch 470/500
Epoch 471/500
Epoch 472/500
Epoch 473/500
Epoch 474/500
Epoch 475/500
Epoch 476/500
Epoch 477/500
Epoch 478/500
Epoch 479/500
Epoch 480/500
Epoch 481/500
Epoch 482/500
Epoch 

Epoch 495/500
Epoch 496/500
Epoch 497/500
Epoch 498/500
Epoch 499/500
Epoch 500/500


500 units, 500 epochs:

In [0]:
# test start of rhyme 500 units _ 500 epochs
print(generate_seq(model_500_500, mapping_500_500, 10, 'Sing a son', 100))
# test mid-line
print(generate_seq(model_500_500, mapping_500_500, 10, 'king was i', 100))
# test mid-line
print(generate_seq(model_500_500, mapping_500_500, 10, 'Hanging out', 100))
# test not in original
print(generate_seq(model_500_500, mapping_500_500, 10, 'hello world', 40))

Sing a song of sixpence,A pocket full of rye.Four and twenty blackbirdAnd pecked off her nose..ere tat pigtbar
king was in his counting house,Counting out his money;The queen was in the parlour,Eating bread and honey.The 
Hanging out the clothes,When down came a blackbirdAnd pecked off her nose..ere tat pigtbardaa ddd itetyyella cb
hello world,For iin,wastii  oa  hisWWonnnttht tallo


500 units, 50 epochs:

In [0]:
# test start of rhyme 500 units 50 epochs
print(generate_seq(model_500, mapping_500, 10, 'Sing a son', 100))
# test mid-line
print(generate_seq(model_500, mapping_500, 10, 'king was i', 100))
# test mid-line
print(generate_seq(model_500, mapping_500, 10, 'Hanging out', 100))
# test not in original
print(generate_seq(model_500, mapping_500, 10, 'hello world', 40))

Sing a song of sixpence,A pocket full of rye.Four and twenty blackbirds,Baked in a pie.When the pie was opened
king was in his counting house,Counting out his money;The queen was in the garden,Hanging out the clothes,When
Hanging out the clothes,When down came a blackbirds,Baked in a pie.When the pie was openedThe birds began to si
hello world.Furaattttttidddaaad n tthinyWWha ,’nttt


### The results are interesting! Higher number of epochs does mean overall better performance but does not retain some of the accurate knowledge from lower epoch model.  For exampe: "Four and twenty blackbirds,Baked in a pie" is correct and accurately represented by the 50 epoch model, the 500 epoch model cannot represent this accuratly. Instead it displays: "Four and twenty blackbridsAnd pecked off her nose..". Mayby adding more layers with less units will help retain previous knowledge but still higher epochs lead to better performance overall!

### ------------------------------------------------------------------------------------------------------------------------------------------------------

### Let us try to specify different levels of dropout and see the performance:

In [0]:
# define model
model = Sequential()
model.add(LSTM(500, dropout=0.3, input_shape=(X.shape[1], X.shape[2])))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())
# compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
history=model.fit(X, y, epochs=50)

# save the model to file
model.save('model_500units_30drop.h5')
# save the mapping
dump(mapping, open('mapping_500units_30drop.pkl', 'wb'))

# load the model
model_500_30 = load_model('model_500units_30drop.h5')
# load the mapping
mapping_500_30 = load(open('mapping_500units_30drop.pkl', 'rb'))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_17 (LSTM)               (None, 500)               1078000   
_________________________________________________________________
dense_16 (Dense)             (None, 38)                19038     
Total params: 1,097,038
Trainable params: 1,097,038
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoc

In [0]:
# define model
model = Sequential()
model.add(LSTM(500, dropout=0.9, input_shape=(X.shape[1], X.shape[2])))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())
# compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
history=model.fit(X, y, epochs=50)

# save the model to file
model.save('model_500units_90drop.h5')
# save the mapping
dump(mapping, open('mapping_500units_90drop.pkl', 'wb'))

# load the model
model_500_90 = load_model('model_500units_90drop.h5')
# load the mapping
mapping_500_90 = load(open('mapping_500units_90drop.pkl', 'rb'))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_18 (LSTM)               (None, 500)               1078000   
_________________________________________________________________
dense_17 (Dense)             (None, 38)                19038     
Total params: 1,097,038
Trainable params: 1,097,038
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoc

### Results for dropout=0.3:

In [0]:
# test start of rhyme 500 units 50 epochs, dropout=0.3
print(generate_seq(model_500_30, mapping_500_30, 10, 'Sing a son', 100))
# test mid-line
print(generate_seq(model_500_30, mapping_500_30, 10, 'king was i', 100))
# test mid-line
print(generate_seq(model_500_30, mapping_500_30, 10, 'Hanging out', 100))
# test not in original
print(generate_seq(model_500_30, mapping_500_30, 10, 'hello world', 40))

Sing a song of sixpence,A pocket full of rye.Four and twenty blackbirdAnd pecked off her nose.Haun ing ouusing
king was in his counting house,Counting ouu his coneeing hos ing oun ing oouse,Counting ouu his coneeing hos i
Hanging out the clothes,When town came a blackbirdAnd pecked off her nose.Haun ing ouusing ouuting oouse,Counti
hello worldeeerardanbboggtt d ing y  sis Wounting o


### Results for dropout=0.9:

In [0]:
# test start of rhyme 500 units 50 epochs, dropout=0.9
print(generate_seq(model_500_90, mapping_500_90, 10, 'Sing a son', 100))
# test mid-line
print(generate_seq(model_500_90, mapping_500_90, 10, 'king was i', 100))
# test mid-line
print(generate_seq(model_500_90, mapping_500_90, 10, 'Hanging out', 100))
# test not in original
print(generate_seq(model_500_90, mapping_500_90, 10, 'hello world', 40))

Sing a son  oe  in  an  in  an  in  an  in  an  in  an  in  an  in  an  in  an  in  an  in  an  in  an  in  an
king was in  an  in  an  in  an  in  an  in  an  in  an  in  an  in  an  in  an  in  an  in  an  in  an  in  a
Hanging out in  an  in  an  in  an  in  an  in  an  in  an  in  an  in  an  in  an  in  an  in  an  in  an  in 
hello world o s in  an  in  an  in  an  in  an  in 


### According to MLmastery: "Dropout is a regularization method where input and recurrent connections to LSTM units are probabilistically excluded from activation and weight updates while training a network. This has the effect of reducing overfitting and improving model performance." Then it is obvious that increasing the dropout will lead to a less overfit model but in this case to a worse performer since it is not memorizing the poem. Therefore a dropout of 0 is better. In a sense, we are overfitting the model to memorize the entire poem. The problem with high dropout is obvious if you see the results of a dropout=0.9.

In [0]:
# 4.1 Training at a Word Level

In [0]:
# encode the text as integers
tokenizer = Tokenizer()
tokenizer.fit_on_texts([s])
encoded = tokenizer.texts_to_sequences([s])[0]

# determine the vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)

Vocabulary Size: 51


In [0]:
# create word -> word sequences
sequences = list()
for i in range(1, len(encoded)):
    sequence = encoded[i-1:i+1]
    sequences.append(sequence)
print('Total Sequences: %d' % len(sequences))

Total Sequences: 76


In [0]:
# split into X and y elements
sequences = np.array(sequences)
X, y = sequences[:,0],sequences[:,1]

# one hot encode outputs
y = to_categorical(y, num_classes=vocab_size)

In [0]:
model = Sequential() #This is were LSTM starts
model.add(Embedding(vocab_size, 10, input_length=1)) #How many output dimensions you want at the output
model.add(LSTM(500)) #number of hidden nodes 
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())

# compile network
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit network
model.fit(X, y, epochs=500, verbose=2)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_6 (Embedding)      (None, 1, 10)             510       
_________________________________________________________________
lstm_22 (LSTM)               (None, 500)               1022000   
_________________________________________________________________
dense_21 (Dense)             (None, 51)                25551     
Total params: 1,048,061
Trainable params: 1,048,061
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/500
 - 5s - loss: 3.9317 - acc: 0.0526
Epoch 2/500
 - 0s - loss: 3.9277 - acc: 0.1053
Epoch 3/500
 - 0s - loss: 3.9242 - acc: 0.1053
Epoch 4/500
 - 0s - loss: 3.9206 - acc: 0.1053
Epoch 5/500
 - 0s - loss: 3.9162 - acc: 0.1053
Epoch 6/500
 - 0s - loss: 3.9119 - acc: 0.1053
Epoch 7/500
 - 0s - loss: 3.9067 - acc: 0.1053
Epoch 8/500
 - 0s - loss: 3.9008 - acc: 0.1053
Epoch 9/500
 - 

 - 0s - loss: 0.6877 - acc: 0.6974
Epoch 156/500
 - 0s - loss: 0.6854 - acc: 0.6974
Epoch 157/500
 - 0s - loss: 0.6824 - acc: 0.6974
Epoch 158/500
 - 0s - loss: 0.6798 - acc: 0.6974
Epoch 159/500
 - 0s - loss: 0.6761 - acc: 0.6974
Epoch 160/500
 - 0s - loss: 0.6742 - acc: 0.6974
Epoch 161/500
 - 0s - loss: 0.6713 - acc: 0.6842
Epoch 162/500
 - 0s - loss: 0.6692 - acc: 0.6711
Epoch 163/500
 - 0s - loss: 0.6661 - acc: 0.6974
Epoch 164/500
 - 0s - loss: 0.6649 - acc: 0.6842
Epoch 165/500
 - 0s - loss: 0.6635 - acc: 0.6974
Epoch 166/500
 - 0s - loss: 0.6628 - acc: 0.6711
Epoch 167/500
 - 0s - loss: 0.6599 - acc: 0.6974
Epoch 168/500
 - 0s - loss: 0.6576 - acc: 0.6974
Epoch 169/500
 - 0s - loss: 0.6539 - acc: 0.6974
Epoch 170/500
 - 0s - loss: 0.6526 - acc: 0.7105
Epoch 171/500
 - 0s - loss: 0.6493 - acc: 0.7105
Epoch 172/500
 - 0s - loss: 0.6486 - acc: 0.6842
Epoch 173/500
 - 0s - loss: 0.6488 - acc: 0.6974
Epoch 174/500
 - 0s - loss: 0.6497 - acc: 0.6974
Epoch 175/500
 - 0s - loss: 0.6486

Epoch 323/500
 - 0s - loss: 0.5920 - acc: 0.7105
Epoch 324/500
 - 0s - loss: 0.5910 - acc: 0.6842
Epoch 325/500
 - 0s - loss: 0.5920 - acc: 0.6842
Epoch 326/500
 - 0s - loss: 0.5920 - acc: 0.6974
Epoch 327/500
 - 0s - loss: 0.5895 - acc: 0.6974
Epoch 328/500
 - 0s - loss: 0.5895 - acc: 0.6974
Epoch 329/500
 - 0s - loss: 0.5921 - acc: 0.6842
Epoch 330/500
 - 0s - loss: 0.5904 - acc: 0.6974
Epoch 331/500
 - 0s - loss: 0.5903 - acc: 0.6974
Epoch 332/500
 - 0s - loss: 0.5919 - acc: 0.6974
Epoch 333/500
 - 0s - loss: 0.5924 - acc: 0.6974
Epoch 334/500
 - 0s - loss: 0.5940 - acc: 0.6974
Epoch 335/500
 - 0s - loss: 0.5935 - acc: 0.6974
Epoch 336/500
 - 0s - loss: 0.5929 - acc: 0.6842
Epoch 337/500
 - 0s - loss: 0.5906 - acc: 0.6974
Epoch 338/500
 - 0s - loss: 0.5911 - acc: 0.6711
Epoch 339/500
 - 0s - loss: 0.5915 - acc: 0.6974
Epoch 340/500
 - 0s - loss: 0.5893 - acc: 0.6974
Epoch 341/500
 - 0s - loss: 0.5900 - acc: 0.6974
Epoch 342/500
 - 0s - loss: 0.5894 - acc: 0.6974
Epoch 343/500
 - 0s 

 - 0s - loss: 0.5843 - acc: 0.6974
Epoch 491/500
 - 0s - loss: 0.5838 - acc: 0.6974
Epoch 492/500
 - 0s - loss: 0.5854 - acc: 0.6974
Epoch 493/500
 - 0s - loss: 0.5851 - acc: 0.6842
Epoch 494/500
 - 0s - loss: 0.5852 - acc: 0.7105
Epoch 495/500
 - 0s - loss: 0.5846 - acc: 0.6974
Epoch 496/500
 - 0s - loss: 0.5837 - acc: 0.6974
Epoch 497/500
 - 0s - loss: 0.5838 - acc: 0.6974
Epoch 498/500
 - 0s - loss: 0.5846 - acc: 0.6974
Epoch 499/500
 - 0s - loss: 0.5843 - acc: 0.6974
Epoch 500/500
 - 0s - loss: 0.5835 - acc: 0.6974


<keras.callbacks.History at 0x247ef8cfb38>

In [0]:
# evaluate
in_text = 'Sing'
print('Current word: ', in_text)
encoded = tokenizer.texts_to_sequences([in_text])[0]
encoded = np.array(encoded)
yhat = model.predict_classes(encoded, verbose=0)
for word, index in tokenizer.word_index.items():
    if index == yhat:
        print('Next word: ', word)

Current word:  Sing
Next word:  a


In [0]:
# generate a sequence from the model
def generate_seq(model, tokenizer, seed_text, n_words):
    in_text, result = seed_text, seed_text
    # generate a fixed number of words
    for _ in range(n_words):
        # encode the text as integer
        encoded = tokenizer.texts_to_sequences([in_text])[0]
        encoded = np.array(encoded)
        # predict a word in the vocabulary
        yhat = model.predict_classes(encoded, verbose=0)
        # map predicted word index to word
        out_word = ''
        for word, index in tokenizer.word_index.items():
            if index == yhat:
                out_word = word
                break
        # append to input
        in_text, result = out_word, result + ' ' + out_word
    return result

In [0]:
print(generate_seq(model, tokenizer, 'Sing', 10))
print(generate_seq(model, tokenizer, 'king', 10))
print(generate_seq(model, tokenizer, 'Hanging', 10))

Sing a pocket full of rye four and honey the king
king the king the king the king the king the king
Hanging out his counting house counting house counting house counting house


### Training by word is a worse model when using the same parameters that we used in the 500 unit, 500 epoch, 10 Character trained model previously tested. The model could become better if we trained for more epochs but theoretically this makes sense since the more information given the better at predicting. The accuracy of this model is lower at the same number of epochs thatn the 10 character model. Let us test this idea by training using two words instead of one (keeping one word output to be able to compare these two models:

# 4.2 Two Words in, One word out

In [0]:
data="""Sing a song of sixpence,\n
        A pocket full of rye.\n
        Four and twenty blackbirds,\n
        Baked in a pie.\n
        When the pie was opened\n
        The birds began to sing;\n
        Wasn’t that a dainty dish,\n
        To set before the king.\n
        The king was in his counting house,\n
        Counting out his money;\n
        The queen was in the parlour,\n
        Eating bread and honey.\n
        The maid was in the garden,\n
        Hanging out the clothes,\n
        When down came a blackbird\n
        And pecked off her nose."""

In [0]:
# generate a sequence from a language model
def generate_seq(model, tokenizer, max_length, seed_text, n_words):
    in_text = seed_text
    # generate a fixed number of words
    for _ in range(n_words):
        # encode the text as integer
        encoded = tokenizer.texts_to_sequences([in_text])[0]
        # pre-pad sequences to a fixed length
        encoded = pad_sequences([encoded], maxlen=max_length, padding='pre')
        # predict probabilities for each word
        yhat = model.predict_classes(encoded, verbose=0)
        # map predicted word index to word
        out_word = ''
        for word, index in tokenizer.word_index.items():
            if index == yhat:
                out_word = word
                break
        # append to input
        in_text += ' ' + out_word
    return in_text

In [0]:
# integer encode sequences of words
tokenizer = Tokenizer()
tokenizer.fit_on_texts([data])
encoded = tokenizer.texts_to_sequences([data])[0]
# retrieve vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)
# encode 2 words -> 1 word
sequences = list()
for i in range(2, len(encoded)):
    sequence = encoded[i-2:i+1]
    sequences.append(sequence)
print('Total Sequences: %d' % len(sequences))

# pad sequences
max_length = max([len(seq) for seq in sequences])
sequences = pad_sequences(sequences, maxlen=max_length, padding='pre')
print('Max Sequence Length: %d' % max_length)
# split into input and output elements
sequences = array(sequences)
X, y = sequences[:,:-1],sequences[:,-1]
y = to_categorical(y, num_classes=vocab_size)

Vocabulary Size: 51
Total Sequences: 77
Max Sequence Length: 3


In [0]:
# define model
model = Sequential()
model.add(Embedding(vocab_size, 10, input_length=max_length-1))
model.add(LSTM(50))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())
# compile network
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit network
model.fit(X, y, epochs=500, verbose=2)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_7 (Embedding)      (None, 2, 10)             510       
_________________________________________________________________
lstm_23 (LSTM)               (None, 50)                12200     
_________________________________________________________________
dense_22 (Dense)             (None, 51)                2601      
Total params: 15,311
Trainable params: 15,311
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/500
 - 4s - loss: 3.9318 - acc: 0.0390
Epoch 2/500
 - 0s - loss: 3.9293 - acc: 0.1169
Epoch 3/500
 - 0s - loss: 3.9271 - acc: 0.1169
Epoch 4/500
 - 0s - loss: 3.9252 - acc: 0.1169
Epoch 5/500
 - 0s - loss: 3.9230 - acc: 0.1169
Epoch 6/500
 - 0s - loss: 3.9207 - acc: 0.1169
Epoch 7/500
 - 0s - loss: 3.9183 - acc: 0.1169
Epoch 8/500
 - 0s - loss: 3.9159 - acc: 0.1169
Epoch 9/500
 - 0s - l

Epoch 156/500
 - 0s - loss: 1.0326 - acc: 0.7792
Epoch 157/500
 - 0s - loss: 1.0166 - acc: 0.7922
Epoch 158/500
 - 0s - loss: 1.0000 - acc: 0.7922
Epoch 159/500
 - 0s - loss: 0.9839 - acc: 0.7922
Epoch 160/500
 - 0s - loss: 0.9693 - acc: 0.7922
Epoch 161/500
 - 0s - loss: 0.9533 - acc: 0.8052
Epoch 162/500
 - 0s - loss: 0.9390 - acc: 0.8052
Epoch 163/500
 - 0s - loss: 0.9244 - acc: 0.8052
Epoch 164/500
 - 0s - loss: 0.9098 - acc: 0.8052
Epoch 165/500
 - 0s - loss: 0.8955 - acc: 0.8182
Epoch 166/500
 - 0s - loss: 0.8824 - acc: 0.8182
Epoch 167/500
 - 0s - loss: 0.8688 - acc: 0.8312
Epoch 168/500
 - 0s - loss: 0.8556 - acc: 0.8312
Epoch 169/500
 - 0s - loss: 0.8429 - acc: 0.8571
Epoch 170/500
 - 0s - loss: 0.8293 - acc: 0.8701
Epoch 171/500
 - 0s - loss: 0.8179 - acc: 0.8701
Epoch 172/500
 - 0s - loss: 0.8038 - acc: 0.8701
Epoch 173/500
 - 0s - loss: 0.7919 - acc: 0.8701
Epoch 174/500
 - 0s - loss: 0.7798 - acc: 0.8831
Epoch 175/500
 - 0s - loss: 0.7684 - acc: 0.8831
Epoch 176/500
 - 0s 

Epoch 324/500
 - 0s - loss: 0.1586 - acc: 0.9610
Epoch 325/500
 - 0s - loss: 0.1577 - acc: 0.9610
Epoch 326/500
 - 0s - loss: 0.1566 - acc: 0.9610
Epoch 327/500
 - 0s - loss: 0.1554 - acc: 0.9610
Epoch 328/500
 - 0s - loss: 0.1546 - acc: 0.9610
Epoch 329/500
 - 0s - loss: 0.1536 - acc: 0.9610
Epoch 330/500
 - 0s - loss: 0.1530 - acc: 0.9610
Epoch 331/500
 - 0s - loss: 0.1521 - acc: 0.9610
Epoch 332/500
 - 0s - loss: 0.1518 - acc: 0.9610
Epoch 333/500
 - 0s - loss: 0.1510 - acc: 0.9610
Epoch 334/500
 - 0s - loss: 0.1504 - acc: 0.9610
Epoch 335/500
 - 0s - loss: 0.1495 - acc: 0.9610
Epoch 336/500
 - 0s - loss: 0.1484 - acc: 0.9610
Epoch 337/500
 - 0s - loss: 0.1474 - acc: 0.9610
Epoch 338/500
 - 0s - loss: 0.1469 - acc: 0.9610
Epoch 339/500
 - 0s - loss: 0.1459 - acc: 0.9610
Epoch 340/500
 - 0s - loss: 0.1457 - acc: 0.9610
Epoch 341/500
 - 0s - loss: 0.1446 - acc: 0.9610
Epoch 342/500
 - 0s - loss: 0.1441 - acc: 0.9610
Epoch 343/500
 - 0s - loss: 0.1432 - acc: 0.9610
Epoch 344/500
 - 0s 

Epoch 492/500
 - 0s - loss: 0.0916 - acc: 0.9610
Epoch 493/500
 - 0s - loss: 0.0912 - acc: 0.9610
Epoch 494/500
 - 0s - loss: 0.0913 - acc: 0.9481
Epoch 495/500
 - 0s - loss: 0.0911 - acc: 0.9610
Epoch 496/500
 - 0s - loss: 0.0913 - acc: 0.9610
Epoch 497/500
 - 0s - loss: 0.0914 - acc: 0.9610
Epoch 498/500
 - 0s - loss: 0.0907 - acc: 0.9610
Epoch 499/500
 - 0s - loss: 0.0910 - acc: 0.9481
Epoch 500/500
 - 0s - loss: 0.0901 - acc: 0.9610


<keras.callbacks.History at 0x24782f98128>

In [0]:
# evaluate model
print(generate_seq(model, tokenizer, max_length-1, 'Sing a', 10))
print(generate_seq(model, tokenizer, max_length-1, 'The king', 10))
print(generate_seq(model, tokenizer, max_length-1, 'Hanging out', 10))

Sing a song of sixpence a pocket full of rye four and
The king was in the parlour eating bread and honey the maid
Hanging out the clothes when down came a blackbird and pecked off


### The two word input, one word output model is a lot better than the one word trained model. This makes sense since we are starting with more information in training. But still, we see that in terms of keeping the number of epochs fixed (500) with 500 units, training with 10 characters is a lot better than training per word and also per two words. Neither the one word, and two word trained models have high accuracy at the same epochs. 

# 4.3 Training Line by Line 

In [0]:
# generate a sequence from a language model
def generate_seq(model, tokenizer, max_length, seed_text, n_words):
    in_text = seed_text
    # generate a fixed number of words
    for _ in range(n_words):
        # encode the text as integer
        encoded = tokenizer.texts_to_sequences([in_text])[0]
        # pre-pad sequences to a fixed length
        encoded = pad_sequences([encoded], maxlen=max_length, padding='pre')
        # predict probabilities for each word
        yhat = model.predict_classes(encoded, verbose=0)
        # map predicted word index to word
        out_word = ''
        for word, index in tokenizer.word_index.items():
            if index == yhat:
                out_word = word
                break
        # append to input
        in_text += ' ' + out_word
    return in_text

# prepare the tokenizer on the source text
tokenizer = Tokenizer()
tokenizer.fit_on_texts([data])
# determine the vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)
# create line-based sequences
sequences = list()
for line in data.split('\n'):
	encoded = tokenizer.texts_to_sequences([line])[0]
	for i in range(1, len(encoded)):
		sequence = encoded[:i+1]
		sequences.append(sequence)
print('Total Sequences: %d' % len(sequences))
# pad input sequences
max_length = max([len(seq) for seq in sequences])
sequences = pad_sequences(sequences, maxlen=max_length, padding='pre')
print('Max Sequence Length: %d' % max_length)
# split into input and output elements
sequences = array(sequences)
X, y = sequences[:,:-1],sequences[:,-1]
y = to_categorical(y, num_classes=vocab_size)

Vocabulary Size: 51
Total Sequences: 63
Max Sequence Length: 7
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_8 (Embedding)      (None, 6, 10)             510       
_________________________________________________________________
lstm_24 (LSTM)               (None, 50)                12200     
_________________________________________________________________
dense_23 (Dense)             (None, 51)                2601      
Total params: 15,311
Trainable params: 15,311
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/500
 - 4s - loss: 3.9315 - acc: 0.0159
Epoch 2/500
 - 0s - loss: 3.9291 - acc: 0.0317
Epoch 3/500
 - 0s - loss: 3.9270 - acc: 0.0476
Epoch 4/500
 - 0s - loss: 3.9249 - acc: 0.0635
Epoch 5/500
 - 0s - loss: 3.9226 - acc: 0.0476
Epoch 6/500
 - 0s - loss: 3.9204 - acc: 0.0952
Epoch 7/500
 - 0s - loss: 3.9179 - acc: 0.1270
Epoch

Epoch 155/500
 - 0s - loss: 1.9092 - acc: 0.5238
Epoch 156/500
 - 0s - loss: 1.8963 - acc: 0.5397
Epoch 157/500
 - 0s - loss: 1.8853 - acc: 0.5556
Epoch 158/500
 - 0s - loss: 1.8748 - acc: 0.5397
Epoch 159/500
 - 0s - loss: 1.8623 - acc: 0.5397
Epoch 160/500
 - 0s - loss: 1.8513 - acc: 0.5397
Epoch 161/500
 - 0s - loss: 1.8411 - acc: 0.5397
Epoch 162/500
 - 0s - loss: 1.8299 - acc: 0.5556
Epoch 163/500
 - 0s - loss: 1.8197 - acc: 0.5714
Epoch 164/500
 - 0s - loss: 1.8094 - acc: 0.5714
Epoch 165/500
 - 0s - loss: 1.7999 - acc: 0.5714
Epoch 166/500
 - 0s - loss: 1.7889 - acc: 0.6032
Epoch 167/500
 - 0s - loss: 1.7778 - acc: 0.6032
Epoch 168/500
 - 0s - loss: 1.7672 - acc: 0.6190
Epoch 169/500
 - 0s - loss: 1.7567 - acc: 0.5873
Epoch 170/500
 - 0s - loss: 1.7465 - acc: 0.5873
Epoch 171/500
 - 0s - loss: 1.7372 - acc: 0.6349
Epoch 172/500
 - 0s - loss: 1.7296 - acc: 0.6349
Epoch 173/500
 - 0s - loss: 1.7170 - acc: 0.6190
Epoch 174/500
 - 0s - loss: 1.7078 - acc: 0.6349
Epoch 175/500
 - 0s 

 - 0s - loss: 0.7683 - acc: 0.8571
Epoch 323/500
 - 0s - loss: 0.7638 - acc: 0.8571
Epoch 324/500
 - 0s - loss: 0.7594 - acc: 0.8571
Epoch 325/500
 - 0s - loss: 0.7553 - acc: 0.8571
Epoch 326/500
 - 0s - loss: 0.7526 - acc: 0.8571
Epoch 327/500
 - 0s - loss: 0.7483 - acc: 0.8571
Epoch 328/500
 - 0s - loss: 0.7440 - acc: 0.8571
Epoch 329/500
 - 0s - loss: 0.7400 - acc: 0.8571
Epoch 330/500
 - 0s - loss: 0.7364 - acc: 0.8571
Epoch 331/500
 - 0s - loss: 0.7330 - acc: 0.8571
Epoch 332/500
 - 0s - loss: 0.7296 - acc: 0.8571
Epoch 333/500
 - 0s - loss: 0.7257 - acc: 0.8571
Epoch 334/500
 - 0s - loss: 0.7232 - acc: 0.8571
Epoch 335/500
 - 0s - loss: 0.7182 - acc: 0.8571
Epoch 336/500
 - 0s - loss: 0.7140 - acc: 0.8571
Epoch 337/500
 - 0s - loss: 0.7112 - acc: 0.8571
Epoch 338/500
 - 0s - loss: 0.7067 - acc: 0.8571
Epoch 339/500
 - 0s - loss: 0.7041 - acc: 0.8571
Epoch 340/500
 - 0s - loss: 0.6994 - acc: 0.8730
Epoch 341/500
 - 0s - loss: 0.6962 - acc: 0.8571
Epoch 342/500
 - 0s - loss: 0.6927

Epoch 490/500
 - 0s - loss: 0.3427 - acc: 0.9365
Epoch 491/500
 - 0s - loss: 0.3410 - acc: 0.9365
Epoch 492/500
 - 0s - loss: 0.3399 - acc: 0.9365
Epoch 493/500
 - 0s - loss: 0.3385 - acc: 0.9365
Epoch 494/500
 - 0s - loss: 0.3370 - acc: 0.9365
Epoch 495/500
 - 0s - loss: 0.3359 - acc: 0.9365
Epoch 496/500
 - 0s - loss: 0.3349 - acc: 0.9365
Epoch 497/500
 - 0s - loss: 0.3336 - acc: 0.9365
Epoch 498/500
 - 0s - loss: 0.3313 - acc: 0.9365
Epoch 499/500
 - 0s - loss: 0.3302 - acc: 0.9365
Epoch 500/500
 - 0s - loss: 0.3293 - acc: 0.9365


<keras.callbacks.History at 0x247f88169e8>

In [0]:
# define model
model = Sequential()
model.add(Embedding(vocab_size, 10, input_length=max_length-1))
model.add(LSTM(50))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())
# compile network
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit network
model.fit(X, y, epochs=500, verbose=2)

In [0]:
# evaluate model
print(generate_seq(model, tokenizer, max_length-1, 'Sing', 4))
print(generate_seq(model, tokenizer, max_length-1, 'A pocket', 4))
print(generate_seq(model, tokenizer, max_length-1, 'Four', 4))

Sing a song of sixpence
A pocket full of rye house
Four and twenty blackbirds nose


# Creating a Train and Test set to visualize validation error per epoch

In order to visualize the validation error per epoch we simply need to add a simple parameter called validation_split in the model.fit() method. This can be done easily. For the sake of demonstration I will only run the model for 10 epochs since pringint this notebook will be long enough already. The point of this cells is to visualize that the validation loss and validationa accuracy can indeed be visualized:

In [0]:
# define model
model = Sequential()
model.add(Embedding(vocab_size, 10, input_length=max_length-1))
model.add(LSTM(50))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())
# compile network
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit network
model.fit(X_train, y_train, epochs=10, verbose=2, validation_split=0.2)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_13 (Embedding)     (None, 6, 10)             510       
_________________________________________________________________
lstm_31 (LSTM)               (None, 50)                12200     
_________________________________________________________________
dense_30 (Dense)             (None, 51)                2601      
Total params: 15,311
Trainable params: 15,311
Non-trainable params: 0
_________________________________________________________________
None
Train on 33 samples, validate on 9 samples
Epoch 1/10
 - 6s - loss: 3.9323 - acc: 0.0303 - val_loss: 3.9325 - val_acc: 0.1111
Epoch 2/10
 - 0s - loss: 3.9290 - acc: 0.0303 - val_loss: 3.9324 - val_acc: 0.1111
Epoch 3/10
 - 0s - loss: 3.9266 - acc: 0.0606 - val_loss: 3.9318 - val_acc: 0.1111
Epoch 4/10
 - 0s - loss: 3.9244 - acc: 0.0909 - val_loss: 3.9318 - val_acc: 0.1111
Epoch 5/10
 - 0s - los

<keras.callbacks.History at 0x248140326d8>

# SUMMARY: METHODS AND RESULTS 

# Deliverables to receive credit RESULTS:
The following deliverables will receive increasing amount of bonus credit:

1. Optimize the following cells above to get the model to work reasonably well on the above generated sentences. Again, this is a toy problem as language models require a lot of computation... so this toy problem is great for rapid experimentation to explore different aspects of deep learning language models.

The model indeed was optimized for better accuracy by increasing the number of epochs and the number of units. Increasing the number of units by itself increased the accuracy of the model but the addition of more epochs boosted the model. The best model of all the models trained in this notebook was the LSTM with 500 units and 500 epochs trained in 10 characters.

2. Write a function to split the text corpus file into training and validation and pipe the validation data into the model.fit() function to be able to track validation error per epoch. Lookup Keras documentation to see how this is handled.

This was done easily by adding the validation_split parameter to the model.fit() function. In the output of the cell you can visualize the validation loss and the validation accuracy per epoch.

3. Write a summary (methods and results) in the below cells of the different things you applied. You must include your intuitions behind what did work and what did not work well

In summary, various models were trained:

- Trained in 10 characters
- Trained per word
- Trained per two words
- Trained by line

In general the 10 character model was found to be better when we fixed the number of epochs. I am sure the line by line model could perform better provided the correct number of memory units and epochs but for the sake of comparison we fixed the number of epochs and units to be able to compared different training methodology easily. The models in general we're found to have trouble when predicting words after inputing a word that is repeated througout the text even when they correctly predict the previous context. The problem was fixed but then new problems arised in other parts of the prediction. 

4. Do something even more interesting. Try a different source text. Train a word-level model. We'll leave it up to your creativity to explore and write a summary of your methods and results. 

As previously stated, we succesfully trained at word level, at two word level, and at line-by-line level. 


# To Do RESULTS:
- Try different numbers of memory cells (DONE)
We tried different number of cells including 500 and 1000. Training time increased considerable for higher number of unit cells. The model with higher epochs performs overall better but in some parts of the text the lower unit model performed better. Therefore training with higher units does not necessarly mean that the model will perform better in every part of the text. Adding more layers will fix this problem since knowledge from the accurate less epoch model will be preserved. 

- Try different lengths of training epochs 
Adding more epochs lead to a better model overall since the model goes over the data various times. This is independent of the training strategy. In almost all situations increasing the number of epochs will result in a better model.

- Try different sequence lengths and pre-processing of data
As mentioned previously we tried training character by character, word by word, two words per two words, and line by line. 

- Try regularization techniques such as Dropout.
Different levels of dropout where tested. Introducing dropout lead to a worse model in this case. According to MLmastery: "Dropout is a regularization method where input and recurrent connections to LSTM units are probabilistically excluded from activation and weight updates while training a network. This has the effect of reducing overfitting and improving model performance." Then it is obvious that increasing the dropout will lead to a less overfit model but in this case to a worse performer since it is not memorizing the poem. Therefore a dropout of 0 is better. In a sense, we are overfitting the model to memorize the entire poem. The problem with high dropout is obvious if you see the results of a dropout=0.9.