## How to Use Word Embedding Layers for Deep Learning with Keras

Bengio refers to his network's design as having one hidden layer beyond the word features mapping and direct connections from the word features to the output (i.e. two hidden layers - shared word features _C_ and the hyperbolic tangent hidden layer.  

In [13]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
from keras.models import Sequential 
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils

In this series of preprocessing steps, I'm converting the text into a more manageable format to reduce the vocabulary that the network must learn. 

In [7]:
# load ascii text and convert it to lowercase
filename = "data/brown-train-alt.txt"
raw_text = open(filename).read()
raw_text = raw_text.lower()

Now I need to prepare the data for modeling in the neural network - I can't model the characters directly, so I need to convert them to integers.

I can do this by first creating a set of all the distinct characters in the book and then creating a map of each character to a unique integer. 

In [8]:
# create mapping of unique chars to integers
chars = sorted(list(set(raw_text)))
char_to_int = dict((c, i) for i, c in enumerate(chars))

In [10]:
n_chars = len(raw_text)
n_vocab = len(chars)
print("Total Characters: ", n_chars)
print("Total Vocab: ", n_vocab)

Total Characters:  578691
Total Vocab:  51


In [11]:
# prepare the dataset of input to output pairs encoded as integers
seq_length = 100
dataX = []
dataY = []
for i in range(0, n_chars - seq_length, 1):
    seq_in = raw_text[i:i + seq_length]
    seq_out = raw_text[i + seq_length]
    dataX.append([char_to_int[char] for char in seq_in])
    dataY.append(char_to_int[seq_out])
n_patterns = len(dataX)
print("Total Patterns: ", n_patterns)

Total Patterns:  578591


Now that I've prepped my training data, I need to transform it so its suitable for use with Keras. 

First, I need to transform the list of input sequences into the form [samples, time steps, features] expected by an LSTM recurrent neural network. 

Next, I'll need to rescale the integers to the range 0-to-1 to make the patterns easier to learn by the LSTM network that will use the sigmoid activation function by default. __Note,__ Bengio's model uses a tanh activation function. I'm not sure if it needs a similar prep stage. 

Finally, I'll convert the output patterns nto a one hot encoding to configure the network to predicct the probability of each of the different characers in the vocabulary (51 according to the above). 

In [15]:
# reshape X to be [samples, time steps, features]
X = numpy.reshape(dataX, (n_patterns, seq_length, 1))
# normalize
X = X / float(n_vocab)
# one hot enccode the output variable
y = np_utils.to_categorical(dataY)

Here, I'll define my model:

The first is a shared word features layer with 256 memory units. The network uses a tanh hidden layer sandwiched between that and a Dense output layer using the softmax activation function to produce a probability prediction for each of the 51 characters between 0 and 1. 

In [16]:
# my model
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2])))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


In [17]:
# define the checkpoint
filepath="weights-improvement-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

In [18]:
model.fit(X, y, epochs=20, batch_size=128, callbacks=callbacks_list)

Instructions for updating:
Use tf.cast instead.
Epoch 1/20

KeyboardInterrupt: 