# Bengio's Model Implementation

In this model, I need to achieve three things:
- associate each word in the vocabulary given with a distributed word feature vector. 
- express the joint probability funciton of word sequences in terms of the feature vectors of these words in a sentence (i.e. generate sequences like the LSTM model)
- learn the word feature vectors and the parameters of the probability function through a multi-layer perceptron. 

I also need to learn how to properly optimize a model to use TPUs or GPUs so the training time gets reduced, and figure out a method to calculate perplexity. 

__An overview of the TPU conversion workflow__:
1. Build a Keras model for training in functional API with static input `batch_size`
2. Convert Keras model to TPU model
3. Train the TPU model with static `batch_size * 8` and save the weights to file. 
4. Build a Keras model for inference with the same structure but variable batch input size. 
5. Load the model weights
6. Predict with the inferencing model 
7. Activate the TPU in Colab after uploading directory to Drive and Mounting it. 

In [1]:
import tensorflow as tf
from tensorflow.python.keras.layers import Input, LSTM, Bidirectional, Dense, Embedding

## Static Input Batch Size

Input pipelines running on CPU and GPU are mostly free from static shape requirements, while in a TPU environment, static shapes and batch sizes are imposed. 

The TPU is not fully utilized unless all 8 TPU cores are used. To fully speed up trianing, I can choose a larger batch size compared to training the same model on a single GPU. A batch size of 1024 (i.e. 128 per core) is generally a good start point. 

__In Keras__, to define a static batch size, I use its functional API and then specify the `batch_size` parameter for the input layer. The model builds in a function which takes a `batch_size` parameter so I can come back later to make another model for inferencing runs on CPU or GPU which take variable batch size inputs. 

In [2]:
def make_model(batch_size=None):
    source = Input(shape=(maxlen,), batch_size=batch_size,
                   dtype=tf.int32, name='Input')
    embedding = Embedding(input_dim=max_features,
                          output_dim=128, name='Embedding')(source)
    lstm = LSTM(32, name='LSTM')(embedding)
    predicted_var = Dense(1, activation='sigmoid', name='Output')(lstm)
    model = tf.keras.Model(inputs=[source], outputs=[predicted_var])
    model.compile(
        optimizer=tf.train.RMSPropOptimizer(learning_rate=0.01),
        loss='binary_crossentropy',
        metrics=['acc'])
    return model


training_model = make_model(batch_size=128)

NameError: name 'maxlen' is not defined