In this notebook I create a Recurrent Neural Network model based on the Long Short-Term Memory unit to train and benchmark on the Penn Treebank dataset. 

</b>Language Modelling</b> -- a very relevant task that is the cornerstone of many different linguistic problems such as <b>Speech Recognition, Machine Translation and Image Captioning</b>. For this, I will be using the Penn Treebank dataset, which is an often-used dataset for benchmarking Language Modelling models.


***


I need <b><code>numpy</code></b> and <b><code>tensorflow</code></b>. Additionally, we can import directly the <b><code>tensorflow\.models.rnn</code></b> model, which includes the function for building RNNs, and <b><code>tensorflow\.models.rnn.ptb.reader</code></b> which is the helper module for getting the input data from the dataset we just downloaded.



In [None]:
!pip install tensorflow==2.2.0rc0
!pip install numpy


Collecting tensorflow==2.2.0rc0
  Downloading tensorflow-2.2.0rc0-cp37-cp37m-manylinux2010_x86_64.whl (515.9 MB)
[K     |████████████████████████████████| 515.9 MB 22 kB/s 
Collecting h5py<2.11.0,>=2.10.0
  Downloading h5py-2.10.0-cp37-cp37m-manylinux1_x86_64.whl (2.9 MB)
[K     |████████████████████████████████| 2.9 MB 46.9 MB/s 
Collecting gast==0.3.3
  Downloading gast-0.3.3-py2.py3-none-any.whl (9.7 kB)
Collecting tensorboard<2.2.0,>=2.1.0
  Downloading tensorboard-2.1.1-py3-none-any.whl (3.8 MB)
[K     |████████████████████████████████| 3.8 MB 36.2 MB/s 
Collecting tensorflow-estimator<2.2.0,>=2.1.0
  Downloading tensorflow_estimator-2.1.0-py2.py3-none-any.whl (448 kB)
[K     |████████████████████████████████| 448 kB 76.1 MB/s 
Installing collected packages: tensorflow-estimator, tensorboard, h5py, gast, tensorflow
  Attempting uninstall: tensorflow-estimator
    Found existing installation: tensorflow-estimator 2.7.0
    Uninstalling tensorflow-estimator-2.7.0:
      Successf

In [None]:
import time
import numpy as np
import tensorflow as tf
if not tf.__version__ == '2.2.0-rc0':
    print(tf.__version__)
    raise ValueError('please upgrade to TensorFlow 2.2.0-rc0, or restart your Kernel (Kernel->Restart & Clear Output)')

In [None]:
!mkdir data
!mkdir data/ptb
!wget -q -O data/ptb/reader.py https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DL0120EN-SkillsNetwork/labs/Week3/data/ptb/reader.py
!cp data/ptb/reader.py . 



In [None]:
import reader

<h2>Building the LSTM model for Language Modeling</h2>

I start building the model using TensorFlow. The very first thing is download and extract the <code>simple-examples</code> dataset, which can be done by executing the code cell below.


In [None]:
!wget http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz 
!tar xzf simple-examples.tgz -C data/

--2022-01-16 14:31:07--  http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz
Resolving www.fit.vutbr.cz (www.fit.vutbr.cz)... 147.229.9.23, 2001:67c:1220:809::93e5:917
Connecting to www.fit.vutbr.cz (www.fit.vutbr.cz)|147.229.9.23|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 34869662 (33M) [application/x-gtar]
Saving to: ‘simple-examples.tgz’


2022-01-16 14:31:12 (7.70 MB/s) - ‘simple-examples.tgz’ saved [34869662/34869662]



In [None]:
#Initial weight scale
init_scale = 0.1
#Initial learning rate
learning_rate = 1.0
#Maximum permissible norm for the gradient (For gradient clipping -- another measure against Exploding Gradients)
max_grad_norm = 5
#The number of layers in our model
num_layers = 2
#The total number of recurrence steps, also known as the number of layers when our RNN is "unfolded"
num_steps = 20
#The number of processing units (neurons) in the hidden layers
hidden_size_l1 = 256
hidden_size_l2 = 128
#The maximum number of epochs trained with the initial learning rate
max_epoch_decay_lr = 4
#The total number of epochs in training
max_epoch = 15
#The probability for keeping data in the Dropout Layer (This is an optimization, but is outside our scope for this notebook!)
#At 1, we ignore the Dropout Layer wrapping.
keep_prob = 1
#The decay for the learning rate
decay = 0.5
#The size for each batch of data
batch_size = 30
#The size of our vocabulary
vocab_size = 10000
embeding_vector_size= 200
#Training flag to separate training from testing
is_training = 1
#Data directory for our dataset
data_dir = "data/simple-examples/data/"

In [None]:
# Reads the data and separates it into training data, validation data and testing data
raw_data = reader.ptb_raw_data(data_dir)
train_data, valid_data, test_data, vocab, word_to_id = raw_data

In [None]:
len(train_data)

929589

In [None]:
def id_to_word(id_list):
    line = []
    for w in id_list:
        for word, wid in word_to_id.items():
            if wid == w:
                line.append(word)
    return line            
                

print(id_to_word(train_data[0:100]))

['aer', 'banknote', 'berlitz', 'calloway', 'centrust', 'cluett', 'fromstein', 'gitano', 'guterman', 'hydro-quebec', 'ipo', 'kia', 'memotec', 'mlx', 'nahb', 'punts', 'rake', 'regatta', 'rubens', 'sim', 'snack-food', 'ssangyong', 'swapo', 'wachter', '<eos>', 'pierre', '<unk>', 'N', 'years', 'old', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'nov.', 'N', '<eos>', 'mr.', '<unk>', 'is', 'chairman', 'of', '<unk>', 'n.v.', 'the', 'dutch', 'publishing', 'group', '<eos>', 'rudolph', '<unk>', 'N', 'years', 'old', 'and', 'former', 'chairman', 'of', 'consolidated', 'gold', 'fields', 'plc', 'was', 'named', 'a', 'nonexecutive', 'director', 'of', 'this', 'british', 'industrial', 'conglomerate', '<eos>', 'a', 'form', 'of', 'asbestos', 'once', 'used', 'to', 'make', 'kent', 'cigarette', 'filters', 'has', 'caused', 'a', 'high', 'percentage', 'of', 'cancer', 'deaths', 'among', 'a', 'group', 'of']


Lets just read one mini-batch now and feed our network:


In [None]:
itera = reader.ptb_iterator(train_data, batch_size, num_steps)
first_touple = itera.__next__()
_input_data = first_touple[0]
_targets = first_touple[1]

In [None]:
_input_data.shape

(30, 20)

In [None]:
_targets.shape

(30, 20)

Lets look at 3 sentences of our input x:


In [None]:
_input_data[0:3]

array([[9970, 9971, 9972, 9974, 9975, 9976, 9980, 9981, 9982, 9983, 9984,
        9986, 9987, 9988, 9989, 9991, 9992, 9993, 9994, 9995],
       [2654,    6,  334, 2886,    4,    1,  233,  711,  834,   11,  130,
         123,    7,  514,    2,   63,   10,  514,    8,  605],
       [   0, 1071,    4,    0,  185,   24,  368,   20,   31, 3109,  954,
          12,    3,   21,    2, 2915,    2,   12,    3,   21]],
      dtype=int32)

In [None]:
print(id_to_word(_input_data[0,:]))

['aer', 'banknote', 'berlitz', 'calloway', 'centrust', 'cluett', 'fromstein', 'gitano', 'guterman', 'hydro-quebec', 'ipo', 'kia', 'memotec', 'mlx', 'nahb', 'punts', 'rake', 'regatta', 'rubens', 'sim']


<b>embedding_lookup()</b> finds the embedded values for our batch of 30x20 words. It  goes to each row of <code>input_data</code>, and for each word in the row/sentence, finds the correspond vector in <code>embedding_dic<code>. <br>
It creates a \[30x20x200] tensor, so, the first element of <b>inputs</b> (the first sentence), is a matrix of 20x200, which each row of it, is vector representing a word in the sentence.


In [None]:
embedding_layer = tf.keras.layers.Embedding(vocab_size, embeding_vector_size,batch_input_shape=(batch_size, num_steps),trainable=True,name="embedding_vocab")  

In [None]:
# Define where to get the data for our embeddings from
inputs = embedding_layer(_input_data)
inputs

<tf.Tensor: shape=(30, 20, 200), dtype=float32, numpy=
array([[[ 0.04870662, -0.01193376,  0.00658649, ..., -0.04968027,
         -0.03812311,  0.0402422 ],
        [-0.03306731,  0.00457491,  0.03667506, ..., -0.00831813,
         -0.01256456,  0.03503459],
        [-0.00496637, -0.0070735 ,  0.03331048, ..., -0.00319834,
         -0.00926016, -0.03706694],
        ...,
        [ 0.0372239 , -0.02372563,  0.00439869, ...,  0.04839167,
          0.03670845,  0.02530028],
        [ 0.00545583, -0.0073446 ,  0.0075757 , ...,  0.01225203,
          0.01210945,  0.04445219],
        [ 0.00871379, -0.0168138 , -0.03219406, ...,  0.02777426,
          0.02916456,  0.01563765]],

       [[-0.02266669, -0.04095539, -0.04280273, ..., -0.02549719,
         -0.02029875, -0.02476766],
        [-0.03118582,  0.00155712, -0.04534843, ...,  0.01529891,
         -0.03374769,  0.02878617],
        [ 0.01350128,  0.00672488, -0.02636375, ..., -0.02916452,
         -0.02539345, -0.03887729],
        ...,

<h3>Constructing Recurrent Neural Networks</h3>


In this step, I create the stacked LSTM using <b>tf.keras.layers.StackedRNNCells</b>, which is a 2 layer LSTM network:


In [None]:
lstm_cell_l1 = tf.keras.layers.LSTMCell(hidden_size_l1)
lstm_cell_l2 = tf.keras.layers.LSTMCell(hidden_size_l2)

In [None]:
stacked_lstm = tf.keras.layers.StackedRNNCells([lstm_cell_l1, lstm_cell_l2])

<b>tf.keras.layers.RNN</b> creates a recurrent neural network using <b>stacked_lstm</b>.

The input should be a Tensor of shape: \[batch_size, max_time, embedding_vector_size], in our case it would be (30, 20, 200)


In [None]:
layer  =  tf.keras.layers.RNN(stacked_lstm,[batch_size, num_steps],return_state=False,stateful=True,trainable=True)

Also, we initialize the states of the nework:

<h4>_initial_state</h4>

For each LSTM, there are 2 state matrices, c_state and m_state.  c_state and m_state represent "Memory State" and "Cell State". Each hidden layer, has a vector of size 30, which keeps the states. so, for 200 hidden units in each LSTM, we have a matrix of size \[30x200]


In [None]:
init_state = tf.Variable(tf.zeros([batch_size,embeding_vector_size]),trainable=False)

In [None]:
layer.inital_state = init_state

In [None]:
layer.inital_state

<tf.Variable 'Variable:0' shape=(30, 200) dtype=float32, numpy=
array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)>

The output of the stackedLSTM comes from 128 hidden_layer, and in each time step(=20), one of them get activated. we use the linear activation to map the 128 hidden layer to a \[30X20 matrix]


In [None]:
outputs = layer(inputs)

In [None]:
outputs

<tf.Tensor: shape=(30, 20, 128), dtype=float32, numpy=
array([[[-9.6469698e-04, -8.2536967e-04, -4.6314072e-04, ...,
         -4.3411154e-04,  6.1145984e-04, -5.8975833e-04],
        [-1.2868774e-03, -1.3076189e-03, -7.4472441e-04, ...,
         -1.0691351e-03,  1.5305374e-04, -2.1255064e-04],
        [-1.1594163e-03, -2.4171208e-03, -9.4795099e-04, ...,
          4.2319567e-05, -1.7404248e-04,  1.2326690e-05],
        ...,
        [-1.7939245e-03, -5.9895050e-03, -3.9084491e-04, ...,
          8.1310160e-03, -2.2896552e-03, -8.5368368e-04],
        [-2.8794527e-03, -6.1599035e-03,  1.0960505e-03, ...,
          8.1292819e-03, -8.2695618e-04, -1.8090757e-05],
        [-2.6664443e-03, -5.6500221e-03,  3.3125945e-03, ...,
          8.2190223e-03,  4.2107164e-05, -4.8991543e-04]],

       [[ 1.6534826e-04, -1.3182656e-03, -4.1052344e-04, ...,
         -7.0256654e-05, -1.1013820e-03,  7.9689402e-04],
        [ 1.9702839e-04, -2.3519869e-03, -1.7266523e-03, ...,
         -8.0936600e-04, -4.

<h2>Dense layer</h2>

Now create densely-connected neural network layer that would reshape the outputs tensor from  [30 x 20 x 128] to [30 x 20 x 10000].


In [None]:
dense = tf.keras.layers.Dense(vocab_size)

In [None]:
logits_outputs  = dense(outputs)

In [None]:
print("shape of the output from dense layer: ", logits_outputs.shape) #(batch_size, sequence_length, vocab_size)

shape of the output from dense layer:  (30, 20, 10000)


<h2>Activation layer</h2>

A softmax activation layers is also then applied to derive the probability of the output being in any of the multiclass(10000 in this case) possibilities.


In [None]:
activation = tf.keras.layers.Activation('softmax')

In [None]:
output_words_prob = activation(logits_outputs)

In [None]:
print("shape of the output from the activation layer: ", output_words_prob.shape) #(batch_size, sequence_length, vocab_size)

shape of the output from the activation layer:  (30, 20, 10000)


Lets look at the probability of observing words for t=0 to t=20:


In [None]:
print("The probability of observing words in t=0 to t=20", output_words_prob[0,0:num_steps])

The probability of observing words in t=0 to t=20 tf.Tensor(
[[1.00001831e-04 1.00005440e-04 1.00015423e-04 ... 1.00008110e-04
  1.00021971e-04 1.00010133e-04]
 [1.00012483e-04 1.00029851e-04 1.00026606e-04 ... 1.00023928e-04
  1.00014156e-04 1.00024445e-04]
 [1.00023426e-04 1.00038822e-04 1.00024554e-04 ... 1.00030513e-04
  1.00016150e-04 9.99973490e-05]
 ...
 [1.00049219e-04 9.99556869e-05 9.99562690e-05 ... 9.99481854e-05
  1.00094643e-04 9.99026670e-05]
 [1.00058285e-04 9.99316617e-05 9.99727054e-05 ... 9.99444965e-05
  1.00088459e-04 9.99178956e-05]
 [1.00047902e-04 9.99193653e-05 9.99946933e-05 ... 9.99361509e-05
  1.00080040e-04 9.99256590e-05]], shape=(20, 10000), dtype=float32)


<h3>Prediction</h3>
What is the word correspond to the probability output? Lets use the maximum probability:


In [None]:
np.argmax(output_words_prob[0,0:num_steps], axis=1)

array([6765, 4494, 4494, 8060, 1065, 1065, 2144, 9904, 9904, 5606, 6568,
        352, 6982, 6982, 6982, 6982, 6982, 4316, 4316, 4316])

So, what is the ground truth for the first word of first sentence? You can get it from target tensor, if you want to find the embedding vector:


In [None]:
_targets[0]

array([9971, 9972, 9974, 9975, 9976, 9980, 9981, 9982, 9983, 9984, 9986,
       9987, 9988, 9989, 9991, 9992, 9993, 9994, 9995, 9996], dtype=int32)

In [None]:
def crossentropy(y_true, y_pred):
    return tf.keras.losses.sparse_categorical_crossentropy(y_true, y_pred)

In [None]:
loss  = crossentropy(_targets, output_words_prob)

Lets look at the first 10 values of loss:


In [None]:
loss[0,:10]

<tf.Tensor: shape=(10,), dtype=float32, numpy=
array([9.21046 , 9.210031, 9.210526, 9.210196, 9.210417, 9.20994 ,
       9.210451, 9.21033 , 9.209973, 9.210232], dtype=float32)>

Now, we define cost as average of the losses:


In [None]:
cost = tf.reduce_sum(loss / batch_size)
cost

<tf.Tensor: shape=(), dtype=float32, numpy=184.20805>

<h3>Training</h3>

To do training for our network, we have to take the following steps:

<ol>
    <li>Define the optimizer.</li>
    <li>Assemble layers to build model.</li>
    <li>Calculate the gradients based on the loss function.</li>
    <li>Apply the optimizer to the variables/gradients tuple.</li>
</ol>


<h4>1. Define Optimizer</h4>


In [None]:
# Create a variable for the learning rate
lr = tf.Variable(0.0, trainable=False)
optimizer = tf.keras.optimizers.SGD(lr=lr, clipnorm=max_grad_norm)

<h4>2. Assemble layers to build model.</h4>


In [None]:
model = tf.keras.Sequential()
model.add(embedding_layer)
model.add(layer)
model.add(dense)
model.add(activation)
model.compile(loss=crossentropy, optimizer=optimizer)
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_vocab (Embedding)  (30, 20, 200)             2000000   
_________________________________________________________________
rnn (RNN)                    (30, 20, 128)             671088    
_________________________________________________________________
dense (Dense)                (30, 20, 10000)           1290000   
_________________________________________________________________
activation (Activation)      (30, 20, 10000)           0         
Total params: 3,961,088
Trainable params: 3,955,088
Non-trainable params: 6,000
_________________________________________________________________


<h4>2. Trainable Variables</h4>


Defining a variable, if you passed <i>trainable=True</i>, the variable constructor automatically adds new variables to the graph collection <b>GraphKeys.TRAINABLE_VARIABLES</b>. Now, using <i>tf.trainable_variables()</i> you can get all variables created with <b>trainable=True</b>.


In [None]:
# Get all TensorFlow variables marked as "trainable" (i.e. all of them except _lr, which we just created)
tvars = model.trainable_variables

Note: we can find the name and scope of all variables:


In [None]:
[v.name for v in tvars] 

['embedding_vocab/embeddings:0',
 'rnn/stacked_rnn_cells/lstm_cell/kernel:0',
 'rnn/stacked_rnn_cells/lstm_cell/recurrent_kernel:0',
 'rnn/stacked_rnn_cells/lstm_cell/bias:0',
 'rnn/stacked_rnn_cells/lstm_cell_1/kernel:0',
 'rnn/stacked_rnn_cells/lstm_cell_1/recurrent_kernel:0',
 'rnn/stacked_rnn_cells/lstm_cell_1/bias:0',
 'dense/kernel:0',
 'dense/bias:0']

<h4>3. Calculate the gradients based on the loss function</h4>


In [None]:
x = tf.constant(1.0)
y =  tf.constant(2.0)
with tf.GradientTape(persistent=True) as g:
    g.watch(x)
    g.watch(y)
    func_test = 2 * x * x + 3 * x * y

In [None]:
var_grad = g.gradient(func_test, x) # Will compute to 10.0
print(var_grad)

tf.Tensor(10.0, shape=(), dtype=float32)


In [None]:
var_grad = g.gradient(func_test, y) # Will compute to 3.0
print(var_grad)

tf.Tensor(3.0, shape=(), dtype=float32)


Now, we can look at gradients w\.r.t all variables:


In [None]:
with tf.GradientTape() as tape:
    # Forward pass.
    output_words_prob = model(_input_data)
    # Loss value for this batch.
    loss  = crossentropy(_targets, output_words_prob)
    cost = tf.reduce_sum(loss,axis=0) / batch_size

In [None]:
# Get gradients of loss wrt the trainable variables.
grad_t_list = tape.gradient(cost, tvars)

In [None]:
print(grad_t_list)

[<tensorflow.python.framework.indexed_slices.IndexedSlices object at 0x7f6c309e4ed0>, <tf.Tensor: shape=(200, 1024), dtype=float32, numpy=
array([[ 4.2663459e-07, -6.1952989e-07, -9.2177302e-08, ...,
        -2.2093963e-07,  3.9344943e-07, -1.8651519e-08],
       [ 5.6239850e-07,  4.0840618e-08, -1.0229735e-07, ...,
        -3.0300029e-07,  2.0135602e-07,  2.7928004e-07],
       [ 7.6816309e-07,  8.3982201e-07, -6.2940444e-08, ...,
         3.6399160e-07,  1.4516289e-08, -6.1554104e-09],
       ...,
       [-8.2077321e-07,  7.8317845e-07, -3.3058512e-07, ...,
         6.0170936e-07,  1.1440124e-07,  1.4874279e-07],
       [ 8.5152135e-07,  1.0094560e-06,  5.5223086e-07, ...,
         7.2387923e-07,  2.6708335e-08,  6.9982612e-08],
       [-2.4198414e-07,  7.6622075e-07,  1.8744362e-07, ...,
        -7.3234764e-07,  2.9273892e-07,  3.4699632e-08]], dtype=float32)>, <tf.Tensor: shape=(256, 1024), dtype=float32, numpy=
array([[-2.9210300e-08, -5.5433045e-08, -3.6456406e-08, ...,
         

now, we have a list of tensors, t-list. We can use it to find clipped tensors. <b>clip_by_global_norm</b> clips values of multiple tensors by the ratio of the sum of their norms.

<b>clip_by_global_norm</b> get <i>t-list</i> as input and returns 2 things:

<ul>
    <li>a list of clipped tensors, so called <i>list_clipped</i></li> 
    <li>the global norm (global_norm) of all tensors in t_list</li> 
</ul>


In [None]:
# Define the gradient clipping threshold
grads, _ = tf.clip_by_global_norm(grad_t_list, max_grad_norm)
grads

[<tensorflow.python.framework.indexed_slices.IndexedSlices at 0x7f6c3099f150>,
 <tf.Tensor: shape=(200, 1024), dtype=float32, numpy=
 array([[ 4.2663459e-07, -6.1952989e-07, -9.2177302e-08, ...,
         -2.2093963e-07,  3.9344943e-07, -1.8651519e-08],
        [ 5.6239850e-07,  4.0840618e-08, -1.0229735e-07, ...,
         -3.0300029e-07,  2.0135602e-07,  2.7928004e-07],
        [ 7.6816309e-07,  8.3982201e-07, -6.2940444e-08, ...,
          3.6399160e-07,  1.4516289e-08, -6.1554104e-09],
        ...,
        [-8.2077321e-07,  7.8317845e-07, -3.3058512e-07, ...,
          6.0170936e-07,  1.1440124e-07,  1.4874279e-07],
        [ 8.5152135e-07,  1.0094560e-06,  5.5223086e-07, ...,
          7.2387923e-07,  2.6708335e-08,  6.9982612e-08],
        [-2.4198414e-07,  7.6622075e-07,  1.8744362e-07, ...,
         -7.3234764e-07,  2.9273892e-07,  3.4699632e-08]], dtype=float32)>,
 <tf.Tensor: shape=(256, 1024), dtype=float32, numpy=
 array([[-2.9210300e-08, -5.5433045e-08, -3.6456406e-08, ...,


<h4> 4.Apply the optimizer to the variables/gradients tuple. </h4>


In [None]:
# Create the training TensorFlow Operation through our optimizer
train_op = optimizer.apply_gradients(zip(grads, tvars))

<a id="ltsm"></a>

<h2>LSTM</h2>


let's then create a Class that represents our model. This class needs a few things:

<ul>
    <li>We have to create the model in accordance with our defined hyperparameters</li>
    <li>We have to create the LSTM cell structure and connect them with our RNN structure</li>
    <li>We have to create the word embeddings and point them to the input data</li>
    <li>We have to create the input structure for our RNN</li>
    <li>We need to create a logistic structure to return the probability of our words</li>
    <li>We need to create the loss and cost functions for our optimizer to work, and then create the optimizer</li>
    <li>And finally, we need to create a training operation that can be run to actually train our model</li>
</ul>


In [None]:
class PTBModel(object):


    def __init__(self):
        ######################################
        # Setting parameters for ease of use #
        ######################################
        self.batch_size = batch_size
        self.num_steps = num_steps
        self.hidden_size_l1 = hidden_size_l1
        self.hidden_size_l2 = hidden_size_l2
        self.vocab_size = vocab_size
        self.embeding_vector_size = embeding_vector_size
        # Create a variable for the learning rate
        self._lr = 1.0
        
        ###############################################################################
        # Initializing the model using keras Sequential API  #
        ###############################################################################
        
        self._model = tf.keras.models.Sequential()
        
        ####################################################################
        # Creating the word embeddings layer and adding it to the sequence #
        ####################################################################
        with tf.device("/cpu:0"):
            # Create the embeddings for our input data. Size is hidden size.
            self._embedding_layer = tf.keras.layers.Embedding(self.vocab_size, self.embeding_vector_size,batch_input_shape=(self.batch_size, self.num_steps),trainable=True,name="embedding_vocab")  #[10000x200]
            self._model.add(self._embedding_layer)
            

        ##########################################################################
        # Creating the LSTM cell structure and connect it with the RNN structure #
        ##########################################################################
        # Create the LSTM Cells. 
        # This creates only the structure for the LSTM and has to be associated with a RNN unit still.
        # The argument  of LSTMCell is size of hidden layer, that is, the number of hidden units of the LSTM (inside A). 
        # LSTM cell processes one word at a time and computes probabilities of the possible continuations of the sentence.
        lstm_cell_l1 = tf.keras.layers.LSTMCell(hidden_size_l1)
        lstm_cell_l2 = tf.keras.layers.LSTMCell(hidden_size_l2)
        

        
        # By taking in the LSTM cells as parameters, the StackedRNNCells function junctions the LSTM units to the RNN units.
        # RNN cell composed sequentially of stacked simple cells.
        stacked_lstm = tf.keras.layers.StackedRNNCells([lstm_cell_l1, lstm_cell_l2])


        

        ############################################
        # Creating the input structure for our RNN #
        ############################################
        # Input structure is 20x[30x200]
        # Considering each word is represended by a 200 dimentional vector, and we have 30 batchs, we create 30 word-vectors of size [30xx2000]
        # The input structure is fed from the embeddings, which are filled in by the input data
        # Feeding a batch of b sentences to a RNN:
        # In step 1,  first word of each of the b sentences (in a batch) is input in parallel.  
        # In step 2,  second word of each of the b sentences is input in parallel. 
        # The parallelism is only for efficiency.  
        # Each sentence in a batch is handled in parallel, but the network sees one word of a sentence at a time and does the computations accordingly. 
        # All the computations involving the words of all sentences in a batch at a given time step are done in parallel. 

        ########################################################################################################
        # Instantiating our RNN model and setting stateful to True to feed forward the state to the next layer #
        ########################################################################################################
        
        self._RNNlayer  =  tf.keras.layers.RNN(stacked_lstm,[batch_size, num_steps],return_state=False,stateful=True,trainable=True)
        
        # Define the initial state, i.e., the model state for the very first data point
        # It initialize the state of the LSTM memory. The memory state of the network is initialized with a vector of zeros and gets updated after reading each word.
        self._initial_state = tf.Variable(tf.zeros([batch_size,embeding_vector_size]),trainable=False)
        self._RNNlayer.inital_state = self._initial_state
    
        ############################################
        # Adding RNN layer to keras sequential API #
        ############################################        
        self._model.add(self._RNNlayer)
        
        #self._model.add(tf.keras.layers.LSTM(hidden_size_l1,return_sequences=True,stateful=True))
        #self._model.add(tf.keras.layers.LSTM(hidden_size_l2,return_sequences=True))
        
        
        ####################################################################################################
        # Instantiating a Dense layer that connects the output to the vocab_size  and adding layer to model#
        ####################################################################################################
        self._dense = tf.keras.layers.Dense(self.vocab_size)
        self._model.add(self._dense)
 
        
        ####################################################################################################
        # Adding softmax activation layer and deriving probability to each class and adding layer to model #
        ####################################################################################################
        self._activation = tf.keras.layers.Activation('softmax')
        self._model.add(self._activation)

        ##########################################################
        # Instantiating the stochastic gradient decent optimizer #
        ########################################################## 
        self._optimizer = tf.keras.optimizers.SGD(lr=self._lr, clipnorm=max_grad_norm)
        
        
        ##############################################################################
        # Compiling and summarizing the model stacked using the keras sequential API #
        ##############################################################################
        self._model.compile(loss=self.crossentropy, optimizer=self._optimizer)
        self._model.summary()


    def crossentropy(self,y_true, y_pred):
        return tf.keras.losses.sparse_categorical_crossentropy(y_true, y_pred)

    def train_batch(self,_input_data,_targets):
        #################################################
        # Creating the Training Operation for our Model #
        #################################################
        # Create a variable for the learning rate
        self._lr = tf.Variable(0.0, trainable=False)
        # Get all TensorFlow variables marked as "trainable" (i.e. all of them except _lr, which we just created)
        tvars = self._model.trainable_variables
        # Define the gradient clipping threshold
        with tf.GradientTape() as tape:
            # Forward pass.
            output_words_prob = self._model(_input_data)
            # Loss value for this batch.
            loss  = self.crossentropy(_targets, output_words_prob)
            # average across batch and reduce sum
            cost = tf.reduce_sum(loss/ self.batch_size)
        # Get gradients of loss wrt the trainable variables.
        grad_t_list = tape.gradient(cost, tvars)
        # Define the gradient clipping threshold
        grads, _ = tf.clip_by_global_norm(grad_t_list, max_grad_norm)
        # Create the training TensorFlow Operation through our optimizer
        train_op = self._optimizer.apply_gradients(zip(grads, tvars))
        return cost
        
    def test_batch(self,_input_data,_targets):
        #################################################
        # Creating the Testing Operation for our Model #
        #################################################
        output_words_prob = self._model(_input_data)
        loss  = self.crossentropy(_targets, output_words_prob)
        # average across batch and reduce sum
        cost = tf.reduce_sum(loss/ self.batch_size)

        return cost
    @classmethod
    def instance(cls) : 
        return PTBModel()

With that, the actual structure of our Recurrent Neural Network with Long Short-Term Memory is finished. What remains for us to do is to actually create the methods to run through time -- that is, the <code>run_epoch</code> method to be run at each epoch and a <code>main</code> script which ties all of this together.

What our <code>run_epoch</code> method should do is take our input data and feed it to the relevant operations. This will return at the very least the current result for the cost function.


In [None]:

########################################################################################################################
# run_one_epoch takes as parameters  the model instance, the data to be fed, training or testing mode and verbose info #
########################################################################################################################
def run_one_epoch(m, data,is_training=True,verbose=False):

    #Define the epoch size based on the length of the data, batch size and the number of steps
    epoch_size = ((len(data) // m.batch_size) - 1) // m.num_steps
    start_time = time.time()
    costs = 0.
    iters = 0
    
    m._model.reset_states()
    
    #For each step and data point
    for step, (x, y) in enumerate(reader.ptb_iterator(data, m.batch_size, m.num_steps)):
        
        #Evaluate and return cost, state by running cost, final_state and the function passed as parameter
        #y = tf.keras.utils.to_categorical(y, num_classes=vocab_size)
        if is_training : 
            loss=  m.train_batch(x, y)
        else :
            loss = m.test_batch(x, y)
                                   

        #Add returned cost to costs (which keeps track of the total costs for this epoch)
        costs += loss
        
        #Add number of steps to iteration counter
        iters += m.num_steps

        if verbose and step % (epoch_size // 10) == 10:
            print("Itr %d of %d, perplexity: %.3f speed: %.0f wps" % (step , epoch_size, np.exp(costs / iters), iters * m.batch_size / (time.time() - start_time)))
        


    # Returns the Perplexity rating for us to keep track of how the model is evolving
    return np.exp(costs / iters)


Now, we create the <code>main</code> method to tie everything together. The code here reads the data from the directory, using the <code>reader</code> helper module, and then trains and evaluates the model on both a testing and a validating subset of data.


In [None]:
# Reads the data and separates it into training data, validation data and testing data
raw_data = reader.ptb_raw_data(data_dir)
train_data, valid_data, test_data, _, _ = raw_data

In [None]:
# Instantiates the PTBModel class
m=PTBModel.instance()   
K = tf.keras.backend 
for i in range(max_epoch):
    # Define the decay for this epoch
    lr_decay = decay ** max(i - max_epoch_decay_lr, 0.0)
    dcr = learning_rate * lr_decay
    m._lr = dcr
    K.set_value(m._model.optimizer.learning_rate,m._lr)
    print("Epoch %d : Learning rate: %.3f" % (i + 1, m._model.optimizer.learning_rate))
    # Run the loop for this epoch in the training mode
    train_perplexity = run_one_epoch(m, train_data,is_training=True,verbose=True)
    print("Epoch %d : Train Perplexity: %.3f" % (i + 1, train_perplexity))
        
    # Run the loop for this epoch in the validation mode
    valid_perplexity = run_one_epoch(m, valid_data,is_training=False,verbose=False)
    print("Epoch %d : Valid Perplexity: %.3f" % (i + 1, valid_perplexity))
    
# Run the loop in the testing mode to see how effective was our training
test_perplexity = run_one_epoch(m, test_data,is_training=False,verbose=False)
print("Test Perplexity: %.3f" % test_perplexity)



Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_vocab (Embedding)  (30, 20, 200)             2000000   
_________________________________________________________________
rnn_1 (RNN)                  (30, 20, 128)             671088    
_________________________________________________________________
dense_1 (Dense)              (30, 20, 10000)           1290000   
_________________________________________________________________
activation_1 (Activation)    (30, 20, 10000)           0         
Total params: 3,961,088
Trainable params: 3,955,088
Non-trainable params: 6,000
_________________________________________________________________
Epoch 1 : Learning rate: 1.000
Itr 10 of 1549, perplexity: 4722.261 speed: 963 wps
Itr 164 of 1549, perplexity: 1092.543 speed: 988 wps
Itr 318 of 1549, perplexity: 850.528 speed: 993 wps
Itr 472 of 1549, perplexity: 702.228 speed: 992 wp

As you can see, the model's perplexity rating drops very quickly after a few iterations. As was elaborated before, <b>lower Perplexity means that the model is more certain about its prediction</b>. As such, we can be sure that this model is performing well!
