# Applying LSTM for Language Modelling
In this notebook, we will go over the topic of what Language Modelling is and create a Recurrent Neural Network model based on the Long Short-Term Memory unit to train and be benchmarked by the Penn Treebank.

## The Objective
By now, you should have an understanding of how Recurrent Networks work - a specialized model to process sequential data by keeping track of the "state" or context. In this notebook, we go over a TensorFlow code snippet for creating a model focused on **Language Modelling** - a very relevant task that is the cornerstone of many different linguistic problems such as **Speech Recognition, Machine Translation and Image Captioning**. For this, we will be using the Penn Treebank, which is an often-used dataset for benchmarking Language Modelling models.

## What exactly is Language Modelling?
Language Modelling, to put it simply, **is the task of assigning probabilities to sequences of words**. This means that, given a context of one or a few words in the language the model was trained on, the model should have a knowledge of what are the most probable words or sequence of words for the sentence. Language Modelling is one of the tasks under Natural Language Processing, and one of the most important.

<img src="https://ibm.box.com/shared/static/1d1i5gub6wljby2vani2vzxp0xsph702.png" width="768"/>
<center>*Example of a sentence being predicted*</center>

In this example, one can see the predictions for the next word of a sentence, given the context "This is an". As you can see, this boils down to a sequential data analysis task - you are given a word or a sequence of words (the input data), and, given the context (the state), you need to find out what is the next word (the prediction). This kind of analysis is very important for language-related tasks such as **Speech Recognition, Machine Translation, Image Captioning, Text Correction** and many other very relevant problems. 

<img src="https://ibm.box.com/shared/static/az39idf9ipfdpc5ugifpgxnydelhyf3i.png" width="1080"/>
<center>*The above example schematized as an RNN in execution*</center>

As the above image shows, Recurrent Network models fit this problem like a glove. Alongside LSTM and its capacity to maintain the model's state for over one thousand time steps, we have all the tools we need to undertake this problem. The goal is to create a model that can reach **low levels of perplexity** on our desired dataset.

For Language Modelling problems, **perplexity** is the way to gauge efficiency. Perplexity is simply a measure of how well a probabilistic model is able to predict its sample. A higher-level way to explain this would be saying that **low perplexity means a higher degree of trust in the predictions the model makes**. Therefore, the lower perplexity is, the better.

## The Penn Treebank dataset
Historically, datasets big enough for Natural Language Processing are hard to come by. This is in part due to the necessity of the sentences to be broken down and tagged with a certain degree of correctness - or else the models trained on it won't be able to be correct at all. This means that we need a **large amount of data, annotated by or at least corrected by humans**. This is, of course, not an easy task at all.

The Penn Treebank, or PTB for short, is a dataset maintained by the University of Pennsylvania. It is *huge* - there are over **four million and eight hundred thousand** annotated words in it, all corrected by humans. It is composed of many different sources, from abstracts of Department of Energy papers to texts from the Library of America. Since it is verifiably correct and of such a huge size, the Penn Treebank has been used time and time again as a benchmark dataset for Language Modelling.

The dataset is divided in different kinds of annotations, such as Piece-of-Speech, Syntactic and Semantic skeletons. For this example, we will simply use a sample of clean, non-annotated words (with the exception of one tag - `<unk>`, which is used for rare words such as uncommon proper nouns) for our model. This means that we just want to predict what the next words would be, not what they mean in context or their classes on a given sentence. 
<br/>
<div class="alert alert-block alert-info">
<center>the percentage of lung cancer deaths among the workers at the west `<unk>` mass. paper factory appears to be the highest for any asbestos workers studied in western industrialized countries he said 
 the plant which is owned by `<unk>` & `<unk>` co. was under contract with `<unk>` to make the cigarette filters 
 the finding probably will support those who argue that the u.s. should regulate the class of asbestos including `<unk>` more `<unk>` than the common kind of asbestos `<unk>` found in most schools and other buildings dr. `<unk>` said
    
</center>
</div>
<center>*Example of text from the dataset we are going to use, `ptb.train`*</center>

<h2>Word Embeddings</h2><br/>

For better processing, in this example, we will make use of [**word embeddings**]( [https://www.tensorflow.org/tutorials/word2vec/), which are **a way of representing sentence structures or words as n-dimensional vectors (where n is a reasonably high number, such as 200 or 500) of real numbers**. Basically, we will assign each word a randomly-initialized vector, and input those into the network to be processed. After a number of iterations, these vectors are expected to assume values that help the network to correctly predict what it needs to - in our case, the probable next word in the sentence. This is shown to be very effective in Natural Language Processing tasks, and is a commonplace practice.
<br/><br/>
<font size = 4>
    <strong>
$$Vec("Example") = [0.02, 0.00, 0.00, 0.92, 0.30,...]$$
    </strong>
</font>
<br/>
Word Embedding tends to group up similarly used words *reasonably* together in the vectorial space. For example, if we use T-SNE (a dimensional reduction visualization algorithm) to flatten the dimensions of our vectors into a 2-dimensional space and use the words these vectors represent as their labels, we might see something like this:

<img src="https://ibm.box.com/shared/static/bqhc5dg879gcoabzhxra1w8rkg3od1cu.png" width="800"/>
<center>*T-SNE Mockup with clusters marked for easier visualization*</center>

As you can see, words that are frequently used together, in place of each other, or in the same places as them tend to be grouped together - being closer together the higher these correlations are. For example, "None" is pretty semantically close to "Zero", while a phrase that uses "Italy" can probably also fit "Germany" in it, with little damage to the sentence structure. A vectorial "closeness" for similar words like this is a great indicator of a well-built model.

---

We need to import the necessary modules for our code. We need **`numpy` and `tensorflow`**, obviously. Additionally, we can import directly the **`tensorflow.models.rnn.rnn`** model, which includes the function for building RNNs, and **`tensorflow.models.rnn.ptb.reader`** which is the helper module for getting the input data from the dataset.

In [2]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import time
import numpy as np
import tensorflow as tf
from resources.ptb import reader

## Building the LSTM model for Language Modeling
Now that we know exactly what we are doing, we can start building our model using TensorFlow. Additionally, for the sake of making it easy to play around with the model's hyperparameters, we can declare them beforehand.

In [2]:
# Initial weight scale.
init_scale = 0.1
# Initial learning rate.
learning_rate = 1.0
# Maximum permissible norm for the gradient (For gradient clipping - another measure against Exploding Gradients).
max_grad_norm = 5
# The number of layers in our model.
num_layers = 2
# The total number of recurrence steps, also known as the number of layers when our RNN is "unfolded".
num_steps = 20
# The number of processing units (neurons) in the hidden layers.
hidden_size = 200
# The maximum number of epochs trained with the initial learning rate.
max_epoch = 4
# The total number of epochs in training.
max_max_epoch = 13
# The probability for keeping data in the Dropout Layer (this is an optimization).
# At 1, we ignore the Dropout Layer wrapping.
keep_prob = 1
# The decay for the learning rate.
decay = 0.5
# The size for each batch of data.
batch_size = 30
# The size of our vocabulary.
vocab_size = 10000
# Training flag to separate training from testing.
is_training = 1
# Data directory for our dataset.
data_dir = "./resources/data/simple-examples/data/"

Some clarifications for LSTM architecture based on the argumants:

Network structure:
- In this network, the number of LSTM cells are 2. To give the model more expressive power, we can add multiple layers of LSTMs to process the data. The output of the first layer will become the input of the second and so on.
- The recurrence steps are 20, that is, when our RNN is "Unfolded", the recurrence step is 20.   
- the structure is like: 
     - 200 input units -> [200x200] Weight -> 200 Hidden units (first layer) -> [200x200] Weight matrix  -> 200 Hidden units (second layer) ->  [200] weight Matrix -> 200 unit output

Hidden layer:
- Each LSTM has 200 hidden units which is equivalant to the dimensianality of the embedding words and output. 

Input layer: 
- The network has 200 input units. 
- Suppose each word is represented by an embedding vector of dimensionality e = 200. The input layer of each cell will have 200 linear units. These e = 200 linear units are connected to each of the h = 200 LSTM units in the hidden layer (assuming there is only one hidden layer, though our case has 2 layers).
- The input shape is [batch_size, num_steps], that is [30x20]. It will turn into [30x20x200] after embedding, and then 20x[30x200]




This code is adapted from the PTBModel example bundled with the TensorFlow source code.  
#### Train data
The story starts from data: 
- Train data is a list of words, represented by numbers - N = 929589 numbers, e.g. [9971, 9972, 9974, 9975,...]
- We read data as mini-batch of size b = 30. Assume the size of each sentence is 20 words (num_steps = 20). Then it will take int(N / b * h) + 1 = 1548 iterations for the learner to go through all sentences once. So, the number of iterators is 1548
- Each batch data is read from train dataset of size 600, and shape of [30x20]

First we start an interactive session:

In [3]:
session = tf.InteractiveSession()

In [4]:
# Reads the data and separates it into training data, validation data and testing data.
raw_data = reader.ptb_raw_data(data_dir)
train_data, valid_data, test_data, _, _ = raw_data

Lets just read one mini-batch now and feed our network:

In [5]:
itera = reader.ptb_iterator(train_data, batch_size, num_steps)
first_tuple = next(itera)
x = first_tuple[0]
y = first_tuple[1]

In [6]:
x.shape

(30, 20)

Lets look at 3 sentences of our input x:

In [7]:
x[0:3]

array([[9970, 9971, 9972, 9974, 9975, 9976, 9980, 9981, 9982, 9983, 9984,
        9986, 9987, 9988, 9989, 9991, 9992, 9993, 9994, 9995],
       [2654,    6,  334, 2886,    4,    1,  233,  711,  834,   11,  130,
         123,    7,  514,    2,   63,   10,  514,    8,  605],
       [   0, 1071,    4,    0,  185,   24,  368,   20,   31, 3109,  954,
          12,    3,   21,    2, 2915,    2,   12,    3,   21]],
      dtype=int32)

In [8]:
size = hidden_size

We define 2 placeholders to feed them with mini-batches, that is x and y:

In [9]:
_input_data = tf.placeholder(tf.int32, [batch_size, num_steps])  # [30#20]
_targets = tf.placeholder(tf.int32, [batch_size, num_steps])  # [30#20]

Lets define a dictionary, and use it later to feed the placeholders with our first mini-batch:

In [10]:
feed_dict = {_input_data: x, _targets: y}

For example, we can use it to feed _input_data:

In [11]:
session.run(_input_data, feed_dict)

array([[9970, 9971, 9972, 9974, 9975, 9976, 9980, 9981, 9982, 9983, 9984,
        9986, 9987, 9988, 9989, 9991, 9992, 9993, 9994, 9995],
       [2654,    6,  334, 2886,    4,    1,  233,  711,  834,   11,  130,
         123,    7,  514,    2,   63,   10,  514,    8,  605],
       [   0, 1071,    4,    0,  185,   24,  368,   20,   31, 3109,  954,
          12,    3,   21,    2, 2915,    2,   12,    3,   21],
       [   3,   71,    4,   27,  246,   60,   11,  215,    4,    1, 1846,
           9,    3,   71,  546,    2, 6505,  162,    6,  104],
       [  93,   25,    6,  261,  681,  251,    0,  278, 3246,   13,  200,
           1,    8,  105, 3360,    1,    4,    0,  536,    4],
       [  20,    6,  954,   12,    3,   21,   78,   14,  977,  726,    0,
          37,   42,   34,    5,  437,  116,  206,  927,    2],
       [  18,  296,    7,  201,   76,    4,  182,  560, 3836,   17,  974,
         975,    6,  942,    4,  156,   77, 1570,  288,  644],
       [  23, 1238,  899,    5,   25,  20

In this step, we create the stacked LSTM, which is a 2 layer LSTM network:

In [12]:
lstm_cell = tf.contrib.rnn.BasicLSTMCell(hidden_size, forget_bias=0.0)
stacked_lstm = tf.contrib.rnn.MultiRNNCell([lstm_cell] * num_layers)

Also, we initialize the states of the nework:

#### _initial_state

For each LSTM, there are 2 state matrics, c_state and m_state.  c_state and m_state represent "Memory State" and "Cell State". Each hidden layer, has a vector of size 30, which keeps the states. So, for 200 hidden units in each LSTM, we have a matrix of size [30x200]

In [13]:
_initial_state = stacked_lstm.zero_state(batch_size, tf.float32)
_initial_state

(LSTMStateTuple(c=<tf.Tensor 'MultiRNNCellZeroState/BasicLSTMCellZeroState/zeros:0' shape=(30, 200) dtype=float32>, h=<tf.Tensor 'MultiRNNCellZeroState/BasicLSTMCellZeroState/zeros_1:0' shape=(30, 200) dtype=float32>),
 LSTMStateTuple(c=<tf.Tensor 'MultiRNNCellZeroState/BasicLSTMCellZeroState_1/zeros:0' shape=(30, 200) dtype=float32>, h=<tf.Tensor 'MultiRNNCellZeroState/BasicLSTMCellZeroState_1/zeros_1:0' shape=(30, 200) dtype=float32>))

Lets look at the states, though they are all zero for now:

In [14]:
session.run(_initial_state, feed_dict)

(LSTMStateTuple(c=array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]], dtype=float32), h=array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)),
 LSTMStateTuple(c=array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]], dtype=float32), h=array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
  

### Embeddings
We create the embeddings for our input data. Embedding is a dictionary of [10000x200] for all 10000 unique words.

In [15]:
embedding = tf.get_variable("embedding", [vocab_size, hidden_size])  # [10000x200]

In [16]:
session.run(tf.global_variables_initializer())
session.run(embedding, feed_dict)

array([[ 0.01063205,  0.00814733, -0.00844954, ...,  0.00378895,
        -0.02073141,  0.01221155],
       [ 0.01041644,  0.0137022 ,  0.00560501, ...,  0.01056347,
         0.00471439, -0.02278037],
       [-0.02003521,  0.00571358, -0.02156209, ..., -0.00082284,
         0.01038924,  0.02189862],
       ...,
       [-0.00145212,  0.02051926,  0.01961245, ...,  0.01894623,
         0.0170134 ,  0.01944283],
       [ 0.01519434,  0.02213032,  0.01244638, ..., -0.00379353,
        -0.00215455,  0.01070447],
       [ 0.01711179, -0.01450142, -0.01781657, ...,  0.00271924,
        -0.0067168 ,  0.01134633]], dtype=float32)

`embedding_lookup` goes to each row of `input_data` and for each word in the row/sentence, finds the correspond vector in embedding.
It creates a [30x20x200] matrix, so, the first element of __inputs__ (the first sentence), is a matrix of 20x200, which each row of it, is a vector representing a word in the sentence.

In [17]:
# Define where to get the data for our embeddings.
inputs = tf.nn.embedding_lookup(embedding, _input_data)

In [18]:
inputs

<tf.Tensor 'embedding_lookup:0' shape=(30, 20, 200) dtype=float32>

In [19]:
session.run(inputs[0], feed_dict)

array([[-0.0134263 ,  0.0212361 ,  0.02112414, ..., -0.00959834,
         0.01385364,  0.00844003],
       [-0.02212243, -0.01237901,  0.00419767, ..., -0.01935913,
         0.0139115 , -0.00408517],
       [ 0.01915921, -0.00510946, -0.00119747, ..., -0.01553603,
         0.00980788, -0.00401851],
       ...,
       [ 0.01955495, -0.01637472, -0.00495487, ...,  0.02245566,
         0.0200689 ,  0.01397086],
       [-0.023112  , -0.00590596,  0.00579223, ..., -0.01317282,
         0.00883655,  0.01173041],
       [ 0.00109089, -0.01057535,  0.01938266, ..., -0.00067376,
        -0.00868236,  0.02187211]], dtype=float32)

### Constructing Recurrent Neural Networks
`tf.nn.dynamicrnn()` creates a recurrent neural network using `stacked_lstm` which is an instance of RNNCell. 

The input should be a Tensor of shape: [batch_size, max_time, ...], in our case it would be (30, 20, 200)

This method, returns a pair (outputs, new_state) where:
- outputs is a length T list of outputs (one for each input), or a nested tuple of such elements.
- new_state is the final state


In [20]:
outputs, new_state =  tf.nn.dynamic_rnn(stacked_lstm, inputs, initial_state=_initial_state)

So, lets look at the outputs. The output of the stackedLSTM comes from 200 hidden layer, and in each time step (20), one of them gets activated. We use the linear activation to map the 200 hidden layer to a [?x10 matrix].

In [21]:
outputs

<tf.Tensor 'rnn/transpose_1:0' shape=(30, 20, 200) dtype=float32>

In [22]:
session.run(tf.global_variables_initializer())
session.run(outputs[0], feed_dict)

array([[ 3.4254440e-04,  3.1593619e-05,  1.6462205e-04, ...,
        -2.7911790e-04,  3.4073930e-06, -1.8114239e-04],
       [ 2.8872013e-04,  2.4893074e-04,  1.4808765e-04, ...,
        -2.0268549e-04, -3.9944818e-04, -2.7534069e-04],
       [ 2.3651251e-04,  8.9221931e-04, -2.2316233e-05, ...,
         2.0748454e-04, -6.9108745e-04, -6.5613375e-04],
       ...,
       [-7.5111107e-05, -5.8198889e-04, -3.6036398e-04, ...,
        -6.7492179e-04, -3.3249514e-04, -5.5762048e-05],
       [-5.7119832e-05, -3.1587936e-04, -2.2311052e-04, ...,
        -5.4390036e-04, -4.4383504e-04, -3.3675591e-04],
       [-6.3285333e-05, -5.0682033e-04,  1.4555137e-04, ...,
        -3.8899272e-04, -5.3006998e-04, -2.6523112e-04]], dtype=float32)

Lets reshape the output tensor from  [30 x 20 x 200] to [600 x 200]

In [23]:
output = tf.reshape(outputs, [-1, size])
output

<tf.Tensor 'Reshape:0' shape=(600, 200) dtype=float32>

In [24]:
session.run(output[0], feed_dict)

array([ 3.42544401e-04,  3.15936195e-05,  1.64622048e-04,  4.51625470e-04,
        2.50978279e-04,  4.32280300e-04, -7.55842775e-05,  1.37098934e-04,
       -1.32023357e-04, -1.63846184e-04, -1.71776890e-04, -1.21824203e-04,
        2.14190208e-04, -3.66728491e-04,  3.40574450e-04, -3.29786068e-04,
       -3.72775539e-05, -3.53801115e-05, -2.95415757e-05,  2.87478993e-04,
       -1.29926004e-04, -1.48746592e-04,  3.20694089e-04, -1.97542555e-04,
        2.74078629e-04, -6.54401883e-05, -6.13756885e-04,  1.65539430e-04,
        7.61668853e-05,  1.37133844e-04,  5.19918918e-04, -1.45887507e-05,
       -1.81402691e-04, -3.14831459e-06,  7.46607257e-05, -2.23519935e-04,
        2.44077659e-04,  1.27249819e-04,  2.24282667e-05, -6.26288820e-05,
       -1.86371675e-04,  1.79491180e-04,  1.23375103e-05,  4.26117651e-04,
       -2.39729525e-05, -9.59906247e-05,  1.57951319e-04,  1.60361713e-04,
        3.05357418e-04,  1.13533315e-04,  3.21410160e-04,  8.96532787e-04,
       -2.55986350e-04, -

### Logistic unit
Now, we create a logistic unit to return the probability of the output word. That is, mapping the 600.

Softmax = [600 x 200] * [200 x 1000] + [1 x 1000] -> [600 x 1000]

In [25]:
softmax_w = tf.get_variable("softmax_w", [size, vocab_size])  # [200x1000]
softmax_b = tf.get_variable("softmax_b", [vocab_size])  # [1x1000]
logits = tf.matmul(output, softmax_w) + softmax_b

In [26]:
session.run(tf.global_variables_initializer())
logi = session.run(logits, feed_dict)
logi.shape

(600, 10000)

In [27]:
first_word_output_probablity = logi[0]
first_word_output_probablity.shape

(10000,)

### Prediction
The maximum probablity.

In [28]:
embedding_array = session.run(embedding, feed_dict)
np.argmax(first_word_output_probablity)

4654

So, what is the ground truth for the first word of first sentence? 

In [29]:
y[0][0]

9971

Also, you can get it from target tensor, if you want to find the embedding vector:

In [30]:
_targets

<tf.Tensor 'Placeholder_1:0' shape=(30, 20) dtype=int32>

It is time to compare logit with target

In [31]:
targ = session.run(tf.reshape(_targets, [-1]), feed_dict) 

In [32]:
first_word_target_code = targ[0]
first_word_target_code

9971

In [33]:
first_word_target_vec = session.run(tf.nn.embedding_lookup(embedding, targ[0]))
first_word_target_vec

array([ 1.45170279e-02, -1.79491118e-02,  1.78318582e-02, -2.89530307e-03,
       -6.23417646e-03,  7.49334320e-03,  1.64727047e-02, -3.48045118e-03,
       -2.36215871e-02,  2.42397189e-04,  1.93719529e-02, -4.46389243e-03,
       -1.97104029e-02,  8.94280896e-03,  7.62885809e-03, -2.08076779e-02,
       -4.36184928e-04, -1.12350555e-02,  1.85768344e-02, -2.57953256e-03,
        6.81799464e-03,  1.92548819e-02,  7.84593821e-03,  7.38065690e-03,
       -1.41855124e-02, -1.25131011e-03, -1.14028119e-02,  2.93873623e-03,
        4.14362177e-04,  2.34870352e-02,  2.21861750e-02,  7.54712150e-03,
       -1.03686983e-02,  3.60180996e-03, -4.58486378e-03,  2.27417387e-02,
        1.08637847e-02,  1.07932389e-02, -8.71755835e-03,  1.91723183e-02,
       -2.61896849e-03,  1.02394409e-02, -1.27013242e-02,  1.84333883e-03,
       -8.36285390e-03, -1.93770174e-02,  2.02140585e-02,  2.42355429e-02,
       -1.84150636e-02,  1.14948452e-02,  1.55534223e-02,  8.85790586e-03,
       -1.31806647e-02, -

#### Objective function

Now we want to define our objective function. Our objective is to minimize loss function, that is, to minimize the average negative log probability of the target words:

loss = −1N∑i = 1Nln⁡ptargeti  
This function is already implimented and available in TensorFlow through `sequence_loss_by_example` so we can just use it here. `sequence_loss_by_example` is weighted cross-entropy loss for a sequence of logits (per example).  

Its arguments:  

logits: List of 2D Tensors of shape [batch_size x num_decoder_symbols].  
targets: List of 1D batch-sized int32 Tensors of the same length as logits.  
weights: List of 1D batch-sized float Tensors of the same length as logits.  

In [34]:
loss = tf.contrib.legacy_seq2seq.sequence_loss_by_example([logits], [tf.reshape(_targets, [-1])],[tf.ones([batch_size * num_steps])])

`loss` is a 1D batch-sized float Tensor [600x1]: The log-perplexity for each sequence.

In [35]:
session.run(loss, feed_dict)

array([9.199925 , 9.208916 , 9.204049 , 9.219676 , 9.212233 , 9.225138 ,
       9.220055 , 9.201657 , 9.196995 , 9.206326 , 9.217281 , 9.195156 ,
       9.19926  , 9.215224 , 9.194069 , 9.195595 , 9.202955 , 9.220181 ,
       9.204887 , 9.213215 , 9.221146 , 9.214842 , 9.214809 , 9.210042 ,
       9.204665 , 9.1955595, 9.216201 , 9.196068 , 9.201439 , 9.199709 ,
       9.21013  , 9.201667 , 9.206893 , 9.204727 , 9.218736 , 9.202456 ,
       9.206758 , 9.208283 , 9.214172 , 9.210835 , 9.223533 , 9.209924 ,
       9.195504 , 9.198727 , 9.2158985, 9.20256  , 9.193649 , 9.219003 ,
       9.208118 , 9.222098 , 9.210141 , 9.197668 , 9.208061 , 9.20483  ,
       9.207626 , 9.204768 , 9.210172 , 9.197681 , 9.207999 , 9.2099695,
       9.225169 , 9.210031 , 9.225235 , 9.197767 , 9.207822 , 9.201268 ,
       9.215479 , 9.20993  , 9.204683 , 9.222193 , 9.222103 , 9.19782  ,
       9.225204 , 9.221833 , 9.204607 , 9.206292 , 9.223461 , 9.221262 ,
       9.196561 , 9.212295 , 9.21061  , 9.221179 , 

In [36]:
cost = tf.reduce_sum(loss) / batch_size

session.run(tf.global_variables_initializer())
session.run(cost, feed_dict)

184.2402

Now, lets store the new state as final state.

In [37]:
final_state = new_state

### Training

To do gradient clipping in TensorFlow we have to take the following steps:

1. Define the optimizer.
2. Extract variables that are trainable.
3. Calculate the gradients based on the loss function.
4. Apply the optimizer to the variables / gradients tuple.

#### 1. Define Optimizer

`GradientDescentOptimizer` constructs a new gradient descent optimizer. Later, we use constructed `optimizer` to compute gradients for a loss and apply gradients to variables.

In [38]:
# Create a variable for the learning rate.
lr = tf.Variable(0.0, trainable=False)
# Create the gradient descent optimizer with our learning rate.
optimizer = tf.train.GradientDescentOptimizer(lr)


#### 2. Trainable Variables

Definining a variable, if you passed `trainable=True`, the `Variable()` constructor automatically adds new variables to the graph collection `GraphKeys.TRAINABLE_VARIABLES`. Now, using `tf.trainable_variables()` you can get all variables created with `trainable=True`.

In [39]:
# Get all TensorFlow variables marked as "trainable" (i.e. all of them except lr, which we just created).
tvars = tf.trainable_variables()
tvars

[<tf.Variable 'embedding:0' shape=(10000, 200) dtype=float32_ref>,
 <tf.Variable 'rnn/multi_rnn_cell/cell_0/basic_lstm_cell/kernel:0' shape=(400, 800) dtype=float32_ref>,
 <tf.Variable 'rnn/multi_rnn_cell/cell_0/basic_lstm_cell/bias:0' shape=(800,) dtype=float32_ref>,
 <tf.Variable 'softmax_w:0' shape=(200, 10000) dtype=float32_ref>,
 <tf.Variable 'softmax_b:0' shape=(10000,) dtype=float32_ref>]

We can find the name and scope of all variables:

In [40]:
tvars = tvars[3:]

In [41]:
[v.name for v in tvars]

['softmax_w:0', 'softmax_b:0']

#### 3. Calculate the gradients based on the loss function

In [42]:
cost

<tf.Tensor 'truediv:0' shape=() dtype=float32>

In [43]:
tvars

[<tf.Variable 'softmax_w:0' shape=(200, 10000) dtype=float32_ref>,
 <tf.Variable 'softmax_b:0' shape=(10000,) dtype=float32_ref>]

#### Gradient:
The gradient of a function is the slope of the line, or the rate of change of a function. It's a vector (a direction to move) that points in the direction of greatest increase of the function, and calculated by __derivative__ operation.

First lets recall the gradient function using an toy example:
$$ z=\left(2x^2+3xy\right)$$

In [44]:
var_x = tf.placeholder(tf.float32)
var_y = tf.placeholder(tf.float32) 
func_test = 2.0 * var_x * var_x + 3.0 * var_x * var_y
session.run(tf.global_variables_initializer())
feed = {var_x: 1.0, var_y: 2.0}
session.run(func_test, feed)

8.0

The `tf.gradients()` function allows you to compute the symbolic gradient of one tensor with respect to one or more other tensors - including variables. `tf.gradients(func, xs)` constructs symbolic partial derivatives of sum of `func` w.r.t. x in `xs`. 

Now, lets look at the derivitive w.r.t. `var_x`:
$$ \frac{\partial \:}{\partial \:x}\left(2x^2+3xy\right)=4x+3y $$


In [45]:
var_grad = tf.gradients(func_test, [var_x])
session.run(var_grad, feed)

[10.0]

The derivitive w.r.t. `var_y`:
$$ \frac{\partial \:}{\partial \:y}\left(2x^2+3xy\right)=3x $$

In [46]:
var_grad = tf.gradients(func_test, [var_y])
session.run(var_grad, feed)

[3.0]

Now, we can look at gradients w.r.t all variables:

In [47]:
tf.gradients(cost, tvars)

[<tf.Tensor 'gradients_2/MatMul_grad/MatMul_1:0' shape=(200, 10000) dtype=float32>,
 <tf.Tensor 'gradients_2/add_grad/Reshape_1:0' shape=(10000,) dtype=float32>]

In [48]:
grad_t_list = tf.gradients(cost, tvars)
session.run(grad_t_list, feed_dict)

[array([[-8.33650629e-05, -3.26001864e-05, -1.04524006e-04, ...,
          2.46637796e-07,  2.41907458e-07,  2.43596673e-07],
        [-5.81361091e-05, -9.65743893e-05, -7.65056029e-05, ...,
          1.70635587e-07,  1.67399520e-07,  1.68530349e-07],
        [-6.67524073e-05, -1.40866468e-04, -2.93421210e-04, ...,
          2.45405346e-07,  2.40726251e-07,  2.42392417e-07],
        ...,
        [-1.30252956e-04, -2.94189464e-04, -1.86949241e-04, ...,
          3.69254451e-07,  3.62220675e-07,  3.64666676e-07],
        [ 2.63405673e-05, -1.14045048e-04, -1.10268797e-04, ...,
          2.12287162e-07,  2.08215198e-07,  2.09683805e-07],
        [ 1.53462388e-05, -1.35425507e-05, -2.78567495e-06, ...,
          3.43967343e-09,  3.38012640e-09,  3.40703110e-09]], dtype=float32),
 array([-0.79802614, -1.0313637 , -1.031363  , ...,  0.00202788,
         0.00198919,  0.00200279], dtype=float32)]

Now, we have a list of tensors, t-list. We can use it to find clipped tensors. `clip_by_global_norm` clips values of multiple tensors by the ratio of the sum of their norms.

`clip_by_global_norm` get _t-list_ as input and returns 2 things:
 - a list of clipped tensors, so called __list_clipped__ 
 - the global norm (global_norm) of all tensors in t_list

In [49]:
max_grad_norm

5

In [50]:
# Define the gradient clipping threshold.
grads, _ = tf.clip_by_global_norm(grad_t_list, max_grad_norm)
grads

[<tf.Tensor 'clip_by_global_norm/clip_by_global_norm/_0:0' shape=(200, 10000) dtype=float32>,
 <tf.Tensor 'clip_by_global_norm/clip_by_global_norm/_1:0' shape=(10000,) dtype=float32>]

In [51]:
session.run(grads, feed_dict)

[array([[-8.33650629e-05, -3.26001864e-05, -1.04524006e-04, ...,
          2.46637796e-07,  2.41907458e-07,  2.43596673e-07],
        [-5.81361091e-05, -9.65743893e-05, -7.65056029e-05, ...,
          1.70635587e-07,  1.67399520e-07,  1.68530349e-07],
        [-6.67524073e-05, -1.40866468e-04, -2.93421210e-04, ...,
          2.45405346e-07,  2.40726251e-07,  2.42392417e-07],
        ...,
        [-1.30252956e-04, -2.94189464e-04, -1.86949241e-04, ...,
          3.69254451e-07,  3.62220675e-07,  3.64666676e-07],
        [ 2.63405673e-05, -1.14045048e-04, -1.10268797e-04, ...,
          2.12287162e-07,  2.08215198e-07,  2.09683805e-07],
        [ 1.53462388e-05, -1.35425507e-05, -2.78567495e-06, ...,
          3.43967343e-09,  3.38012640e-09,  3.40703110e-09]], dtype=float32),
 array([-0.79802614, -1.0313637 , -1.031363  , ...,  0.00202788,
         0.00198919,  0.00200279], dtype=float32)]

#### 4. Apply the optimizer to the variables / gradients tuple.

In [52]:
# Create the training TensorFlow Operation through the optimizer.
train_op = optimizer.apply_gradients(zip(grads, tvars))

In [53]:
session.run(tf.global_variables_initializer())
session.run(train_op, feed_dict)

We learned how the model is build step by step. Now, lets create a Class that represents our model. This class needs a few things:
- We have to create the model in accordance with our defined hyperparameters
- We have to create the placeholders for our input data and expected outputs (the real data)
- We have to create the LSTM cell structure and connect them with our RNN structure
- We have to create the word embeddings and point them to the input data
- We have to create the input structure for our RNN
- We have to instanciate our RNN model and retrieve the variable in which we should expect our outputs to appear
- We need to create a logistic structure to return the probability of our words
- We need to create the loss and cost functions for our optimizer to work and then create the optimizer
- And finally, we need to create a training operation that can be run to actually train our model


In [54]:
class PTBModel(object):
    def __init__(self, is_training):
        # Setting parameters for ease of use.
        self.batch_size = batch_size
        self.num_steps = num_steps
        size = hidden_size
        self.vocab_size = vocab_size
        
        # Creating placeholders for our input data and expected outputs (target data).
        self._input_data = tf.placeholder(tf.int32, [batch_size, num_steps])  # [30#20]
        self._targets = tf.placeholder(tf.int32, [batch_size, num_steps])  # [30#20]

        # Creating the LSTM cell structure and connect it with the RNN structure.
        # Create the LSTM unit. 
        # This creates only the structure for the LSTM and has to be associated with a RNN unit still.
        # The argument n_hidden (size=200) of BasicLSTMCell is size of hidden layer, that is,
        # the number of hidden units of the LSTM.
        # Size is the same as the size of our hidden layer, and no bias is added to the Forget Gate. 
        # LSTM cell processes one word at a time and computes probabilities of the possible continuations of the sentence.
        lstm_cell = tf.contrib.rnn.BasicLSTMCell(size, forget_bias=0.0)
        
        # Unless you changed keep_prob, this won't actually execute - this is a dropout wrapper for our LSTM unit.
        # This is an optimization of the LSTM output, but is not needed at all.
        if is_training and keep_prob < 1:
            lstm_cell = tf.contrib.rnn.DropoutWrapper(lstm_cell, output_keep_prob=keep_prob)
        
        # By taking in the LSTM cells as parameters, the MultiRNNCell function junctions the LSTM units to the RNN units.
        # RNN cell composed sequentially of multiple simple cells.
        stacked_lstm = tf.contrib.rnn.MultiRNNCell([lstm_cell] * num_layers)

        # Define the initial state, i.e. the model state for the very first data point.
        # It initialize the state of the LSTM memory. The memory state of the network is initialized
        # with a vector of zeros and gets updated after reading each word.
        self._initial_state = stacked_lstm.zero_state(batch_size, tf.float32)

        # Creating the word embeddings and pointing them to the input data.
        with tf.device("/cpu:0"):
            # Create the embeddings for our input data. Size is hidden size.
            embedding = tf.get_variable("embedding", [vocab_size, size])  # [10000x200]
            # Define where to get the data for our embeddings from.
            inputs = tf.nn.embedding_lookup(embedding, self._input_data)

        # Unless you changed keep_prob, this won't actually execute - this is a dropout addition for our inputs.
        # This is an optimization of the input processing and is not needed at all.
        if is_training and keep_prob < 1:
            inputs = tf.nn.dropout(inputs, keep_prob)

        # Creating the input structure for our RNN.
        # Input structure is 20x[30x200].
        # Considering each word is represended by a 200 dimentional vector, and we have 30 batches,
        # we create 30 word-vectors of size [30x2000].
        #inputs = [tf.squeeze(input_, [1]) for input_ in tf.split(1, num_steps, inputs)]
        # The input structure is fed from the embeddings, which are filled in by the input data.
        # Feeding a batch of b sentences to a RNN:
        # In step 1, first word of each of the b sentences (in a batch) is input in parallel.  
        # In step 2, second word of each of the b sentences is input in parallel. 
        # The parallelism is only for efficiency.  
        # Each sentence in a batch is handled in parallel, but the network sees one word of a sentence
        # at a time and does the computations accordingly. 
        # All the computations involving the words of all sentences in a batch at a given time step
        # are done in parallel. 

        # Instanciating our RNN model and retrieving the structure for returning the outputs and the state.
        outputs, state = tf.nn.dynamic_rnn(stacked_lstm, inputs, initial_state=self._initial_state)

        # Creating a logistic unit to return the probability of the output word.
        output = tf.reshape(outputs, [-1, size])
        softmax_w = tf.get_variable("softmax_w", [size, vocab_size])  # [200x1000]
        softmax_b = tf.get_variable("softmax_b", [vocab_size])  # [1x1000]
        logits = tf.matmul(output, softmax_w) + softmax_b

        # Defining the loss and cost functions for the model's learning to work.
        loss = tf.contrib.legacy_seq2seq.sequence_loss_by_example([logits], [tf.reshape(self._targets, [-1])],
                                                      [tf.ones([batch_size * num_steps])])
        self._cost = cost = tf.reduce_sum(loss) / batch_size

        # Store the final state.
        self._final_state = state

        # Everything after this point is relevant only for training.
        if not is_training:
            return

        # Creating the Training Operation for the Model.
        # Create a variable for the learning rate.
        self._lr = tf.Variable(0.0, trainable=False)
        # Get all TensorFlow variables marked as "trainable" (i.e. all of them except _lr).
        tvars = tf.trainable_variables()
        # Define the gradient clipping threshold.
        grads, _ = tf.clip_by_global_norm(tf.gradients(cost, tvars), max_grad_norm)
        # Create the gradient descent optimizer with the learning rate.
        optimizer = tf.train.GradientDescentOptimizer(self._lr)
        # Create the training TensorFlow Operation through the optimizer.
        self._train_op = optimizer.apply_gradients(zip(grads, tvars))

    # Helper functions for the LSTM RNN class

    """Assign the learning rate for the model."""
    def assign_lr(self, session, lr_value):
        session.run(tf.assign(self._lr, lr_value))

    """Returns the input data for the model."""
    @property
    def input_data(self):
        return self._input_data

    """Returns the targets for the model."""
    @property
    def targets(self):
        return self._targets

    """Returns the initial state for the model."""
    @property
    def initial_state(self):
        return self._initial_state

    """Returns the defined cost."""
    @property
    def cost(self):
        return self._cost

    """Returns the final state for the model."""
    @property
    def final_state(self):
        return self._final_state

    """Returns the current learning rate for the model."""
    @property
    def lr(self):
        return self._lr

    """Returns the training operation defined for the model."""
    @property
    def train_op(self):
        return self._train_op

With that, the actual structure of the Recurrent Neural Network with Long Short-Term Memory is finished. What remains is to actually create the methods to run through time - that is, the `run_epoch` method to be run at each epoch and a `main` script which ties all of this together.

What the `run_epoch` method should do is take the input data and feed it to the relevant operations. This will return at the very least the current result for the cost function.

In [55]:
"""run_epoch takes as parameters the current session, the model instance, the data to be
fed, and the operation to be run."""
def run_epoch(session, m, data, eval_op, verbose=False):
    # Define the epoch size based on the length of the data, batch size and the number of steps.
    epoch_size = ((len(data) // m.batch_size) - 1) // m.num_steps
    start_time = time.time()
    costs = 0.0
    iters = 0
    #state = m.initial_state.eval()
    #m.initial_state = tf.convert_to_tensor(m.initial_state) 
    #state = m.initial_state.eval()
    state = session.run(m.initial_state)
    
    # For each step and data point.
    for step, (x, y) in enumerate(reader.ptb_iterator(data, m.batch_size, m.num_steps)):
        # Evaluate and return cost and state by running cost, final_state and the function passed as parameter.
        cost, state, _ = session.run([m.cost, m.final_state, eval_op],
                                     {m.input_data: x,
                                      m.targets: y,
                                      m.initial_state: state})
        
        # Add returned cost to costs (which keeps track of the total costs for this epoch).
        costs += cost
        
        # Add number of steps to iteration counter.
        iters += m.num_steps

        if verbose and step % (epoch_size // 10) == 10:
            print("%.3f perplexity: %.3f speed: %.0f wps" % (step * 1.0 / epoch_size, np.exp(costs / iters),
              iters * m.batch_size / (time.time() - start_time)))

    # Returns the Perplexity rating to keep track of how the model is evolving.
    return np.exp(costs / iters)

Now, we create the `main` method to tie everything together. The code here reads the data from the directory, using the `reader` helper module, and then trains and evaluates the model on both a testing and a validating subset of data.

In [56]:
# Reads the data and separates it into training data, validation data and testing data.
raw_data = reader.ptb_raw_data(data_dir)
train_data, valid_data, test_data, _, _ = raw_data

In [None]:
# Initializes the Execution Graph and the Session.
with tf.Graph().as_default(), tf.Session() as session:
    initializer = tf.random_uniform_initializer(-init_scale, init_scale)
    
    # Instantiates the model for training.
    # tf.variable_scope add a prefix to the variables created with tf.get_variable
    with tf.variable_scope("model", reuse=None, initializer=initializer):
        m = PTBModel(is_training=True)
        
    # Reuses the trained parameters for the validation and testing models.
    # They are different instances but use the same variables for weights and biases,
    # they just don't change when data is input.
    with tf.variable_scope("model", reuse=True, initializer=initializer):
        mvalid = PTBModel(is_training=False)
        mtest = PTBModel(is_training=False)

    # Initialize all variables.
    tf.global_variables_initializer().run()

    for i in range(max_max_epoch):
        # Define the decay for this epoch.
        lr_decay = decay ** max(i - max_epoch, 0.0)
        
        # Set the decayed learning rate as the learning rate for this epoch.
        m.assign_lr(session, learning_rate * lr_decay)

        print("Epoch %d : Learning rate: %.3f" % (i + 1, session.run(m.lr)))
        
        # Run the loop for this epoch in the training model.
        train_perplexity = run_epoch(session, m, train_data, m.train_op,
                                   verbose=True)
        print("Epoch %d : Train Perplexity: %.3f" % (i + 1, train_perplexity))
        
        # Run the loop for this epoch in the validation model.
        valid_perplexity = run_epoch(session, mvalid, valid_data, tf.no_op())
        print("Epoch %d : Valid Perplexity: %.3f" % (i + 1, valid_perplexity))
    
    # Run the loop in the testing model to see how effective was the training.
    test_perplexity = run_epoch(session, mtest, test_data, tf.no_op())
    
    print("Test Perplexity: %.3f" % test_perplexity)

Epoch 1 : Learning rate: 1.000
0.006 perplexity: 5891.393 speed: 1667 wps
0.106 perplexity: 1084.938 speed: 1933 wps
0.205 perplexity: 816.106 speed: 1924 wps
0.305 perplexity: 661.687 speed: 1914 wps
0.404 perplexity: 555.329 speed: 1920 wps
0.504 perplexity: 490.278 speed: 1926 wps
0.603 perplexity: 441.656 speed: 1929 wps
0.702 perplexity: 405.473 speed: 1930 wps
0.802 perplexity: 377.890 speed: 1932 wps
0.901 perplexity: 352.301 speed: 1933 wps
Epoch 1 : Train Perplexity: 332.073
Epoch 1 : Valid Perplexity: 201.592
Epoch 2 : Learning rate: 1.000
0.006 perplexity: 222.417 speed: 1904 wps
0.106 perplexity: 197.852 speed: 1933 wps
0.205 perplexity: 189.711 speed: 1938 wps
0.305 perplexity: 181.570 speed: 1937 wps
0.404 perplexity: 173.281 speed: 1911 wps
0.504 perplexity: 169.721 speed: 1890 wps
0.603 perplexity: 166.098 speed: 1889 wps
0.702 perplexity: 162.890 speed: 1892 wps
0.802 perplexity: 160.616 speed: 1868 wps
0.901 perplexity: 156.890 speed: 1846 wps
Epoch 2 : Train Perplexi

As you can see, the model's perplexity rating drops very quickly after a few iterations. As was elaborated before, **lower perplexity means that the model is more certain about its prediction**.