Here, we'll make use of word embeddings -- a way of representing sentence structures or words as n-dimensional vectors of real numbers
- So we pretty much assign each word a randomly-initialized vector, and input those into the network to be processed

And after iterating through our model, the vectors assume values that help the network correctly predict what it needs to (the probable next word in the sentence)
- It will group words similar to this picture below: 
<img src="https://ibm.box.com/shared/static/bqhc5dg879gcoabzhxra1w8rkg3od1cu.png" width="500">
<i>Source: IBM </i>
<br>
So words that are frequently used together are grouped together

In [12]:
import time
import numpy as np
import tensorflow as tf

<h3> Get Data: </h3>

Download the Penn Treebank dataset from IBM:

In [7]:
!mkdir data
!wget -q -O data/ptb.zip https://ibm.box.com/shared/static/z2yvmhbskc45xd2a9a4kkn6hg4g4kj5r.zip
!unzip -o data/ptb.zip -d data
!cp data/ptb/reader.py .

import reader

mkdir: cannot create directory ‘data’: File exists
Archive:  data/ptb.zip
  inflating: data/ptb/reader.py      
  inflating: data/__MACOSX/ptb/._reader.py  
  inflating: data/__MACOSX/._ptb     


Download simple examples dataset:

In [6]:
!wget http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz 
!tar xzf simple-examples.tgz -C data/

--2019-03-06 20:42:22--  http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz
Resolving www.fit.vutbr.cz (www.fit.vutbr.cz)... 147.229.9.23, 2001:67c:1220:809::93e5:917
Connecting to www.fit.vutbr.cz (www.fit.vutbr.cz)|147.229.9.23|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 34869662 (33M) [application/x-gtar]
Saving to: ‘simple-examples.tgz’


2019-03-06 20:42:33 (3.16 MB/s) - ‘simple-examples.tgz’ saved [34869662/34869662]



<h3> Building the LSTM Model: </h3>

Here, we define the model's hypterparameters so that we can practice playing around with them:

In [9]:
init_scale = 0.1                  # initial weight scale
learning_rate = 1.0               # initial learning weight
max_grad_norm = 5                 # max permissible norm for the gradient -- for Gradient Clipping
num_layers = 2                    # number of layers in our model
num_steps = 20                    # total number of recurrence steps 

hidden_size_l1 = 256              # number of neurons (processing units) in the hidden layers
hidden_size_l2 = 128

max_epoch_decay_lr = 4            # max number of epochs trained with the initial learning weight
max_epoch = 15                    # total epochs in training

keep_prob = 1                     # probability of keeping data in the Dropout layer
decay = 0.5                       # the decay for the learning rate
batch_size = 60                   # size for each batch of data
vocab_size = 10000                # vocab size
embedding_vector_size = 200       

is_training = 1                   # training flag to separate training from testing
data_dir = "data/simple-examples/data/" # data directory

- the structure is like:
    - 200 input units -> [200x256] Weight -> 256 Hidden units (first layer) -> [256x128] Weight matrix  -> 128 Hidden units (second layer) ->  [128x200] weight Matrix -> 200 unit output

<h4> Train Data: </h4>
Train data is a list of words, of size 929589, represented by numbers, e.g. [9971, 9972, 9974, 9975,...]

Start an interactive session:

In [13]:
session = tf.InteractiveSession()

In [15]:
# reads the data and separates it into train, validation, and test datasets
raw_data = reader.ptb_raw_data(data_dir)
train_data, valid_data, test_data, vocab, word_to_id = raw_data

In [32]:
print("Length of training data: ", len(train_data))
print("Length of validation data: ", len(valid_data))
print("Length of test data: ", len(test_data))
print("Length of vocab: ", vocab)

Length of training data:  929589
Length of validation data:  73760
Length of test data:  82430
Length of vocab:  10000


<h5>Define a function to translate id's back to their respective words:</h5>

In [29]:
def id_to_word(id_list):
    line = []
    for w in id_list:
        for word, wid in word_to_id.items():
            if wid == w:
                line.append(word)
    return line

In [34]:
print(train_data[0:100])

[9970, 9971, 9972, 9974, 9975, 9976, 9980, 9981, 9982, 9983, 9984, 9986, 9987, 9988, 9989, 9991, 9992, 9993, 9994, 9995, 9996, 9997, 9998, 9999, 2, 9256, 1, 3, 72, 393, 33, 2133, 0, 146, 19, 6, 9207, 276, 407, 3, 2, 23, 1, 13, 141, 4, 1, 5465, 0, 3081, 1596, 96, 2, 7682, 1, 3, 72, 393, 8, 337, 141, 4, 2477, 657, 2170, 955, 24, 521, 6, 9207, 276, 4, 39, 303, 438, 3684, 2, 6, 942, 4, 3150, 496, 263, 5, 138, 6092, 4241, 6036, 30, 988, 6, 241, 760, 4, 1015, 2786, 211, 6, 96, 4]


In [35]:
print(id_to_word(train_data[0:100]))

['aer', 'banknote', 'berlitz', 'calloway', 'centrust', 'cluett', 'fromstein', 'gitano', 'guterman', 'hydro-quebec', 'ipo', 'kia', 'memotec', 'mlx', 'nahb', 'punts', 'rake', 'regatta', 'rubens', 'sim', 'snack-food', 'ssangyong', 'swapo', 'wachter', '<eos>', 'pierre', '<unk>', 'N', 'years', 'old', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'nov.', 'N', '<eos>', 'mr.', '<unk>', 'is', 'chairman', 'of', '<unk>', 'n.v.', 'the', 'dutch', 'publishing', 'group', '<eos>', 'rudolph', '<unk>', 'N', 'years', 'old', 'and', 'former', 'chairman', 'of', 'consolidated', 'gold', 'fields', 'plc', 'was', 'named', 'a', 'nonexecutive', 'director', 'of', 'this', 'british', 'industrial', 'conglomerate', '<eos>', 'a', 'form', 'of', 'asbestos', 'once', 'used', 'to', 'make', 'kent', 'cigarette', 'filters', 'has', 'caused', 'a', 'high', 'percentage', 'of', 'cancer', 'deaths', 'among', 'a', 'group', 'of']


<h5> Reading one mini-batch and feeding our network: </h5>

In [37]:
itera = reader.ptb_iterator(train_data, batch_size, num_steps)
first_tuple = itera.__next__()
x = first_tuple[0]
y = first_tuple[1]

In [46]:
x.shape

(60, 20)

Looking at 3 sentences of our input x:

In [49]:
x[0:3]

array([[9970, 9971, 9972, 9974, 9975, 9976, 9980, 9981, 9982, 9983, 9984,
        9986, 9987, 9988, 9989, 9991, 9992, 9993, 9994, 9995],
       [ 901,   33, 3361,    8, 1279,  437,  597,    6,  261, 4276, 1089,
           8, 2836,    2,  269,    4, 5526,  241,   13, 2420],
       [2654,    6,  334, 2886,    4,    1,  233,  711,  834,   11,  130,
         123,    7,  514,    2,   63,   10,  514,    8,  605]],
      dtype=int32)

Looking at the same 3 sentences in words:

In [48]:
print(id_to_word(x[0]))
print(id_to_word(x[1]))
print(id_to_word(x[2]))

['aer', 'banknote', 'berlitz', 'calloway', 'centrust', 'cluett', 'fromstein', 'gitano', 'guterman', 'hydro-quebec', 'ipo', 'kia', 'memotec', 'mlx', 'nahb', 'punts', 'rake', 'regatta', 'rubens', 'sim']
['test', 'will', 'concentrate', 'and', 'sometimes', 'give', 'away', 'a', 'few', 'exact', 'questions', 'and', 'answers', '<eos>', 'use', 'of', 'scoring', 'high', 'is', 'widespread']
['color', 'a', 'second', 'round', 'of', '<unk>', 'economic', 'talks', 'scheduled', 'for', 'next', 'week', 'in', 'washington', '<eos>', 'not', 'that', 'washington', 'and', 'tokyo']


<h5> Define 2 placeholders to feed them with mini-batches (x and y): </h5>

In [50]:
_input_data = tf.placeholder(tf.int32, [batch_size, num_steps]) # [30, 20]
_targets = tf.placeholder(tf.int32, [batch_size, num_steps]) # [30, 20]

Define a dictionary to feed the placeholders with our first mini-batch

In [51]:
feed_dict = {_input_data:x, _targets:y}


For example, we can use it to feed <code>\_input\_data</code>:

In [52]:
session.run(_input_data, feed_dict)

array([[9970, 9971, 9972, ..., 9993, 9994, 9995],
       [ 901,   33, 3361, ...,  241,   13, 2420],
       [2654,    6,  334, ...,  514,    8,  605],
       ...,
       [7831,   36, 1678, ...,    4, 4558,  157],
       [  59, 2070, 2433, ...,  400,    1, 1173],
       [2097,    3,    2, ..., 2043,   23,    1]], dtype=int32)

<h3> Create the Stacked LSTM: </h3>

In [53]:
lstm_cell_l1 = tf.contrib.rnn.BasicLSTMCell(hidden_size_l1, forget_bias=0.0)
lstm_cell_l2 = tf.contrib.rnn.BasicLSTMCell(hidden_size_l2, forget_bias=0.0)
stacked_lstm = tf.contrib.rnn.MultiRNNCell([lstm_cell_l1, lstm_cell_l2])

<h4> Initialize the states of the network: </h4>

There are 2 state matrices:
<ul>
    <li> Memory State: m_state </li>
    <li> Cell State: c_state </li>
</ul>
Each hidden layer has a vector of size 30 which keeps the states. So for 256 hidden units we have a matrix of size [30x256]

In [55]:
_initial_state = stacked_lstm.zero_state(batch_size, tf.float32)
_initial_state

(LSTMStateTuple(c=<tf.Tensor 'MultiRNNCellZeroState_1/BasicLSTMCellZeroState/zeros:0' shape=(60, 256) dtype=float32>, h=<tf.Tensor 'MultiRNNCellZeroState_1/BasicLSTMCellZeroState/zeros_1:0' shape=(60, 256) dtype=float32>),
 LSTMStateTuple(c=<tf.Tensor 'MultiRNNCellZeroState_1/BasicLSTMCellZeroState_1/zeros:0' shape=(60, 128) dtype=float32>, h=<tf.Tensor 'MultiRNNCellZeroState_1/BasicLSTMCellZeroState_1/zeros_1:0' shape=(60, 128) dtype=float32>))

Looking at the states, even though they're all 0 for now:

In [56]:
session.run(_initial_state, feed_dict)

(LSTMStateTuple(c=array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]], dtype=float32), h=array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)),
 LSTMStateTuple(c=array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]], dtype=float32), h=array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
  

<h3> Creating Word Embeddings </h3>

A typical approach is to use the one-hot encoding method to convert the words in our dataset to vectors of numbers. This creates a high-dimensional sparse dataset, but is also inefficient when working with such large datasets. 
<br>
<br>
Thus, we'll use the word2vec approach. And we'll represent this as a layer in our LSTM where the word IDs will be represented as a dense representation before feeding into the LSTM 
<br> 
<br>
<b>Note:</b> The embedding vectors will also get updated during the training process of the deep neural network

In [57]:
embedding_vocab = tf.get_variable("embedding_vocab", [vocab_size, embedding_vector_size])

Initializing the <code>embedding_words</code> with random values: 

In [58]:
session.run(tf.global_variables_initializer())
session.run(embedding_vocab)

array([[ 0.0105087 , -0.00547711, -0.02412317, ...,  0.0218011 ,
         0.01151476, -0.00269401],
       [ 0.00557042, -0.01931978,  0.01293854, ...,  0.01300764,
        -0.0143984 , -0.01703923],
       [-0.01553622,  0.00341475, -0.02288882, ...,  0.00634491,
         0.01816645,  0.01964622],
       ...,
       [ 0.01514607,  0.00448869,  0.01775977, ...,  0.00507732,
         0.00423872,  0.01946887],
       [ 0.00766245,  0.00756617,  0.00355108, ..., -0.02027432,
         0.02383254,  0.01404527],
       [-0.01999212, -0.02049918,  0.00030354, ..., -0.00645145,
         0.0080325 ,  0.01551213]], dtype=float32)

Below 
<code>embedding_lookup()</code> finds the embedded values for our batch of 30x20 words.
<br>
<br>
it goes to each row of <code>input_data</code> and for each word in the row/sentence it finds the corresponding vector in <code>embedding_dict</code>
<br>
<br>
it creates a [30,20,200] tensor. i.e. the first element of <b>inputs</b> (the first sentence), is a matrix of [20x200], where each row is a vector representing the word in the sentence

In [60]:
inputs = tf.nn.embedding_lookup(embedding_vocab, _input_data) #shape = (30, 20, 200)
inputs

<tf.Tensor 'embedding_lookup_1:0' shape=(60, 20, 200) dtype=float32>

In [61]:
session.run(inputs[0], feed_dict)

array([[ 0.01591928,  0.01368844, -0.01573621, ..., -0.01691541,
        -0.02147604,  0.01647602],
       [-0.01626554,  0.00250316,  0.0025011 , ..., -0.02000877,
        -0.02241979,  0.0008825 ],
       [-0.01340975, -0.00129013, -0.01743571, ..., -0.02258467,
         0.00187787, -0.01269052],
       ...,
       [-0.02416496,  0.0054648 , -0.01665604, ...,  0.00693947,
        -0.00316988,  0.00745784],
       [-0.01260721, -0.01141733, -0.0157061 , ..., -0.02046303,
         0.02371893,  0.01127747],
       [ 0.01806679,  0.00390237,  0.01104726, ...,  0.01748979,
        -0.01281851,  0.0220017 ]], dtype=float32)

<h3> Contructing the Recurrent Neural Network: </h3>

<code>tf.nn.dynamic_rnn()</code> creates a RNN using <code>stacked_lstm</code>
<br>
the input should be a tensor of shape: [batch_size, max_time, embedding_vector_size] -- here it's (60, 20, 200)
<br>
<br>
This method returns a pair (outputs, new_state) where:
- <b>outputs:</b> is a length T list of outputs (one for each input), or a nested tuple of such elements
- <b>new_state:</b> is the final state

In [62]:
outputs, new_state = tf.nn.dynamic_rnn(stacked_lstm, inputs, initial_state=_initial_state)

Looking at the outputs:

In [63]:
outputs

<tf.Tensor 'rnn/transpose_1:0' shape=(60, 20, 128) dtype=float32>

In [64]:
session.run(tf.global_variables_initializer())
session.run(outputs[0], feed_dict)

array([[ 7.0622109e-04,  1.6270705e-04, -4.4020417e-04, ...,
        -4.0015270e-04,  4.6334666e-04, -4.8935020e-05],
       [ 3.5085747e-04,  3.9049544e-04, -8.2564383e-04, ...,
        -3.9553904e-04,  5.6036044e-04, -3.6975081e-04],
       [ 2.5471963e-04,  6.6098431e-04, -2.4576910e-04, ...,
        -2.7519374e-05,  9.9058985e-04, -1.6818022e-04],
       ...,
       [-2.0592793e-06,  6.7708013e-04, -1.3280376e-04, ...,
         1.6114897e-04,  9.1858098e-04,  5.6003584e-05],
       [ 2.9415407e-04,  5.3887622e-04,  3.7562539e-04, ...,
         3.5242076e-04,  1.0768790e-03, -5.0518272e-04],
       [ 3.6447361e-04,  1.1660187e-05, -8.9612447e-05, ...,
        -7.8243669e-05,  4.7572015e-04, -9.3728211e-04]], dtype=float32)

And we need to flatten the outputs so we can connect it to our softmax layer. Let's reshape it from [60 x 20 x 200] to [1200 x 200]
<br>
<br>
<b>To do this:</b> Imagine our output is 3-d tensor as following (of course each <code>sen_x_word_y</code> is a an embedded vector by itself): 
<ul>
    <li>sentence 1: [[sen1word1], [sen1word2], [sen1word3], ..., [sen1word20]]</li> 
    <li>sentence 2: [[sen2word1], [sen2word2], [sen2word3], ..., [sen2word20]]</li>   
    <li>sentence 3: [[sen3word1], [sen3word2], [sen3word3], ..., [sen3word20]]</li>  
    <li>...  </li>
    <li>sentence 30: [[sen30word1], [sen30word2], [sen30word3], ..., [sen30word20]]</li>   
</ul>
Now, the flatten would convert this 3-dim tensor to:

[ [sen1word1], [sen1word2], [sen1word3], ..., [sen1word20],[sen2word1], [sen2word2], [sen2word3], ..., [sen2word20], ..., [sen30word20] ]

In [65]:
output = tf.reshape(outputs, [-1, hidden_size_l2])
output

<tf.Tensor 'Reshape:0' shape=(1200, 128) dtype=float32>