<h3> LSTM Model - Penn Treebank Dataset </h3>

Here, we break the model out into every individual step so that we can gain a better understanding of how they work -- and later combine them into one LSTM Model Class - located in other file "WordEmbeddings_LSTM"

<br>
<br>
We'll make use of word embeddings -- a way of representing sentence structures or words as n-dimensional vectors of real numbers
- So we pretty much assign each word a randomly-initialized vector, and input those into the network to be processed

And after iterating through our model, the vectors assume values that help the network correctly predict what it needs to (the probable next word in the sentence)
- It will group words in a similar fashion to this picture below (i.e. words that are frequently used together are grouped together): <br>
<img src="https://ibm.box.com/shared/static/bqhc5dg879gcoabzhxra1w8rkg3od1cu.png" width="500">
<i>Source: IBM </i>
<br>

In [1]:
import time
import numpy as np
import tensorflow as tf

<h3> Get Data: </h3>

Download the Penn Treebank dataset from IBM:

In [2]:
!mkdir data
!wget -q -O data/ptb.zip https://ibm.box.com/shared/static/z2yvmhbskc45xd2a9a4kkn6hg4g4kj5r.zip
!unzip -o data/ptb.zip -d data
!cp data/ptb/reader.py .

import reader

mkdir: cannot create directory ‘data’: File exists
Archive:  data/ptb.zip
  inflating: data/ptb/reader.py      
  inflating: data/__MACOSX/ptb/._reader.py  
  inflating: data/__MACOSX/._ptb     


Download simple examples dataset:

In [3]:
!wget http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz 
!tar xzf simple-examples.tgz -C data/

--2019-03-08 15:26:43--  http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz
Resolving www.fit.vutbr.cz (www.fit.vutbr.cz)... 147.229.9.23, 2001:67c:1220:809::93e5:917
Connecting to www.fit.vutbr.cz (www.fit.vutbr.cz)|147.229.9.23|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 34869662 (33M) [application/x-gtar]
Saving to: ‘simple-examples.tgz.4’


2019-03-08 15:26:54 (3.28 MB/s) - ‘simple-examples.tgz.4’ saved [34869662/34869662]



<h3> Building the LSTM Model: </h3>

Here, we define the model's hypterparameters so that we can practice playing around with them:

In [4]:
init_scale = 0.1                  # initial weight scale
learning_rate = 1.0               # initial learning weight
max_grad_norm = 5                 # max permissible norm for the gradient -- for Gradient Clipping
num_layers = 2                    # number of layers in our model
num_steps = 20                    # total number of recurrence steps 

hidden_size_l1 = 256              # number of neurons (processing units) in the hidden layers
hidden_size_l2 = 128

max_epoch_decay_lr = 4            # max number of epochs trained with the initial learning weight
max_epoch = 15                    # total epochs in training

keep_prob = 1                     # probability of keeping data in the Dropout layer
decay = 0.5                       # the decay for the learning rate
batch_size = 60                   # size for each batch of data
vocab_size = 10000                # vocab size
embedding_vector_size = 200       

is_training = 1                   # training flag to separate training from testing
data_dir = "data/simple-examples/data/" # data directory

- the structure is like:
    - 200 input units -> [200x256] Weight -> 256 Hidden units (first layer) -> [256x128] Weight matrix  -> 128 Hidden units (second layer) ->  [128x200] weight Matrix -> 200 unit output

<h4> Train Data: </h4>
Train data is a list of words, of size 929589, represented by numbers, e.g. [9971, 9972, 9974, 9975,...]

Start an interactive session:

In [5]:
session = tf.InteractiveSession()

In [6]:
# reads the data and separates it into train, validation, and test datasets
raw_data = reader.ptb_raw_data(data_dir)
train_data, valid_data, test_data, vocab, word_to_id = raw_data

In [7]:
print("Length of training data: ", len(train_data))
print("Length of validation data: ", len(valid_data))
print("Length of test data: ", len(test_data))
print("Length of vocab: ", vocab)

Length of training data:  929589
Length of validation data:  73760
Length of test data:  82430
Length of vocab:  10000


<h5>Define a function to translate id's back to their respective words:</h5>

In [8]:
def id_to_word(id_list):
    line = []
    for w in id_list:
        for word, wid in word_to_id.items():
            if wid == w:
                line.append(word)
    return line

In [9]:
print(train_data[0:100])

[9970, 9971, 9972, 9974, 9975, 9976, 9980, 9981, 9982, 9983, 9984, 9986, 9987, 9988, 9989, 9991, 9992, 9993, 9994, 9995, 9996, 9997, 9998, 9999, 2, 9256, 1, 3, 72, 393, 33, 2133, 0, 146, 19, 6, 9207, 276, 407, 3, 2, 23, 1, 13, 141, 4, 1, 5465, 0, 3081, 1596, 96, 2, 7682, 1, 3, 72, 393, 8, 337, 141, 4, 2477, 657, 2170, 955, 24, 521, 6, 9207, 276, 4, 39, 303, 438, 3684, 2, 6, 942, 4, 3150, 496, 263, 5, 138, 6092, 4241, 6036, 30, 988, 6, 241, 760, 4, 1015, 2786, 211, 6, 96, 4]


In [10]:
print(id_to_word(train_data[0:100]))

['aer', 'banknote', 'berlitz', 'calloway', 'centrust', 'cluett', 'fromstein', 'gitano', 'guterman', 'hydro-quebec', 'ipo', 'kia', 'memotec', 'mlx', 'nahb', 'punts', 'rake', 'regatta', 'rubens', 'sim', 'snack-food', 'ssangyong', 'swapo', 'wachter', '<eos>', 'pierre', '<unk>', 'N', 'years', 'old', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'nov.', 'N', '<eos>', 'mr.', '<unk>', 'is', 'chairman', 'of', '<unk>', 'n.v.', 'the', 'dutch', 'publishing', 'group', '<eos>', 'rudolph', '<unk>', 'N', 'years', 'old', 'and', 'former', 'chairman', 'of', 'consolidated', 'gold', 'fields', 'plc', 'was', 'named', 'a', 'nonexecutive', 'director', 'of', 'this', 'british', 'industrial', 'conglomerate', '<eos>', 'a', 'form', 'of', 'asbestos', 'once', 'used', 'to', 'make', 'kent', 'cigarette', 'filters', 'has', 'caused', 'a', 'high', 'percentage', 'of', 'cancer', 'deaths', 'among', 'a', 'group', 'of']


<h5> Reading one mini-batch and feeding our network: </h5>

In [11]:
itera = reader.ptb_iterator(train_data, batch_size, num_steps)
first_tuple = itera.__next__()
x = first_tuple[0]
y = first_tuple[1]

In [12]:
x.shape

(60, 20)

Looking at 3 sentences of our input x:

In [13]:
x[0:3]

array([[9970, 9971, 9972, 9974, 9975, 9976, 9980, 9981, 9982, 9983, 9984,
        9986, 9987, 9988, 9989, 9991, 9992, 9993, 9994, 9995],
       [ 901,   33, 3361,    8, 1279,  437,  597,    6,  261, 4276, 1089,
           8, 2836,    2,  269,    4, 5526,  241,   13, 2420],
       [2654,    6,  334, 2886,    4,    1,  233,  711,  834,   11,  130,
         123,    7,  514,    2,   63,   10,  514,    8,  605]],
      dtype=int32)

Looking at the same 3 sentences in words:

In [14]:
print(id_to_word(x[0]))
print(id_to_word(x[1]))
print(id_to_word(x[2]))

['aer', 'banknote', 'berlitz', 'calloway', 'centrust', 'cluett', 'fromstein', 'gitano', 'guterman', 'hydro-quebec', 'ipo', 'kia', 'memotec', 'mlx', 'nahb', 'punts', 'rake', 'regatta', 'rubens', 'sim']
['test', 'will', 'concentrate', 'and', 'sometimes', 'give', 'away', 'a', 'few', 'exact', 'questions', 'and', 'answers', '<eos>', 'use', 'of', 'scoring', 'high', 'is', 'widespread']
['color', 'a', 'second', 'round', 'of', '<unk>', 'economic', 'talks', 'scheduled', 'for', 'next', 'week', 'in', 'washington', '<eos>', 'not', 'that', 'washington', 'and', 'tokyo']


<h5> Define 2 placeholders to feed them with mini-batches (x and y): </h5>

In [15]:
_input_data = tf.placeholder(tf.int32, [batch_size, num_steps]) # [30, 20]
_targets = tf.placeholder(tf.int32, [batch_size, num_steps]) # [30, 20]

Define a dictionary to feed the placeholders with our first mini-batch

In [16]:
feed_dict = {_input_data:x, _targets:y}


For example, we can use it to feed <code>\_input\_data</code>:

In [17]:
session.run(_input_data, feed_dict)

array([[9970, 9971, 9972, ..., 9993, 9994, 9995],
       [ 901,   33, 3361, ...,  241,   13, 2420],
       [2654,    6,  334, ...,  514,    8,  605],
       ...,
       [7831,   36, 1678, ...,    4, 4558,  157],
       [  59, 2070, 2433, ...,  400,    1, 1173],
       [2097,    3,    2, ..., 2043,   23,    1]], dtype=int32)

<h3> Create the Stacked LSTM: </h3>

In [18]:
lstm_cell_l1 = tf.contrib.rnn.BasicLSTMCell(hidden_size_l1, forget_bias=0.0)
lstm_cell_l2 = tf.contrib.rnn.BasicLSTMCell(hidden_size_l2, forget_bias=0.0)
stacked_lstm = tf.contrib.rnn.MultiRNNCell([lstm_cell_l1, lstm_cell_l2])

<h4> Initialize the states of the network: </h4>

There are 2 state matrices:
<ul>
    <li> Memory State: m_state </li>
    <li> Cell State: c_state </li>
</ul>
Each hidden layer has a vector of size 30 which keeps the states. So for 256 hidden units we have a matrix of size [30x256]

In [19]:
_initial_state = stacked_lstm.zero_state(batch_size, tf.float32)
_initial_state

(LSTMStateTuple(c=<tf.Tensor 'MultiRNNCellZeroState/BasicLSTMCellZeroState/zeros:0' shape=(60, 256) dtype=float32>, h=<tf.Tensor 'MultiRNNCellZeroState/BasicLSTMCellZeroState/zeros_1:0' shape=(60, 256) dtype=float32>),
 LSTMStateTuple(c=<tf.Tensor 'MultiRNNCellZeroState/BasicLSTMCellZeroState_1/zeros:0' shape=(60, 128) dtype=float32>, h=<tf.Tensor 'MultiRNNCellZeroState/BasicLSTMCellZeroState_1/zeros_1:0' shape=(60, 128) dtype=float32>))

Looking at the states, even though they're all 0 for now:

In [20]:
session.run(_initial_state, feed_dict)

(LSTMStateTuple(c=array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]], dtype=float32), h=array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)),
 LSTMStateTuple(c=array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]], dtype=float32), h=array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
  

<h3> Creating Word Embeddings </h3>

A typical approach is to use the one-hot encoding method to convert the words in our dataset to vectors of numbers. This creates a high-dimensional sparse dataset, but is also inefficient when working with such large datasets. 
<br>
<br>
Thus, we'll use the word2vec approach. And we'll represent this as a layer in our LSTM where the word IDs will be represented as a dense representation before feeding into the LSTM 
<br> 
<br>
<b>Note:</b> The embedding vectors will also get updated during the training process of the deep neural network

In [21]:
embedding_vocab = tf.get_variable("embedding_vocab", [vocab_size, embedding_vector_size])

Initializing the <code>embedding_words</code> with random values: 

In [22]:
session.run(tf.global_variables_initializer())
session.run(embedding_vocab)

array([[-0.0213737 , -0.00334409, -0.01018565, ..., -0.01020918,
         0.02334081,  0.00289541],
       [-0.00548434, -0.02135967,  0.01319274, ..., -0.00279438,
         0.01150737, -0.00271155],
       [-0.00017232, -0.01461577, -0.00689847, ...,  0.00439154,
         0.00612313, -0.0164778 ],
       ...,
       [-0.0209833 ,  0.00340537,  0.01616802, ...,  0.02202382,
        -0.00369697, -0.02364882],
       [-0.00075741,  0.00768025,  0.01728018, ...,  0.01072704,
        -0.00530361, -0.01803417],
       [-0.00375612, -0.022914  ,  0.00630276, ...,  0.01526984,
        -0.01173498,  0.00300352]], dtype=float32)

Below 
<code>embedding_lookup()</code> finds the embedded values for our batch of 60x20 words.
<br>
<br>
it goes to each row of <code>input_data</code> and for each word in the row/sentence it finds the corresponding vector in <code>embedding_dict</code>
<br>
<br>
it creates a [60,20,200] tensor. i.e. the first element of <b>inputs</b> (the first sentence), is a matrix of [20x200], where each row is a vector representing the word in the sentence

In [23]:
inputs = tf.nn.embedding_lookup(embedding_vocab, _input_data) #shape = (60, 20, 200)
inputs

<tf.Tensor 'embedding_lookup:0' shape=(60, 20, 200) dtype=float32>

In [24]:
session.run(inputs[0], feed_dict)

array([[-0.01636498,  0.01951164, -0.01602272, ..., -0.02367425,
         0.01455395, -0.01347792],
       [ 0.00582112, -0.01481521, -0.00915285, ..., -0.00859324,
        -0.0011811 ,  0.01254718],
       [-0.01520363, -0.01240561, -0.00730615, ...,  0.0145461 ,
        -0.0126853 ,  0.02235372],
       ...,
       [ 0.00772456, -0.023672  , -0.01346414, ..., -0.01951801,
         0.01067849, -0.00712226],
       [-0.00237579,  0.00305569,  0.01819591, ...,  0.00439961,
        -0.0165371 ,  0.01986329],
       [-0.02240466, -0.01714158,  0.02011445, ..., -0.01531839,
         0.0086587 ,  0.00743859]], dtype=float32)

<h3> Contructing the Recurrent Neural Network: </h3>

<code>tf.nn.dynamic_rnn()</code> creates a RNN using <code>stacked_lstm</code>
<br>
the input should be a tensor of shape: [batch_size, max_time, embedding_vector_size] -- here it's (60, 20, 200)
<br>
<br>
This method returns a pair (outputs, new_state) where:
- <b>outputs:</b> is a length T list of outputs (one for each input), or a nested tuple of such elements
- <b>new_state:</b> is the final state

In [25]:
outputs, new_state = tf.nn.dynamic_rnn(stacked_lstm, inputs, initial_state=_initial_state)

Looking at the outputs:

In [26]:
outputs

<tf.Tensor 'rnn/transpose_1:0' shape=(60, 20, 128) dtype=float32>

In [27]:
session.run(tf.global_variables_initializer())
session.run(outputs[0], feed_dict)

array([[ 7.9406323e-05,  3.9944524e-04,  6.3427855e-05, ...,
        -7.7575125e-04, -3.1275125e-04,  6.1279177e-05],
       [ 5.5325805e-04,  4.5826728e-04,  2.3342554e-04, ...,
        -4.8883003e-04, -7.5739459e-04,  2.6636335e-04],
       [ 4.2421496e-04,  8.5593783e-04,  4.4784762e-04, ...,
        -5.8149017e-04, -4.1002306e-04,  2.9800067e-04],
       ...,
       [-3.0399158e-04,  9.3412941e-04,  9.2715595e-04, ...,
        -1.3110086e-03,  7.7368328e-05,  9.3800321e-05],
       [ 3.0067988e-04,  2.1566251e-04,  7.6333305e-04, ...,
        -6.6103722e-04,  3.9553220e-04,  1.4844591e-04],
       [ 8.5739687e-04, -4.5345520e-04,  7.0111483e-04, ...,
        -8.6387061e-04,  1.3594154e-03, -1.5523579e-04]], dtype=float32)

And we need to flatten the outputs so we can connect it to our softmax layer. Let's reshape it from [60 x 20 x 200] to [1200 x 200]
<br>
<br>
<b>To do this:</b> Imagine our output is 3-d tensor as following (of course each <code>sen_x_word_y</code> is an embedded vector by itself): 
<ul>
    <li>sentence 1: [[sen1word1], [sen1word2], [sen1word3], ..., [sen1word20]]</li> 
    <li>sentence 2: [[sen2word1], [sen2word2], [sen2word3], ..., [sen2word20]]</li>   
    <li>sentence 3: [[sen3word1], [sen3word2], [sen3word3], ..., [sen3word20]]</li>  
    <li>...  </li>
    <li>sentence 30: [[sen30word1], [sen30word2], [sen30word3], ..., [sen30word20]]</li>   
</ul>
Now, the flatten would convert this 3-dim tensor to:

[ [sen1word1], [sen1word2], [sen1word3], ..., [sen1word20],[sen2word1], [sen2word2], [sen2word3], ..., [sen2word20], ..., [sen30word20] ]

In [28]:
output = tf.reshape(outputs, [-1, hidden_size_l2])
output

<tf.Tensor 'Reshape:0' shape=(1200, 128) dtype=float32>

<h3> Creating Logistic Unit </h3>
- this will return the probability of the output word in our vocabulary of 10,000 words

In [29]:
softmax_w = tf.get_variable("softmax_w", [hidden_size_l2, vocab_size]) #[256x10,000]
softmax_b = tf.get_variable("softmax_b", [vocab_size]) #[1x10,000]
logits = tf.matmul(output, softmax_w) + softmax_b
prob = tf.nn.softmax(logits)

Looking at the probability of observing words for t=0 to t=20:

In [30]:
session.run(tf.global_variables_initializer())
output_words_prob = session.run(prob, feed_dict)
print("shape of the output: ", output_words_prob.shape)
print("probability of observing words in the range t=0 to t=20: ", output_words_prob[0:20])

shape of the output:  (1200, 10000)
probability of observing words in the range t=0 to t=20:  [[9.84885410e-05 9.83528225e-05 9.99198164e-05 ... 9.86349187e-05
  1.01478494e-04 9.95689697e-05]
 [9.84855214e-05 9.83500722e-05 9.99271215e-05 ... 9.86392624e-05
  1.01486527e-04 9.95785231e-05]
 [9.84893923e-05 9.83557402e-05 9.99320910e-05 ... 9.86375089e-05
  1.01487371e-04 9.95713781e-05]
 ...
 [9.84974104e-05 9.83587524e-05 9.99250260e-05 ... 9.86395607e-05
  1.01486046e-04 9.95751252e-05]
 [9.84975923e-05 9.83535210e-05 9.99231052e-05 ... 9.86348750e-05
  1.01490121e-04 9.95790542e-05]
 [9.84945582e-05 9.83526697e-05 9.99238327e-05 ... 9.86357627e-05
  1.01485304e-04 9.95770024e-05]]


<h3> Prediction </h3>
- what is the word corresponding to the maximum probability?

In [31]:
np.argmax(output_words_prob[0:20], axis=1)

array([3261, 3744, 3261, 6397, 9649, 9649, 9649, 6947, 4475,  445, 6397,
       6397,  780, 6397, 8792,  780, 2971, 8792, 8792, 8792])

what is the ground truth for the first word of the first sentence?

In [32]:
y[0]

array([9971, 9972, 9974, 9975, 9976, 9980, 9981, 9982, 9983, 9984, 9986,
       9987, 9988, 9989, 9991, 9992, 9993, 9994, 9995, 9996], dtype=int32)

and you can get that same ground truth using the <b>target</b> tensor:

In [33]:
targ = session.run(_targets, feed_dict)
targ[0]

array([9971, 9972, 9974, 9975, 9976, 9980, 9981, 9982, 9983, 9984, 9986,
       9987, 9988, 9989, 9991, 9992, 9993, 9994, 9995, 9996], dtype=int32)

<h3> Defining the Objective Function </h3>
- we minimize the loss function -- the average negative log probability of the target words:
$$\text{loss} = -\frac{1}{N}\sum_{i=1}^{N} \ln p_{\text{target}_i}$$
this function can be implemented in tensorflow through <code>sequence_loss_by_example</code>. it calculates the weighted cross-entropy loss for <b>logits</b> and <b>target</b> sequence
<br>
<br>
The arguments of this function are:
<ul>
    <li>logits: List of 2D Tensors of shape [batch_size x num_decoder_symbols].</li>  
    <li>targets: List of 1D batch-sized int32 Tensors of the same length as logits.</li>   
    <li>weights: List of 1D batch-sized float-Tensors of the same length as logits.</li> 
</ul>

In [34]:
loss = tf.contrib.legacy_seq2seq.sequence_loss_by_example([logits], [tf.reshape(_targets, [-1])],[tf.ones([batch_size * num_steps])])

Looking at the first 10 values for loss:

In [35]:
session.run(loss, feed_dict)[:10]

array([9.219284, 9.21556 , 9.201283, 9.207876, 9.213242, 9.21275 ,
       9.194984, 9.193851, 9.211199, 9.226957], dtype=float32)

Define loss as the average of all losses:

In [36]:
cost = tf.reduce_sum(loss) / batch_size
session.run(tf.global_variables_initializer())
session.run(cost, feed_dict)

184.19221

<h3>Training our Model</h3>

We take the following steps to train our model:
<ol>
    <li>Define the optimizer.</li>
    <li>Extract variables that are trainable.</li>
    <li>Calculate the gradients based on the loss function.</li>
    <li>Apply the optimizer to the variables/gradients tuple.</li>
</ol>

<h5> 1. Define the Optimizer </h5>
- Here we'll use the gradient descent optimizer

In [37]:
# Create a variable for the learning rate
lr = tf.Variable(0.0, trainable=False)

# Create optimizer with our learning rate
optimizer = tf.train.GradientDescentOptimizer(lr)

<h5> 2. Extract Trainable Variables </h5>
- if you passed <code>trainable=True</code>, the variable constructor automatically adds the variable to the graph collection <b>GraphKeys.TRAINABLE_VARIABLES</b>. With <code>tf.trainable_variables()</code> you can get all the variables created with <code>trainable=True</code>

In [38]:
tvars = tf.trainable_variables()
tvars

[<tf.Variable 'embedding_vocab:0' shape=(10000, 200) dtype=float32_ref>,
 <tf.Variable 'rnn/multi_rnn_cell/cell_0/basic_lstm_cell/kernel:0' shape=(456, 1024) dtype=float32_ref>,
 <tf.Variable 'rnn/multi_rnn_cell/cell_0/basic_lstm_cell/bias:0' shape=(1024,) dtype=float32_ref>,
 <tf.Variable 'rnn/multi_rnn_cell/cell_1/basic_lstm_cell/kernel:0' shape=(384, 512) dtype=float32_ref>,
 <tf.Variable 'rnn/multi_rnn_cell/cell_1/basic_lstm_cell/bias:0' shape=(512,) dtype=float32_ref>,
 <tf.Variable 'softmax_w:0' shape=(128, 10000) dtype=float32_ref>,
 <tf.Variable 'softmax_b:0' shape=(10000,) dtype=float32_ref>]

In [39]:
[v.name for v in tvars]

['embedding_vocab:0',
 'rnn/multi_rnn_cell/cell_0/basic_lstm_cell/kernel:0',
 'rnn/multi_rnn_cell/cell_0/basic_lstm_cell/bias:0',
 'rnn/multi_rnn_cell/cell_1/basic_lstm_cell/kernel:0',
 'rnn/multi_rnn_cell/cell_1/basic_lstm_cell/bias:0',
 'softmax_w:0',
 'softmax_b:0']

<h5> Calculate Gradients based on Loss Function </h5>
- Gradient -- is calculated the same way as taking the derivative

In [40]:
tf.gradients(cost, tvars)

[<tensorflow.python.framework.ops.IndexedSlices at 0x7f2b800b9c50>,
 <tf.Tensor 'gradients/rnn/while/rnn/multi_rnn_cell/cell_0/basic_lstm_cell/MatMul/Enter_grad/b_acc_3:0' shape=(456, 1024) dtype=float32>,
 <tf.Tensor 'gradients/rnn/while/rnn/multi_rnn_cell/cell_0/basic_lstm_cell/BiasAdd/Enter_grad/b_acc_3:0' shape=(1024,) dtype=float32>,
 <tf.Tensor 'gradients/rnn/while/rnn/multi_rnn_cell/cell_1/basic_lstm_cell/MatMul/Enter_grad/b_acc_3:0' shape=(384, 512) dtype=float32>,
 <tf.Tensor 'gradients/rnn/while/rnn/multi_rnn_cell/cell_1/basic_lstm_cell/BiasAdd/Enter_grad/b_acc_3:0' shape=(512,) dtype=float32>,
 <tf.Tensor 'gradients/MatMul_grad/MatMul_1:0' shape=(128, 10000) dtype=float32>,
 <tf.Tensor 'gradients/add_grad/Reshape_1:0' shape=(10000,) dtype=float32>]

In [43]:
grad_t_list = tf.gradients(cost, tvars)
#session.run(grad_t_list, feed_dict)

now we have a list of tensorts - grad_t_list. We can use it to find clipped tensors. <code>clip_by_global_norm</code> clips values of multiple tensors by the ratio of the sum of their norms.
<br>
<br>
<code>clip_by_global_norm</code> takes t-list as input and returns 2 things:
- a list of clipped tensors -- called <i>list_clipped</i>
- the global norm of all tensors in t-list -- called <i>global_norm</i>

In [44]:
# define the gradiet clipping threshold:
grads, _ = tf.clip_by_global_norm(grad_t_list, max_grad_norm)
grads

[<tensorflow.python.framework.ops.IndexedSlices at 0x7f2b800ad0f0>,
 <tf.Tensor 'clip_by_global_norm/clip_by_global_norm/_1:0' shape=(456, 1024) dtype=float32>,
 <tf.Tensor 'clip_by_global_norm/clip_by_global_norm/_2:0' shape=(1024,) dtype=float32>,
 <tf.Tensor 'clip_by_global_norm/clip_by_global_norm/_3:0' shape=(384, 512) dtype=float32>,
 <tf.Tensor 'clip_by_global_norm/clip_by_global_norm/_4:0' shape=(512,) dtype=float32>,
 <tf.Tensor 'clip_by_global_norm/clip_by_global_norm/_5:0' shape=(128, 10000) dtype=float32>,
 <tf.Tensor 'clip_by_global_norm/clip_by_global_norm/_6:0' shape=(10000,) dtype=float32>]

In [45]:
session.run(grads, feed_dict)

[IndexedSlicesValue(values=array([[ 2.5460706e-06, -1.5599641e-06, -5.0178919e-06, ...,
         -5.1370648e-06,  1.2168175e-06, -8.8229208e-06],
        [ 3.5357975e-06, -5.6158678e-06, -1.0667203e-06, ...,
          4.7822709e-07,  7.7584300e-06, -8.2193346e-06],
        [ 1.1310232e-06, -4.9726486e-06,  9.2283102e-08, ...,
         -1.9307936e-06, -9.7034885e-08, -7.8395706e-06],
        ...,
        [-4.5345719e-06, -1.0879848e-05, -1.1152408e-05, ...,
         -8.4349831e-06, -6.2263553e-06,  9.7209504e-06],
        [-2.4112430e-06, -8.7081917e-06, -9.6059503e-06, ...,
         -5.9034192e-06, -4.0739587e-06,  1.1668343e-05],
        [-1.5840376e-06,  1.0775825e-06, -5.2606206e-06, ...,
         -1.6369095e-06, -8.3876294e-06,  1.0226038e-06]], dtype=float32), indices=array([9970, 9971, 9972, ..., 2043,   23,    1], dtype=int32), dense_shape=array([10000,   200], dtype=int32)),
 array([[-8.5344025e-09, -1.0975511e-08,  4.0865434e-08, ...,
          4.1109509e-08, -3.0001356e-08,  

<h5> 4. Apply the optimizer to the variables/gradients tuple</h5>

In [46]:
# create the tensorflow training operation through our optimizer
train_op = optimizer.apply_gradients(zip(grads, tvars))

In [47]:
session.run(tf.global_variables_initializer())
session.run(train_op, feed_dict)