# NLP seq2seq with TensorFlow part01
In this notebook we learn NLP seq2seq model with TensorFlow via the following steps

* word embeddings
* sequence encoding with rnn

The goal of this notebook is to introduce some helper functions provided by Tensorflow (version 1.0.1)

* [`tf.contrib.layers.embed_sequence`](https://www.tensorflow.org/api_docs/python/tf/contrib/layers/embed_sequence) to convert sparse-input (sequence ids) => dense-representation (word-vector see [word2vec](https://www.tensorflow.org/tutorials/word2vec) for more detail)
* [`tf.contrib.rnn.BasicRNNCell`](https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/BasicRNNCell) to model a basic RNN cell
* [`tf.contrib.rnn.BasicLSTMCell`](https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/BasicLSTMCell) to model a Long-Short-Term-Memory cell
* [`tf.contrib.rnn.GRUCell`](https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/GRUCell) to model a Gated-Recurrent-Unit cell
* [`tf.nn.dynamic_rnn`](https://www.tensorflow.org/api_docs/python/tf/nn/dynamic_rnn) to perform fully dynamic unrolling of our rnn i.e we compute the final state of our RNN

In [1]:
import tensorflow as tf
import numpy as np

# for auto-reloading external modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

import copy, sys, time
if '../common' not in sys.path:
    sys.path.insert(0, '../common')

import helper
from gradient_check import rel_error
source_path = '../common/data/small_vocab_en'
target_path = '../common/data/small_vocab_fr'
source_text = helper.load_data(source_path)
target_text = helper.load_data(target_path)


## Preprocessing data
The first step is to create lookup tables word to integer-id and vice-versa, note that we always add some special word into the dictionary e.g
~~~~
CODES = {'<PAD>': 0, '<EOS>': 1, '<UNK>': 2, '<GO>': 3 }
~~~~

In [2]:
def create_lookup_tables(text, special_codes):
    vocab_to_int = copy.copy(special_codes)
    vocab = set(text.split())
    
    for v_i, v in enumerate(vocab, len(CODES)):
        vocab_to_int[v] = v_i

    int_to_vocab = {v_i: v for v, v_i in vocab_to_int.items()}
    return vocab_to_int, int_to_vocab

CODES = {'<PAD>': 0, '<EOS>': 1, '<UNK>': 2, '<GO>': 3 }
src_vocab_to_int, src_int_to_vocab = create_lookup_tables(source_text, CODES)
des_vocab_to_int, des_int_to_vocab = create_lookup_tables(target_text, CODES)

Given lookup tables, we need convert text into ids

In [3]:
def text_to_ids(text, vocab_to_int, append_eos = False):
    eos = []
    if append_eos:
        eos = [vocab_to_int['<EOS>']]
    
    sequence_ids = []
    for sent in text.split('\n'):
        sent_ids = [vocab_to_int[w] for w in sent.split()]
        if len(sent_ids) > 0:
            sequence_ids.append(sent_ids + eos)
    return sequence_ids

src_seq_ids = text_to_ids(source_text, src_vocab_to_int)
des_seq_ids = text_to_ids(target_text, des_vocab_to_int, append_eos=True)

i_max = np.argmax([len(s) for s in src_seq_ids])
i_min = np.argmin([len(s) for s in src_seq_ids])
print ('max len {:2d} at {}'.format(len(src_seq_ids[i_max]), i_max))
print ('min len {:2d} at {}'.format(len(src_seq_ids[i_min]), i_min))

max len 17 at 1
min len  3 at 5057


## Try word embedding with RNN
In this section, we want to implement the encoder part of the following schema
<img src="images/encoder_decoder.png" width="600"/>

We will use the following helper functions
* helper.pad_sentence_batch: we want all sentence in one batch has same length
* [`tf.contrib.layers.embed_sequence`](https://www.tensorflow.org/api_docs/python/tf/contrib/layers/embed_sequence) to embed a sequence (run rnn for all sequence)

In [16]:
tf.reset_default_graph()

# create interactive session 
sess = tf.InteractiveSession()

# create data
input_data = tf.placeholder(tf.int32, shape = [None, None])
src_vocab_size = len(src_vocab_to_int)
src_embed_dim = 2

print ('source vocab-size: {}'.format(src_vocab_size))

# we create initilizer so we can control embedding-weights init
embed_weights = np.linspace(0.0, 1.0, src_vocab_size * src_embed_dim, dtype=np.float32).reshape(src_vocab_size, 
                                                                                                src_embed_dim)


embed_init = tf.constant_initializer(embed_weights)

# we create embedding
embed_input = tf.contrib.layers.embed_sequence(input_data, src_vocab_size, src_embed_dim, initializer=embed_init)

source vocab-size: 231


## Check embed layer
We will run embed-layer, we should expect **embed-outputs** match with **embed_weights**, we only test for two batches with different seq-length

In [23]:
# we initilize our variable, another way is to use tf.assign
sess.run(tf.global_variables_initializer())

batch_size = 2
indices = [1, 5057]
batch_datas = []
for idx in indices:
    test_batch = np.array(helper.pad_sentence_batch(src_seq_ids[idx:idx+batch_size]))
    print (test_batch.shape)
    batch_datas.append(test_batch)
    embed_vals = sess.run(embed_input, feed_dict={input_data:test_batch})
    seq_len = test_batch.shape[1]
    w = 0
    while (w==0): 
        i = np.random.randint(batch_size)
        j = np.random.randint(seq_len)
        w = test_batch[i,j]
    print ('word[{},{}] = {}'.format(i, j, test_batch[i,j]))
    print ('embed_vals[{},{}] = {}'.format(i, j, embed_vals[i,j]))
    print ('embed_weight[{}] = {}'.format(test_batch[i,j], embed_weights[test_batch[i,j]]))
    print ('rel-err {:e}\n'.format(rel_error(embed_vals[i,j], embed_weights[test_batch[i,j]])))

(2, 17)
word[1,7] = 127
embed_vals[1,7] = [ 0.55097616  0.55314535]
embed_weight[127] = [ 0.55097616  0.55314535]
rel-err 0.000000e+00

(2, 9)
word[1,5] = 113
embed_vals[1,5] = [ 0.49023861  0.4924078 ]
embed_weight[113] = [ 0.49023861  0.4924078 ]
rel-err 0.000000e+00



# Implement encoder layer  
Given embed_input ($w_1,...,w_n$), we are ready to make it passed through a RNN encoder. Since the seq-len is variable, we will use 

* [`tf.nn.dynamic_rnn`](https://www.tensorflow.org/api_docs/python/tf/nn/dynamic_rnn) to perform un-roll rnn encoder
* [`tf.contrib.rnn.BasicRNNCell`](https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/BasicRNNCell) or [`tf.contrib.rnn.BasicLSTMCell`](https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/BasicLSTMCell) to model a cell in our RNN

In [None]:
rnn_size = 4

enc_cell = tf.contrib.rnn.BasicRNNCell(rnn_size)
_, enc_state = tf.nn.dynamic_rnn(enc_cell, embed_input, dtype=tf.float32)

In [29]:
# print all variable
tvars = tf.global_variables()
sess.run(tf.global_variables_initializer())

for var in tvars:
    print(var.name)  # Prints the name of the variable alongside its value.

EmbedSequence/embeddings:0
rnn/basic_rnn_cell/weights:0
rnn/basic_rnn_cell/biases:0


## Inspect variables
We look at our trainable variables:
* embedding-weights variable: **EmbedSequence/embeddings:0**
* rnn-weights variable: **rnn/basic_rnn_cell/weights:0**
* rnn-biases variable: **rnn/basic_rnn_cell/biases:0**

In [27]:
rnn_ew = [var for var in tvars if var.name == 'EmbedSequence/embeddings:0'][0]
rnn_w  = [var for var in tvars if var.name == 'rnn/basic_rnn_cell/weights:0'][0]
rnn_b  = [var for var in tvars if var.name == 'rnn/basic_rnn_cell/biases:0'][0]

# we should expect rnn_ew.shape = (vocab_size = 231, embed_dim = 2)
print ('rnn_ew has shape {}'.format(rnn_ew.get_shape().as_list()))

# we should expect rnn_w.shape = (embed_dim + rnn_size, rnn_size)
print (rnn_w.get_shape().as_list())

# we should expect rnn_b.shape = (rnn_size)
print (rnn_b.get_shape().as_list())

rnn_ew has shape [231, 2]
[6, 4]
[4]


## RNN encoder
Let's run RNN encoder with an input data to verify if it follows the following dynamics
$$
h_0 = (0,\ldots,0) \in \mathbb{R}^H, x_t \in \mathbb{R}^D, W \in \mathbb{R}^{(D+H)\times H}, b \in \mathbb{R}^H
$$
with update rule
$$
h_t = \tanh\left( x_{t} \times W[0:D,:] +  h_{t-1}\times W[D:,:] + b\right)
$$

In the following we do implement via 2 ways
* naive ways: do implementation as above formula
* vectorized ways for batched input by noticing
$$
x_{t} \times W[0:D,:] +  h_{t-1}\times W[D:,:] = stack(x_{t}, h_{t-1}) \times W
$$

In [98]:
# let run rnn now, we reduce the dimension to verify it easier
seq_in = batch_datas[0][:,0:2]
enc_in  = sess.run(embed_input, feed_dict={input_data : seq_in})
enc_out = sess.run(enc_state, feed_dict={input_data : seq_in})
print ('encoder input:  {}'.format(enc_in.shape))
print ('encoder output: {}'.format(enc_out.shape))

w_v = rnn_w.eval()
b_v = rnn_b.eval()
print ('rnn_w has shape         {}'.format(w_v.shape))
print ('encoder input has shape {}'.format(enc_in.shape))
print ('\nencoder output\n{}'.format(enc_out))

D = src_embed_dim
H = rnn_size

# naive implementation
h0 = np.zeros((batch_size, H), dtype=np.float32)
h1 = np.tanh(enc_in[:,0,:].dot(w_v[0:D,:]) + h0.dot(w_v[D:,:]) + b_v)
h2 = np.tanh(enc_in[:,1,:].dot(w_v[0:D,:]) + h1.dot(w_v[D:,:]) + b_v)
print ('\nnaive compute rnn\n{}'.format(h2))
print ('\nrel-error: {:e}'.format(rel_error(enc_out, h2)))

# vectorized implementation
trans_enc_in = np.transpose(enc_in, [1,0,2])
seq_len = trans_enc_in.shape[0]
h = np.zeros((batch_size, H), dtype=np.float32)
for i in range(seq_len):
    x_h = np.concatenate((trans_enc_in[i], h), axis=1)
    h = np.tanh(x_h.dot(w_v) + b_v)
    #h = np.tanh(trans_enc_in[i].dot(w_v[0:D,:]) + h.dot(w_v[D:,:]) + b_v)
    
print ('\nvectorized compute rnn\n{}'.format(h))
print ('\nrel-error: {:e}'.format(rel_error(enc_out, h)))

encoder input:  (2, 2, 2)
encoder output: (2, 4)
rnn_w has shape         (6, 4)
encoder input has shape (2, 2, 2)

encoder output
[[ 0.37476987  0.06455565  0.06910897 -0.32439587]
 [ 0.41001096  0.11829546  0.11236352 -0.36842257]]

naive compute rnn
[[ 0.37476984  0.06455564  0.06910896 -0.32439587]
 [ 0.41001099  0.11829546  0.11236351 -0.36842257]]

rel-error: 1.078092e-07

vectorized compute rnn
[[ 0.3747699   0.06455563  0.06910896 -0.32439584]
 [ 0.41001099  0.11829551  0.11236353 -0.36842257]]

rel-error: 1.889484e-07


We can see that RNN works as epected, we do see some error since Tensorflow uses different math-backend (Eigen) than Numpy with MKL.

We also see the vectorized doesn't match the naive-implementation, we suspect that due to machine-error.

Now let's look at LSTM.

# RNN encoder with LSTM cell
Let's recall the update rull for LSTM
$$
\begin{aligned}
i_{t}&=\mathrm{sigm}(W_{i}x_{t}+U_{i}h_{t-1}+b_{i})\\
g_{t}&=\tanh(W_{g}x_{t}+U_{g}h_{t-1}+b_{c})\\
f_{t}&=\mathrm{sigm}(W_{f}x_{t}+U_{f}h_{t-1}+b_{f})\\
o_{t}&=\mathrm{sigm}(W_{o}x_{t}+U_{o}h_{t-1}+b_{o})\\
c_{t}&=f_{t}\circ c_{t-1}+i_{t}\circ g_{t}\\
h_{t}&=o_{t}\circ \tanh(c_{t})
\end{aligned}
$$

In [62]:
rnn_size = 4

lstm_cell = tf.contrib.rnn.BasicLSTMCell(rnn_size)
_, lstm_enc_state = tf.nn.dynamic_rnn(lstm_cell, embed_input, dtype=tf.float32)

In [124]:
# print all variable
tvars = tf.global_variables()
sess.run(tf.global_variables_initializer())

for var in tvars:
    print(var.name)  # Prints the name of the variable alongside its value.
    
lstm_w  = [var for var in tvars if var.name == 'rnn/basic_lstm_cell/weights:0'][0]
lstm_b  = [var for var in tvars if var.name == 'rnn/basic_lstm_cell/biases:0'][0]

# we should expect rnn_w.shape = (embed_dim + rnn_size, 4*rnn_size)
print (lstm_w.get_shape().as_list())

# we should expect rnn_b.shape = (4*rnn_size)
print (lstm_b.get_shape().as_list())

EmbedSequence/embeddings:0
rnn/basic_rnn_cell/weights:0
rnn/basic_rnn_cell/biases:0
rnn/basic_lstm_cell/weights:0
rnn/basic_lstm_cell/biases:0
rnn/gru_cell/gates/weights:0
rnn/gru_cell/gates/biases:0
rnn/gru_cell/candidate/weights:0
rnn/gru_cell/candidate/biases:0
[6, 16]
[16]


## Run encoder with LSTM cell
We reduce input to small dimension and pass it through a rnn-encoder

In [118]:
seq_in = batch_datas[0][:,0:3]
enc_in  = sess.run(embed_input, feed_dict={input_data : seq_in})
enc_out = sess.run(lstm_enc_state, feed_dict={input_data : seq_in})
print ('encoder input:  {}'.format(enc_in.shape))

print ('encoder output.c: {}'.format(enc_out.c.shape))
print ('encoder output.h: {}'.format(enc_out.h.shape))

w_v = lstm_w.eval()
b_v = lstm_b.eval()
print (w_v.shape)
print (enc_in[0,0,:]) 
print ('\nencoder output.c\n{}'.format(enc_out.c))
print ('\nencoder output.h\n{}'.format(enc_out.h))

encoder input:  (2, 3, 2)
encoder output.c: (2, 4)
encoder output.h: (2, 4)
(6, 16)
[ 0.29934925  0.30151844]

encoder output.c
[[ 0.0770378   0.27204043 -0.00417575  0.33137965]
 [ 0.02871538  0.14283675  0.01768154  0.19223367]]

encoder output.h
[[ 0.03677867  0.13295256 -0.00200626  0.16160902]
 [ 0.01427472  0.07151565  0.00893414  0.09909989]]


## Re-implement LSTM
Let's verify if LSTM follows above dynamics by re-implement update-rule

In [121]:
def sigmoid(x):
    return 1.0/(1.0 + np.exp(-x))

# vectorized implementation
trans_enc_in = np.transpose(enc_in, [1,0,2])

seq_len = trans_enc_in.shape[0]
h = np.zeros((batch_size, H), dtype=np.float32)
c = np.zeros((batch_size, H), dtype=np.float32)

# forget_bias is implemented in Tensorflow: to reduce the scale of forgetting at the beginning of the training
forget_bias = 1.0
for i in range(seq_len):
    x_h = np.concatenate((trans_enc_in[i], h), axis=1)
    i_g_f_o = x_h.dot(w_v) + b_v
    i,g,f,o = np.split(i_g_f_o, 4, axis=1)
    c = sigmoid(f + forget_bias)*c + sigmoid(i)*np.tanh(g)
    h = sigmoid(o)*np.tanh(c)
    
print ('\nre-compute c\n{}\n\nrel-error = {:e}'.format(c, rel_error(c, enc_out.c)))
print ('\nre-compute h\n{}\n\nrel-error = {:e}'.format(h, rel_error(h, enc_out.h)))


re-compute c
[[ 0.0770378   0.27204043 -0.00417575  0.33137965]
 [ 0.02871538  0.14283675  0.01768156  0.19223371]]

rel-error = 6.320640e-07

re-compute h
[[ 0.03677867  0.13295257 -0.00200626  0.16160905]
 [ 0.01427472  0.07151565  0.00893415  0.09909992]]

rel-error = 6.254585e-07


# RNN with GRU cell
We know that LSTM has much better feature than simple RNN cell since it retains memory in the network (solve the gradient vanishing issue), however LSTM requires more computation/resource (we need to comput $h$ and $c$). Recently, GRU cell becomes more popular since it has same feature as LSTM but computation is more efficient. Let's look at GRU cell.

$$
\begin{aligned}
r_t &= \mathrm{sigm}\left(W_rx_t+U_rh_{t-1}+b_r\right)\\
u_t &= \mathrm{sigm}\left(W_ux_t+U_uh_{t-1}+b_u\right)\\
c_t &= \tanh\left(W_cx_t + U_c(r_t\circ h_{t-1}) + b_c\right)\\
h_t &= u_t \circ h_{t-1} + (1-u_t)\circ c_t
\end{aligned}
$$

In [122]:
rnn_size = 4

gru_cell = tf.contrib.rnn.GRUCell(rnn_size)
_, gru_enc_state = tf.nn.dynamic_rnn(gru_cell, embed_input, dtype=tf.float32)

In [126]:
# print all variable
tvars = tf.global_variables()
sess.run(tf.global_variables_initializer())

for var in tvars:
    print(var.name)  # Prints the name of the variable alongside its value.
    
gru_w_gate  = [var for var in tvars if var.name == 'rnn/gru_cell/gates/weights:0'][0]
gru_b_gate  = [var for var in tvars if var.name == 'rnn/gru_cell/gates/biases:0'][0]

gru_w_cand  = [var for var in tvars if var.name == 'rnn/gru_cell/candidate/weights:0'][0]
gru_b_cand  = [var for var in tvars if var.name == 'rnn/gru_cell/candidate/biases:0'][0]

# should has shape (embed_dim + rnn_size, 2*rnn_size)
print (gru_w_gate.get_shape().as_list())

# should has shape (2*rnn_size)
print (gru_b_gate.get_shape().as_list())

# should has shape (embed_dim + rnn_size, rnn_size)
print (gru_w_cand.get_shape().as_list())

# should has shape (rnn_size)
print (gru_b_cand.get_shape().as_list())

EmbedSequence/embeddings:0
rnn/basic_rnn_cell/weights:0
rnn/basic_rnn_cell/biases:0
rnn/basic_lstm_cell/weights:0
rnn/basic_lstm_cell/biases:0
rnn/gru_cell/gates/weights:0
rnn/gru_cell/gates/biases:0
rnn/gru_cell/candidate/weights:0
rnn/gru_cell/candidate/biases:0
[6, 8]
[8]
[6, 4]
[4]


In [132]:
seq_in = batch_datas[0][:,0:10]
enc_in  = sess.run(embed_input, feed_dict={input_data : seq_in})
enc_out = sess.run(gru_enc_state, feed_dict={input_data : seq_in})
print ('encoder input:  {}'.format(enc_in.shape))
print ('\nencoder output:\n{}'.format(enc_out))

# get value
w_gate = gru_w_gate.eval()
b_gate = gru_b_gate.eval()

w_cand = gru_w_cand.eval()
b_cand = gru_b_cand.eval()

# vectorized implementation
trans_enc_in = np.transpose(enc_in, [1,0,2])

seq_len = trans_enc_in.shape[0]
h = np.zeros((batch_size, H), dtype=np.float32)
for i in range(seq_len):
    x_h = np.concatenate((trans_enc_in[i], h), axis=1)
    r_u = sigmoid(x_h.dot(w_gate) + b_gate)
    r,u = np.split(r_u, 2, axis=1)
    x_r_h = np.concatenate((trans_enc_in[i], r*h), axis=1)
    c = np.tanh(x_r_h.dot(w_cand) + b_cand)
    h = u * h + (1.0 - u) * c

print('\nre-compute gru output:\n{}'.format(h))
print('\nrel-error: {:e}'.format(rel_error(enc_out, h)))

encoder input:  (2, 10, 2)

encoder output:
[[ 0.10027966 -0.64120585 -0.19857292 -0.67132753]
 [ 0.15945557 -0.7119996  -0.18807514 -0.70002747]]

re-compute gru output:
[[ 0.10027965 -0.64120579 -0.19857292 -0.67132741]
 [ 0.1594556  -0.71199954 -0.18807516 -0.70002747]]

rel-error: 9.345023e-08


# Conclusion
We have go through encoder-layer of Tensorflow using RNN, LSTM, GRU cells. We have re-implemented computation to verify and understand Tensorflow's implementation.