# NLP with TensorFlow part01
In this notebook we learn how to use NLP with TensorFlow via the following steps
* word embeddings
* language model with rnn


In [45]:
import tensorflow as tf
import numpy as np

# for auto-reloading external modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

import copy, sys, time
if '../common' not in sys.path:
    sys.path.insert(0, '../common')

import helper
from gradient_check import rel_error
source_path = '../common/data/small_vocab_en'
target_path = '../common/data/small_vocab_fr'
source_text = helper.load_data(source_path)
target_text = helper.load_data(target_path)


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Preprocessing data
The first step is to create lookup tables word to integer-id and vice-versa, note that we always add some special word into the dictionary e.g
~~~~
CODES = {'<PAD>': 0, '<EOS>': 1, '<UNK>': 2, '<GO>': 3 }
~~~~

In [8]:
def create_lookup_tables(text, special_codes):
    vocab_to_int = copy.copy(special_codes)
    vocab = set(text.split())
    
    for v_i, v in enumerate(vocab, len(CODES)):
        vocab_to_int[v] = v_i

    int_to_vocab = {v_i: v for v, v_i in vocab_to_int.items()}
    return vocab_to_int, int_to_vocab

CODES = {'<PAD>': 0, '<EOS>': 1, '<UNK>': 2, '<GO>': 3 }
src_vocab_to_int, src_int_to_vocab = create_lookup_tables(source_text, CODES)
des_vocab_to_int, des_int_to_vocab = create_lookup_tables(target_text, CODES)

Given lookup tables, we need convert text into ids

In [19]:
def text_to_ids(text, vocab_to_int, append_eos = False):
    eos = []
    if append_eos:
        eos = [vocab_to_int['<EOS>']]
    
    sequence_ids = []
    for sent in text.split('\n'):
        sent_ids = [vocab_to_int[w] for w in sent.split()]
        if len(sent_ids) > 0:
            sequence_ids.append(sent_ids + eos)
    return sequence_ids

src_seq_ids = text_to_ids(source_text, src_vocab_to_int)
des_seq_ids = text_to_ids(target_text, des_vocab_to_int, append_eos=True)

i_max = np.argmax([len(s) for s in src_seq_ids])
i_min = np.argmin([len(s) for s in src_seq_ids])
print ('max len {:2d} at {}'.format(len(src_seq_ids[i_max]), i_max))
print ('min len {:2d} at {}'.format(len(src_seq_ids[i_min]), i_min))

max len 17 at 1
min len  3 at 5057


## Try word embedding with RNN
In this section, we want to implement the encoder part of the following schema
<img src="images/encoder_decoder.png" width="600"/>

We will use the following helper functions
* helper.pad_sentence_batch: we want all sentence in one batch has same length
* [`tf.contrib.layers.embed_sequence`](https://www.tensorflow.org/api_docs/python/tf/contrib/layers/embed_sequence) to embed a sequence (run rnn for all sequence)

In [78]:
tf.reset_default_graph()

# create interactive session 
sess = tf.InteractiveSession()

# create data
input_data = tf.placeholder(tf.int32, shape = [None, None])
src_vocab_size = len(src_vocab_to_int)
src_embed_dim = 2

print ('source vocab-size: {}'.format(src_vocab_size))

# we create initilizer so we can control embedding-weights init
embed_weights = np.linspace(0.0, 1.0, src_vocab_size * src_embed_dim, dtype=np.float32).reshape(src_vocab_size, 
                                                                                                src_embed_dim)


embed_init = tf.constant_initializer(embed_weights)

# we create embedding
embed_input = tf.contrib.layers.embed_sequence(input_data, src_vocab_size, src_embed_dim, initializer=embed_init)

source vocab-size: 231


## Check embed layer
We will run embed-layer, we should expect **embed-outputs** match with **embed_weights**, we only test for two batches with different seq-length

In [79]:
sess.run(tf.global_variables_initializer())

batch_size = 2
indices = [1, 5057]
for idx in indices:
    test_batch = np.array(helper.pad_sentence_batch(src_seq_ids[idx:idx+batch_size]))
    print (test_batch.shape)
    embed_vals = sess.run(embed_input, feed_dict={input_data:test_batch})
    seq_len = test_batch.shape[1]
    w = 0
    while (w==0): 
        i = np.random.randint(batch_size)
        j = np.random.randint(seq_len)
        w = test_batch[i,j]
    print ('word[{},{}] = {}'.format(i, j, test_batch[i,j]))
    print ('embed_vals[{},{}] = {}'.format(i, j, embed_vals[i,j]))
    print ('embed_weight[{}] = {}'.format(test_batch[i,j], embed_weights[test_batch[i,j]]))
    print ('rel-err {:e}'.format(rel_error(embed_vals[i,j], embed_weights[test_batch[i,j]])))

(2, 17)
word[0,1] = 201
embed_vals[0,1] = [ 0.87201732  0.87418658]
embed_weight[201] = [ 0.87201732  0.87418658]
rel-err 0.000000e+00
(2, 9)
word[1,7] = 73
embed_vals[1,7] = [ 0.31670281  0.318872  ]
embed_weight[73] = [ 0.31670281  0.318872  ]
rel-err 0.000000e+00


## Implement encoder layer  
Given embed_input ($w_1,...,w_n$), we are ready to make it passed through a RNN encoder. Since the seq-len is variable, we will use 

* [`tf.nn.dynamic_rnn`](https://www.tensorflow.org/api_docs/python/tf/nn/dynamic_rnn) to perform un-roll rnn encoder
* [`tf.contrib.rnn.BasicRNNCell`](https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/BasicRNNCell) or [`tf.contrib.rnn.BasicLSTMCell`](https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/BasicLSTMCell) to model a cell in our RNN

In [80]:
rnn_size = 3

enc_cell = tf.contrib.rnn.BasicRNNCell(rnn_size)
_, enc_state = tf.nn.dynamic_rnn(enc_cell, embed_input, dtype=tf.float32)

tvars = tf.trainable_variables()
sess.run(tf.global_variables_initializer())

for var in tvars:
    print(var.name)  # Prints the name of the variable alongside its value.

EmbedSequence/embeddings:0
rnn/basic_rnn_cell/weights:0
rnn/basic_rnn_cell/biases:0


In [85]:
rnn_w = [var for var in tvars if var.name == 'rnn/basic_rnn_cell/weights:0']
rnn_b = [var for var in tvars if var.name == 'rnn/basic_rnn_cell/biases:0']
print (sess.run(rnn_w[0]))
print (sess.run(rnn_b[0]))

[[-0.09072495  0.36186045  0.51128739]
 [ 0.09670514  0.46966249  0.58714062]
 [ 0.1749981   0.34577197  0.19017464]
 [-0.38420555 -0.26265156 -0.20659775]
 [-0.42786956  0.74334031  0.68960029]]
[ 0.  0.  0.]
