<a href="https://colab.research.google.com/github/nianlonggu/Tensorflow-Notebooks/blob/master/Tensorflow_Natural_Language_Processing_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this example, we learn how to use tensorflow to implement a simple RNN network for the task of movie comment analysis.<br>
A typical RNN project for text processing includes folllowing steps:<br>
**1. Tokenization**<br/>
**2. Padding & Truncation**<br/>
**3. Contruct RNNs**<br/>  &ensp;  3.1 Word embedding  <br/>
&ensp; 3.2 Add recurrent units <br/>
&ensp; 3.3 Config the output and define loss function <br/>
**4. Model Training & Testing** <br/>

In the next, the whole procedure is illustrated.

### Import necessary libraraies

In [1]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer  # used for integer-tokenization of raw text
from tensorflow.keras.preprocessing.sequence import pad_sequences   #used for padding short sequences
from keras.datasets import imdb
import numpy as np

Using TensorFlow backend.


### Load data

In [2]:
try:
  (x_train, y_train ), ( x_test, y_test) = imdb.load_data()
except:
  print("numpy version doesn't fit, use np.load manually!")
  imdb_data = np.load("/root/.keras/datasets/imdb.npz", allow_pickle = True)
  x_train, y_train, x_test, y_test = imdb_data["x_train"], imdb_data["y_train"],\
                                      imdb_data["x_test"], imdb_data["y_test"]

y_train = np.expand_dims(y_train, -1)
y_test = np.expand_dims(y_test, -1)

Downloading data from https://s3.amazonaws.com/text-datasets/imdb.npz
numpy version doesn't fit, use np.load manually!


In [3]:
print(type(x_train))
print(type(x_train[0]))
print(x_train.shape)
print(y_train.shape)
print("x_train examples:")
print(x_train[0][:10])

<class 'numpy.ndarray'>
<class 'list'>
(25000,)
(25000, 1)
x_train examples:
[23022, 309, 6, 3, 1069, 209, 9, 2175, 30, 1]


### Tokenization

We notice that x_train is an ndarray of list, each list contains integer-tokenized value. We can use imdb.load_word_index() to get the index (integer representation) of each word, so we can convert tokenized sequence back into text sequence.

In [4]:
word_index = imdb.get_word_index()
## we get the inverse_word_index
inverse_word_index = {}
for key in word_index.keys():
  inverse_word_index[word_index[key]] = key
  

# define a helper function
def sequences_to_texts(token_sequences):
  # word indices is an ndarray of integer token list or a nested token list
  text_list = []
  for i in range(len(token_sequences)):
    text_list.append( [ inverse_word_index[ids] for ids in token_sequences[i] ]   )
  text_list = np.array(text_list)
  return text_list


x_train_text = sequences_to_texts(x_train)
x_test_text =  sequences_to_texts(x_test)


print("Train examples:\n", ' '.join(x_train_text[0]) )
print("Test examples:\n", " ".join(x_test_text[0]))

Downloading data from https://s3.amazonaws.com/text-datasets/imdb_word_index.json
Train examples:
 bromwell high is a cartoon comedy it ran at the same time as some other programs about school life such as teachers my 35 years in the teaching profession lead me to believe that bromwell high's satire is much closer to reality than is teachers the scramble to survive financially the insightful students who can see right through their pathetic teachers' pomp the pettiness of the whole situation all remind me of the schools i knew and their students when i saw the episode in which a student repeatedly tried to burn down the school i immediately recalled at high a classic line inspector i'm here to sack one of your teachers student welcome to bromwell high i expect that many adults of my age think that bromwell high is far fetched what a pity that it isn't
Test examples:
 i went and saw this movie last night after being coaxed to by a few friends of mine i'll admit that i was reluctant to s

Suppose that we get the raw text first and we want to convert texts to integer token sequences, we can use Tokenizer class.

In [0]:
x_all_text = np.concatenate( [x_train_text, x_test_text] , axis =0)   # It is reasonable that we need to consider both training and test dataset when tokenization
tokenizer = Tokenizer(num_words = None)
tokenizer.fit_on_texts(x_all_text)    ## input can be ndarray or nested list
x_train_sequence = tokenizer.texts_to_sequences( x_train_text )
x_test_sequence = tokenizer.texts_to_sequences(x_test_text)

In [6]:

## we can look several examples of the word_index of this tokenizer. Note that this tokenizer.word_index is different from the imdb.get_word_index()
print("Tokenizer word index:\n",tokenizer.word_index )

Tokenizer word index:


In the following we still use x_train, y_train, x_test, y_test as the default sequences

### Padding and Truncation

Padding and truncation is to make sure every sequence in the training or test dataset has the same time length, so that during training we can fed a batch of data instead of single sample. <br/>
The common stragtegies of select length include:<br/>
&ensp; 1. select length as the maximum sequence length; <br/>
&ensp; 2. select length as a length which can cover the majority of the sequence lengths, for those senquences whose lengths exceed that threshold, just truncate them;

Here we try on the 2nd method.

During padding and truncation there are also 2 ways: "pre" or "post". Here we choose "pre", which means add 0s or truncating from the head of each sequence.


In [7]:
x_all_data = np.concatenate( [x_train, x_test], axis = 0 ).tolist()
sequence_len_list = np.array([len(seq) for seq in x_all_data ])
len_mean, len_std = np.mean(sequence_len_list), np.std(sequence_len_list)
print("Sequence length mean: ", len_mean)
print("Sequence length std: ", len_std)

Sequence length mean:  233.75892
Sequence length std:  172.91149458735703


The mean of sequence length $\mu=234$ and standard deviation $\sigma=173$.  If we assume that the distribution of length is Gaussian, then the confidence levels of $\mu\pm\sigma$, $\mu\pm2\sigma$, $\mu\pm3\sigma$ are 68.26%, 95.44%, and 99.74% respectively. Therefore, we can select the uniform sequence length as $\mu+2\sigma$, which is above around (95.44+100)/2 % = 97.72% sequences 

In [8]:
sequence_len = int(len_mean+2*len_std)
percentage_below_sequence_length = np.sum( sequence_len_list < sequence_len  )/len(sequence_len_list)
print("%.2f%%sequences have shorter length than %d"%(percentage_below_sequence_length, sequence_len))

0.95%sequences have shorter length than 579


After select the sequence_length, we use pad_sequences to pad and truncate all sequence from x_train and x_test

In [9]:
pad = "pre"
x_train_pad = pad_sequences( x_train.tolist(), maxlen = sequence_len, padding = pad, truncating = pad )  ## the input can be ndarray of lists or nested lists
x_test_pad = pad_sequences( x_test, maxlen = sequence_len, padding = pad, truncating =pad )
print(type(x_train_pad))
print(x_train_pad.shape)
print(x_train_pad[0])

<class 'numpy.ndarray'>
(25000, 579)
[    0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0  

##Implement using tensorflow functions

###Construct RNNs

Both tensorflow and keras can be used to contruct RNNs. Although keras method is much simpler, tensorflow version can better reveal the network structure. So we first use tensorflow version to construct networks, and latter try keras version.

#### Word Embedding
As we showed above, each entry in x_train is a 579 vector. Each element in this vector is an integer token (word inddex) which corresponds to a single word. The task of find word embedding is to convert each integer into a float type vector. Ideally the euclidean/cosine distance between two embeddings can represent the similarity of two words, even if their integer tokens vary a lot. <br/>
To achieve this we can define a tensor which is a look-up table. Given each integer, we can fetch a unique feature vector for this integer token. Note: **This look-up table will be also optimzed during training.**

In [10]:
# first we need to define placeholder X and y
X= tf.placeholder( tf.int32, [None, sequence_len ]  )
y = tf.placeholder( tf.float32, [None,1] )

vocabulary_size, embedding_size =  len(word_index), 8
# This is a look-up table, the first dim is the vocabulary size, the second dim is the embedding dimension
word_embeddings = tf.get_variable("word_embeddings", [ vocabulary_size, embedding_size  ]  )  

# then we use the embedding_lookup method to get the word embedding given a batch of X
X_embedded = tf.nn.embedding_lookup( word_embeddings, X )

print(X_embedded)

Instructions for updating:
Colocations handled automatically by placer.
Tensor("embedding_lookup/Identity:0", shape=(?, 579, 20), dtype=float32)


So we notice that X_embedded has a shape of [batch_size,  sequence_length,  input_dimension ]. In some cases where we do not need to do embedding first, our placeholder should also have the sahpe of [None, sequence_length, input_dimension]

#### Add Recurrent Units
Here  we add two RNN units. This represents the deep RNN structure where multiple Recurrent Units are concatenated. The first RU takes X_embedded as the inout sequence, while the second RU takes the first RU's output as input sequence. It is like two units are concatented along the network layer  but are parallel along the time axis.


A further introduction for the  outputs, state =  tf.nn.dynamic_rnn( RNN_Cell, input_seq, dtype  )
for BasicRNN cell, the hidden state h  equals the last (non-zero) ouput. This hidden state is just the notation "a" in Andrew's course
for LSTMCell, the state is tuple,  state = ( c, h ) where the first varaible is the internal memory cell variable "c", and h is the hidden state varaible "a"


Moreover, since hidden state h always contains the last non-zero ouput, while outputs[-1] cound be 0, especially for the case where we pad 0's to the tail of short sequences. Therefore, "h" may represent the "real" last output of different sequences. That's why it is safer to **use state[-1] as the final output of the RNN cell** for further processing.

In [11]:
# num_units is the dimension of the output (the output "a" and the state variable "c")
basic_unit_1 = tf.nn.rnn_cell.BasicLSTMCell( num_units = 8, name = "ru1" )
basic_unit_2 = tf.nn.rnn_cell.BasicLSTMCell( num_units = 4, name = "ru2" )


# the outputs are a sequence of outputs at each time step, while the state is a tuple which contains the last output (a) and the state info (c)
outputs , state = tf.nn.dynamic_rnn( basic_unit_1, X_embedded, dtype = tf.float32 )
outputs , state = tf.nn.dynamic_rnn( basic_unit_2, outputs, dtype = tf.float32 )

print(outputs, state)

Instructions for updating:
This class is equivalent as tf.keras.layers.LSTMCell, and will be replaced by that in Tensorflow 2.0.
Instructions for updating:
Please use `keras.layers.RNN(cell)`, which is equivalent to this API
Tensor("rnn_1/transpose_1:0", shape=(?, 579, 8), dtype=float32) LSTMStateTuple(c=<tf.Tensor 'rnn_1/while/Exit_3:0' shape=(?, 8) dtype=float32>, h=<tf.Tensor 'rnn_1/while/Exit_4:0' shape=(?, 8) dtype=float32>)


#### Config the output and define loss function

In [19]:
logits = tf.layers.dense( state[-1], 1 )
pred_y = tf.nn.sigmoid( logits )
loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits( labels = y, logits = logits ))

accuracy = tf.reduce_mean( tf.cast( tf.equal( tf.cast( pred_y > 0.5, tf.int32), tf.cast( y, tf.int32) ), tf.float32))

optimizer = tf.train.AdamOptimizer(learning_rate=1E-3).minimize(loss)

print(loss)

Tensor("Mean_2:0", shape=(), dtype=float32)


### Training

First we provide some util functions

In [0]:
def get_next_batch(x,y, batch_size):
  data_length = x.shape[0]
  selected_index = np.random.choice(  data_length, batch_size, replace = False )
  return x[selected_index], y[selected_index]
  

In [21]:
batch_size = 64
num_batches = x_train.shape[0] // batch_size
max_epochs = 20

# print(num_batches)

init = tf.global_variables_initializer()
with tf.Session() as sess:
  sess.run(init)
  
  for epoch in range(max_epochs):
    for batch in range( num_batches):
      x_train_batch, y_train_batch = get_next_batch( x_train_pad, y_train, batch_size )
      x_test_batch, y_test_batch = get_next_batch( x_test_pad, y_test, batch_size )
      sess.run( optimizer, feed_dict={ X:x_train_batch, y:y_train_batch }  )
      
      if batch%10 == 0:
        loss_train = loss.eval( feed_dict={X:x_train_batch, y:y_train_batch} )
        acc_train = accuracy.eval(feed_dict={ X:x_train_batch, y:y_train_batch })
        acc_test = accuracy.eval( feed_dict={ X:x_test_batch, y:y_test_batch } )
#         pred_y_test = pred_y.eval(feed_dict={ X:x_test_batch, y:y_test_batch })
#         pred_y_test = (pred_y_test>0.5).astype(np.int64)
#         real_y_test = y_test_batch.astype(np.int64)
#         acc = np.mean( pred_y_test == real_y_test )
    
        print("Epoch %.2f, Training loss: %.2f, Train Accuracy: %.2f, Test Accuracy: %.2f"%(epoch+batch/num_batches, loss_train, acc_train*100, acc_test*100))  
        
      
    
    
    
    

Epoch 0.00, Training loss: 0.69, Train Accuracy: 50.00, Test Accuracy: 48.44
Epoch 0.03, Training loss: 0.69, Train Accuracy: 50.00, Test Accuracy: 54.69
Epoch 0.05, Training loss: 0.69, Train Accuracy: 67.19, Test Accuracy: 59.38
Epoch 0.08, Training loss: 0.68, Train Accuracy: 79.69, Test Accuracy: 67.19
Epoch 0.10, Training loss: 0.66, Train Accuracy: 67.19, Test Accuracy: 50.00
Epoch 0.13, Training loss: 0.59, Train Accuracy: 70.31, Test Accuracy: 79.69
Epoch 0.15, Training loss: 0.53, Train Accuracy: 85.94, Test Accuracy: 73.44
Epoch 0.18, Training loss: 0.58, Train Accuracy: 76.56, Test Accuracy: 64.06
Epoch 0.21, Training loss: 0.52, Train Accuracy: 84.38, Test Accuracy: 78.12
Epoch 0.23, Training loss: 0.49, Train Accuracy: 79.69, Test Accuracy: 81.25
Epoch 0.26, Training loss: 0.40, Train Accuracy: 87.50, Test Accuracy: 79.69
Epoch 0.28, Training loss: 0.36, Train Accuracy: 87.50, Test Accuracy: 85.94
Epoch 0.31, Training loss: 0.40, Train Accuracy: 84.38, Test Accuracy: 81.25

KeyboardInterrupt: ignored

##Implement using Keras model
The procedure of constructing RNNs and Training can be easily implemented by Keras package

In [0]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, GRU, Dense
from tensorflow.keras.optimizers import Adam

In [26]:
embedding_size = 8

# contruct RNNs
model = Sequential()
model.add( Embedding( input_dim =vocabulary_size, output_dim= embedding_size, input_length = sequence_len, name = "layer_embedding" )  )
model.add( GRU( units = 16, return_sequences = True  ) )
model.add( GRU( units = 8, return_sequences = True ) )
model.add( GRU( units = 4, return_sequences = False  ) )
model.add( Dense( 1, activation="sigmoid" ) )

# config training and testing
model.compile(  loss = "binary_crossentropy", optimizer = Adam(lr=1e-3), metrics = ["accuracy"] )
model.summary() ## not necessary

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
layer_embedding (Embedding)  (None, 579, 8)            708672    
_________________________________________________________________
gru_3 (GRU)                  (None, 579, 16)           1200      
_________________________________________________________________
gru_4 (GRU)                  (None, 579, 8)            600       
_________________________________________________________________
gru_5 (GRU)                  (None, 4)                 156       
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 5         
Total params: 710,633
Trainable params: 710,633
Non-trainable params: 0
_________________________________________________________________


In [27]:
model.fit(  x_train_pad, y_train, validation_split=0.05, epochs=3, batch_size= 64   )

Train on 23750 samples, validate on 1250 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x7fa68d43a780>

In [29]:
## model.evaluate(x,y)  # evaluate on the test dataset using the provided metrics 
## model.predict(x)  # predict a batch of x given a batch of y 
results = model.evaluate( x_test_pad[:2000], y_test[:2000] )



In [32]:
results

[0.2573547387123108, 0.8985]

In [87]:
example_x=[
    
    "This film is the worst one I have ever seen!",
    "I cannot forget how many people are laughing when the actors are dancing",
    "Mediocre!",
    "Greatest ever!",
    "Wow what a great surprise this was. I was told by a friend this was good but it\'s been awhile since I liked a Keanu movie so I was hesitant to try it. Retired hit-man John Wick (Keanu Reeves) loses his wife to cancer. After her funeral he receives a puppy she left him. A few days later some thugs, led by the son of a Russian gangster John used to work for, break into John's house. They beat him up, take the keys to his beloved car, and kill the puppy. They did this not knowing who he was; they just wanted the car. Now John Wick is out for revenge and the Russian gangster is trying to save his son's life by sending killers after John. Keanu\'s great here. Glad to see him doing something watchable again. Willem Dafoe, Alfie Allen, Ian McShane, and Lance Reddick lead a good supporting cast. Michael Nyqvist was made to play villains. Even Adrianne Palicki was good. Oh and hey the beat-up guy from the Allstate commercials is in this. The stuff with the hotel for assassins and the way they all know each other was pretty funny. About the only problem I had with it was the unrealistic scene where the bad guy finally gets the upper hand on the 'hero' and doesn't kill him. This sort of thing is common in movies but it's always unbelievable and reminds me of the old James Bond villains. This is easily the best action movie this year. Possibly the best straight action movie since the first Taken. English-speaking action movies, that is. It doesn't reinvent the genre or anything but it's entertaining.",
    ]

my_tokenizer = Tokenizer()
my_tokenizer.word_index = word_index
example_x_tokens = my_tokenizer.texts_to_sequences( example_x )
example_x_tokens_pad = pad_sequences( example_x_tokens, maxlen = sequence_len, padding = pad, truncating= pad  )

example_pred_y = model.predict(example_x_tokens_pad)
print(example_pred_y)

[[0.08563638]
 [0.5451344 ]
 [0.64223486]
 [0.9524087 ]
 [0.9731012 ]]


#### Analysis of the word embedding

In [64]:
embedding_layer = model.get_layer("layer_embedding")
word_embeddings = embedding_layer.get_weights()[0]  # there could be multiple weight matrixs

print(word_embeddings.shape)

(88584, 8)


We can check the vocabulary size and value of tokens

In [65]:
token_list =[]
for key in word_index.keys():
  token_list.append(word_index[key])

print(min(token_list))
print(max(token_list))
print(len(token_list))

1
88584
88584


So word embeddings is a ndarray of the shape [ vocabulary_size, embedding_dimension ] (look-up matrix). We also notice that in the word_index the tokens are consecutively from 1 to vocabulary_size. Therefore, if we have a word "good" with a token=100, then the word embedding of "good" should be word_embeddings[100-1]

In [79]:
from scipy.spatial.distance import cdist

def parse_distance_of_word(word, k=10):
  
  embedding_of_current_word = word_embeddings[ word_index[word] -1]
  distances = cdist( word_embeddings, [embedding_of_current_word], metric = "cosine"  )[:,0].T.tolist()
  token_index = (np.arange( len(distances) )+1).tolist()
  sorted_token_index = [x for _,x in sorted( zip( distances, token_index ), key= lambda pair: pair[0]  )  ]
  sorted_word_list = [inverse_word_index[ids] for ids in sorted_token_index ]
  
  print("The %d closest words:"%(k)  )
  print(sorted_word_list[:k])
  
  print("The %d most different words:"%(k)  )
  print(sorted_word_list[-k:])
  

parse_distance_of_word("great")

The 10 closest words:
['great', 'fillums', "margheriti's", 'intimacies', 'onion', 'helumis', 'nandani', 'colada', 'farfella', 'testosterone']
The 10 most different words:
['subsequences', 'jaimie', '0000000000001', 'suoi', "'acting", 'bunch', 'forgery', "'futurama'", 'jarada', 'frescorts']
