## Recurrent neural networks for sequence classification with TensorFlow Eager
----

In this tutorial, we are going to build a recurrent neural network (RNN) for sequence classification. 

The goal of the RNN will be to classify the sentiment of a movie review. We will be training this model using the IMDB sentiment analysis dataset ([dataset source](http://ai.stanford.edu/~amaas/data/sentiment/)).

To download the dataset, you just have to run the **get_datasets.sh** script in the **datasets** folder, from your terminal. This script will automatically download and extract the zipped file.

> **chmod o+x datasets/get_datasets.sh** 

> **datasets/get_datasets.sh**

This dataset comes with 25000 movie reviews for training and 25000 reviews for testing. Let's jump in the data analysis :)!

### Import here useful libraries
----

In [1]:
# Data processing and pathnames handling
import pandas as pd
import glob

# Libraries for text processing
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')
import re

# Import function to write data to tfrecords
from data_utils import text2tfrecords

# Import TensorFlow and TensorFlow Eager
import tensorflow as tf
import tensorflow.contrib.eager as tfe

[nltk_data] Downloading package punkt to /home/mada/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


  from ._conv import register_converters as _register_converters


In [2]:
# Enable eager mode. Once activated it cannot be reversed! Run just once.
tfe.enable_eager_execution()

### Exploratory data analysis on the IMDB dataset
----

In [3]:
# Get the filenames of the positive/negative reviews we will use for training the RNN
train_pos_files = glob.glob('datasets/aclImdb/train/pos/*')
train_neg_files = glob.glob('datasets/aclImdb/train/neg/*')

# Concatenate both positive and negative reviews filenames
train_files = train_pos_files + train_neg_files

In [4]:
print('Number positive reviews used for training: ', len(train_pos_files))
print('Number negative reviews used for training: ', len(train_neg_files))

Number positive reviews used for training:  12500
Number negative reviews used for training:  12500


In [5]:
# Read a positive review
print('Example of a positive review:\n\n', open(train_pos_files[5],'r').read())

Example of a positive review:

 Rented the movie as a joke. My friends and I had so much fun laughing at it that I went and found a used copy and bought it for myself. Now when all my friends are looking for a funny movie I give them Sasquatch Hunters. It needs to be said though there is a rule that was made that made the movie that much better. No talking is allowed while the movie is on unless the words are Sasquatch repeated in a chant. I loved the credit at the end of the movie as well. "Thanks for the Jeep, Tom!" Whoever Tom is I say thank you because without your Jeep the movie may not have been made. In short a great movie if you are looking for something to laugh at. If you want a good movie maybe look for something else but if you don't mind a laugh at the expense of a man in a monkey suit grab yourself a copy.


In [6]:
# Read a negative review
print('Example of a negative review:\n\n', open(train_neg_files[7070],'r').read())

Example of a negative review:

 The Good: I liked this movie because it was the first horror movie I've seen in a long time that actually scared me. The acting wasn't too bad, and the "Cupid" killer was believable and disturbing.<br /><br />The Bad: The story line and plot of this movie is incredibly weak. There just wasn't much to it. The ways the killer killed his victims was very horrifying and disgusting. I do not recommend this movie to anyone who can not handle gore.<br /><br />Overall: A good scare, but a bad story.<br /><br />** out of *****


### Create a word vocabulary
----
As you can see in the example above, the text comes along with HTML tags as well. We will have to remove these during text processing. For each review in part, we will perform the following data cleaning tasks:

* *Strip any HTML tag in the review*
* *Use the word tokenizer to extract the words from the review. Example:*
    > **word_tokenize("I can't believe I wasted my time with this movie.")** &rarr; *['I', 'ca', "n't", 'believe', 'I', 'wasted', 'my', 'time', 'with', 'this', 'movie', '.']*
* Create list of words 
* Replace words that appear less than the minimum set frequency, with **< Unknown >** token
* Add a **< Start >** and **< End >** token to the vocabulary

In [7]:
# List with all the reviews in the train dataset
reviews = [open(train_files[i],'r').read() for i in range(len(train_files))]

# Remove HTML tags
reviews = [re.sub(r'<[^>]+>', ' ', review) for review in reviews]

# Tokenize each review in part
reviews = [word_tokenize(review) for review in reviews]

# Flatten nested list
reviews = [word for review in reviews for word in review]

In [8]:
# Compute the frequency of each word
word_frequency = pd.value_counts(reviews)

# Keep only words with frequency higher than minimum
min_frequency = 5
vocabulary = word_frequency[word_frequency>=min_frequency].index.tolist()

# Add Unknown, Start and End token. 
extra_tokens = ['Unknown_token', 'Start_token', 'End_token']
vocabulary += extra_tokens

print('Number of words in the vocabulary: ', len(vocabulary))

Number of words in the vocabulary:  34819


In [9]:
# Create a word2idx dictionary
word2idx = {vocabulary[i]: i for i in range(len(vocabulary))}

### Write data to TFRecords
----
I am lucky enough to have a computer with 32GB RAM so holding the train and test dataset in memory is definitely not a problem for me. However, I do realize that some of you might struggle with RAM capabilities so I am going to make it easier for you :).

We are going to create two tfrecords datasets: one for training and one for testing. Then, we are going to read the data in batches from disk using the Dataset iterator.

This will provide us with two main advantages:
* no constraints on your RAM capabilities
* the ability to pad the variable length sequences, on the fly, within a batch

Honestly, most of the real-world datasets are too big to fit into memory so I believe it is good practice to learn how to deal with such scenarios :).

In [None]:
# Write train data to tfrecords.This might take a while (~10 minutes)
train_writer = tf.python_io.TFRecordWriter('datasets/aclImdb/train.tfrecords')
text2tfrecords(train_files, train_writer, word2idx, vocabulary)

In [12]:
# Get the filenames of the reviews we will use for testing the RNN 
test_pos_files = glob.glob('datasets/aclImdb/test/pos/*')
test_neg_files = glob.glob('datasets/aclImdb/test/neg/*')
test_files = test_pos_files + test_neg_files

# Write test data to tfrecords (~10 minutes)
test_writer = tf.python_io.TFRecordWriter('datasets/aclImdb/test.tfrecords')
text2tfrecords(test_files, test_writer, word2idx, vocabulary)

### TFDataset Iterator
----

In [13]:
train_dataset = tf.data.TFRecordDataset('datasets/aclImdb/train.tfrecords')