## Sentiment Analysis of Reviews using RNNs in TensorFlow

Modified from original code here: https://github.com/adeshpande3/LSTM-Sentiment-Analysis/blob/master/Oriole%20LSTM.ipynb

#### Some imports to make code compatible with Python 2 as well as 3

In [1]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

In [2]:
import collections
import math
import os
import random
import tarfile
import re

In [3]:
from six.moves import urllib

In [4]:
import numpy as np
import matplotlib as mp
import matplotlib.pyplot as plt
import tensorflow as tf

In [5]:
print(np.__version__)
print(mp.__version__)
print(tf.__version__)

1.13.1
2.0.2
1.3.0


#### Download, unzip and untar files in an automated way

In [6]:
DOWNLOADED_FILENAME = 'ImdbReviews.tar.gz'

def download_file(url_path):
    if not os.path.exists(DOWNLOADED_FILENAME):
        filename, _ = urllib.request.urlretrieve(url_path, DOWNLOADED_FILENAME)

    print('Found and verified file from this path: ', url_path)
    print('Downloaded file: ', DOWNLOADED_FILENAME)

### Extract reviews and the corresponding positive and negative labels from the dataset

In [7]:
TOKEN_REGEX = re.compile("[^A-Za-z0-9 ]+")


def get_reviews(dirname, positive=True):
    label = 1 if positive else 0

    reviews = []
    labels = []
    for filename in os.listdir(dirname):
        if filename.endswith(".txt"):
            with open(dirname + filename, 'r+') as f:
                review = f.read().decode('utf-8')
                review = review.lower().replace("<br />", " ")
                review = re.sub(TOKEN_REGEX, '', review)
                
                reviews.append(review)
                labels.append(label)
    
    return reviews, labels           

def extract_labels_data():
    # If the file has not already been extracted
    if not os.path.exists('aclImdb'):
        with tarfile.open(DOWNLOADED_FILENAME) as tar:
            tar.extractall()
            tar.close()
        
    positive_reviews, positive_labels = get_reviews("aclImdb/train/pos/", positive=True)
    negative_reviews, negative_labels = get_reviews("aclImdb/train/neg/", positive=False)

    data = positive_reviews + negative_reviews
    labels = positive_labels + negative_labels

    return labels, data

In [8]:
URL_PATH = 'http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'

download_file(URL_PATH)

Found and verified file from this path:  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Downloaded file:  ImdbReviews.tar.gz


In [9]:
labels, data = extract_labels_data()

In [10]:
labels[:5]

[1, 1, 1, 1, 1]

In [11]:
data[:5]

[u'bromwell high is a cartoon comedy it ran at the same time as some other programs about school life such as teachers my 35 years in the teaching profession lead me to believe that bromwell highs satire is much closer to reality than is teachers the scramble to survive financially the insightful students who can see right through their pathetic teachers pomp the pettiness of the whole situation all remind me of the schools i knew and their students when i saw the episode in which a student repeatedly tried to burn down the school i immediately recalled  at  high a classic line inspector im here to sack one of your teachers student welcome to bromwell high i expect that many adults of my age think that bromwell high is far fetched what a pity that it isnt',
 u'homelessness or houselessness as george carlin stated has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going to school work or vote for the matter mo

In [12]:
len(labels), len(data)

(25000, 25000)

In [13]:
max_document_length = max([len(x.split(" ")) for x in data])
print(max_document_length)

2470


### How many words to consider in each review?

Majority of the reviews fall under 250 words. This a number we've chosen based on some analysis of the data:

* Count the number of words in each file and divide by number of files to get an average i.e. **avg_words_per_file = total_words / num_files**
* Plot the words per file on matplot lib and try find a number which includes a majority of files

Word embeddings all have the same dimensionality which you can specify. A document is a vector of word embeddings (one dbpedia instance is a document in this case)

* Each document should be of the **same length**, documents longer than the MAX_SEQUENCE_LENGTH are truncated to this length
* The other documents will be **padded** by a special symbol to be the same max length

In [14]:
MAX_SEQUENCE_LENGTH = 250

### Vocabulary processor
 
http://tflearn.org/data_utils/
 
Library to map every word which occurs in our dataset to a unique identifer. If there are 10023 words each will be assigned a unique id from 1-10023

In [15]:
vocab_processor = tf.contrib.learn.preprocessing.VocabularyProcessor(MAX_SEQUENCE_LENGTH)

#### Transform every word to a representation using unique ids

In [16]:
x_data = np.array(list(vocab_processor.fit_transform(data)))
y_output = np.array(labels)

vocabulary_size = len(vocab_processor.vocabulary_)
print(vocabulary_size)

111526


In [17]:
data[3:5]

[u'this is easily the most underrated film inn the brooks cannon sure its flawed it does not give a realistic view of homelessness unlike say how citizen kane gave a realistic view of lounge singers or titanic gave a realistic view of italians you idiots many of the jokes fall flat but still this film is very lovable in a way many comedies are not and to pull that off in a story about some of the most traditionally reviled members of society is truly impressive its not the fisher king but its not crap either my only complaint is that brooks should have cast someone else in the lead i love mel as a director and writer not so much as a lead',
 u'this is not the typical mel brooks film it was much less slapstick than most of his movies and actually had a plot that was followable leslie ann warren made the movie she is such a fantastic underrated actress there were some moments that could have been fleshed out a bit more and some scenes that could probably have been cut to make the room to

In [18]:
x_data[3:5]

array([[290,   3, 364,  10, 121, 365, 291, 366,  10, 168, 367, 368, 162,
        369,   7, 370, 243, 286,   4, 371, 372,  53,  92, 373, 374, 375,
        376, 377, 378,   4, 371, 372,  53, 379, 380,  93, 381, 378,   4,
        371, 372,  53, 382, 146, 383,  83,  53,  10, 384, 385, 386, 103,
        387, 290, 291,   3, 388, 389,  25,   4, 390,  83, 391, 238, 243,
         61,  30, 392,  32, 206,  25,   4, 393,  17,  14,  53,  10, 121,
        394, 395, 396,  53, 397,   3, 398, 399, 162, 243,  10, 400, 401,
        103, 162, 243, 402, 403,  22, 404, 405,   3,  32, 168, 285, 301,
        406, 407, 408,  25,  10,  28,  59, 252, 167,  13,   4, 409,  61,
        410, 243, 411,  35,  13,   4,  28,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0

In [19]:
x_data[:2]

array([[  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,
         14,  15,  16,  17,  18,  19,  20,  13,  21,  22,  23,  24,  25,
         10,  26,  27,  28,  29,  30,  31,  32,   1,  33,  34,   3,  35,
         36,  30,  37,  38,   3,  21,  10,  39,  30,  40,  41,  10,  42,
         43,  44,  45,  46,  47,  48,  49,  50,  21,  51,  10,  52,  53,
         10,  54,  55,  56,  57,  29,  53,  10,  58,  59,  60,  61,  49,
         43,  62,  59,  63,  10,  64,  25,  65,   4,  66,  67,  68,  30,
         69,  70,  10,  18,  59,  71,  72,   9,   2,   4,  73,  74,  75,
         76,  77,  30,  78,  79,  53,  80,  21,  66,  81,  30,   1,   2,
         59,  82,  32,  83,  84,  53,  22,  85,  86,  32,   1,   2,   3,
         87,  88,  89,   4,  90,  32,   7,  91,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0

In [20]:
y_output[:5]

array([1, 1, 1, 1, 1])

#### Shuffle the data so the training instances are randomly fed to the RNN

In [21]:
np.random.seed(22)
shuffle_indices = np.random.permutation(np.arange(len(x_data)))

x_shuffled = x_data[shuffle_indices]
y_shuffled = y_output[shuffle_indices]

In [22]:
TRAIN_DATA = 5000
TOTAL_DATA = 6000

train_data = x_shuffled[:TRAIN_DATA]
train_target = y_shuffled[:TRAIN_DATA]

test_data = x_shuffled[TRAIN_DATA:TOTAL_DATA]
test_target = y_shuffled[TRAIN_DATA:TOTAL_DATA]

In [23]:
tf.reset_default_graph()

x = tf.placeholder(tf.int32, [None, MAX_SEQUENCE_LENGTH])
y = tf.placeholder(tf.int32, [None])

In [24]:
num_epochs = 20
batch_size = 25
embedding_size = 50
max_label = 2

### Embeddings to represent words

These embeddings are generated as a part of the training process of the RNN. The embeddings are trained using the reviews in the training dataset.

* *embedding_matrix* This is a matrix which holds the embeddings for every word in the vocabulary. The values are determined during the training process
* *embeddings* The embeddings for the words which are input as a part of one training batch

In [25]:
embedding_matrix = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
embeddings = tf.nn.embedding_lookup(embedding_matrix, x)

In [26]:
embedding_matrix

<tf.Variable 'Variable:0' shape=(111526, 50) dtype=float32_ref>

In [27]:
embeddings

<tf.Tensor 'embedding_lookup:0' shape=(?, 250, 50) dtype=float32>

In [28]:
lstmCell = tf.contrib.rnn.BasicLSTMCell(embedding_size)
lstmCell = tf.contrib.rnn.DropoutWrapper(cell=lstmCell, output_keep_prob=0.75)

### Results from an RNN of LSTM cells

(ouput, (**final_state**, other_state_info))

We're interested in the final state of this RNN because those are the encodings we feed into the prediction layer of our neural network

In [32]:
_, (encoding, _) = tf.nn.dynamic_rnn(lstmCell, embeddings, dtype=tf.float32)

In [33]:
encoding

<tf.Tensor 'rnn_1/while/Exit_2:0' shape=(?, 50) dtype=float32>

#### A densely connected prediction layer

* *activation=None* because the activation will be part of the tf.nn.sparse_softmax_cross_entropy_with_logits
* *cross_entropy* the loss function for probability distributions
* *max_label* the number of outputs of the prediction layer, here is 2, positive or negative

In [34]:
logits = tf.layers.dense(encoding, max_label, activation=None)

In [35]:
cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=y)
loss = tf.reduce_mean(cross_entropy)

#### Find the output with the highest probability and compare against the true label

In [36]:
prediction = tf.equal(tf.argmax(logits, 1), tf.cast(y, tf.int64))
accuracy = tf.reduce_mean(tf.cast(prediction, tf.float32))

In [37]:
optimizer = tf.train.AdamOptimizer(0.01)
train_step = optimizer.minimize(loss)

In [38]:
init = tf.global_variables_initializer()

In [39]:
with tf.Session() as session:
    init.run()
    
    for epoch in range(num_epochs):
        
        num_batches = int(len(train_data) // batch_size) + 1
        
        for i in range(num_batches):
            # Select train data
            min_ix = i * batch_size
            max_ix = np.min([len(train_data), ((i+1) * batch_size)])

            x_train_batch = train_data[min_ix:max_ix]
            y_train_batch = train_target[min_ix:max_ix]
            
            train_dict = {x: x_train_batch, y: y_train_batch}
            session.run(train_step, feed_dict=train_dict)
            
            train_loss, train_acc = session.run([loss, accuracy], feed_dict=train_dict)

        test_dict = {x: test_data, y: test_target}
        test_loss, test_acc = session.run([loss, accuracy], feed_dict=test_dict)    
        print('Epoch: {}, Test Loss: {:.2}, Test Acc: {:.5}'.format(epoch + 1, test_loss, test_acc)) 

Epoch: 1, Test Loss: 0.69, Test Acc: 0.49
Epoch: 2, Test Loss: 0.8, Test Acc: 0.505
Epoch: 3, Test Loss: 0.83, Test Acc: 0.602
Epoch: 4, Test Loss: 0.8, Test Acc: 0.731
Epoch: 5, Test Loss: 1.1, Test Acc: 0.759
Epoch: 6, Test Loss: 1.3, Test Acc: 0.774
Epoch: 7, Test Loss: 1.3, Test Acc: 0.796
Epoch: 8, Test Loss: 1.3, Test Acc: 0.797
Epoch: 9, Test Loss: 1.4, Test Acc: 0.799
Epoch: 10, Test Loss: 1.5, Test Acc: 0.809
Epoch: 11, Test Loss: 1.5, Test Acc: 0.813
Epoch: 12, Test Loss: 1.5, Test Acc: 0.813
Epoch: 13, Test Loss: 1.6, Test Acc: 0.813
Epoch: 14, Test Loss: 1.6, Test Acc: 0.814
Epoch: 15, Test Loss: 1.6, Test Acc: 0.819
Epoch: 16, Test Loss: 1.7, Test Acc: 0.82
Epoch: 17, Test Loss: 1.7, Test Acc: 0.82
Epoch: 18, Test Loss: 1.8, Test Acc: 0.82
Epoch: 19, Test Loss: 1.8, Test Acc: 0.818
Epoch: 20, Test Loss: 1.9, Test Acc: 0.819
