# Recurrent Neural Network for Sentiment Analysis

Adapted from http://nbviewer.jupyter.org/github/rasbt/pattern_classification/blob/master/machine_learning/scikit-learn/outofcore_modelpersistence.ipynb

<br>
<br>

## The IMDb Movie Review Dataset

In this section, we will train a simple logistic regression model to classify movie reviews from the 50k IMDb review dataset that has been collected by Maas et. al.

> AL Maas, RE Daly, PT Pham, D Huang, AY Ng, and C Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Lin- guistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics

[Source: http://ai.stanford.edu/~amaas/data/sentiment/]

The dataset consists of 50,000 movie reviews from the original "train" and "test" subdirectories. The class labels are binary (1=positive and 0=negative) and contain 25,000 positive and 25,000 negative movie reviews, respectively.
For simplicity, I assembled the reviews in a single CSV file.


In [1]:
import re
import collections
import sys
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords

In [2]:
df = pd.read_csv('shuffled_movie_data.csv')
df.tail()

Unnamed: 0,review,sentiment
49995,"OK, lets start with the best. the building. al...",0
49996,The British 'heritage film' industry is out of...,0
49997,I don't even know where to begin on this one. ...,0
49998,Richard Tyler is a little boy who is scared of...,0
49999,I waited long to watch this movie. Also becaus...,1


Let us shuffle the class labels.

In [3]:
df.head().values[0]

array(['In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve of Halloween, she was murdered in the backyard of her house and her murder remained unsolved. Twenty-two years later, the writer Mark Fuhrman (Christopher Meloni), who is a former LA detective that has fallen in disgrace for perjury in O.J. Simpson trial and moved to Idaho, decides to investigate the case with his partner Stephen Weeks (Andrew Mitchell) with the purpose of writing a book. The locals squirm and do not welcome them, but with the support of the retired detective Steve Carroll (Robert Forster) that was in charge of the investigation in the 70\'s, they discover the criminal and a net of power and money to cover the murder.<br /><br />"Murder in Greenwich" is a good TV movie, with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a Kennedy. The powerful and ric

## Generator

First, we define a generator that returns the document body and the corresponding class label:

In [4]:
def stream_docs(path):
    with open(path, 'r') as csv:
        next(csv) # skip header
        for line in csv:
            text, label = line[:-3], int(line[-2])
            yield text, label

To conform that the `stream_docs` function fetches the documents as intended, let us execute the following code snippet before we implement the `get_minibatch` function:

After we confirmed that our `stream_docs` functions works, we will now implement a `get_minibatch` function to fetch a specified number (`size`) of documents:

In [5]:
def get_minibatch(doc_stream, size):
    docs, y = [], []
    for _ in range(size):
        text, label = next(doc_stream)
        docs.append(text)
        y.append(label)
    return docs, y

## Preprocessing Text Data

Now, let us define a simple `tokenizer` that splits the text into individual word tokens. Furthermore, we will use some simple regular expression to remove HTML markup and all non-letter characters but "emoticons," convert the text to lower case, remove stopwords, and apply the Porter stemming algorithm to convert the words into their root form.

In [6]:
#from nltk.stem import WordNetLemmatizer
#wordnet_lemmatizer = WordNetLemmatizer()

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    text = re.sub(r"it's", " it is", text)
    text = re.sub(r"that's", " that is", text)
    text = re.sub(r"\'s", " 's", text)
    text = re.sub(r"\'ve", " have", text)
    text = re.sub(r"won't", " will not", text)
    text = re.sub(r"don't", " do not", text)
    text = re.sub(r"can't", " can not", text)
    text = re.sub(r"cannot", " can not", text)
    text = re.sub(r"n\'t", " n\'t", text)
    text = re.sub(r"\'re", " are", text)
    text = re.sub(r"\'d", " would", text)
    text = re.sub(r"\'ll", " will", text)
    text = re.sub(r"\'m", " am", text)

    text = re.sub('[\W]+', ' ', text.lower())
    text = [w for w in text.split()]
    
    #tokenized = [wordnet_lemmatizer.lemmatize(w) for w in text]
    return text
    #return tokenized

Let's give it at try:

In [7]:
tokenizer("This :) is a <br /> test! :-) and I'm not sure what will happens</br>")

['this',
 'is',
 'a',
 'test',
 'and',
 'i',
 'am',
 'not',
 'sure',
 'what',
 'will',
 'happens']

In [8]:
sentences = []
labels    = []
lenghts   = []

doc_stream = stream_docs('shuffled_movie_data.csv')

for idx, review in enumerate(doc_stream):
    list_of_words = tokenizer(review[0])
    sentences.append(list_of_words)
    labels.append(review[1])
    lenghts.append(len(list_of_words))
    sys.stdout.write('\r{:5.2f}%'.format(100*(idx+1)/50000))
sys.stdout.write('\rDone     \n\n')  

 0.00% 0.00% 0.01% 0.01% 0.01% 0.01% 0.01% 0.02% 0.02% 0.02% 0.02% 0.02% 0.03% 0.03% 0.03% 0.03% 0.03% 0.04% 0.04% 0.04% 0.04% 0.04% 0.05% 0.05% 0.05% 0.05% 0.05% 0.06% 0.06% 0.06% 0.06% 0.06% 0.07% 0.07% 0.07% 0.07% 0.07% 0.08% 0.08% 0.08% 0.08% 0.08% 0.09% 0.09% 0.09% 0.09% 0.09% 0.10% 0.10% 0.10% 0.10% 0.10% 0.11% 0.11% 0.11% 0.11% 0.11% 0.12% 0.12% 0.12% 0.12% 0.12% 0.13% 0.13% 0.13% 0.13% 0.13% 0.14% 0.14% 0.14% 0.14% 0.14% 0.15% 0.15% 0.15% 0.15% 0.15% 0.16% 0.16% 0.16% 0.16% 0.16% 0.17% 0.17% 0.17% 0.17% 0.17% 0.18% 0.18% 0.18% 0.18% 0.18% 0.19% 0.19% 0.19% 0.19% 0.19% 0.20% 0.20% 0.20% 0.20% 0.20% 0.21% 0.21% 0.21% 0.21% 0.21% 0.22% 0.22% 0.22% 0.22% 0.22% 0.23% 0.23% 0.23% 0.23% 0.23% 0.24% 0.24% 0.24% 0.24% 0.24% 0.25% 0.25% 0.25% 0.25% 0.25% 0.26% 0.26% 0.26% 0.26% 0.26% 0.27% 0.27% 0.27% 0.27% 0.27% 0.28% 0.28% 0.28% 0.28% 0.28% 0.29

 2.75% 2.75% 2.75% 2.75% 2.75% 2.76% 2.76% 2.76% 2.76% 2.76% 2.77% 2.77% 2.77% 2.77% 2.77% 2.78% 2.78% 2.78% 2.78% 2.78% 2.79% 2.79% 2.79% 2.79% 2.79% 2.80% 2.80% 2.80% 2.80% 2.80% 2.81% 2.81% 2.81% 2.81% 2.81% 2.82% 2.82% 2.82% 2.82% 2.82% 2.83% 2.83% 2.83% 2.83% 2.83% 2.84% 2.84% 2.84% 2.84% 2.84% 2.85% 2.85% 2.85% 2.85% 2.85% 2.86% 2.86% 2.86% 2.86% 2.86% 2.87% 2.87% 2.87% 2.87% 2.87% 2.88% 2.88% 2.88% 2.88% 2.88% 2.89% 2.89% 2.89% 2.89% 2.89% 2.90% 2.90% 2.90% 2.90% 2.90% 2.91% 2.91% 2.91% 2.91% 2.91% 2.92% 2.92% 2.92% 2.92% 2.92% 2.93% 2.93% 2.93% 2.93% 2.93% 2.94% 2.94% 2.94% 2.94% 2.94% 2.95% 2.95% 2.95% 2.95% 2.95% 2.96% 2.96% 2.96% 2.96% 2.96% 2.97% 2.97% 2.97% 2.97% 2.97% 2.98% 2.98% 2.98% 2.98% 2.98% 2.99% 2.99% 2.99% 2.99% 2.99% 3.00% 3.00% 3.00% 3.00% 3.00% 3.01% 3.01% 3.01% 3.01% 3.01% 3.02% 3.02% 3.02% 3.02% 3.02% 3.03% 3.03% 3.03

 5.33% 5.33% 5.33% 5.33% 5.34% 5.34% 5.34% 5.34% 5.34% 5.35% 5.35% 5.35% 5.35% 5.35% 5.36% 5.36% 5.36% 5.36% 5.36% 5.37% 5.37% 5.37% 5.37% 5.37% 5.38% 5.38% 5.38% 5.38% 5.38% 5.39% 5.39% 5.39% 5.39% 5.39% 5.40% 5.40% 5.40% 5.40% 5.40% 5.41% 5.41% 5.41% 5.41% 5.41% 5.42% 5.42% 5.42% 5.42% 5.42% 5.43% 5.43% 5.43% 5.43% 5.43% 5.44% 5.44% 5.44% 5.44% 5.44% 5.45% 5.45% 5.45% 5.45% 5.45% 5.46% 5.46% 5.46% 5.46% 5.46% 5.47% 5.47% 5.47% 5.47% 5.47% 5.48% 5.48% 5.48% 5.48% 5.48% 5.49% 5.49% 5.49% 5.49% 5.49% 5.50% 5.50% 5.50% 5.50% 5.50% 5.51% 5.51% 5.51% 5.51% 5.51% 5.52% 5.52% 5.52% 5.52% 5.52% 5.53% 5.53% 5.53% 5.53% 5.53% 5.54% 5.54% 5.54% 5.54% 5.54% 5.55% 5.55% 5.55% 5.55% 5.55% 5.56% 5.56% 5.56% 5.56% 5.56% 5.57% 5.57% 5.57% 5.57% 5.57% 5.58% 5.58% 5.58% 5.58% 5.58% 5.59% 5.59% 5.59% 5.59% 5.59% 5.60% 5.60% 5.60% 5.60% 5.60% 5.61% 5.61% 5.61% 5.61

 7.76% 7.76% 7.76% 7.76% 7.76% 7.77% 7.77% 7.77% 7.77% 7.77% 7.78% 7.78% 7.78% 7.78% 7.78% 7.79% 7.79% 7.79% 7.79% 7.79% 7.80% 7.80% 7.80% 7.80% 7.80% 7.81% 7.81% 7.81% 7.81% 7.81% 7.82% 7.82% 7.82% 7.82% 7.82% 7.83% 7.83% 7.83% 7.83% 7.83% 7.84% 7.84% 7.84% 7.84% 7.84% 7.85% 7.85% 7.85% 7.85% 7.85% 7.86% 7.86% 7.86% 7.86% 7.86% 7.87% 7.87% 7.87% 7.87% 7.87% 7.88% 7.88% 7.88% 7.88% 7.88% 7.89% 7.89% 7.89% 7.89% 7.89% 7.90% 7.90% 7.90% 7.90% 7.90% 7.91% 7.91% 7.91% 7.91% 7.91% 7.92% 7.92% 7.92% 7.92% 7.92% 7.93% 7.93% 7.93% 7.93% 7.93% 7.94% 7.94% 7.94% 7.94% 7.94% 7.95% 7.95% 7.95% 7.95% 7.95% 7.96% 7.96% 7.96% 7.96% 7.96% 7.97% 7.97% 7.97% 7.97% 7.97% 7.98% 7.98% 7.98% 7.98% 7.98% 7.99% 7.99% 7.99% 7.99% 7.99% 8.00% 8.00% 8.00% 8.00% 8.00% 8.01% 8.01% 8.01% 8.01% 8.01% 8.02% 8.02% 8.02% 8.02% 8.02% 8.03% 8.03% 8.03% 8.03% 8.03% 8.04% 8.04% 8.04

Done     



In [9]:
MAXLEN = max(lenghts)
print('Maximun number of words in a review :', MAXLEN)

assert len(sentences) == len(labels) == 50000

Maximun number of words in a review : 2507


In [10]:
MEAN_LEN = int(sum(lenghts)/len(lenghts))
STD_LEN  = (sum((x - MEAN_LEN)**2 for x in lenghts)/ len(lenghts))**0.5
print('MEAN LEN = ', MEAN_LEN)
print('STD  LEN = ', STD_LEN)

MEAN LEN =  236
STD  LEN =  174.83702022169103


In [11]:
from collections import Counter

def bag_words(reviews, vocabulary):
    all_words = []
    for review in reviews:
        all_words += review
    
    count  = [('UNKNOWN', -1)]
    count += Counter(all_words).most_common(vocabulary - 1)
    
    word_dict = {}
    for i in range(len(count)):
        word_dict[count[i][0]] = i
    
    return word_dict, dict(zip(word_dict.values(), word_dict.keys()))

In [12]:
def make_index_sentences(reviews, dictionary, MEAN_LEN):
    ID_sentences = [] 
    for review in reviews:
        ID_sentence = [0 for i in range(MEAN_LEN)]
        for lsen, word in enumerate(review):
            idr = dictionary.get(word, 0)
            if lsen >= MEAN_LEN: break
            else: ID_sentence[lsen] = idr
        ID_sentences.append(ID_sentence)
    return ID_sentences

In [13]:
voc_size         = 10000

In [14]:
word_index, id_toWord = bag_words(sentences, voc_size)

In [15]:
MAX_LEN   = 400

In [16]:
IDreviews = make_index_sentences(sentences, word_index, MAX_LEN)

## Generator for training 

In [17]:
def get_training_set(sentences, labels, lenghts, batch_size):
    N = len(sentences)
    for i in range(0, N, batch_size):
        embeddings = np.array(sentences[i : i + batch_size], dtype = np.int32)
        batch_lebl = np.reshape(np.array(labels[i: i + batch_size] , dtype = np.int32), (-1, 1))
        seq_lenght = np.array(lenghts[i: i + batch_size], dtype = np.int32)
        yield embeddings, batch_lebl, seq_lenght

In [18]:
batch_size = 10
gen = get_training_set(IDreviews, labels, lenghts, batch_size)
batch1, batch2, batch3 = next(gen)
print('review_shape : ', batch1.shape, ', label_shape: ', batch2.shape,', seq_shape : ', batch3.shape)

review_shape :  (10, 400) , label_shape:  (10, 1) , seq_shape :  (10,)


## Recurrent Neural Network

In [None]:
import tensorflow as tf

def getWeights(shape):
    initVar = weights = tf.truncated_normal_initializer(stddev=0.1)
    #return tf.Variable(tf.truncated_normal( shape  = shape,
    #                                        stddev = 0.01), name = 'W')
    return tf.get_variable('W',
                            dtype = tf.float32,
                            shape = shape,
                            initializer = tf.truncated_normal_initializer(stddev=0.01))

def getBiases(shape):
    #return tf.Variable(tf.zeros(0.0, shape=shape, dtype = tf.float32), name = 'b')
    initVar = tf.constant(0.0, shape = shape, dtype = tf.float32)
    return tf.get_variable('b',
                            dtype = tf.float32,
                            initializer = initVar)

def RNN(input_rev, vocabulary_size, emb_size, n_hidden, batch_size, seq_max_len, seq_len, num_layers):  
    embedding = tf.Variable(tf.random_uniform((vocabulary_size, emb_size), -1, 1))
    embed     = tf.nn.embedding_lookup(embedding, input_rev)
    
    lstms_fw = [tf.contrib.cudnn_rnn.CudnnCompatibleLSTMCell(n_hidden) for _ in range(num_layers)]
    cell_fw  = tf.contrib.rnn.MultiRNNCell(lstms_fw)

    lstms_bw = [tf.contrib.cudnn_rnn.CudnnCompatibleLSTMCell(n_hidden) for _ in range(num_layers)]
    cell_bw  = tf.contrib.rnn.MultiRNNCell(lstms_bw)
    
    cell_fw  = tf.nn.rnn_cell.DropoutWrapper(cell_fw, output_keep_prob=0.5)
    cell_bw  = tf.nn.rnn_cell.DropoutWrapper(cell_bw, output_keep_prob=0.5)
    
    #initial_state   = cell.zero_state(batch_size, tf.float32)
    
    outputs, states = tf.nn.bidirectional_dynamic_rnn(cell_fw, 
                                                      cell_bw,
                                                      embed,
                                                      sequence_length=seq_len,
                                                      dtype=tf.float32)
    #outputs, states = tf.nn.dynamic_rnn(cell, input_rev, initial_state = initial_state)
    
    
    index   = tf.range(0, batch_size) * seq_max_len + (seq_len - 1)
    outputs = tf.gather(tf.reshape(outputs, [-1, n_hidden]), index)
    out     = tf.layers.dense(inputs=outputs, units=40)
    out     = tf.nn.dropout(out, keep_prob = 0.5)
    out     = tf.layers.dense(inputs=out, units=1)
    #out     = tf.matmul(outputs, weights) + biases
    #out     = tf.matmul(outputs[:, -1], weights) + biases
    res     = tf.sigmoid(out, 'sigmoid')
    return res   

### Parameters

In [None]:
emb_size         = 100
num_hidden_units = 40
out_dim          = 1
number_of_layers = 2
seq_max_len      = MAX_LEN

### Hyperparameters

In [None]:
learning_rate  = 0.0025
batch_size     = 200
display_freq   = 10
training_steps = 20

### Graph

In [None]:
#X       = tf.placeholder(tf.float32, [None, seq_max_len, input_dim], name = 'input')
X       = tf.placeholder(tf.int32,   [None, seq_max_len], name='input')
seqLen  = tf.placeholder(tf.int32  , [None], name = 'seq_len')
y       = tf.placeholder(tf.int32,   [None, 1], name = 'labels')

with tf.variable_scope("RNN", reuse=tf.AUTO_REUSE):
    pred_out = RNN(X, voc_size, emb_size, num_hidden_units, batch_size, seq_max_len, seqLen, number_of_layers)

In [None]:
with tf.variable_scope("Train", reuse=tf.AUTO_REUSE):
    #cost       = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(labels = y, logits = pred_out))
    #assert y.shape == pred_out.shape
    
    cost       = tf.losses.mean_squared_error(y, pred_out)
    train_op   = tf.train.AdamOptimizer(learning_rate = learning_rate).minimize(cost)
    
    pred_class = tf.greater(pred_out,0.5)
    acc_mes    = tf.equal(pred_class, tf.equal(y,1), name = 'correct_pred')
    acc        = tf.reduce_mean(tf.cast(acc_mes, tf.float32), name='accuracy')

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


### Training

In [None]:
init     = tf.global_variables_initializer()

N = len(sentences)

cv_div  = 40000

X_train = IDreviews[:cv_div]
y_train = labels[:cv_div]
l_train = lenghts[:cv_div]

X_test  = IDreviews[cv_div:]
y_test  = labels[cv_div:]
l_test  = lenghts[cv_div:]

assert len(X_train) == len(y_train) == len(l_train) == cv_div
assert len(X_test)  == len(y_test)  == len(l_test)  == 50000 - cv_div

train_loss = []
test_acc   = []

fmt = 'epoch : {:4d}, training loss = {:4.3f}, testing accuracy = {:4.3f}'
with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as sess:
    sess.run(init)
    #print('-------Training-------\n')
    for ep in range(1, training_steps + 1):
        gen = get_training_set(X_train, y_train, l_train, batch_size)
        loss_t = []
        for i in range(1, len(X_train) // batch_size + 1):
            x_batch , y_batch, seq_len_batch = next(gen)
            _, loss = sess.run([train_op, cost], feed_dict={X:x_batch, y:y_batch, seqLen: seq_len_batch})
            loss_t.append(loss)
        
        train_loss.append(sum(loss_t)/len(loss_t))
        
        gen_test = get_training_set(X_test, y_test, l_test, batch_size)
        acc_t    = []
        
        for i in range(len(X_test) // batch_size):
            x_batch , y_batch, seq_len_batch = next(gen_test)
            accuracy = sess.run([acc], feed_dict={X:x_batch, y:y_batch, seqLen: seq_len_batch})
            acc_t.append(accuracy[0])
        
        test_acc.append(sum(acc_t)/len(acc_t))
        print(fmt.format(ep, sum(loss_t)/len(loss_t), sum(acc_t)/len(acc_t)))

In [None]:
#This lines are for plot error on each epoch

ep  = np.arange(1, training_steps + 1, 1)
fig, ax   = plt.subplots(figsize=(14, 8))
l1        = ax.plot(ep, train_loss)
ax.set(xlabel='Epoch', ylabel='Cost', title='Recurrent Neural Network - Training Loss')
ax.axis([0.0, training_steps + 0.5, 0.1, train_loss[0]+0.01])
#plt.legend([l1, l2],["Training","Validation"])
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(14, 8))
l2      = ax.plot( ep, test_acc)
ax.set(xlabel='Epoch', ylabel='Cost', title='Recurrent Neural Network - Testing Accuracy ')
ax.axis([0.8, training_steps + 0.5, test_acc[0]-0.01, test_acc[-1]+0.01])
#plt.legend([l1, l2],["Training","Validation"])
plt.show()

# Testing

In [None]:
example = ['I loved this movie']
m  = tokenizer(example[0])
X_sam = np.array(make_index_sentences([m], word_index, MAX_LEN), dtype=np.int32)
y_sam = np.reshape(np.array([1], dtype=np.int32), (1,1))
seq_len_sam = np.array([len(m)], dtype=np.int32)

print(X_sam.shape, y_sam.shape, seq_len_sam.shape, seq_len_sam)

g = tf.get_default_graph()
with tf.Session(graph = g) as sess:
    accuracy = sess.run([pred_out], feed_dict={X:X_sam})

In [None]:
example = ['This movie was great!']
m  = tokenizer(example[0])
X = vectorizer.transform([m])
net.predSentiment(X)

In [None]:
example = ["I didn't like this movie"]
m  = tokenizer(example[0])
X = vectorizer.transform([m])

net.predSentiment(X)

In [None]:
example = ['I did not like this movie']
m  = tokenizer(example[0])
X = vectorizer.transform([m])
net.predSentiment(X)

In [None]:
example = ["I don't like this movie"]
m  = tokenizer(example[0])
X = vectorizer.transform([m])
net.predSentiment(X)

In complex sentences the result is not the correct

In [None]:
example = ["I love the actor but the history was the worst, I don't recommend this one"]
m  = tokenizer(example[0])
X = vectorizer.transform([m])
net.predSentiment(X)