# Sentiment Analysis with an RNN

In this notebook, I implement a recurrent neural network that performs sentiment analysis. The architecture for this network is described as following. 

We'll pass in words to an embedding layer and let the network learn the embedding table on it's own. From the embedding layer, the new representations will be passed to two layers of LSTM cells. These will add recurrent connections to the network so we can include information about the sequence of words in the data. Finally, the LSTM cells will go to a sigmoid output layer here. The output layer will just be a single unit then, with a sigmoid activation function.

We don't care about the sigmoid outputs except for the very last one, we can ignore the rest. We'll calculate the cost from the output of the last step and the training label.

In [1]:
import numpy as np
import tensorflow as tf

  return f(*args, **kwds)


In [2]:
with open('reviews.txt', 'r') as f:
    reviews = f.read()
with open('labels.txt', 'r') as f:
    labels = f.read()

In [3]:
reviews[:2000]

'bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   \nstory of a man who has unnatural feelings for a pig . starts out with a opening scene that is a terrific example of absurd comedy . a formal orchestra audience is tu

## Data preprocessing

We'll want to get rid of those periods. Also, you might notice that the reviews are delimited with newlines `\n`. To deal with those, I'm going to split the text into each review using `\n` as the delimiter. Then I can combined all the reviews back together into one big string.

First, let's remove all punctuation. Then get all the text without the newlines and split it into individual words.

In [4]:
from string import punctuation
all_text = ''.join([c for c in reviews if c not in punctuation])
reviews = all_text.split('\n')

all_text = ' '.join(reviews)
words = all_text.split()

In [5]:
all_text[:2000]

'bromwell high is a cartoon comedy  it ran at the same time as some other programs about school life  such as  teachers   my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers   the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students  when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled          at           high  a classic line inspector i  m here to sack one of your teachers  student welcome to bromwell high  i expect that many adults of my age think that bromwell high is far fetched  what a pity that it isn  t    story of a man who has unnatural feelings for a pig  starts out with a opening scene that is a terrific example of absurd comedy  a formal orchestra audience is turned into an insane  violent m

### Encoding the words

The embedding lookup requires that we pass in integers to our network. The easiest way to do this is to create dictionaries that map the words in the vocabulary to integers. Then we can convert each of our reviews into integers so they can be passed into the network.

In [6]:
from collections import Counter
counts = Counter(words)
vocab = sorted(counts, key=counts.get, reverse=True)
vocab_to_int = {word: ii for ii, word in enumerate(vocab, 1)}

reviews_ints = []
for each in reviews:
    reviews_ints.append([vocab_to_int[word] for word in each.split()])

### Encoding the labels

Our labels are "positive" or "negative". To use these labels in our network, we need to convert them to 0 and 1.

In [7]:
labels = labels.split('\n')
labels = np.array([1 if each == 'positive' else 0 for each in labels])

In [8]:
review_lens = Counter([len(x) for x in reviews_ints])
print("Zero-length reviews: {}".format(review_lens[0]))
print("Maximum review length: {}".format(max(review_lens)))

Zero-length reviews: 1
Maximum review length: 2514


First, remove the review with zero length from the `reviews_ints` list.

In [9]:
non_zero_idx = [ii for ii, review in enumerate(reviews_ints) if len(review) != 0]
len(non_zero_idx)

25000

In [10]:
reviews_ints = [reviews_ints[ii] for ii in non_zero_idx]
labels = np.array([labels[ii] for ii in non_zero_idx])

In [11]:
seq_len = 200
features = np.zeros((len(reviews_ints), seq_len), dtype=int)
for i, row in enumerate(reviews_ints):
    features[i, -len(row):] = np.array(row)[:seq_len]

## Training, Validation, Test



In [12]:
split_frac = 0.8
split_idx = int(len(features)*0.8)
train_x, val_x = features[:split_idx], features[split_idx:]
train_y, val_y = labels[:split_idx], labels[split_idx:]

test_idx = int(len(val_x)*0.5)
val_x, test_x = val_x[:test_idx], val_x[test_idx:]
val_y, test_y = val_y[:test_idx], val_y[test_idx:]

print("\t\t\tFeature Shapes:")
print("Train set: \t\t{}".format(train_x.shape), 
      "\nValidation set: \t{}".format(val_x.shape),
      "\nTest set: \t\t{}".format(test_x.shape))

			Feature Shapes:
Train set: 		(20000, 200) 
Validation set: 	(2500, 200) 
Test set: 		(2500, 200)


With train, validation, and text fractions of 0.8, 0.1, 0.1, the final shapes should look like:
```
                    Feature Shapes:
Train set: 		 (20000, 200) 
Validation set: 	(2500, 200) 
Test set: 		  (2500, 200)
```

## Build the graph

Here I list the parameters for training the network:

* `lstm_size`: Number of units in the hidden layers in the LSTM cells. 
* `batch_size`: The number of reviews to feed the network in one training pass. Typically this should be set as high as you can go without running out of memory.
* `learning_rate`: Learning rate

In [13]:
lstm_size = 200
batch_size = 250
learning_rate = 0.001
embed_size = 200 

The words will be passed to an embedding layer and let the network learn the embedding table on it's own. From the embedding layer, the new representations will be passed to two layers of LSTM cells. These will add recurrent connections to the network so we can include information about the sequence of words in the data. Finally, the LSTM cells will go to a sigmoid output layer here. The output layer will just be a single unit then, with a sigmoid activation function.

We don't care about the sigmoid outputs except for the very last one, we can ignore the rest. We'll calculate the cost from the output of the last step and the training label.

In [14]:
n_words = len(vocab_to_int) + 1
graph = tf.Graph()

with tf.device('/gpu:0'):
    with graph.as_default():
        inputs_ = tf.placeholder(tf.int32, [None, None], name='inputs')
        labels_ = tf.placeholder(tf.int32, [None, None], name='labels')
        keep_prob = tf.placeholder(tf.float32, name='keep_prob')
        
        embedding = tf.Variable(tf.random_uniform((n_words, embed_size), -1, 1))
        embed = tf.nn.embedding_lookup(embedding, inputs_)


        gru_layer_1 = tf.contrib.rnn.GRUCell(lstm_size)        
        dropout_layer_1 = tf.contrib.rnn.DropoutWrapper(gru_layer_1, output_keep_prob=keep_prob)
        
        gru_layer_2 = tf.contrib.rnn.GRUCell(lstm_size)
        dropout_layer_2 = tf.contrib.rnn.DropoutWrapper(gru_layer_2, output_keep_prob=keep_prob)
        

        # Stack up multiple LSTM layers
        stacked_all_layers = tf.contrib.rnn.MultiRNNCell([dropout_layer_1, dropout_layer_2])

        outputs, final_state = tf.nn.dynamic_rnn(stacked_all_layers, embed, dtype=tf.float32)

        predictions = tf.contrib.layers.fully_connected(outputs[:, -1], 1, activation_fn=tf.sigmoid)
        cost = tf.losses.mean_squared_error(labels_, predictions)

        optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)

        correct_pred = tf.equal(tf.cast(tf.round(predictions), tf.int32), labels_)
        accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

### Batching

This is a simple function for returning batches from our data. First it removes data such that we only have full batches. Then it iterates through the `x` and `y` arrays and returns slices out of those arrays with size `[batch_size]`.

In [15]:
def get_batches(x, y, batch_size=100):
    
    n_batches = len(x)//batch_size
    x, y = x[:n_batches*batch_size], y[:n_batches*batch_size]
    for ii in range(0, len(x), batch_size):
        yield x[ii:ii+batch_size], y[ii:ii+batch_size]

## Training

Below is the typical training code. If you want to do this yourself, feel free to delete all this code and implement it yourself. Before you run this, make sure the `checkpoints` directory exists.

In [16]:
epochs = 5

with graph.as_default():
    saver = tf.train.Saver()

with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    iteration = 1
    for e in range(epochs):
        #state = sess.run(initial_state)
        
        for ii, (x, y) in enumerate(get_batches(train_x, train_y, batch_size), 1):
            feed = {inputs_: x,
                    labels_: y[:, None],
                    keep_prob: 0.7}
            loss, state, _ = sess.run([cost, final_state, optimizer], feed_dict=feed)
            
            if iteration%10==0:
                print("Epoch: {}/{}".format(e, epochs),
                      "Iteration: {}".format(iteration),
                      "Train loss: {:.3f}".format(loss))
            iteration +=1
            
        val_acc = []
        val_state = sess.run(stacked_all_layers.zero_state(batch_size, tf.float32))
        for x, y in get_batches(val_x, val_y, batch_size):
            feed = {inputs_: x,
                    labels_: y[:, None],
                    keep_prob: 1}
            batch_acc, val_state = sess.run([accuracy, final_state], feed_dict=feed)
            val_acc.append(batch_acc)
        print("Val acc: {:.3f}".format(np.mean(val_acc)))
    saver.save(sess, "checkpoints/sentiment.ckpt")

Epoch: 0/5 Iteration: 10 Train loss: 0.237
Epoch: 0/5 Iteration: 20 Train loss: 0.219
Epoch: 0/5 Iteration: 30 Train loss: 0.228
Epoch: 0/5 Iteration: 40 Train loss: 0.211
Epoch: 0/5 Iteration: 50 Train loss: 0.208
Epoch: 0/5 Iteration: 60 Train loss: 0.183
Epoch: 0/5 Iteration: 70 Train loss: 0.194
Epoch: 0/5 Iteration: 80 Train loss: 0.160
Val acc: 0.768
Epoch: 1/5 Iteration: 90 Train loss: 0.150
Epoch: 1/5 Iteration: 100 Train loss: 0.153
Epoch: 1/5 Iteration: 110 Train loss: 0.156
Epoch: 1/5 Iteration: 120 Train loss: 0.151
Epoch: 1/5 Iteration: 130 Train loss: 0.119
Epoch: 1/5 Iteration: 140 Train loss: 0.114
Epoch: 1/5 Iteration: 150 Train loss: 0.128
Epoch: 1/5 Iteration: 160 Train loss: 0.100
Val acc: 0.815
Epoch: 2/5 Iteration: 170 Train loss: 0.139
Epoch: 2/5 Iteration: 180 Train loss: 0.097
Epoch: 2/5 Iteration: 190 Train loss: 0.096
Epoch: 2/5 Iteration: 200 Train loss: 0.148
Epoch: 2/5 Iteration: 210 Train loss: 0.070
Epoch: 2/5 Iteration: 220 Train loss: 0.162
Epoch: 2/5 

## Testing

In [17]:
test_acc = []
with tf.Session(graph=graph) as sess:
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    test_state = sess.run(stacked_all_layers.zero_state(batch_size, tf.float32))
    for ii, (x, y) in enumerate(get_batches(test_x, test_y, batch_size), 1):
        feed = {inputs_: x,
                labels_: y[:, None],
                keep_prob: 1}
                #initial_state: test_state}
        batch_acc, test_state = sess.run([accuracy, final_state], feed_dict=feed)
        test_acc.append(batch_acc)
    print("Test accuracy: {:.3f}".format(np.mean(test_acc)))

INFO:tensorflow:Restoring parameters from checkpoints/sentiment.ckpt
Test accuracy: 0.821
