## Sentimet RNN

Dataset: imdb movie reviews and labels(pos/neg).

A RNN used to predict the sentiment from the words of a review. The model uses an embedding layer to map the words from the review to embedding and the those embeddings are passed to several LSTM cells, which will output to a sigmoid that will predict the sentiment.

Input: word -> one_hot_encoding_of_word -> embedding -> LSTM Recurrent Cells -> Output: Sigmoid

----------

<img src="assets/network_diagram.png" width=400px>


In [1]:
import numpy as np
import tensorflow as tf

In [2]:
with open('./reviews.txt', 'r') as f:
    reviews = f.read()

In [3]:
with open('./labels.txt', 'r') as f:
    labels = f.read().split('\n')

In [4]:
reviews[:1000]

'bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   \nstory of a man who has unnatural feelings for a pig . starts out with a opening scene that is a terrific example of absurd comedy . a formal orchestra audience is tu

In [5]:
from string import punctuation
reviews = ''.join([c for c in reviews if c not in punctuation])
reviews = reviews.split('\n') #get rid of the \n character

word_set = set([word for word in (' '.join(reviews)).split()])


In [6]:
print(reviews[0])

bromwell high is a cartoon comedy  it ran at the same time as some other programs about school life  such as  teachers   my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers   the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students  when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled          at           high  a classic line inspector i  m here to sack one of your teachers  student welcome to bromwell high  i expect that many adults of my age think that bromwell high is far fetched  what a pity that it isn  t   


In [7]:
print(list(word_set)[:25])

['grouping', 'novellas', 'osiris', 'ruling', 'diagonal', 'entails', 'swanks', 'romulus', 'morissey', 'brillent', 'artworks', 'juries', 'teleprinter', 'adage', 'eschew', 'enclosed', 'teta', 'devotee', 'lune', 'weixler', 'section', 'silences', 'whistlestop', 'malozzie', 'bloss']


In [8]:
for review in reviews:
    if 'sequencesthe' in review:
        print (review)
        break

when i come on imdb boards  i  m always fed up when i see a  the worst movie ever  post  after watching this movie  i think that i am soon going to create my own post    br    br   the opening titlesgreat  some kind of lame zoom on a gas oven  yeah  focus on the fireexplosionsgreat action packed movie     br    br   the actorsi think that ice t is a cool rapper  even a nice actor  sometimes  i insist   sometimes   but the steven seagal like policeman he plays is    beyond the words  the rest of the cast is    well i don  t know where those actors were hired but jeez   i bet my dog would have been a much better actor than them    br    br   the plothijacking  original isn  t it    br    br   the action sequencesthe first shot of the movie is an explosion  i told myself  well  cool   at least there will be some nice pyrotechnics    i was dead wrong  the rest of the movie is mostly filled with low rent stock shots taken from the air force     br    br   the dialogs are hilarious  the musi

# Encoding words

The embedding lookup requires that we pass in integers to our network. The easiest way to do this is to create dictionaries that map the words in the vocabulary to integers. Then we can convert each of our reviews into integers so they can be passed into the network.

<b>Exercise:</b> Now you're going to encode the words with integers. Build a dictionary that maps words to integers. Later we're going to pad our input vectors with zeros, so make sure the integers start at 1, not 0. Also, convert the reviews to integers and store the reviews in a new list called reviews_ints.

In [9]:
#saving mapping table
import json
import os
if os.path.exists('vocab_to_int.json'):
    vocab_to_int = json.load(open('vocab_to_int.json', 'r'))
    print('loaded vocab_to_int,  size: {}'.format(len(vocab_to_int)))
else:
    vocab_to_int = dict([(word, idx+1) for idx, word in enumerate(word_set)])
    json.dump(vocab_to_int, open('vocab_to_int.json', 'w'))

loaded vocab_to_int,  size: 74072


In [10]:
reviews_ints = [[vocab_to_int[word] for word in review.split()] for review in reviews]

In [11]:
labels = [1 if label == 'positive' else 0 for label in labels if len(label) > 0]

In [12]:
from collections import Counter
review_lens = Counter([len(x) for x in reviews_ints])
print("Zero length reviews: {}".format(review_lens[0]))
print("Maximum length reviews: {}".format(max(review_lens)))

Zero length reviews: 1
Maximum length reviews: 2514



Okay, a couple issues here. We seem to have one review with zero length. And, the maximum review length is way too many steps for our RNN. Let's truncate to 200 steps. For reviews shorter than 200, we'll pad with 0s. For reviews longer than 200, we can truncate them to the first 200 characters.

Exercise: First, remove the review with zero length from the reviews_ints list.

In [13]:
reviews_ints = [rev for rev in reviews_ints if len(rev) > 0]


<b>Exercise:</b> Now, create an array features that contains the data we'll pass to the network. The data should come from review_ints, since we want to feed integers to the network. Each row should be 200 elements long. For reviews shorter than 200 words, left pad with 0s. That is, if the review is ['best', 'movie', 'ever'], [117, 18, 128] as integers, the row will look like [0, 0, 0, ..., 0, 117, 18, 128]. For reviews longer than 200, use on the first 200 words as the feature vector.

This isn't trivial and there are a bunch of ways to do this. But, if you're going to be building your own deep learning networks, you're going to have to get used to preparing your data.

In [14]:
#not sure i buy why only 200 words makes sense, but following along with tthe tutorial
seq_len = 200
features = np.zeros((len(reviews_ints), seq_len), dtype=int)

In [15]:
for idx, review in enumerate(reviews_ints):
    trunc_rev = review[:seq_len]
    start_idx = (seq_len - len(trunc_rev))
    features[idx, start_idx:] = trunc_rev

In [16]:
features[:10, :100]

array([[    0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0, 42386, 22123, 33565,
        70291, 38773, 11818,  1901, 72811, 10903, 72356, 25002, 11449,
        35077, 41756, 48831, 13275, 69853, 45314, 20860, 70669, 35077,
        47666, 60116, 34992, 21234, 72356, 15004, 31694, 44112, 19702,
        55762, 72405, 39159, 42386, 22123, 39374, 34820, 33565, 45612,
        24616],
       [    0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     

# Training, Validation, Test

With our data in nice shape, we'll split it into training, validation, and test sets.

Exercise: Create the training, validation, and test sets here. You'll need to create sets for the features and the labels, train_x and train_y for example. Define a split fraction, split_frac as the fraction of data to keep in the training set. Usually this is set to 0.8 or 0.9. The rest of the data will be split in half to create the validation and testing data.

In [17]:
labels = np.array(labels)

In [18]:
labels = labels.reshape(len(labels), -1)

In [19]:
split_frac = 0.8

r,c = features.shape

train_x, val_x = features[:int(r*split_frac)], features[int(r*split_frac):]
train_y, val_y = labels[:int(r*split_frac)], labels[int(r*split_frac):]

r_val, _ = val_x.shape

val_x, test_x = val_x[:int(0.5 * r_val)], val_x[int(0.5 * r_val):]
val_y, test_y = val_y[:int(0.5 * r_val)], val_y[int(0.5 * r_val):]


print("\t\t\tFeature Shapes:")
print("Train set: \t\t{}, {}".format(train_x.shape, train_y.shape), 
      "\nValidation set: \t{}, {}".format(val_x.shape, val_y.shape),
      "\nTest set: \t\t{}, {}".format(test_x.shape, test_y.shape))

			Feature Shapes:
Train set: 		(20000, 200), (20000, 1) 
Validation set: 	(2500, 200), (2500, 1) 
Test set: 		(2500, 200), (2500, 1)



# Build the graph

Here, we'll build the graph. First up, defining the hyperparameters.
 - lstm_size: Number of units in the hidden layers in the LSTM cells. Usually larger is better performance wise. Common values are 128, 256, 512, etc.
 - lstm_layers: Number of LSTM layers in the network. I'd start with 1, then add more if I'm underfitting.
 - batch_size: The number of reviews to feed the network in one training pass. Typically this should be set as high as you can go without running out of memory.
 - learning_rate: Learning rate

In [20]:
lstm_size = 256
lstm_layers = 2 #starting with lstm layer of 1 and if underfitting will expand to 2 
batch_size = 500
learning_rate = 0.001
embed_size = 300


For the network itself, we'll be passing in our 200 element long review vectors. Each batch will be batch_size vectors. We'll also be using dropout on the LSTM layer, so we'll make a placeholder for the keep probability.

Exercise: Create the inputs_, labels_, and drop out keep_prob placeholders using tf.placeholder. labels_ needs to be two-dimensional to work with some functions later. Since keep_prob is a scalar (a 0-dimensional tensor), you shouldn't provide a size to tf.placeholder.

In [21]:
n_words = len(word_set)
print('n_words: {}'.format(n_words))

tf.reset_default_graph()

graph = tf.Graph()

with graph.as_default():
    inputs_ = tf.placeholder(tf.int32, shape=(batch_size, seq_len), name='inputs_')
    labels_ = tf.placeholder(tf.int32, shape=(batch_size, 1), name='labels_') #this might be wrong
    keep_prob = tf.placeholder(tf.float32, name='keep_prob')

n_words: 74072


# Embedding Layer

Adding layer to map the ~74K word vocab to size 300 feature array

In [22]:
with graph.as_default():
    with tf.name_scope('embedding_layer'):
        #added 1 to embedding since 0th row is used for padding
        embedding = tf.Variable(tf.random_uniform((n_words + 1, embed_size), -1.0, 1.0), name='embedding') 
        embed = tf.nn.embedding_lookup(embedding, inputs_)
        tf.summary.histogram('embedding', embedding)
    

# LSTM Cell


In [23]:
with graph.as_default():
    with tf.name_scope('lstm'):
        #creates an lstm cell with num_units = lstm_size
        lstm = tf.contrib.rnn.BasicLSTMCell(num_units = lstm_size)

        #wraps lstm cell in a cell that applies dropout to the output of the lstm cell
        drop = tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob) 

## Note from tutorial

Most of the time, you're network will have better performance with more layers. That's sort of the magic of deep learning, adding more layers allows the network to learn really complex relationships. Again, there is a simple way to create multiple layers of LSTM cells with tf.contrib.rnn.MultiRNNCell.

Here, [drop] * lstm_layers creates a list of cells (drop) that is lstm_layers long. The MultiRNNCell wrapper builds this into multiple layers of RNN cells, one for each cell in the list.

So the final cell you're using in the network is actually multiple (or just one) LSTM cells with dropout. But it all works the same from an achitectural viewpoint, just a more complicated graph in the cell.

In [24]:
with graph.as_default():
    with tf.name_scope('lstm_layer'):
        cell = tf.contrib.rnn.MultiRNNCell([drop] * lstm_layers) 
        initial_state = cell.zero_state(batch_size=batch_size, dtype=tf.float32)

## RNN forward pass

Now we need to actually run the data through the RNN nodes. You can use tf.nn.dynamic_rnn to do this. You'd pass in the RNN cell you created (our multiple layered LSTM cell for instance), and the inputs to the network.

`outputs, final_state = tf.nn.dynamic_rnn(cell, inputs, initial_state=initial_state)`

Above I created an initial state, initial_state, to pass to the RNN. This is the cell state that is passed between the hidden layers in successive time steps. tf.nn.dynamic_rnn takes care of most of the work for us. We pass in our cell and the input to the cell, then it does the unrolling and everything else for us. It returns outputs for each time step and the final_state of the hidden layer.

In [25]:
with graph.as_default():
    with tf.name_scope('dynamic_rnn'):
        outputs, final_state = tf.nn.dynamic_rnn(cell = cell, inputs = embed, initial_state=initial_state)
        tf.summary.histogram('outputs', outputs)
        tf.summary.histogram('final_state', final_state)

## outputs

we only care about the final output, `outputs[:, -1]`, which we'll compare to `labels_` to determine the cost of the rnn

In [26]:
with graph.as_default():
    with tf.name_scope('prediction'):
        predictions = tf.contrib.layers.fully_connected(outputs[:, -1], num_outputs=1, activation_fn=tf.sigmoid) 
        
    with tf.name_scope('cost'):
        cost = tf.losses.mean_squared_error(labels=labels_, predictions=predictions)
        tf.summary.scalar('cost', cost)
    
    with tf.name_scope('optimizer'):
        opt = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)

## validation accuracy

add nodes to calculate validation accuracy

In [27]:
with graph.as_default():
    with tf.name_scope('accuracy'):
        correct_pred = tf.equal(tf.cast(tf.round(predictions), tf.int32), labels_)
        accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
        tf.summary.scalar('accuracy', accuracy)

In [28]:
def get_batches(x, y, batch_size = 100):
    n_batches = len(x)//batch_size
    x, y = x[:n_batches * batch_size], y[:n_batches * batch_size]
    for idx in range(0, len(x), batch_size):
        yield x[idx : idx + batch_size], y[idx : idx + batch_size]

In [29]:
with graph.as_default():
    saver = tf.train.Saver()
    merged = tf.summary.merge_all()

# write out graph

## Training
### Before running create a checkpoints directory and summary director

In [30]:

TRAINING = False

if TRAINING:
    epochs = 15
    print("lstm_size: {}, lstm_layers: {}, batch_size: {}, learning_rate: {}".format(lstm_size, lstm_layers, batch_size, learning_rate))
    with tf.Session(graph = graph) as sess:
        sess.run(tf.global_variables_initializer())
        graph_writer = tf.summary.FileWriter('./logs/graph', sess.graph)
        train_writer = tf.summary.FileWriter('./logs/perf/train')
        valid_writer = tf.summary.FileWriter('./logs/perf/valid')
        iteration = 1

        for e in range(epochs):
            #print(e)
            state = sess.run(initial_state)
            for x,y in get_batches(train_x, train_y, batch_size):
                #print(x,y)
                feed = {inputs_: x,
                        labels_: y,
                        keep_prob: 0.5,
                        initial_state: state}
                loss, state, _, summary, acc = sess.run([cost, final_state, opt, merged, accuracy], feed_dict=feed)

                if iteration%5 == 0:
                    train_writer.add_summary(summary, iteration)
                    print("Epoch: {}/{}".format(e, epochs),
                          "Iteration: {}".format(iteration),
                          "Train Loss: {:.3f}".format(loss),
                          "Accuracy: {:.3f}".format(acc)
                         )
                if iteration%25 == 0:
                    val_acc = []
                    val_state = sess.run(cell.zero_state(batch_size, tf.float32))
                    for vx, vy in get_batches(val_x, val_y, batch_size):
                        feed = {inputs_: vx,
                                labels_: vy,
                                keep_prob: 1.0,
                                initial_state: val_state
                               }
                        batch_acc, val_state, summary = sess.run([accuracy, final_state, merged], feed_dict=feed)
                        val_acc.append(batch_acc)
                    valid_writer.add_summary(summary, iteration)
                    print('Validation Acc: {:.3f}'.format(np.mean(val_acc)))

                iteration += 1
            saver.save(sess, "./checkpoints/sentiment.ckpt", e)

In [31]:
test_acc = []
with tf.Session(graph=graph) as sess:
    saver.restore(sess, tf.train.latest_checkpoint("./checkpoints"))
    test_state = sess.run(cell.zero_state(batch_size, tf.float32))
    for x,y in get_batches(test_x, test_y, batch_size):
        feed_dict = {
            inputs_: x,
            labels_: y,
            keep_prob: 1.0,
            initial_state: test_state
        }
        batch_acc, test_state = sess.run([accuracy, final_state], feed_dict=feed_dict)
        test_acc.append(batch_acc)
    print("Test Accuracy: {:.3f}".format(np.mean(test_acc)))

Test Accuracy: 0.809


In [36]:
#evidence of overfitting pass the 11th ckpt

test_acc = []
with tf.Session(graph=graph) as sess:
    saver.restore(sess, './checkpoints/sentiment.ckpt-11')
    test_state = sess.run(cell.zero_state(batch_size, tf.float32))
    for x,y in get_batches(test_x, test_y, batch_size):
        feed_dict = {
            inputs_: x,
            labels_: y,
            keep_prob: 1.0,
            initial_state: test_state
        }
        batch_acc, test_state = sess.run([accuracy, final_state], feed_dict=feed_dict)
        test_acc.append(batch_acc)
    print("Test Accuracy: {:.3f}".format(np.mean(test_acc)))

Test Accuracy: 0.821


In [48]:
sent1 = 'i watched this movie yesterday and thought it was great hamburger'

In [49]:
words = sent1.split()

In [50]:
sub_sents = []

for i in range(2, len(words) + 1):
    sub_sents.append(words[:i])
print(sub_sents)

[['i', 'watched'], ['i', 'watched', 'this'], ['i', 'watched', 'this', 'movie'], ['i', 'watched', 'this', 'movie', 'yesterday'], ['i', 'watched', 'this', 'movie', 'yesterday', 'and'], ['i', 'watched', 'this', 'movie', 'yesterday', 'and', 'thought'], ['i', 'watched', 'this', 'movie', 'yesterday', 'and', 'thought', 'it'], ['i', 'watched', 'this', 'movie', 'yesterday', 'and', 'thought', 'it', 'was'], ['i', 'watched', 'this', 'movie', 'yesterday', 'and', 'thought', 'it', 'was', 'great'], ['i', 'watched', 'this', 'movie', 'yesterday', 'and', 'thought', 'it', 'was', 'great', 'hamburger']]


In [51]:
sub_sents_ints = [[vocab_to_int[word] for word in sent] for sent in sub_sents]

In [53]:
zero200_padding = lambda inp_list : [0 for _ in range(200 - len(inp_list))] + inp_list