In [None]:
# Make new conda environment
#!conda create -n srnn
#!source activate srnn
#!while read -r line; do conda install $line -y; done < requirements.txt

# Sentiment Analysis with an RNN

In this notebook, you'll implement a recurrent neural network that performs sentiment analysis. Using an RNN rather than a feedfoward network is more accurate since we can include information about the *sequence* of words. Here we'll use a dataset of movie reviews, accompanied by labels.

The architecture for this network is shown below.

<img src="assets/network_diagram.png" width=400px>

Here, we'll pass in words to an embedding layer. We need an embedding layer because we have tens of thousands of words, so we'll need a more efficient representation for our input data than one-hot encoded vectors. You should have seen this before from the word2vec lesson. You can actually train up an embedding with word2vec and use it here. But it's good enough to just have an embedding layer and let the network learn the embedding table on it's own.

From the embedding layer, the new representations will be passed to LSTM cells. These will add recurrent connections to the network so we can include information about the sequence of words in the data. Finally, the LSTM cells will go to a sigmoid output layer here. We're using the sigmoid because we're trying to predict if this text has positive or negative sentiment. The output layer will just be a single unit then, with a sigmoid activation function.

We don't care about the sigmoid outputs except for the very last one, we can ignore the rest. We'll calculate the cost from the output of the last step and the training label.

In [1]:
import numpy as np
import tensorflow as tf

In [2]:
with open('../sentiment-network/reviews.txt', 'r') as f:
    reviews = f.read()
with open('../sentiment-network/labels.txt', 'r') as f:
    labels = f.read()

## Data preprocessing

The first step when building a neural network model is getting your data into the proper form to feed into the network. Since we're using embedding layers, we'll need to encode each word with an integer. We'll also want to clean it up a bit.

You can see an example of the reviews data above. We'll want to get rid of those periods. Also, you might notice that the reviews are delimited with newlines `\n`. To deal with those, I'm going to split the text into each review using `\n` as the delimiter. Then I can combined all the reviews back together into one big string.

First, let's remove all punctuation. Then get all the text without the newlines and split it into individual words.

In [3]:
# 1. Clean Up Punctuation
from string import punctuation
all_text = ''.join([c for c in reviews if c not in punctuation])

# 2. Split all_text into separate reviews (separated by newline chars)
reviews = all_text.split('\n')

# 3. Define text as explicit sequence of words
words = ' '.join(reviews).split()

### Encoding the words

The embedding lookup requires that we pass in integers to our network. The easiest way to do this is to create dictionaries that map the words in the vocabulary to integers. Then we can convert each of our reviews into integers so they can be passed into the network.

> **Exercise:** Now you're going to encode the words with integers. Build a dictionary that maps words to integers. Later we're going to pad our input vectors with zeros, so make sure the integers **start at 1, not 0**.
> Also, convert the reviews to integers and store the reviews in a new list called `int_reviews`. 

In [4]:
# 4. Define your vocab
vocab = set(words)

# 5. Create your dictionary that maps vocab words to integers here
vocab_to_int = {w: i+1 for i,w in enumerate(vocab)}

# 6. Convert the reviews to integers, same shape as reviews list, but with integers
#  -- make sure to filter out the last review with 0 length 
int_reviews = [[vocab_to_int[word] for word in rev.split()]
                for rev in reviews if len(rev) > 0]

### Encoding the labels

Our labels are "positive" or "negative". To use these labels in our network, we need to convert them to 0 and 1.

> **Exercise:** Convert labels from `positive` and `negative` to 1 and 0, respectively.

In [5]:
# Convert labels to 1s and 0s for 'positive' and 'negative'
#  -- make sure to filter out the last label with 0 length 
labels = np.array([1 if lab=='positive' else 0 for lab in labels.split('\n') if len(lab) > 0])

If you built `labels` correctly, you should see the next output.

In [6]:
from collections import Counter
review_lens = Counter([len(x) for x in int_reviews])
print("Zero-length reviews: {}".format(review_lens[0]))
print("Maximum review length: {}".format(max(review_lens)))

Zero-length reviews: 0
Maximum review length: 2514


Okay, the maximum review length is way too many steps for our RNN. Let's truncate to 200 steps. For reviews shorter than 200, we'll pad with 0s. For reviews longer than 200, we can truncate them to the first 200 characters.

> **Exercise:** Now, create an array `features` that contains the data we'll pass to the network. The data should come from `review_ints`, since we want to feed integers to the network. Each row should be 200 elements long. For reviews shorter than 200 words, left pad with 0s. That is, if the review is `['best', 'movie', 'ever']`, `[117, 18, 128]` as integers, the row will look like `[0, 0, 0, ..., 0, 117, 18, 128]`. For reviews longer than 200, use on the first 200 words as the feature vector.

This isn't trivial and there are a bunch of ways to do this. But, if you're going to be building your own deep learning networks, you're going to have to get used to preparing your data.



In [7]:
seq_len = 200
features = np.array([rev[0:200] if len(rev) > seq_len else [0]*(seq_len-len(rev))+rev for rev in int_reviews ])

In [8]:
features.shape

(25000, 200)

## Training, Validation, Test



With our data in nice shape, we'll split it into training, validation, and test sets.

> **Exercise:** Create the training, validation, and test sets here. You'll need to create sets for the features and the labels, `train_x` and `train_y` for example. Define a split fraction, `split_frac` as the fraction of data to keep in the training set. Usually this is set to 0.8 or 0.9. The rest of the data will be split in half to create the validation and testing data.

In [53]:
split_frac = 0.8
n_data = int(features.shape[0])
np.random.seed(42)  # Ensure permuation's reproducibility
random_indices = np.random.permutation(n_data)

trn_vt_bdry = int(np.ceil(split_frac * n_data))
n_vt = n_data - trn_vt_bdry
val_tst_bdry = int(trn_vt_bdry + np.ceil(0.5 * n_vt))

trn_idx = random_indices[:trn_vt_bdry]
val_idx = random_indices[trn_vt_bdry:val_tst_bdry]
tst_idx = random_indices[val_tst_bdry:]

trn_x, trn_y = features[trn_idx,:], np.reshape(labels[trn_idx], (len(labels[trn_idx]),1))
val_x, val_y = features[val_idx,:], np.reshape(labels[val_idx], (len(labels[val_idx]),1))
tst_x, tst_y = features[tst_idx,:], np.reshape(labels[tst_idx], (len(labels[tst_idx]),1))

print("\t\t\tFeature Shapes:")
print("Train set: \t\t{}".format(trn_x.shape), 
      "\nValidation set: \t{}".format(val_x.shape),
      "\nTest set: \t\t{}".format(tst_x.shape))

			Feature Shapes:
Train set: 		(20000, 200) 
Validation set: 	(2500, 200) 
Test set: 		(2500, 200)


## Build the graph

Here, we'll build the graph. First up, defining the hyperparameters.

* `lstm_width`: Number of units in the hidden layers in the LSTM cells. Usually larger is better performance wise. Common values are 128, 256, 512, etc.
    - If we use a 300-dim embedding space, then 128 would represent a further dimRedux and 512 would represent more of a nonlinear feature exploration...
    - actually, I might be wrong:  it is the "layer width," however LSTM cells have four internal layers (3 sigmoid layers and 1 tanh layer), and the Udacity instructor says that this number applies to each layer, i.e., 256 specifies the existence of 4*256=1024 units
    - you can literally think of this as specifying how wide you want a regular hidden layer to be, though an LSTM cell is more complicated
    - called lstm_size in original Udacity file
* `lstm_depth`: Number of LSTM layers in the network. I'd start with 1, then add more if I'm underfitting.
    - If we prescribes 256-unit hidden layers, then, e.g., 2 LSTM layers puts 512 nodes into the network
    - Just watch out: one can easily have too many free parameters and overfit! A good approach is to start at 1 and tune up from there, while using hefty amount of regularization. 
    - called lstm_layers in original Udacity file
* `num_reviews_per_batch`: The number of reviews to feed the network in one training pass. Typically this should be set as high as you can go without running out of memory.
    - make this as large as possible w/out running out of memory; i.e., take advantage of the TF's optimized matrix operations and the parallel processing capacity of your CPU and/or GPU 
    - called batch_size in original Udacity file
* `learning_rate`: Learning rate

In [54]:
lstm_width = 256
lstm_depth = 1
num_reviews_per_batch = 500
learning_rate = 0.001

For the network itself, we'll be passing in our 200 element long review vectors. Each batch will include `num_reviews_per_batch` vectors. We'll also be using dropout on the LSTM layer, so we'll make a placeholder for the keep probability.

> **Exercise:** Create the `inputs_`, `labels_`, and drop out `keep_prob` placeholders using `tf.placeholder`. `labels_` needs to be two-dimensional to work with some functions later.  Since `keep_prob` is a scalar (a 0-dimensional tensor), you shouldn't provide a size to `tf.placeholder`.

In [55]:
# Create the graph object
graph = tf.Graph()

# Add nodes to the graph
with graph.as_default(), tf.name_scope("model_inputs"):
    inputs_ = tf.placeholder(tf.int32, shape=[None, 200], name='inputs')
    labels_ = tf.placeholder(tf.int32, shape=[None,1], name='labels')
    keep_prob = tf.placeholder(tf.float32, name='keep_prob')

### Embedding

One-hot encoding is great, right? Not for extremely large vector spaces!

There are 74000 words in our vocabulary, which means that it is massively inefficient  to one-hot encode our classes here. You should remember dealing with this problem from the word2vec lesson. Instead of one-hot encoding, we can have an embedding layer and use that layer as a lookup table. You could train an embedding layer using word2vec, then load it here. But, it's fine to just make a new layer and let the network learn the weights.

> **Exercise:** Create the embedding lookup matrix as a `tf.Variable`. Use that embedding matrix to get the embedded vectors to pass to the LSTM cell with [`tf.nn.embedding_lookup`](https://www.tensorflow.org/api_docs/python/tf/nn/embedding_lookup). This function takes the embedding matrix and an input tensor, such as the review vectors. Then, it'll return another tensor with the embedded vectors. So, if the embedding layer has 200 units, the function will return a tensor with size [num_reviews_per_batch, 200].



In [56]:
# Dim Reduction:  Vocab Vectors --> Embedding Vectors 
#   -- the embedding vector dimensionality is the number of 
#      units in the embedding layer
embedding_size = 300 
vocab_size = len(vocab)  # ~74k

# We start with a number (num_reviews_per_batch) of reviews, each having 200 
#   "sequence dimensions" (inputs_.shape: (None, 200)) where each dimension 
#    can range over the integers 1:74k.
#
# The integers in each dimension have been arbitrarily assigned to words.
#   Thus, the integers themselves give the false impression that there
#   is a logical ordering to the categorical values.  This is why we usually
#   one-hot encode, where each word would be an orthogonal vector in a 74k-dimensional
#   word space.  However, one-hot encoding would render each 200-element sequence 
#   into a 200x74k=14.8M-dimensional input vector.  14.8M features?  Nice try. 
#   To solve the "categorical issue" problem, we instead assume that words are not
#   likely best represented as orthogonal vectors anyway.  Orthogonality implies
#   a complete disimilarity, but many words hold similar meanings.  With this in
#   mind, a better assumption might be to position words in a much lower-dimensional
#   space where two words might be orthogonal, but more often are not.  In the one-hot
#   space, a word's vector representation is all 0's except for the axis which 
#   represents the word.  In an embedding space, the components of the vector 
#   representation might take on any value, e.g., any number between 0 and 1. 
#   Words with similar meanings likely have a lot in common, so should likely lie
#   near each other (i.e., their normalized dot product is close to 1).
#
# Anyway, the point is, instead of having a 200-element sequence of 74k-dimensional
#   vectors, the embedding representation will allow us to use a 200-element sequence
#   of, say, 300-dimensional vectors.  This reduces the feature dimensionality of a review
#   from 14.8M to 60k (a reduction by more than 200x).
#

# vocab_size+1: Accounts for the 0-word we included for padding.
with graph.as_default(), tf.name_scope("embedding"):
    embedding = tf.Variable(
        tf.random_uniform([vocab_size+1, embedding_size], -1.0, 1.0),
        name="W")
    embed = tf.nn.embedding_lookup(embedding, inputs_)

### LSTM cell

<img src="assets/network_diagram.png" width=400px>

Next, we'll create our LSTM cells to use in the recurrent network ([TensorFlow documentation](https://www.tensorflow.org/api_docs/python/tf/contrib/rnn)). Here we are just defining what the cells look like. This isn't actually building the graph, just defining the type of cells we want in our graph.

To create a basic LSTM cell for the graph, you'll want to use `tf.contrib.rnn.BasicLSTMCell`. Looking at the function documentation:

```
tf.contrib.rnn.BasicLSTMCell(num_units, forget_bias=1.0, input_size=None, state_is_tuple=True, activation=<function tanh at 0x109f1ef28>)
```

you can see it takes a parameter called `num_units`, the number of units in the cell, called `lstm_width` in this code (called lstm_size in original Udacity file). So then, you can write something like 

```
lstm = tf.contrib.rnn.BasicLSTMCell(num_units)
```

to create an LSTM cell with `num_units`. Next, you can add dropout to the cell with `tf.contrib.rnn.DropoutWrapper`. This just wraps the cell in another cell, but with dropout added to the inputs and/or outputs. It's a really convenient way to make your network better with almost no effort! So you'd do something like

```
drop = tf.contrib.rnn.DropoutWrapper(cell, output_keep_prob=keep_prob)
```

Most of the time, your network will have better performance with more layers. That's sort of the magic of deep learning, adding more layers allows the network to learn really complex relationships. Again, there is a simple way to create multiple layers of LSTM cells with `tf.contrib.rnn.MultiRNNCell`:

```
cell = tf.contrib.rnn.MultiRNNCell([drop] * lstm_depth)
```

Here, `[drop] * lstm_depth` creates a list of cells (`drop`) that is `lstm_depth` long. The `MultiRNNCell` wrapper builds this into multiple layers of RNN cells, one for each cell in the list.

So the final cell you're using in the network is actually multiple (or just one) LSTM cells with dropout. But it all works the same from an achitectural viewpoint, just a more complicated graph in the cell.

> **Exercise:** Below, use `tf.contrib.rnn.BasicLSTMCell` to create an LSTM cell. Then, add drop out to it with `tf.contrib.rnn.DropoutWrapper`. Finally, create multiple LSTM layers with `tf.contrib.rnn.MultiRNNCell`.

Here is [a tutorial on building RNNs](https://www.tensorflow.org/tutorials/recurrent) that will help you out.


In [57]:
with graph.as_default():
    # Your basic LSTM cell
    lstm = tf.contrib.rnn.BasicLSTMCell(lstm_width)
    
    # Add dropout to the cell
    drop = tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob)
    
    # Stack up multiple LSTM layers, for deep learning
    cell = tf.contrib.rnn.MultiRNNCell([drop]*lstm_depth)
    
    # Getting an initial state of all zeros
    initial_state = cell.zero_state(num_reviews_per_batch, tf.float32)

### RNN forward pass

<img src="assets/network_diagram.png" width=400px>

Now we need to actually run the data through the RNN nodes. You can use [`tf.nn.dynamic_rnn`](https://www.tensorflow.org/api_docs/python/tf/nn/dynamic_rnn) to do this. You'd pass in the RNN cell you created (our multiple layered LSTM `cell` for instance), and the inputs to the network.

```
outputs, final_state = tf.nn.dynamic_rnn(cell, inputs, initial_state=initial_state)
```

Above I created an initial state, `initial_state`, to pass to the RNN. This is the cell state that is passed between the hidden layers in successive time steps. `tf.nn.dynamic_rnn` takes care of most of the work for us. We pass in our cell and the input to the cell, then it does the unrolling and everything else for us. It returns outputs for each time step and the final_state of the hidden layer.

> **Exercise:** Use `tf.nn.dynamic_rnn` to add the forward pass through the RNN. Remember that we're actually passing in vectors from the embedding layer, `embed`.



In [59]:
with graph.as_default():
    # swap_memory: Transparently swap the tensors produced in forward inference
    #    but needed for back prop from GPU to CPU.  This allows training RNNs
    #    which would typically not fit on a single GPU, with very minimal (or no)
    #    performance penalty.
    # outputs:
    #    -- see help for info on outputs and final_state
    outputs, final_state = tf.nn.dynamic_rnn(
        cell, 
        embed, 
        initial_state=initial_state,
        swap_memory=True)

### Output

We only care about the final output, we'll be using that as our sentiment prediction. So we need to grab the last output with `outputs[:, -1]`, the calculate the cost from that and `labels_`.

Why do we use all rows but only the last column in outputs?  The implementation of our RNN is unrolled (see figure above), and so each output column represents predictions that were made prior to finishing the seq_len-element sequence.  We only care about the final prediction/decision/output of the unrolled RNN.

In [60]:
with graph.as_default():
    predictions = tf.contrib.layers.fully_connected(outputs[:, -1], 1, activation_fn=tf.sigmoid)
    cost = tf.losses.mean_squared_error(labels_, predictions)
    optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)

### Validation accuracy

Here we can add a few nodes to calculate the accuracy which we'll use in the validation pass.

Note that accuracy might be a good metric here only because for a collection many movies and many reviewer personality types one can assume fairly balanced positive and negative reviews.  This measure would be less important if, for example, 97% of the reviews were negative (0) and only 3% were positive.  In this case, a model could just guess negative for every review and maintain 97% accuracy.

In [61]:
with graph.as_default():
    correct_pred = tf.equal(tf.cast(tf.round(predictions), tf.int32), labels_)
    accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

### Batching

This is a simple function for returning batches from our data. First it removes data such that we only have full batches. Then it iterates through the `x` and `y` arrays and returns slices out of those arrays with size `[num_reviews_per_batch]`.

In [62]:
def get_batches(x, y, batch_size=100):
    
    n_batches = len(x)//batch_size
    x, y = x[:n_batches*batch_size], y[:n_batches*batch_size]
    for ii in range(0, len(x), batch_size):
        yield x[ii:ii+batch_size], y[ii:ii+batch_size]

## Training

Below is the typical training code. If you want to do this yourself, feel free to delete all this code and implement it yourself. Before you run this, make sure the `checkpoints` directory exists.

In [63]:
epochs = 10

with graph.as_default():
    saver = tf.train.Saver()

with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    iteration = 1
    for e in range(epochs):
        state = sess.run(initial_state)
        
        for ii, (x, y) in enumerate(get_batches(trn_x, trn_y, num_reviews_per_batch), 1):
            feed = {inputs_: x,
                    labels_: y,
                    keep_prob: 0.5,
                    initial_state: state}
            loss, state, _ = sess.run([cost, final_state, optimizer], feed_dict=feed)
            
            if iteration%5==0:
                print("Epoch: {}/{}".format(e, epochs),
                      "Iteration: {}".format(iteration),
                      "Train loss: {:.3f}".format(loss))

            if iteration%25==0:
                val_acc = []
                val_state = sess.run(cell.zero_state(num_reviews_per_batch, tf.float32))
                for x, y in get_batches(val_x, val_y, num_reviews_per_batch):
                    feed = {inputs_: x,
                            labels_: y,
                            keep_prob: 1,
                            initial_state: val_state}
                    batch_acc, val_state = sess.run([accuracy, final_state], feed_dict=feed)
                    val_acc.append(batch_acc)
                print("Val acc: {:.3f}".format(np.mean(val_acc)))
            iteration +=1
    saver.save(sess, "checkpoints/sentiment.ckpt")

Epoch: 0/10 Iteration: 5 Train loss: 0.249
Epoch: 0/10 Iteration: 10 Train loss: 0.239
Epoch: 0/10 Iteration: 15 Train loss: 0.238
Epoch: 0/10 Iteration: 20 Train loss: 0.234
Epoch: 0/10 Iteration: 25 Train loss: 0.228
Val acc: 0.624
Epoch: 0/10 Iteration: 30 Train loss: 0.239
Epoch: 0/10 Iteration: 35 Train loss: 0.224
Epoch: 0/10 Iteration: 40 Train loss: 0.184
Epoch: 1/10 Iteration: 45 Train loss: 0.177
Epoch: 1/10 Iteration: 50 Train loss: 0.181
Val acc: 0.730
Epoch: 1/10 Iteration: 55 Train loss: 0.184
Epoch: 1/10 Iteration: 60 Train loss: 0.165
Epoch: 1/10 Iteration: 65 Train loss: 0.162
Epoch: 1/10 Iteration: 70 Train loss: 0.178
Epoch: 1/10 Iteration: 75 Train loss: 0.147
Val acc: 0.788
Epoch: 1/10 Iteration: 80 Train loss: 0.139
Epoch: 2/10 Iteration: 85 Train loss: 0.124
Epoch: 2/10 Iteration: 90 Train loss: 0.134
Epoch: 2/10 Iteration: 95 Train loss: 0.108
Epoch: 2/10 Iteration: 100 Train loss: 0.123
Val acc: 0.794
Epoch: 2/10 Iteration: 105 Train loss: 0.113
Epoch: 2/10 Ite

## Testing

In [66]:
test_acc = []
with tf.Session(graph=graph) as sess:
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    test_state = sess.run(cell.zero_state(num_reviews_per_batch, tf.float32))
    for ii, (x, y) in enumerate(get_batches(tst_x, tst_y, num_reviews_per_batch), 1):
        feed = {inputs_: x,
                labels_: y,
                keep_prob: 1,
                initial_state: test_state}
        batch_acc, test_state = sess.run([accuracy, final_state], feed_dict=feed)
        test_acc.append(batch_acc)
    print("Test accuracy: {:.3f}".format(np.mean(test_acc)))

Test accuracy: 0.819
