# Lab3: Sentiment, but slower!

In this assignment, you'll implement an **RNN-based sentence classifier**. Plain ol' RNNs aren't very good at sentiment classification, and they're very picky about things like learning rates. However, they're the foundation for things like LSTMs, which we'll learn about next week, and which *are* quite useful.

## Setup

First, let's load the data as before.

In [7]:
sst_home = '../trees'

import re
import random

# Let's do 2-way positive/negative classification instead of 5-way
easy_label_map = {0:0, 1:0, 2:None, 3:1, 4:1}

def load_sst_data(path):
    data = []
    with open(path) as f:
        for i, line in enumerate(f): 
            example = {}
            example['label'] = easy_label_map[int(line[1])]
            if example['label'] is None:
                continue
            
            # Strip out the parse information and the phrase labels---we don't need those here
            text = re.sub(r'\s*(\(\d)|(\))\s*', '', line)
            example['text'] = text[1:]
            data.append(example)

    random.seed(1)
    random.shuffle(data)
    return data
     
training_set = load_sst_data(sst_home + '/train.txt')
dev_set = load_sst_data(sst_home + '/dev.txt')
test_set = load_sst_data(sst_home + '/test.txt')

print('Training size: {}'.format(len(training_set)))
print('Dev size: {}'.format(len(dev_set)))
print('Test size: {}'.format(len(test_set)))

Training size: 6920
Dev size: 872
Test size: 1821


Next, we'll convert the data to __index vectors__.

To simplify your implementation, we'll use a __fixed unrolling length of 20__. In the conversion process, we'll cut off excess words (towards the left/start end of the sentence), pad short sentences (to the left) with a special word symbol `<PAD>`, and mark out-of-vocabulary words with `<UNK>`, for unknown. As in the previous assignment, we'll use a very small vocabulary for this assignment, so you'll see `<UNK>` often.

In [8]:
import collections
import numpy as np

def sentence_to_padded_index_sequence(datasets):
    '''Annotates datasets with feature vectors.'''
    
    PADDING = "<PAD>"
    UNKNOWN = "<UNK>"
    SEQ_LEN = 20
    
    # Extract vocabulary
    def tokenize(string):
        return string.lower().split()
    
    word_counter = collections.Counter()
    for example in datasets[0]:
        word_counter.update(tokenize(example['text']))
    
    vocabulary = set([word for word in word_counter if word_counter[word] > 10])
    vocabulary = list(vocabulary)
    vocabulary = [PADDING, UNKNOWN] + vocabulary
        
    word_indices = dict(zip(vocabulary, range(len(vocabulary))))
    indices_to_words = {v: k for k, v in word_indices.items()}
        
    for i, dataset in enumerate(datasets):
        for example in dataset:
            example['index_sequence'] = np.zeros((SEQ_LEN), dtype=np.int32)
            
            token_sequence = tokenize(example['text'])
            padding = SEQ_LEN - len(token_sequence)
            
            for i in range(SEQ_LEN):
                if i >= padding:
                    if token_sequence[i - padding] in word_indices:
                        index = word_indices[token_sequence[i - padding]]
                    else:
                        index = word_indices[UNKNOWN]
                else:
                    index = word_indices[PADDING]
                example['index_sequence'][i] = index
    return indices_to_words, word_indices
    
indices_to_words, word_indices = sentence_to_padded_index_sequence([training_set, dev_set, test_set])

In [9]:
print (training_set[18])
print (len(word_indices))

{'text': 'As the dominant Christine , Sylvie Testud is icily brilliant .', 'index_sequence': array([  0,   0,   0,   0,   0,   0,   0,   0,   0, 422, 958,   1,   1,
       682,   1,   1, 318,   1, 363, 509], dtype=int32), 'label': 1}
1250


In [10]:
def evaluate_classifier(classifier, eval_set):
    correct = 0
    hypotheses = classifier(eval_set)
    for i, example in enumerate(eval_set):
        hypothesis = hypotheses[i]
        if hypothesis == example['label']:
            correct += 1        
    return correct / float(len(eval_set))

## Assignments: Building the RNN

Replace the TODOs in the code below to make RNN work. If it's set up properly, it should reach dev set accuracy of about 0.7 within 500 epochs with the given hyperparameters.

You will find 3 TODOs in the code.

### TODO 1:

- You have to define the RNN parameters (attribute *self.dim* sets dimmension of hidden state). 

- (Hint) The paremeters take input's embedding (*self.embedding_dim*) and the previous hidden state (*self.dim*) and provides the current hidden state (*self.dim*).

### TODO 2:

- Write a (very short) Python function that defines one step of an RNN. (Hint) In each step current input and previous hidden states are involved. 

- Recall from slides: $f(h_{t-1}, p_t) = tanh(W[h_{t-1};p_t])$. Note that input $x$ at time step $t$ is *translated* to its embedding representation. 

![](./rnn2.png)


### TODO 3:

- Unroll the RNN using a *for* loop, and obtain the sentence representation with the final hidden state.

- (Hint) Note that we are vectorizing the whole minibatch. That is, in each step we are processing all the examples in the batch together in one go. Try to understand the following two code lines:

   $\rightarrow$ ``self.x_slices = tf.split(self.x, self.sequence_length, 1)``
   
   $\rightarrow$ ``self.h_zero = tf.zeros([self.batch_size, self.dim])``
   
- (Hint) It might be a good idea to reshape (tf.reshape) the tensor at step t in a single tensor. 

In [11]:
import tensorflow as tf

In [27]:
class RNNSentimentClassifier:
    def __init__(self, vocab_size, sequence_length):
        # Define the hyperparameters
        self.learning_rate = 0.2  # Should be about right
        self.training_epochs = 500  # How long to train for - chosen to fit within class time
        self.display_epoch_freq = 5  # How often to test and print out statistics
        self.dim = 24  # The dimension of the hidden state of the RNN
        self.embedding_dim = 8  # The dimension of the learned word embeddings
        self.batch_size = 256  # Somewhat arbitrary - can be tuned, but often tune for speed, not accuracy
        self.vocab_size = vocab_size  # Defined by the file reader above
        self.sequence_length = sequence_length  # Defined by the file reader above
        self.l2_lambda = 0.001
        
        # Define the parameters
        self.E = tf.Variable(tf.random_normal([self.vocab_size, self.embedding_dim], stddev=0.1))
        
        self.W_cl = tf.Variable(tf.random_normal([self.dim, 2], stddev=0.1))
        self.b_cl = tf.Variable(tf.random_normal([2], stddev=0.1))
        
        # TODO 1: Define the RNN parameters
        self.W_rnn = tf.Variable(tf.random_normal([self.embedding_dim+self.dim, self.dim], stddev=0.1))
        self.b_rnn = tf.Variable(tf.random_normal([self.dim], stddev=0.1))
        
        # Define the placeholders
        self.x = tf.placeholder(tf.int32, [None, self.sequence_length])
        self.y = tf.placeholder(tf.int32, [None])
        
        # Split up the inputs into individual tensors
        self.x_slices = tf.split(self.x, self.sequence_length, 1)
    
        # Define the start state of the RNN
        self.h_zero = tf.zeros([self.batch_size, self.dim])
        
        # TODO 2: Write a (very short) Python function that defines one step of an RNN
        def step(x, h_prev):
            # Add your code here 
            pt=tf.nn.embedding_lookup(self.E,x)            
            concat=tf.concat([h_prev,tf.reshape(pt,[self.batch_size,self.embedding_dim])],1)
            h = tf.tanh(tf.matmul(concat, self.W_rnn) + self.b_rnn)  # Broadcasted addition            
            return h
        
        # TODO 3: Unroll the RNN using a for loop, and and obtain the sentence representation with the final hidden state        
        current_h=self.h_zero
        for i in range(self.sequence_length):
            current_h = step(self.x_slices[i], current_h)
        sentence_representation = current_h

        # Compute the logits using one last linear layer
        self.logits = tf.matmul(sentence_representation, self.W_cl) + self.b_cl


        # Define the L2 cost
        self.l2_cost = self.l2_lambda * (tf.reduce_sum(tf.square(self.W_rnn)) +
                                         tf.reduce_sum(tf.square(self.W_cl)))
        
        # Define the cost function (here, the softmax exp and sum are built in)
        self.total_cost = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(labels=self.y, logits=self.logits)+self.l2_cost)
        
        # This  performs the main SGD update equation with gradient clipping
        optimizer_obj = tf.train.GradientDescentOptimizer(learning_rate=self.learning_rate)
        gvs = optimizer_obj.compute_gradients(self.total_cost)
        capped_gvs = [(tf.clip_by_norm(grad, 5.0), var) for grad, var in gvs if grad is not None]
        self.optimizer = optimizer_obj.apply_gradients(capped_gvs)
        
        # Create an operation to fill zero values in for W and b
        self.init = tf.initialize_all_variables()
        
        # Create a placeholder for the session that will be shared between training and evaluation
        self.sess = None
        
    def train(self, training_data, dev_set):
        def get_minibatch(dataset, start_index, end_index):
            indices = range(start_index, end_index)
            vectors = np.vstack([dataset[i]['index_sequence'] for i in indices])
            labels = [dataset[i]['label'] for i in indices]
            return vectors, labels
        
        self.sess = tf.Session()
        
        self.sess.run(self.init)
        print('Training.')

        # Training cycle
        for epoch in range(self.training_epochs):
            random.shuffle(training_set)
            avg_cost = 0.
            total_batch = int(len(training_set) / self.batch_size)
            
            # Loop over all batches in epoch
            for i in range(total_batch):
                # Assemble a minibatch of the next B examples
                minibatch_vectors, minibatch_labels = get_minibatch(training_set, 
                                                                    self.batch_size * i, 
                                                                    self.batch_size * (i + 1))

                # Run the optimizer to take a gradient step, and also fetch the value of the 
                # cost function for logging
                _, c = self.sess.run([self.optimizer, self.total_cost], 
                                     feed_dict={self.x: minibatch_vectors,
                                                self.y: minibatch_labels})
                                                                    
                # Compute average loss
                avg_cost += c / total_batch
                
            # Display some statistics about the step
            # Evaluating only one batch worth of data -- simplifies implementation slightly
            if (epoch+1) % self.display_epoch_freq == 0:
                print("Epoch:", (epoch+1), "Cost:", avg_cost, \
                    "Dev acc:", evaluate_classifier(self.classify, dev_set[0:256]), \
                    "Train acc:", evaluate_classifier(self.classify, training_set[0:256]))  
    
    def classify(self, examples):
        # This classifies a list of examples
        vectors = np.vstack([example['index_sequence'] for example in examples])
        logits = self.sess.run(self.logits, feed_dict={self.x: vectors})
        return np.argmax(logits, axis=1)

In [30]:
classifier = RNNSentimentClassifier(len(word_indices), 20)
classifier.train(training_set, dev_set)

Instructions for updating:
Use `tf.global_variables_initializer` instead.
Training.
Epoch: 5 Cost: 0.700139891218 Dev acc: 0.5546875 Train acc: 0.5078125
Epoch: 10 Cost: 0.69915288466 Dev acc: 0.5546875 Train acc: 0.484375
Epoch: 15 Cost: 0.69836516292 Dev acc: 0.5546875 Train acc: 0.5546875
Epoch: 20 Cost: 0.697663667025 Dev acc: 0.5546875 Train acc: 0.48828125
Epoch: 25 Cost: 0.69692420518 Dev acc: 0.5546875 Train acc: 0.53515625
Epoch: 30 Cost: 0.696329328749 Dev acc: 0.56640625 Train acc: 0.52734375
Epoch: 35 Cost: 0.695711184431 Dev acc: 0.5546875 Train acc: 0.44921875
Epoch: 40 Cost: 0.69532820472 Dev acc: 0.56640625 Train acc: 0.4609375
Epoch: 45 Cost: 0.695023874442 Dev acc: 0.56640625 Train acc: 0.54296875
Epoch: 50 Cost: 0.694452219539 Dev acc: 0.546875 Train acc: 0.484375
Epoch: 55 Cost: 0.694087096938 Dev acc: 0.55078125 Train acc: 0.53515625
Epoch: 60 Cost: 0.69383293611 Dev acc: 0.546875 Train acc: 0.49609375
Epoch: 65 Cost: 0.693075096166 Dev acc: 0.546875 Train acc: 0.5

# Atribution:
Adapted by Oier Lopez de Lacalle, based on a notebook by Sam Bowman at NYU