# Sentiment Classification with RNNs (LSTMs) 

In this assignment you will experiment with training and evaluating sentiment classification models that use recurrent neural networks (RNNs) implemented in PyTorch. For this, you will need to install the <a href="https://pytorch.org/">PyTorch</a> package, using the instructions below (installation with <a href="https://www.anaconda.com/">conda</a> is recomended):

https://pytorch.org/get-started/locally

## Write Your Name Here:

# <font color="blue"> Submission Instructions</font>

1. Click the Save button at the top of the Jupyter Notebook.
2. Please make sure to have entered your name above.
3. Select Cell -> All Output -> Clear. This will clear all the outputs from all cells (but will keep the content of ll cells). 
4. Select Cell -> Run All. This will run all the cells in order, and will take several minutes.
5. Once you've rerun everything, select File -> Download as -> PDF via LaTeX and download a PDF version *lstm-sentiment.pdf* showing the code and the output of all cells, and save it in the same folder that contains the notebook file *lstm-sentiment.ipynb*.
6. Look at the PDF file and make sure all your solutions are there, displayed correctly. The PDF is the only thing we will see when grading!
7. Submit **both** your PDF and notebook on Canvas. Make sure the PDF and notebook show the outputs of the training and evaluation procedures. Also upload the **output** on the test datasets.

In [1]:
from models import *
from sentiment_data import *

import random
import numpy as np
import torch
from typing import NamedTuple

class HyperParams(NamedTuple):
    lstm_size: int
    hidden_size: int
    lstm_layers: int
    drop_out: float
    num_epochs: int
    batch_size: int
    seq_max_len: int

# LSTM-based training and evaluation procedures

We will use the RNNet class defined in `models.py` that uses LSTMs implemented in PyTorch. Depending on the options, this class runs one LSTM (forward) or two LSTMS (bidirectional, forward-backward) on the padded input text. The last state (or concatenated last states), or the average of the states, is used as input to a fully connected network with 3 hidden layers, with a final output sigmoid node computing the probability of the positive class.

In [2]:
# Training procedure for LSTM-based models
def train_model(hp: HyperParams,
                train_exs: List[SentimentExample],
                dev_exs: List[SentimentExample],
                test_exs: List[SentimentExample], 
                word_vectors: WordEmbeddings,
                use_average, bidirectional):
    train_size = len(train_exs)
    class_num = 1
    
    # Specify training on gpu: set to False to train on cpu
    use_gpu = True # torch.cuda.is_available()
    if use_gpu: # Set tensor type when using GPU
        float_type = torch.cuda.FloatTensor
    else: # Set tensor type when using CPU
        float_type = torch.FloatTensor
        
    # To get you started off, we'll pad the training input to 60 words to make it a square matrix.
    train_mat = np.asarray([pad_to_length(np.array(ex.indexed_words), hp.seq_max_len) for ex in train_exs])
    # Also store the actual sequence lengths.
    train_seq_lens = np.array([len(ex.indexed_words) for ex in train_exs])
    
    # Training input reversed, useful is using bidirectional LSTM.
    train_mat_rev = np.asarray([pad_to_length(np.array(ex.get_indexed_words_reversed()), hp.seq_max_len) for ex in train_exs])

    # Extract labels.
    train_labels_arr = np.array([ex.label for ex in train_exs])
    targets = train_labels_arr
    
    # Extract embedding vectors.
    embed_size = word_vectors.get_embedding_length()
    embeddings_vec = np.array(word_vectors.vectors).astype(float)
    
    # Create RNN model.
    rnnModel = RNNet(hp.lstm_size, hp.hidden_size, hp.lstm_layers, hp.drop_out,
                     class_num, word_vectors, 
                     use_average, bidirectional,
                     use_gpu =use_gpu)
    
    # If GPU is available, then run experiments on GPU
    if use_gpu:
        rnnModel.cuda()
    
    # Specify optimizer.
    optimizer = optim.Adam(filter(lambda p: p.requires_grad, rnnModel.parameters()), 
                           lr = 5e-3, weight_decay  =5e-3, betas = (0.9, 0.9))
    
    # Define loss function: Binary Cross Entropy loss for logistic regression (binary classification).
    criterion = nn.BCELoss()
    
    
    # Get embeddings of words for forward and reverse sentence: (num_ex * seq_max_len * embedding_size)
    x = np.zeros((train_size, hp.seq_max_len, embed_size))
    x_rev = np.zeros((train_size, hp.seq_max_len, embed_size))
    for i in range(train_size):
        x[i] = embeddings_vec[train_mat[i].astype(int)]
        x_rev[i] = embeddings_vec[train_mat_rev[i].astype(int)]
    
    # Train the RNN model, gradient descent loop over minibatches.
    for epoch in range(hp.num_epochs):
        rnnModel.train()
        
        ex_idxs = [i for i in range(train_size)]
        random.shuffle(ex_idxs)
        
        total_loss = 0.0
        start = 0
        while start < train_size:
            end = min(start + hp.batch_size, train_size)
            
            # Get embeddings of words for forward and reverse sentence: (num_ex * seq_max_len * embedding_size)
            x_batch = form_input(x[ex_idxs[start:end]]).type(float_type)
            x_batch_rev = form_input(x_rev[ex_idxs[start:end]]).type(float_type)
            y_batch = form_input(targets[ex_idxs[start:end]]).type(float_type)
            seq_lens_batch = train_seq_lens[ex_idxs[start:end]]
            
            # Compute output probabilities over all examples in minibatch.
            probs = rnnModel(x_batch, x_batch_rev, seq_lens_batch).flatten()
            
            # Compute loss over all examples in minibatch.
            loss = criterion(probs, y_batch)
            total_loss += loss.data
            
            # Zero gradients, perform a backward pass, and update the weights.
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            start = end
            
        print("Loss on epoch %i: %f" % (epoch, total_loss))
        
        # Print accuracy on training and development data.
        if epoch % 10 == 0:
            acc = eval_model(rnnModel, train_exs, embeddings_vec, hp.seq_max_len)
            print('Epoch', epoch, ': Accuracy on training set:', acc)
            acc = eval_model(rnnModel, dev_exs, embeddings_vec, hp.seq_max_len)
            print('Epoch', epoch, ': Accuracy on development set:', acc)
            
    # Evaluate model on the training dataset.
    acc = eval_model(rnnModel, train_exs, embeddings_vec, hp.seq_max_len)
    print('Accuracy on training set:', acc)
    
    # Evaluate model on the development dataset.
    acc = eval_model(rnnModel, dev_exs, embeddings_vec, hp.seq_max_len)
    print('Accuracy on develpment set:', acc)
    
    return rnnModel

Here is the testing (evaluation) procedure.

In [3]:
# Evaluate the trained model on test examples and return predicted labels or accuracy.
def eval_model(model, exs, embeddings_vec, seq_max_len, pred_only = False):
    # Put model in evaluation mode.
    model.eval()
    
    # Extract size pf word embedding.
    embed_size = len(embeddings_vec[0])
    
    # Get embeddings of words for forward and reverse sentence: (num_ex * seq_max_len * embedding_size)
    exs_mat = np.asarray([pad_to_length(np.array(ex.indexed_words), seq_max_len) for ex in exs])
    exs_mat_rev = np.asarray([pad_to_length(np.array(ex.get_indexed_words_reversed()), seq_max_len) for ex in exs])
    exs_seq_lens = np.array([len(ex.indexed_words) for ex in exs])
    
    # Get embeddings of words for forward and reverse sentence: (num_ex * seq_max_len * embedding_size)
    x = np.zeros((len(exs), seq_max_len, embed_size))
    x_rev = np.zeros((len(exs), seq_max_len, embed_size))
    for i,ex in enumerate(exs):
        x[i] = embeddings_vec[exs_mat[i].astype(int)]
        x_rev[i] = embeddings_vec[exs_mat_rev[i].astype(int)]
        
    x = form_input(x)
    x_rev = form_input(x_rev)
    
    # Run the model on the test examples.
    preds = model(x, x_rev, exs_seq_lens).cpu().detach().numpy().flatten()
    preds[preds >= 0.5] = 1
    preds[preds < 0.5] = 0
    
    if pred_only == True:
        return preds
    else:
        targets = np.array([ex.label for ex in exs])
        return np.mean(preds == targets)

# Experimental evaluations on the Rotten Tomatoes dataset.

First, code for reading the examples and the corresponding GloVe word embeddings.

In [4]:
random.seed(1)
np.random.seed(1)
torch.manual_seed(1)

word_vecs_path = '../data/glove.6B.300d-relativized.txt'

train_path = '../data/rt/train.txt'
dev_path = '../data/rt/dev.txt'
blind_test_path = '../data/rt/test-blind.txt'
test_output_path = 'test-blind.output.txt'

word_vectors = read_word_embeddings(word_vecs_path)
word_indexer = word_vectors.word_indexer

train_exs = read_and_index_sentiment_examples(train_path, word_indexer)
dev_exs = read_and_index_sentiment_examples(dev_path, word_indexer)
test_exs = read_and_index_sentiment_examples(blind_test_path, word_indexer)

print(repr(len(train_exs)) + " / " + 
      repr(len(dev_exs)) + " / " + 
      repr(len(test_exs)) + " train / dev / test examples")

Read in 30135 vectors of size 300
8530 / 1066 / 1066 train / dev / test examples


## Use only the last state from one LSTM

Evaluate One LSTM + fully connected network, use last hidden state of LSTM. If the evaluation takes more than 1 hour on your computer, try reducing `lst_size`, `hidden_size`, `batch_size` and even `num_epochs`.

Our accuracy on development data is 75.98%

In [12]:
random.seed(1)
np.random.seed(1)
torch.manual_seed(1)

hp = HyperParams(lstm_size = 50, # hidden units in lstm
                 hidden_size = 50, # hidden size of fully-connected layer
                 lstm_layers = 1, # layers in lstm
                 drop_out = 0.5, # dropout rate
                 num_epochs = 50, # number of epochs for SGD-based procedure
                 batch_size = 1024, # examples in a minibatch
                 seq_max_len = 60) # maximum length of an example sequence
use_average = False
bidirectional = False

# Train RNN model.
model1 = train_model(hp, train_exs, dev_exs, test_exs, word_vectors, use_average, bidirectional)

# Generate RNN model predictions for test set.
embeddings_vec = np.array(word_vectors.vectors).astype(float)
test_exs_predicted = eval_model(model1, test_exs, embeddings_vec, hp.seq_max_len, pred_only = True)

# Write the test set output
for i, ex in enumerate(test_exs):
    ex.label = int(test_exs_predicted[i])
write_sentiment_examples(test_exs, test_output_path, word_indexer)

print("Prediction written to file for Rotten Tomatoes dataset.")

Loss on epoch 0: 6.267474
Epoch 0 : Accuracy on training set: 0.5847596717467761
Epoch 0 : Accuracy on development set: 0.5956848030018762
Loss on epoch 1: 5.798876
Loss on epoch 2: 5.336525
Loss on epoch 3: 4.944349
Loss on epoch 4: 4.594477
Loss on epoch 5: 4.402339
Loss on epoch 6: 4.373571
Loss on epoch 7: 4.300267
Loss on epoch 8: 4.347034
Loss on epoch 9: 4.273001
Loss on epoch 10: 4.303786
Epoch 10 : Accuracy on training set: 0.793200468933177
Epoch 10 : Accuracy on development set: 0.7560975609756098
Loss on epoch 11: 4.128220
Loss on epoch 12: 4.127317
Loss on epoch 13: 4.068804
Loss on epoch 14: 3.973914
Loss on epoch 15: 4.045157
Loss on epoch 16: 3.977415
Loss on epoch 17: 4.071261
Loss on epoch 18: 3.851947
Loss on epoch 19: 3.817534
Loss on epoch 20: 3.910577
Epoch 20 : Accuracy on training set: 0.8055099648300117
Epoch 20 : Accuracy on development set: 0.7495309568480301
Loss on epoch 21: 3.820059
Loss on epoch 22: 3.769975
Loss on epoch 23: 3.690644
Loss on epoch 24: 3.

## Use the average of all states from one LSTM

Evaluate One LSTM + fully connected network, use average of all states of the lstm.

Our accuracy on development data is 77.67%

In [13]:
random.seed(1)
np.random.seed(1)
torch.manual_seed(1)

## YOUR CODE HERE

hp = HyperParams(lstm_size = 50, # hidden units in lstm
                 hidden_size = 50, # hidden size of fully-connected layer
                 lstm_layers = 1, # layers in lstm
                 drop_out = 0.5, # dropout rate
                 num_epochs = 50, # number of epochs for SGD-based procedure
                 batch_size = 1024, # examples in a minibatch
                 seq_max_len = 60) # maximum length of an example sequence
use_average = True
bidirectional = False

# Train RNN model.
model1 = train_model(hp, train_exs, dev_exs, test_exs, word_vectors, use_average, bidirectional)

# Generate RNN model predictions for test set.
embeddings_vec = np.array(word_vectors.vectors).astype(float)
test_exs_predicted = eval_model(model1, test_exs, embeddings_vec, hp.seq_max_len, pred_only = True)

# Write the test set output
for i, ex in enumerate(test_exs):
    ex.label = int(test_exs_predicted[i])
write_sentiment_examples(test_exs, test_output_path, word_indexer)

print("Prediction written to file for Rotten Tomatoes dataset.")

Loss on epoch 0: 6.157734
Epoch 0 : Accuracy on training set: 0.6492379835873388
Epoch 0 : Accuracy on development set: 0.6622889305816135
Loss on epoch 1: 5.381994
Loss on epoch 2: 4.778000
Loss on epoch 3: 4.582939
Loss on epoch 4: 4.454709
Loss on epoch 5: 4.404754
Loss on epoch 6: 4.420320
Loss on epoch 7: 4.258039
Loss on epoch 8: 4.284607
Loss on epoch 9: 4.263848
Loss on epoch 10: 4.221387
Epoch 10 : Accuracy on training set: 0.7749120750293084
Epoch 10 : Accuracy on development set: 0.7439024390243902
Loss on epoch 11: 4.174233
Loss on epoch 12: 4.122141
Loss on epoch 13: 4.154859
Loss on epoch 14: 4.099349
Loss on epoch 15: 4.126765
Loss on epoch 16: 4.133695
Loss on epoch 17: 4.122970
Loss on epoch 18: 4.016438
Loss on epoch 19: 4.001831
Loss on epoch 20: 3.994933
Epoch 20 : Accuracy on training set: 0.7994138335287222
Epoch 20 : Accuracy on development set: 0.7607879924953096
Loss on epoch 21: 3.870770
Loss on epoch 22: 3.868393
Loss on epoch 23: 3.862161
Loss on epoch 24: 3

## Use a bidirectional LSTM, concatenate last states

Evaluate Two LSTMs (bidirectional) + fully connected network, concatenate their last states.

Our accuracy on development data is 76.83%

In [5]:
random.seed(1)
np.random.seed(1)
torch.manual_seed(1)

## YOUR CODE HERE

hp = HyperParams(lstm_size = 50, # hidden units in lstm
                 hidden_size = 50, # hidden size of fully-connected layer
                 lstm_layers = 1, # layers in lstm
                 drop_out = 0.5, # dropout rate
                 num_epochs = 50, # number of epochs for SGD-based procedure
                 batch_size = 1024, # examples in a minibatch
                 seq_max_len = 60) # maximum length of an example sequence
use_average = False
bidirectional = True

# Train RNN model.
model1 = train_model(hp, train_exs, dev_exs, test_exs, word_vectors, use_average, bidirectional)

# Generate RNN model predictions for test set.
embeddings_vec = np.array(word_vectors.vectors).astype(float)
test_exs_predicted = eval_model(model1, test_exs, embeddings_vec, hp.seq_max_len, pred_only = True)

# Write the test set output
for i, ex in enumerate(test_exs):
    ex.label = int(test_exs_predicted[i])
write_sentiment_examples(test_exs, test_output_path, word_indexer)

print("Prediction written to file for Rotten Tomatoes dataset.")

Loss on epoch 0: 6.474929
Epoch 0 : Accuracy on training set: 0.5371629542790153
Epoch 0 : Accuracy on development set: 0.5403377110694184
Loss on epoch 1: 6.134594
Loss on epoch 2: 5.732855
Loss on epoch 3: 4.986666
Loss on epoch 4: 4.877936
Loss on epoch 5: 4.496234
Loss on epoch 6: 4.401465
Loss on epoch 7: 4.166438
Loss on epoch 8: 4.170232
Loss on epoch 9: 4.128644
Loss on epoch 10: 4.159239


RuntimeError: CUDA out of memory. Tried to allocate 614.00 MiB (GPU 0; 4.00 GiB total capacity; 1.97 GiB already allocated; 85.09 MiB free; 2.61 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

## Use a bidirectional LSTM, concatenate the averages of their states

Evaluate Two LSTMs (bidirectional) + fully connected network, concatenate the averages of their states.

Our accuracy on development data is 77.39%

In [None]:
random.seed(1)
np.random.seed(1)
torch.manual_seed(1)

## YOUR CODE HERE
hp = HyperParams(lstm_size = 50, # hidden units in lstm
                 hidden_size = 50, # hidden size of fully-connected layer
                 lstm_layers = 1, # layers in lstm
                 drop_out = 0.5, # dropout rate
                 num_epochs = 50, # number of epochs for SGD-based procedure
                 batch_size = 1024, # examples in a minibatch
                 seq_max_len = 60) # maximum length of an example sequence
use_average = True
bidirectional = True

# Train RNN model.
model1 = train_model(hp, train_exs, dev_exs, test_exs, word_vectors, use_average, bidirectional)

# Generate RNN model predictions for test set.
embeddings_vec = np.array(word_vectors.vectors).astype(float)
test_exs_predicted = eval_model(model1, test_exs, embeddings_vec, hp.seq_max_len, pred_only = True)

# Write the test set output
for i, ex in enumerate(test_exs):
    ex.label = int(test_exs_predicted[i])
write_sentiment_examples(test_exs, test_output_path, word_indexer)

print("Prediction written to file for Rotten Tomatoes dataset.")

# Experimental evaluations on the IMDB dataset.

Run the same 4 evaluations on the IMDB dataset.

In [None]:
train_path = '../data/imdb/train.txt'
dev_path = '../data/imdb/dev.txt'
test_path = '../data/imdb/test.txt'

test_output_path = 'test-imdb.output.txt'

## YOUR CODE HERE
word_vecs_path = '../data/glove.6B.300d-relativized.txt'

word_vectors = read_word_embeddings(word_vecs_path)
word_indexer = word_vectors.word_indexer

train_exs = read_and_index_sentiment_examples(train_path, word_indexer)
dev_exs = read_and_index_sentiment_examples(dev_path, word_indexer)
test_exs = read_and_index_sentiment_examples(blind_test_path, word_indexer)

print(repr(len(train_exs)) + " / " + 
      repr(len(dev_exs)) + " / " + 
      repr(len(test_exs)) + " train / dev / test examples")


In [None]:
random.seed(1)
np.random.seed(1)
torch.manual_seed(1)
## Using output state of only 1 LSTM
hp = HyperParams(lstm_size = 50, # hidden units in lstm
                 hidden_size = 50, # hidden size of fully-connected layer
                 lstm_layers = 1, # layers in lstm
                 drop_out = 0.5, # dropout rate
                 num_epochs = 50, # number of epochs for SGD-based procedure
                 batch_size = 1024, # examples in a minibatch
                 seq_max_len = 60) # maximum length of an example sequence
use_average = False
bidirectional = False

# Train RNN model.
model1 = train_model(hp, train_exs, dev_exs, test_exs, word_vectors, use_average, bidirectional)

# Generate RNN model predictions for test set.
embeddings_vec = np.array(word_vectors.vectors).astype(float)
test_exs_predicted = eval_model(model1, test_exs, embeddings_vec, hp.seq_max_len, pred_only = True)

# Write the test set output
for i, ex in enumerate(test_exs):
    ex.label = int(test_exs_predicted[i])
write_sentiment_examples(test_exs, test_output_path, word_indexer)

print("Prediction written to file for IMDb dataset.")

In [None]:
random.seed(1)
np.random.seed(1)
torch.manual_seed(1)

## YOUR CODE HERE
## Using output states of 1 LSTM with averaging
hp = HyperParams(lstm_size = 50, # hidden units in lstm
                 hidden_size = 50, # hidden size of fully-connected layer
                 lstm_layers = 1, # layers in lstm
                 drop_out = 0.5, # dropout rate
                 num_epochs = 50, # number of epochs for SGD-based procedure
                 batch_size = 1024, # examples in a minibatch
                 seq_max_len = 60) # maximum length of an example sequence
use_average = True
bidirectional = False

# Train RNN model.
model1 = train_model(hp, train_exs, dev_exs, test_exs, word_vectors, use_average, bidirectional)

# Generate RNN model predictions for test set.
embeddings_vec = np.array(word_vectors.vectors).astype(float)
test_exs_predicted = eval_model(model1, test_exs, embeddings_vec, hp.seq_max_len, pred_only = True)

# Write the test set output
for i, ex in enumerate(test_exs):
    ex.label = int(test_exs_predicted[i])
write_sentiment_examples(test_exs, test_output_path, word_indexer)

print("Prediction written to file for IMDb dataset.")

In [None]:
random.seed(1)
np.random.seed(1)
torch.manual_seed(1)

## YOUR CODE HERE
## Using bidirectional LSTM without averaging state output
hp = HyperParams(lstm_size = 50, # hidden units in lstm
                 hidden_size = 50, # hidden size of fully-connected layer
                 lstm_layers = 1, # layers in lstm
                 drop_out = 0.5, # dropout rate
                 num_epochs = 50, # number of epochs for SGD-based procedure
                 batch_size = 1024, # examples in a minibatch
                 seq_max_len = 60) # maximum length of an example sequence
use_average = False
bidirectional = True

# Train RNN model.
model1 = train_model(hp, train_exs, dev_exs, test_exs, word_vectors, use_average, bidirectional)

# Generate RNN model predictions for test set.
embeddings_vec = np.array(word_vectors.vectors).astype(float)
test_exs_predicted = eval_model(model1, test_exs, embeddings_vec, hp.seq_max_len, pred_only = True)

# Write the test set output
for i, ex in enumerate(test_exs):
    ex.label = int(test_exs_predicted[i])
write_sentiment_examples(test_exs, test_output_path, word_indexer)

print("Prediction written to file for IMDb dataset.")

In [None]:
random.seed(1)
np.random.seed(1)
torch.manual_seed(1)

## YOUR CODE HERE
## Using bidirectional LSTM with averaging state output
hp = HyperParams(lstm_size = 50, # hidden units in lstm
                 hidden_size = 50, # hidden size of fully-connected layer
                 lstm_layers = 1, # layers in lstm
                 drop_out = 0.5, # dropout rate
                 num_epochs = 50, # number of epochs for SGD-based procedure
                 batch_size = 1024, # examples in a minibatch
                 seq_max_len = 60) # maximum length of an example sequence
use_average = True
bidirectional = True

# Train RNN model.
model1 = train_model(hp, train_exs, dev_exs, test_exs, word_vectors, use_average, bidirectional)

# Generate RNN model predictions for test set.
embeddings_vec = np.array(word_vectors.vectors).astype(float)
test_exs_predicted = eval_model(model1, test_exs, embeddings_vec, hp.seq_max_len, pred_only = True)

# Write the test set output
for i, ex in enumerate(test_exs):
    ex.label = int(test_exs_predicted[i])
write_sentiment_examples(test_exs, test_output_path, word_indexer)

print("Prediction written to file for IMDb dataset.")

## Cross-domain performance

Compare the performance of the Bidirectional LSTM with state averaging on the IMDB test set in two scenarios:

1. The model is trained on the IMDB training data.

2. The model is trained on the Rotten Tomatoes data.

## Bonus points ##
Anything extra goes here. For example:
- Train and evaluate each model 10 times, from different random initializations (different seeds). Average the accuracy over the 10 runs and compare the performance of the 4 models on the Rotten Tomatoes dataset.

## Analysis ##

Include an analysis of the results that you obtained in the experiments above. Also compare with the sentiment classification performance from previous assignments. Show the results using table(s).