# Sentiment Analysis with RNN

In this notebook, 
I build a model that implement a recurrent neural network with PyTorch to perform sentiment analysis on movie reviews.
The dataset is taken from  IMDB  reviews. The reviews are accompanied by labels of the sentiment: positive or negative. 
To build a model for sentiment analysis, actually we can use a simple feedforward network. However with such a framework, the model will only consider individual words to predict the sentiment. With RNN, the prediction will be more accurate because we can also include information about the sequence of the words.  So the model will not only consider the individual words, but also  the order they appear in.

### Model Architecture

I consider a model with the following architecture. First I pass in the words from the review to an embedding layer. Then the new embeddings will be passed to LSTM cells. They will add recurrent connections to the network and give us the ability to include information about the sequence of words in the movie review. Finally, I pass the LSTM outputs to a sigmoid output layer. I use a sigmoid function because a sigmoid will output predicted, sentiment values between 0-1. 
 
### Data Loading

In [12]:
import numpy as np
from string import punctuation

with open('data/reviews.txt', 'r') as f:
    reviews = f.read()
with open('data/labels.txt', 'r') as f:
    labels = f.read()

In [13]:
print(reviews[:2000])
print()
print(labels[:20])

bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   
story of a man who has unnatural feelings for a pig . starts out with a opening scene that is a terrific example of absurd comedy . a formal orchestra audience is turn

# Data Preprocessing

First I will make the data into the proper form. I clean it up a bit by converting the text into lowercase and get rid of periods and punctuation.  

Moreover since the reviews are delimited with newline characters `\n`, I can split the text into individual reviews with the delimiter `\n`. 

I  then combined them back together and split again to collect all individual  words that are used in the reviews.

In [14]:
from collections import Counter
 
reviews = reviews.lower()  
all_text = ''.join([c for c in reviews if c not in punctuation])

# split reviews
reviews_split = all_text.split('\n')

# collect words used in the reviews
all_text = ' '.join(reviews_split)
words = all_text.split()

In [15]:
words[:15]

['bromwell',
 'high',
 'is',
 'a',
 'cartoon',
 'comedy',
 'it',
 'ran',
 'at',
 'the',
 'same',
 'time',
 'as',
 'some',
 'other']

### Integers Encoding

The model will use embedding layers that require us to pass in integers to the model. So we need to encode each word in the vocabulary with an integer. In the following I make a dictionary that maps words to integers. I then
convert the reviews to integers and store them in a new list called `reviews_ints`.

In [16]:
counts = Counter(words)
vocab = sorted(counts, key = counts.get, reverse=True)
vocab_to_int = {word: ii for ii, word in enumerate(vocab,1)}

## tokenize each review and  store them in reviews_ints
reviews_ints = []
for review in reviews_split:
    reviews_ints.append([vocab_to_int[word] for word in review.split()])

To test the dictionary, I print out  the content of the first tokenized review.

In [17]:
# print tokens in first review
print('Tokenized review: \n', reviews_ints[:1])

Tokenized review: 
 [[21025, 308, 6, 3, 1050, 207, 8, 2138, 32, 1, 171, 57, 15, 49, 81, 5785, 44, 382, 110, 140, 15, 5194, 60, 154, 9, 1, 4975, 5852, 475, 71, 5, 260, 12, 21025, 308, 13, 1978, 6, 74, 2395, 5, 613, 73, 6, 5194, 1, 24103, 5, 1983, 10166, 1, 5786, 1499, 36, 51, 66, 204, 145, 67, 1199, 5194, 19869, 1, 37442, 4, 1, 221, 883, 31, 2988, 71, 4, 1, 5787, 10, 686, 2, 67, 1499, 54, 10, 216, 1, 383, 9, 62, 3, 1406, 3686, 783, 5, 3483, 180, 1, 382, 10, 1212, 13583, 32, 308, 3, 349, 341, 2913, 10, 143, 127, 5, 7690, 30, 4, 129, 5194, 1406, 2326, 5, 21025, 308, 10, 528, 12, 109, 1448, 4, 60, 543, 102, 12, 21025, 308, 6, 227, 4146, 48, 3, 2211, 12, 8, 215, 23]]


I also convert the labels to integers 0 and 1 and store the new encoded labels in the list `encoded_labels`

In [18]:
all_labels = labels.split('\n') 
encoded_labels = np.array([1 if c == 'positive' else 0 for c in all_labels] ) 

### Outliers 

To make sure that our reviews are in a good shape for standard processing, I observe whether there are some outliers. 
In the following, I check if the data  contains  extremely long or short reviews.


In [19]:
review_lens = Counter([len(x) for x in reviews_ints])
print("Number of reviews with 0 length: {}".format(review_lens[0]))
print("The length of the longest reviews: {}".format(max(review_lens)))

Number of reviews with 0 length: 1
The length of the longest reviews: 2514


 Since there is an empty review, I simply remove it.


In [20]:
# remove empty reviews
index = [ii for ii, review in enumerate(reviews_ints) if len(review) != 0]
reviews_ints = [reviews_ints[ii] for ii in index]
encoded_labels = np.array([encoded_labels[ii] for ii in index])

### Data Padding and Truncating 
Note that the   length of the longest review  is way too big for the model. To handle this,
I  truncate super long reviews. More precisely, in dealing with both short and very long reviews, I shape  the reviews into a specific length.  I define such a length with `seq_length`.  For reviews shorter than   `seq_length`, I pad with 0s on its beginning,
e.g. if the review is `['best', 'movie', 'ever']`,or  `[117, 18, 128]` as integers, I pad the review so it will look like `[0, 0, 0, ..., 0, 117, 18, 128]`. For reviews longer than `seq_length`, I simply truncate them to the first `seq_length` words. A good `seq_length`, in this case, is 200. 

The following is
a function that returns an array `features` that contains the padded data  of a standard size. I will then pass the array to the model. 

  

In [21]:
# pad reviews by adding 0s in the beginning so the length of each review is equal to seq_length
def pad_features(reviews_ints, seq_length): 
    features = np.zeros((len(reviews_ints), seq_length), dtype=int)
 
    for i, row in enumerate(reviews_ints):
        features[i, -len(row):] = np.array(row)[:seq_length] 
    
    return features

seq_length = 200

features = pad_features(reviews_ints, seq_length=seq_length)

# Splitting Data into Training, Validation, and Testing Sets

Now the data is in the nice shape. I split it into training, validation, and test sets.
The fraction of data that I keep for the training set is 80% and the remaining is split in half to create the data for
validation and testing. 

In [22]:
split_frac = 0.8

## split data into training, validation, and test data  
n = int(split_frac * len(features))
train_x, rem_x = features[:n], features[n:]
train_y, rem_y = encoded_labels[:n], encoded_labels[n:]

ns = int(0.5 * len(rem_x))
test_x, val_x = rem_x[:ns], rem_x[ns:]
test_y, val_y = rem_y[:ns], rem_y[ns:]

## print out the shapes of the data
print("Train set: \t\t{}".format(train_x.shape),
      "\nVal set: \t\t{}".format(val_x.shape),
      "\nTest set: \t\t{}".format(test_x.shape))

Train set: 		(20000, 200) 
Val set: 		(2500, 200) 
Test set: 		(2500, 200)


## DataLoaders and Batching

I create DataLoaders for this data. First by using [TensorDataset](https://pytorch.org/docs/stable/data.html#),
 I create a known format for accessing our data. Note that the TensorDataset takes in an input set of data and a target set of data with the same first dimension and then creates a dataset.
 I then create DataLoaders and batch our training, validation, and test Tensor datasets.
 

In [23]:
import torch
from torch.utils.data import TensorDataset, DataLoader

# create Tensor datasets
train_data = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))
valid_data = TensorDataset(torch.from_numpy(val_x), torch.from_numpy(val_y))
test_data = TensorDataset(torch.from_numpy(test_x), torch.from_numpy(test_y))

# dataloaders
batch_size = 50

# shuffle the data
train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
valid_loader = DataLoader(valid_data, shuffle=True, batch_size=batch_size)
test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size)

# Defining the Model with PyTorch


I consider a model which basically  consists of an 
 [embedding layer](https://pytorch.org/docs/stable/nn.html#embedding) that converts our word tokens (integers) into embeddings of a specific size, an [LSTM layer](https://pytorch.org/docs/stable/nn.html#lstm) defined by a hidden_state size and number of layers, a
a fully-connected output layer that maps the LSTM layer outputs to a desired output_size, and 
a sigmoid activation layer which turns all outputs into a value 0-1. Note that it will return only the last sigmoid output as the output of this model.

### The Embedding Layer

Note that we need an embedding layer because there are more than 74000 words in the review vocabulary. It is simply not efficient to one-hot encode that many classes. So, instead of one-hot encoding, I use an embedding layer and use thelayer as a lookup table.   It's ok to just make a new layer, since we will use it only for dimensionality reduction and let the model learn the weights.


### The LSTM Layers

I create an LSTM  to use in the model. It will take  in an input_size, a hidden_dim, a number of layers, a dropout probability (for dropout between multiple layers), and a batch_first parameter.

In [24]:

import torch.nn as nn
class SentimentRNN(nn.Module):
    def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, drop_prob=0.5):
        # setting up the layers
        super(SentimentRNN, self).__init__()
        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, dropout=drop_prob, batch_first= True)        
        self.dropout = nn.Dropout(0.3)
        self.fc = nn.Linear(hidden_dim, output_size)
        self.sig = nn.Sigmoid()

    def forward(self, x, hidden):
        # return last sigmoid output and hidden state
        embeds = self.embedding(x)
        r_output, hidden = self.lstm(embeds, hidden)
        r_output = r_output.contiguous().view(-1, self.hidden_dim)
        out = self.dropout(r_output)
        out = self.fc(out)
        sig_out = self.sig(out)
        
        batch_size = x.size(0)
        sig_out = sig_out.view(batch_size,-1)
        sig_out = sig_out[:,-1]
        
        return sig_out, hidden
    
    
    def init_hidden(self, batch_size): 
        # create two new tensors: n_layers x batch_size x hidden_dim,
        # initialized to zero, for hidden state and cell state of LSTM
        weight = next(self.parameters()).data
        
        if (train_on_gpu):
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(),
                  weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda())
        else:
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_(),
                      weight.new(self.n_layers, batch_size, self.hidden_dim).zero_())
         
        return hidden

# Instatiate the model 

Before training the model, I need to instantiate it. I  define the hyperparameters as follows.
First, I define
`vocab_size` which is the size of the vocabulary, `output_size` which is the size of the desired output, i.e. the number of class scores we want to output: positive or negative,
 `embedding_dim` which is the number of columns in the embedding lookup table which is the size of our embeddings, 
 `hidden_dim` which is the number of units in the hidden layers of our LSTM cells, and 
 `n_layers` which is the number of LSTM layers in the network. 
 
 Moreover I also need to define the learning rate, and the loss, optimization functions.

In [25]:
vocab_size = len(vocab_to_int)+1
output_size = 1
embedding_dim = 400
hidden_dim = 200
n_layers = 2

net = SentimentRNN(vocab_size, output_size, embedding_dim, hidden_dim, n_layers)

In [26]:
# loss and optimization functions
lr=0.001

criterion = nn.BCELoss()
optimizer = torch.optim.Adam(net.parameters(), lr=lr)

# Training

Now the model is ready for the training. To make the computation faster, I use GPU if it is available.
 

I will use a new kind of cross entropy loss called Binary Cross Entropy Loss, which is designed to work with a single Sigmoid output. [BCELoss](https://pytorch.org/docs/stable/nn.html#bceloss). It applies cross entropy loss to a single value between 0 and 1.

In this training I consider 4 epochs.I iterate through the training dataset four times. Moreover, to
prevent exploding gradients I set the clip to 5. So the maximum gradient value to clip at is five. 

In [27]:
# First I check if GPU is available
train_on_gpu=torch.cuda.is_available()

# training params
epochs = 4 # 
counter = 0
print_every = 100
clip=5 # gradient clipping

# move model to GPU  if available
if(train_on_gpu):
    net.cuda()

net.train()
# train for some number of epochs
for e in range(epochs):
    # initialize hidden state
    h = net.init_hidden(batch_size)

    # batch loop
    for inputs, labels in train_loader:
        counter += 1

        if(train_on_gpu):
            inputs, labels = inputs.cuda(), labels.cuda()

        # creating new variables for the hidden state, otherwise
        # I would backprop through the entire training history
        h = tuple([each.data for each in h])

        # zero accumulated gradients
        net.zero_grad()

        # get the output from the model
        output, h = net(inputs, h)

        # calculate the loss and perform backprop
        loss = criterion(output.squeeze(), labels.float())
        loss.backward()
        # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs or LSTMs.
        nn.utils.clip_grad_norm_(net.parameters(), clip)
        optimizer.step()

        # loss stats
        if counter % print_every == 0:
            # validation loss
            val_h = net.init_hidden(batch_size)
            val_losses = []
            net.eval()
            for inputs, labels in valid_loader:

                # create new variables for the hidden state, otherwise
                # I would backprop through the entire training history
                val_h = tuple([each.data for each in val_h])

                if(train_on_gpu):
                    inputs, labels = inputs.cuda(), labels.cuda()

                output, val_h = net(inputs, val_h)
                val_loss = criterion(output.squeeze(), labels.float())

                val_losses.append(val_loss.item())

            net.train()
            print("Epoch: {}/{}...".format(e+1, epochs),
                  "Step: {}...".format(counter),
                  "Loss: {:.6f}...".format(loss.item()),
                  "Val Loss: {:.6f}".format(np.mean(val_losses)))

Epoch: 1/4... Step: 100... Loss: 0.667548... Val Loss: 0.669033
Epoch: 1/4... Step: 200... Loss: 0.774982... Val Loss: 0.738713
Epoch: 1/4... Step: 300... Loss: 0.525721... Val Loss: 0.540313
Epoch: 1/4... Step: 400... Loss: 0.456010... Val Loss: 0.538258
Epoch: 2/4... Step: 500... Loss: 0.431597... Val Loss: 0.505338
Epoch: 2/4... Step: 600... Loss: 0.261225... Val Loss: 0.466534
Epoch: 2/4... Step: 700... Loss: 0.433863... Val Loss: 0.462178
Epoch: 2/4... Step: 800... Loss: 0.289795... Val Loss: 0.443104
Epoch: 3/4... Step: 900... Loss: 0.175254... Val Loss: 0.482436
Epoch: 3/4... Step: 1000... Loss: 0.384781... Val Loss: 0.459099
Epoch: 3/4... Step: 1100... Loss: 0.192630... Val Loss: 0.446066
Epoch: 3/4... Step: 1200... Loss: 0.419111... Val Loss: 0.532251
Epoch: 4/4... Step: 1300... Loss: 0.244968... Val Loss: 0.535243
Epoch: 4/4... Step: 1400... Loss: 0.254958... Val Loss: 0.445546
Epoch: 4/4... Step: 1500... Loss: 0.259188... Val Loss: 0.528085
Epoch: 4/4... Step: 1600... Loss: 

# Testing

Now I want to test my model. I want to see how my model, after training, performs on the test_data. I check the average loss and the accuracy over the test_data.

In [28]:
# Get test data loss and accuracy

test_losses = [] # track loss
num_correct = 0

# init hidden state
h = net.init_hidden(batch_size)

net.eval()
# iterate over test data
for inputs, labels in test_loader:

    # Creating new variables for the hidden state, otherwise
    # we'd backprop through the entire training history
    h = tuple([each.data for each in h])

    if(train_on_gpu):
        inputs, labels = inputs.cuda(), labels.cuda()
    
    # get predicted outputs
    output, h = net(inputs, h)
    
    # calculate loss
    test_loss = criterion(output.squeeze(), labels.float())
    test_losses.append(test_loss.item())
    
    # convert output probabilities to predicted class (0 or 1)
    pred = torch.round(output.squeeze())  # rounds to the nearest integer
    
    # compare predictions to true label
    correct_tensor = pred.eq(labels.float().view_as(pred))
    correct = np.squeeze(correct_tensor.numpy()) if not train_on_gpu else np.squeeze(correct_tensor.cpu().numpy())
    num_correct += np.sum(correct)


# -- stats! -- ##
# avg test loss
print("Test loss: {:.3f}".format(np.mean(test_losses)))

# accuracy over all test data
test_acc = num_correct/len(test_loader.dataset)
print("Test accuracy: {:.3f}".format(test_acc))

Test loss: 0.504
Test accuracy: 0.800


# Inference

So the model have 80% accuracy. Now I want to input just one example review at a time  without a label, and see what sentiment the model will predict. Will it predict correctly or not.

I make a `predict` function that takes in a trained model, a plain text_review,  a sequence length, and prints whether a positive or negative review is detected! 


In [38]:
def predict(net, test_review, sequence_length=200):  
    test_review = test_review.lower()
    test_text = ''.join([c for c in test_review if c not in punctuation])
    test_words = test_text.split()
    test_ints = []
    test_ints.append([vocab_to_int[word] for word in test_words])
    
    features = pad_features(test_ints, sequence_length)
    feature_tensor = torch.from_numpy(features)
    
    batch_size = feature_tensor.size(0)
    h = net.init_hidden(batch_size)
    if(train_on_gpu):
        feature_tensor = feature_tensor.cuda()
    net.eval()
    output, h = net(feature_tensor, h)
    pred = torch.round(output.squeeze()) 
    if pred.item() == 1 :
        print('POSITIVE review is detected!')
    else:
        print('NEGATIVE review is detected!')
    
    
        

In [39]:
# negative test review
test_review_neg = 'The worst movie I have seen; acting was terrible and I want my money back. This movie had bad acting and the dialogue was slow.'
# positive test review
test_review_pos = 'This movie had the best acting and the dialogue was so good. I loved it.'

In [40]:
# call function 
seq_length=200
predict(net, test_review_neg, seq_length) 

NEGATIVE review is detected!
