<a href="https://colab.research.google.com/github/purvasingh96/Deep-learning-with-neural-networks/blob/master/Deep-learning-with-pytorch/3.%20Recurrent%20Neural%20Networks/Sentiment_analysis_via_RNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis with an RNN
In this notebook I have implemented a RNN that performs sentiment analysis. <br>
Reason for using RNN instead of a strictly feedforward network is that we can also include information about *sequence* of words.

### Network Architecture
Below would be the architecture diagram for my sentiment analysis model - <br>
<img src="https://github.com/purvasingh96/Deep-learning-with-neural-networks/blob/master/Deep-learning-with-pytorch/3.%20Recurrent%20Neural%20Networks/images/network_diagram.png?raw=1"></img>

**Notes -**
1. Since we are performing sentiment analysis, we need a more efficient representation of words as compared to one_hot_encoded vectors. Hence, using *embeded layer for dimensionality reduction.*
2. The new embeddings will be passed to LSTM cells. LSTM cells will add recurrent connections and add ability to *include information about sequence of words.*
3. Final LSTM outputs will go to *Sigmoid output layer.*

### Load in and visualize the data

In [0]:
import numpy as np

with open('labels.txt', 'r') as f:
  labels = f.read()
with open('reviews.txt', 'r') as f:
  reviews = f.read()

In [63]:
print(reviews[:200])

bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  


## Data pre-processing
### Getting rid of punctuations 
1. Get rid of punctuation marks etc.
2. Reviews are delimited by \n. Use \n as delimiter to split text into each reviews.
3. Combine reviews in step-2 into 1 big string.

In [64]:
from string import punctuation
'''
bromwell high is a cartoon comedy  \n
it ran at the same time as some other programs about school life  such as  teachers  \n
 my   years in the teaching profession lead me to believe that bromwell high  
'''
reviews = reviews.lower()
all_text = ''.join([c for c in reviews if c not in punctuation])

reviews_split = all_text.split('\n')
all_text = ' '.join(reviews_split)
words = all_text.split()
print(words[:10])

['bromwell', 'high', 'is', 'a', 'cartoon', 'comedy', 'it', 'ran', 'at', 'the']


### Encoding reviews
Create an array that contains integer encoded version of words in reviews. The word appearing the most should have least integer value. Example if *the* appeared the most in reviews, then assign *'the' : 1*  

In [0]:
from collections import Counter

counts = Counter(words)
vocab = sorted(counts, key=counts.get, reverse=True)
vocab_to_int = {word:ii for ii, word in enumerate(vocab, 1)}


### Encoding labels
If review is positive, then corresponding label is 0 else 1.

In [66]:
reviews_int = []
'''
reviews_split contains multiple reviews 
reviews_int will be 2-D array
'''
for review in reviews_split:
  reviews_int.append([vocab_to_int[word] for word in review.split()])
print(len(vocab_to_int))
print(reviews_int[:10])

74072
[[21025, 308, 6, 3, 1050, 207, 8, 2138, 32, 1, 171, 57, 15, 49, 81, 5785, 44, 382, 110, 140, 15, 5194, 60, 154, 9, 1, 4975, 5852, 475, 71, 5, 260, 12, 21025, 308, 13, 1978, 6, 74, 2395, 5, 613, 73, 6, 5194, 1, 24103, 5, 1983, 10166, 1, 5786, 1499, 36, 51, 66, 204, 145, 67, 1199, 5194, 19869, 1, 37442, 4, 1, 221, 883, 31, 2988, 71, 4, 1, 5787, 10, 686, 2, 67, 1499, 54, 10, 216, 1, 383, 9, 62, 3, 1406, 3686, 783, 5, 3483, 180, 1, 382, 10, 1212, 13583, 32, 308, 3, 349, 341, 2913, 10, 143, 127, 5, 7690, 30, 4, 129, 5194, 1406, 2326, 5, 21025, 308, 10, 528, 12, 109, 1448, 4, 60, 543, 102, 12, 21025, 308, 6, 227, 4146, 48, 3, 2211, 12, 8, 215, 23], [63, 4, 3, 125, 36, 47, 7472, 1395, 16, 3, 4181, 505, 45, 17, 3, 622, 134, 12, 6, 3, 1279, 457, 4, 1721, 207, 3, 10624, 7373, 300, 6, 667, 83, 35, 2116, 1086, 2989, 34, 1, 898, 46417, 4, 8, 13, 5096, 464, 8, 2656, 1721, 1, 221, 57, 17, 58, 794, 1297, 832, 228, 8, 43, 98, 123, 1469, 59, 147, 38, 1, 963, 142, 29, 667, 123, 1, 13584, 410, 61, 9

In [0]:
labels_split = labels.split('\n')
labels_to_int = np.array([1 if label=='positive' else 0 for label in labels_split])

In [68]:
zero_length_reviews = Counter([len(x) for x in reviews_int])
print(max(zero_length_reviews))

2514


### Removing Outliers
This step involves - 
1. Getting rid of extremely long/short reviews
2. Padding/truncating reaining data to maintain constant review length.

<img src="https://github.com/purvasingh96/Deep-learning-with-neural-networks/blob/master/Deep-learning-with-pytorch/3.%20Recurrent%20Neural%20Networks/images/outliers_padding_ex.png?raw=1"></img>

In [0]:
non_zero_idx = [ii for ii, review in enumerate(reviews_int) if len(review)!=0]
reviews_int = [reviews_int[ii] for ii in non_zero_idx]
encoded_labels = np.array([labels_to_int[ii] for ii in non_zero_idx])

In [0]:
def pad_features(reviews_int, seq_length):
  features = np.zeros((len(reviews_int), seq_length), dtype=int)
  for i, row in enumerate(reviews_int):
    features[i, -len(row):] = np.array(row)[:seq_length]
  
  return features

In [71]:
seq_length = 200
features = pad_features(reviews_int, seq_length)
print(features[:30, :10])

[[    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [22382    42 46418    15   706 17139  3389    47    77    35]
 [ 4505   505    15     3  3342   162  8312  1652     6  4819]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [   54    10    14   116    60   798   552    71   364     5]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    1   330   578    34     3   162   748  2731     9   325]
 [    9    11 10171  5305  1946   689   444    22   280   673]
 [    0     0     0     0     0     0     0     0     0

# Training, Testing and Validating 

In [0]:
split_frac = 0.8

split_idx = int(len(features)*split_frac)

train_x, remaining_x = features[:split_idx], features[split_idx:]
train_y, remaining_y = encoded_labels[:split_idx], encoded_labels[split_idx:]

test_idx = int(len(remaining_x)*0.5)
val_x, test_x = remaining_x[:test_idx], remaining_x[test_idx:]
val_y, test_y = remaining_y[:test_idx], remaining_y[test_idx:]

# DataLoaders and Batching

A neat way to create data-loaders and batch our training, validation and test Tensor datasets is as follows -<br>
```python
train_data = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))
train_loader = DataLoader(train_data, batch_size=batch_size)

```
This is an alternative to creating a generator function for batching our data into full batches.

In [0]:
import torch
from torch.utils.data import TensorDataset, DataLoader

train_data = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))
test_data = TensorDataset(torch.from_numpy(test_x), torch.from_numpy(test_y))
valid_data = TensorDataset(torch.from_numpy(val_x), torch.from_numpy(val_y))

batch_size=50

train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
valid_loader = DataLoader(valid_data, shuffle=True, batch_size=batch_size)
test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size)

In [74]:

# First checking if GPU is available
import torch
train_on_gpu=torch.cuda.is_available()

if(train_on_gpu):
    print('Training on GPU.')
else:
    print('No GPU available, training on CPU.')

Training on GPU.


# Sentiment Network with PyTorch

Below are the various layers of our RNN that would perform sentiment analysis - 

1. An embedding layer that converts our word tokens (integers) into embeddings of a specific size.
2. An LSTM layer defined by a hidden_state size and number of layers
3. A fully-connected output layer that maps the LSTM layer outputs to a desired output_size
4. A sigmoid activation layer which turns all outputs into a value 0-1; return only the last sigmoid output as the output of this network."

In [0]:
import torch.nn as nn

class SentimentRNN(nn.Module):
  def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, drop_prob=0.5):
    super(SentimentRNN, self).__init__()

    self.output_size = output_size
    self.n_layers = n_layers
    self.hidden_dim = hidden_dim

    self.embedding = nn.Embedding(vocab_size, embedding_dim)
    self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, dropout=drop_prob, batch_first=True)

    self.dropout = nn.Dropout(0.3)
    self.fc = nn.Linear(hidden_dim, output_size)
    self.sig = nn.Sigmoid()

  def forward(self, x, hidden):
    batch_size = x.size(0)
    x = x.long()
    embeds = self.embedding(x)
    lstm_out, hidden = self.lstm(embeds, hidden)
    lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim)
    out = self.dropout(lstm_out)
    out = self.fc(out)
    sig_out = self.sig(out)

    sig_out = sig_out.view(batch_size, -1)
    sig_out = sig_out[:, -1]

    return sig_out, hidden
  
  def init_hidden(self, batch_size):
    weight = next(self.parameters()).data
    if(train_on_gpu):
      hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(), 
                weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda())
    else:
      hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim), 
                weight,new(self.n_layers, batch_size, self.hidden_dim))
      
    return hidden



# Instantiate the network
Here, I will define the model hyper-parameters - 
1. `vocab_size`: Size of our vocabulary or the range of values for our input, word tokens.
2. `output_size`: Size of our desired output; the number of class scores we want to output (pos/neg).
3. `embedding_dim`: Number of columns in the embedding lookup table; size of our embeddings.
4. `hidden_dim`: Number of units in the hidden layers of our LSTM cells. Usually larger is better performance wise. Common values are 128, 256, 512, etc.
5. `n_layers`: Number of LSTM layers in the network. Typically between 1-3

In [76]:
# Instantiate the model w/ hyperparams
vocab_size = len(vocab_to_int)+1 # +1 for the 0 padding + our word tokens
output_size = 1
embedding_dim = 400
hidden_dim = 256
n_layers = 2

net = SentimentRNN(vocab_size, output_size, embedding_dim, hidden_dim, n_layers)

print(net)

SentimentRNN(
  (embedding): Embedding(74073, 400)
  (lstm): LSTM(400, 256, num_layers=2, batch_first=True, dropout=0.5)
  (dropout): Dropout(p=0.3, inplace=False)
  (fc): Linear(in_features=256, out_features=1, bias=True)
  (sig): Sigmoid()
)


In [0]:
# loss and optimization functions
lr=0.001

criterion = nn.BCELoss()
optimizer = torch.optim.Adam(net.parameters(), lr=lr)

In [78]:
epochs = 4
counter = 0 
print_every = 100
clip = 5
if(train_on_gpu):
  net.cuda()

net.train()
for e in range(epochs):
  h = net.init_hidden(batch_size)
  for inputs, labels in train_loader:
    counter += 1
    if(train_on_gpu):
      inputs, labels = inputs.cuda(), labels.cuda()
    h = tuple([each.data for each in h])
    net.zero_grad()
    output, h = net(inputs, h)
    loss = criterion(output.squeeze(), labels.float())
    loss.backward()
    nn.utils.clip_grad_norm(net.parameters(), clip)
    optimizer.step()

    if counter % print_every == 0:
      val_h = net.init_hidden(batch_size)
      val_losses = []
      net.eval()
      for inputs, labels in valid_loader:
        val_h = tuple([each.data for each in val_h])
        if(train_on_gpu):
          inputs, labels = inputs.cuda(), labels.cuda()
        output, val_h = net(inputs, val_h)
        val_loss = criterion(output.squeeze(), labels.float())
        val_losses.append(val_loss.item())
      net.train()
      print("Epoch: {}/{}...".format(e+1, epochs),
                  "Step: {}...".format(counter),
                  "Loss: {:.6f}...".format(loss.item()),
                  "Val Loss: {:.6f}".format(np.mean(val_losses)))





Epoch: 1/4... Step: 100... Loss: 0.652403... Val Loss: 0.654631
Epoch: 1/4... Step: 200... Loss: 0.675539... Val Loss: 0.696417
Epoch: 1/4... Step: 300... Loss: 0.711188... Val Loss: 0.694983
Epoch: 1/4... Step: 400... Loss: 0.689480... Val Loss: 0.692555
Epoch: 2/4... Step: 500... Loss: 0.693494... Val Loss: 0.688481
Epoch: 2/4... Step: 600... Loss: 0.650585... Val Loss: 0.697146
Epoch: 2/4... Step: 700... Loss: 0.494658... Val Loss: 0.695837
Epoch: 2/4... Step: 800... Loss: 0.663788... Val Loss: 0.710846
Epoch: 3/4... Step: 900... Loss: 0.589776... Val Loss: 0.557277
Epoch: 3/4... Step: 1000... Loss: 0.548341... Val Loss: 0.523456
Epoch: 3/4... Step: 1100... Loss: 0.464825... Val Loss: 0.521520
Epoch: 3/4... Step: 1200... Loss: 0.471691... Val Loss: 0.476805
Epoch: 4/4... Step: 1300... Loss: 0.318779... Val Loss: 0.522014
Epoch: 4/4... Step: 1400... Loss: 0.400367... Val Loss: 0.463238
Epoch: 4/4... Step: 1500... Loss: 0.407411... Val Loss: 0.508995
Epoch: 4/4... Step: 1600... Loss: 

In [79]:
test_losses = []
num_correct = 0

# np.squeeze() removes single dimension enteries from array

h = net.init_hidden(50)
net.eval()

for inputs, labels in test_loader:
  h = tuple([each.data for each in h])
  if(train_on_gpu):
    inputs, labels = inputs.cuda(), labels.cuda()
  
  output, h = net(inputs, h)
  test_loss = criterion(output.squeeze(), labels.float())
  test_losses.append(test_loss.item())
  pred = torch.round(output.squeeze())
  correct_tensor = pred.eq(labels.float().view_as(pred))
  correct = np.squeeze(correct_tensor.numpy()) if not train_on_gpu else np.squeeze(correct_tensor.cpu().numpy())
  num_correct += np.sum(correct)

# -- stats! -- ##
# avg test loss
print("Test loss: {:.3f}".format(np.mean(test_losses)))

# accuracy over all test data
test_acc = num_correct/len(test_loader.dataset)
print("Test accuracy: {:.3f}".format(test_acc))


Test loss: 0.471
Test accuracy: 0.796


In [0]:
from string import punctuation

def tokenize_movie_review(test_review):
  test_review = test_review.lower()
  test_text = ''.join([c for c in test_review if c not in punctuation])
  test_words = test_text.split()
  test_ints = []
  test_ints.append([vocab_to_int[word] for word in test_words])
  return test_ints

In [54]:
test_review_neg = "It was a very bad movie. Terrible acting."
tokenized_review = tokenize_movie_review(test_review_neg)
print(tokenize_movie_review(test_review_neg))

[[8, 14, 3, 55, 76, 18, 388, 113]]


In [81]:
seq_length = 200
features = pad_features(tokenized_review, seq_length)
print(features)

[[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   8  14   3  55  76  18
  388 113]]


In [82]:
feature_tensor = torch.from_numpy(features)
print(feature_tensor.size(0))

1


In [0]:
def predict(net, test_review, seq_length=200):
  net.eval()
  test_ints = tokenize_movie_review(test_review)
  seq_length=seq_length
  features = pad_features(test_ints, seq_length)
  feature_tensor = torch.from_numpy(features)
  batch_size = feature_tensor.size(0)
  h = net.init_hidden(batch_size)
  if(train_on_gpu):
    feature_tensor=feature_tensor.cuda()
  output, h = net(feature_tensor, h)
  pred = torch.round(output.squeeze())
  print('Prediction value, pre-rounding: {:.6f}'.format(output.item()))
  if(pred.item()==1):
    print("Positive")
  else:
    print("Negative")

In [85]:

# positive test review
test_review_pos = 'It was a very bad movie. Terrible acting. I will not recommend it at all. Full money waste.'
seq_length=200 # good to use the length that was trained on

predict(net, test_review_neg, seq_length)

Prediction value, pre-rounding: 0.040038
Negative
