<a href="https://colab.research.google.com/github/purvasingh96/Deep-learning-with-neural-networks/blob/master/Deep-learning-with-pytorch/3.%20Recurrent%20Neural%20Networks/Sentiment_analysis_via_RNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis with an RNN
In this notebook I have implemented a RNN that performs sentiment analysis. <br>
Reason for using RNN instead of a strictly feedforward network is that we can also include information about *sequence* of words.

### Network Architecture
Below would be the architecture diagram for my sentiment analysis model - <br>
<img src="https://github.com/purvasingh96/Deep-learning-with-neural-networks/blob/master/Deep-learning-with-pytorch/3.%20Recurrent%20Neural%20Networks/images/network_diagram.png?raw=1"></img>

**Notes -**
1. Since we are performing sentiment analysis, we need a more efficient representation of words as compared to one_hot_encoded vectors. Hence, using *embeded layer for dimensionality reduction.*
2. The new embeddings will be passed to LSTM cells. LSTM cells will add recurrent connections and add ability to *include information about sequence of words.*
3. Final LSTM outputs will go to *Sigmoid output layer.*

### Load in and visualize the data

In [0]:
import numpy as np

In [0]:
with open('reviews.txt', 'r') as f:
  reviews = f.read()
with open('labels.txt', 'r') as f:
  labels = f.read()

In [6]:
print(reviews[:100])
print(labels[:100])

bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life
positive
negative
positive
negative
positive
negative
positive
negative
positive
negative
positive
n


## Data pre-processing
### Getting rid of punctuations 
1. Get rid of punctuation marks etc.
2. Reviews are delimited by \n. Use \n as delimiter to split text into each reviews.
3. Combine reviews in step-2 into 1 big string.

In [0]:
from string import punctuation

# get rid of punctuation
reviews = reviews.lower()
all_text = ''.join([c for c in reviews if c not in punctuation])

# split by new lines and space
reviews_split = all_text.split('\n')
all_text = ' '.join(reviews_split)

# create a list of words
words = all_text.split()

In [8]:
words[:20]

['bromwell',
 'high',
 'is',
 'a',
 'cartoon',
 'comedy',
 'it',
 'ran',
 'at',
 'the',
 'same',
 'time',
 'as',
 'some',
 'other',
 'programs',
 'about',
 'school',
 'life',
 'such']

### Encoding reviews
Create an array that contains integer encoded version of words in reviews. The word appearing the most should have least integer value. Example if *the* appeared the most in reviews, then assign *'the' : 1*  

In [9]:
from collections import Counter

counts = Counter(words)
'''
counts = Counter({'bromwell': 5,
                  'high': 742,
                  'is': 39879,
                  'a': 60733
                  })

vocabulary_to_int = {'the': 1,
                      'and': 2,
                      'a': 3,
                      'of': 4,
                      'to': 5,
                      'is': 6
                    }
'''
vocabulary = sorted(counts, key=counts.get, reverse=True)
vocabulary_to_int = {word:ii for ii, word in enumerate(vocabulary, 1)}
reviews_int = []
for reviews in reviews_split:
  reviews_int.append([vocabulary_to_int[word] for word in reviews.split()])
print(reviews_int[:1])


[[7428, 322, 6, 2, 1631, 192, 8, 1920, 33, 1, 168, 56, 16, 50, 84, 7429, 42, 540, 124, 141, 16, 2829, 55, 146, 10, 1, 5782, 5243, 426, 73, 5, 245, 12, 7428, 322, 13, 2276, 6, 74, 2585, 5, 720, 83, 6, 2829, 1, 14610, 5, 2200, 4757, 1, 4758, 1589, 35, 51, 68, 198, 143, 64, 1188, 2829, 14611, 1, 14612, 4, 1, 224, 650, 32, 2460, 73, 4, 1, 8769, 9, 854, 3, 64, 1589, 54, 9, 209, 1, 317, 10, 67, 2, 1489, 3793, 768, 5, 2830, 186, 1, 540, 9, 1243, 7430, 33, 322, 2, 384, 302, 5783, 9, 133, 135, 5, 7431, 28, 4, 139, 2829, 1489, 1993, 5, 7428, 322, 9, 502, 12, 116, 1781, 4, 55, 669, 101, 12, 7428, 322, 6, 226, 3794, 45, 2, 2201, 12, 8, 229, 21]]


### Encoding labels
If review is positive, then corresponding label is 0 else 1.

In [0]:
labels_split = labels.split('\n')
encoded_labels = np.array([1 if label == 'positive' else 0 for label in labels_split])


### Removing Outliers
This step involves - 
1. Getting rid of extremely long/short reviews
2. Padding/truncating reaining data to maintain constant review length.

<img src="https://github.com/purvasingh96/Deep-learning-with-neural-networks/blob/master/Deep-learning-with-pytorch/3.%20Recurrent%20Neural%20Networks/images/outliers_padding_ex.png?raw=1"></img>

In [11]:
# removing reviews of length 0
print('Number of reviews before removing outliers: ', len(reviews_int))
non_zero_idx = {ii for ii, review in enumerate(reviews_int) if len(review)!=0}

reviews_int = [reviews_int[ii] for ii in non_zero_idx]
encoded_labels = np.array([encoded_labels[ii] for ii in encoded_labels])


print('Number of reviews after removing outliers: ', len(reviews_int))

Number of reviews before removing outliers:  2446
Number of reviews after removing outliers:  2446


In [0]:
def pad_features(reviews_int, seq_length):
  features = np.zeros((len(reviews_int), seq_length), dtype=int)
  for i, row in enumerate(reviews_int):
    features[i, -len(row):] = np.array(row)[:seq_length]
  
  return features

# Training, Testing and Validating 

In [13]:
seq_length = 200
features = pad_features(reviews_int, seq_length=seq_length)
split_frac = 0.8

split_idx = int(len(features)*split_frac)
train_x, remaining_x = features[:split_idx], features[split_idx:]
train_y, remaining_y = encoded_labels[:split_idx], encoded_labels[split_idx:]

test_idx = int(len(remaining_x)*0.5)
val_x, test_x = remaining_x[:test_idx], remaining_x[test_idx:]
val_y, test_y = remaining_y[:test_idx], remaining_y[test_idx:]

print(features[:30, :10])

[[    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [10874    40 14618    16   749 14619  4073    48    78    36]
 [ 2991   550    16     2  4399   165  4400   796     6  3555]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [   54     9    14   105    55   991   492    73   309     5]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    1   279   575    34     2   165   702  2997    10   316]
 [   10    11  6492  5257  1927   590   440    23   251   619]
 [    0     0     0     0     0     0     0     0     0

# DataLoaders and Batching

A neat way to create data-loaders and batch our training, validation and test Tensor datasets is as follows -<br>
```python
train_data = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))
train_loader = DataLoader(train_data, batch_size=batch_size)

```
This is an alternative to creating a generator function for batching our data into full batches.

In [0]:
import torch
from torch.utils.data import TensorDataset, DataLoader

# create Tensor dataset
train_data = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))
valid_data = TensorDataset(torch.from_numpy(val_x), torch.from_numpy(val_y))
#test_data = TensorDataset(torch.from_numpy(test_x), torch.from_numpy(test_y))


# creating batch size
batch_size = 50

# creating data loader
train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
#test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size)
valid_loader = DataLoader(valid_data, shuffle=True, batch_size=batch_size)


In [18]:
dataiter = iter(train_loader)
sample_x, sample_y = dataiter.next()
print('sample_x :', sample_x)
print('sample_y', sample_y)


sample_x : tensor([[ 118,    2,  911,  ...,  959,   14, 3060],
        [   0,    0,    0,  ...,    2, 4304, 4701],
        [   0,    0,    0,  ...,    8,   13,  337],
        ...,
        [   0,    0,    0,  ...,    3, 3655, 7741],
        [  11,   15,    6,  ...,   40, 1950,  629],
        [   0,    0,    0,  ...,    1, 2999,   39]])
sample_y tensor([0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1,
        0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0,
        0, 1])


In [19]:
train_on_gpu = torch.cuda.is_available()
if(train_on_gpu):
  print('Training on GPU')
else:
  print('No GPU available.')

Training on GPU


# Sentiment Network with PyTorch

Below are the various layers of our RNN that would perform sentiment analysis - 

1. An embedding layer that converts our word tokens (integers) into embeddings of a specific size.
2. An LSTM layer defined by a hidden_state size and number of layers
3. A fully-connected output layer that maps the LSTM layer outputs to a desired output_size
4. A sigmoid activation layer which turns all outputs into a value 0-1; return only the last sigmoid output as the output of this network.

In [0]:
import torch.nn as nn

class SentimentRNN(nn.Module):
  def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, drop_prob=0.5, lr=0.001):
    super(SentimentRNN, self).__init__()
    self.output_size = output_size
    self.n_layers = n_layers
    self.hidden_dim = hidden_dim

    # embedding layer and LSTM layer
    # torch.nn.Embedding(num_embeddings, embedding_dim, padding_idx=None, max_norm=None, 
    # norm_type=2.0, scale_grad_by_freq=False, sparse=False, _weight=None)
    self.embedding = nn.Embedding(vocab_size, embedding_dim)
    self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, batch_first=True, dropout=drop_prob)

    # Dropout layer
    self.dropout = nn.Dropout(0.3)

    # Fully connected layers
    self.fc = nn.Linear(hidden_dim, output_size)
    self.sig = nn.Sigmoid()

  def forward(self, x, hidden):
#    batch_size = x.size(0)
    x = x.long()
    embeds = self.embedding(x)
    lstm_out, hidden = self.lstm(embeds, hidden)

    lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim)
    
    out = self.dropout(lstm_out)
    out = self.fc(out)
    sig_out = self.sig(out)

    sig_out = sig_out.view([50, -1])
    sig_out = sig_out[:, -1]

    return sig_out, hidden

  def init_hidden(self, batch_size):
    weight = next(self.parameters()).data
    if(train_on_gpu):
      hidden = (weight.new(self.n_layers, 50, self.hidden_dim).zero_().cuda(), weight.new(self.n_layers, 50, self.hidden_dim).zero_().cuda())
    else:
      hidden = (weight.new(self.n_layers, 50, self.hidden_dim).zero_(), weight.new(self.n_layers, batch_size, self.hidden_dim).zero_())

    return hidden





# Instantiate the network
Here, I will define the model hyper-parameters - 
1. `vocab_size`: Size of our vocabulary or the range of values for our input, word tokens.
2. `output_size`: Size of our desired output; the number of class scores we want to output (pos/neg).
3. `embedding_dim`: Number of columns in the embedding lookup table; size of our embeddings.
4. `hidden_dim`: Number of units in the hidden layers of our LSTM cells. Usually larger is better performance wise. Common values are 128, 256, 512, etc.
5. `n_layers`: Number of LSTM layers in the network. Typically between 1-3

In [43]:
# Instantiate model with hyperparams

vocab_size = len(vocabulary_to_int)+1
output_size = 1
embedding_dim = 400
hidden_dim = 256
n_layers = 2

net = SentimentRNN(vocab_size, output_size, embedding_dim, hidden_dim, n_layers)
print(net)

SentimentRNN(
  (embedding): Embedding(24985, 400)
  (lstm): LSTM(400, 256, num_layers=2, batch_first=True, dropout=0.5)
  (dropout): Dropout(p=0.3, inplace=False)
  (fc): Linear(in_features=256, out_features=1, bias=True)
  (sig): Sigmoid()
)


In [0]:
# loss and optimization functions
lr =0.001

criterion = nn.BCELoss()
optimizer = torch.optim.Adam(net.parameters(), lr=lr)


In [45]:
# training params
epochs = 4
counter = 0
print_every = 100
clip = 5

if(train_on_gpu):
  net.cuda()

net.train()
for e in range(epochs):
  h = net.init_hidden(batch_size)

  for inputs, labels in train_loader:
    counter += 1
    if(train_on_gpu):
      inputs, labels = inputs.cuda(), labels.cuda()

      h = tuple([each.data for each in h])
      net.zero_grad()

      output, h = net(inputs, h)
      loss = criterion(output.squeeze(), labels.float())
      loss.backward()
      nn.utils.clip_grad_norm(net.parameters(), clip)
      optimizer.step()

      if counter % print_every == 0:
        val_h = net.init_hidden(batch_size)
        val_losses = []
        net.eval()
        for inputs, labels in valid_loader:
          val_h = tuple([each.data for each in val_h])
          if(train_on_gpu):
            inputs, labels = inputs.cuda(), labels.cuda()
          output, val_h = net(inputs, val_h)
          val_loss = criterion(output.squeeze(), labels.float())
          val_losses.append(val_loss.item())
      net.train()
      print("Epoch: {}/{}...".format(e+1, epochs),
                  "Step: {}...".format(counter),
                  "Loss: {:.6f}...".format(loss.item()),
                  "Val Loss: {:.6f}".format(np.mean(val_losses)))







RuntimeError: ignored