<a href="https://colab.research.google.com/github/graviraja/100-Days-of-NLP/blob/applications%2Fsentiment/applications/classification/Improved%20Sentiment%20Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis

Sentiment analysis refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information.



## IMDB - Dataset of 50K Movie Reviews

This is a dataset for binary sentiment classification containing a set of 25,000 highly polar movie reviews for training and 25,000 for testing.

In the [previous notebook](https://github.com/graviraja/100-Days-of-NLP/blob/master/applications/classification/Simple%20Sentiment%20Analysis.ipynb), we explored the basic sentiment analysis using RNN. The `test_accuracy` is less than **`50%`**.

Here we will explore few techniques that can help in increasing the accuracy of the model.

We will try the following:

- packed padded sequences
- pre-trained word embeddings
- different RNN architecture
- bidirectional RNN
- multi-layer RNN
- regularization
- a different optimizer

If you are not aware of what:
 - packed padded sequences are, then checkout this [snippet](https://github.com/graviraja/pytorch-sample-codes/blob/master/pad_sequences.py)

 - embeddings are, then have a look into [this](https://github.com/graviraja/100-Days-of-NLP/tree/master/embeddings)

 - how multi-layer RNN (or) bidirectional RNN works, checkout my explaniation [here](https://github.com/graviraja/100-Days-of-NLP/blob/master/architectures/RNN.ipynb)

- Dropout - [Andrew NG tutorial](https://www.youtube.com/watch?v=ARq74QuavAo)

Reference:
- [Ben Trevett Notebook](https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/2%20-%20Upgraded%20Sentiment%20Analysis.ipynb)

This is the overview of the model:

![arch](https://drive.google.com/uc?id=1yevVkm4nVt19aW36YQiehZWbJJwb7-7D)

With that said let's get into code

# Code

## Imports

In [0]:
import time
import torch
import random
import torch.nn as nn
import torch.optim as optim

from torchtext import data
from torchtext import datasets
from tqdm import tqdm

SEED = 1234

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

## Data Processing

For padding sequences to work, RNN needs the sequence_lengths along with the sequences. By default the `FIELD` does not return lengths. We have to mention explicitly. Set the parameter **`include_lengths = True`** while declaring the FIELD. This will cause batch.text to now be a tuple with the first element being our sentence (a numericalized tensor that has been padded) and the second element being the actual lengths of our sentences.

In [0]:
TEXT = data.Field(tokenize = 'spacy', include_lengths = True)
LABEL = data.LabelField(dtype = torch.float)

## IMDB Dataset

In [3]:
train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

downloading aclImdb_v1.tar.gz


aclImdb_v1.tar.gz: 100%|██████████| 84.1M/84.1M [00:06<00:00, 12.1MB/s]


In [0]:
train_data, valid_data = train_data.split(random_state = random.seed(SEED))

In [5]:

print(f'Number of training examples: {len(train_data)}')
print(f'Number of validation examples: {len(valid_data)}')
print(f'Number of testing examples: {len(test_data)}')

Number of training examples: 17500
Number of validation examples: 7500
Number of testing examples: 25000


In [6]:
print(vars(train_data.examples[0]))

{'text': ['Before', 'the', 'release', 'of', 'George', 'Romero', "'s", 'genre', '-', 'defining', 'Night', 'of', 'the', 'Living', 'Dead', ',', 'zombies', 'were', 'relatively', 'well', '-', 'behaved', 'creatures', '.', 'They', 'certainly', 'had', 'much', 'better', 'table', '-', 'manners', 'in', 'the', 'old', 'days', '.', 'But', 'social', 'etiquette', 'aside', 'what', 'thrills', 'did', 'these', 'early', 'zombies', 'offer', 'to', 'the', 'movie', '-', 'going', 'public', '?', 'Judging', 'by', 'this', 'film', ',', 'none', 'whatsoever.<br', '/><br', '/>The', 'story', 'is', 'about', 'an', 'expedition', 'to', 'Cambodia', ',', 'whose', 'purpose', 'is', 'to', 'find', 'and', 'destroy', 'the', 'secret', 'of', 'zombiefication', '.', 'One', 'of', 'the', 'party', 'discovers', 'the', 'secrets', 'on', 'his', 'own', 'and', 'sets', 'about', 'building', 'his', 'zombie', 'army.<br', '/><br', '/>This', 'film', 'is', 'basically', 'a', 'love', 'triangle', 'with', 'zombies', '.', 'But', 'seeing', 'as', 'this', 'i

## Vocabulary


Using of pre-trained word embeddings is one of the most important step in many SOTA models.

#### Loading the Pre-trained Word Embeddings
We get these vectors simply by specifying which vectors we want and passing it as an argument to build_vocab. TorchText handles downloading the vectors and associating them with the correct words in our vocabulary.

By default, TorchText will initialize words in your vocabulary but not in your pre-trained embeddings to zero. We don't want this, and instead initialize them randomly by setting unk_init to torch.Tensor.normal_. This will now initialize those words via a Gaussian distribution.

In [7]:
MAX_VOCAB_SIZE = 25_000

TEXT.build_vocab(train_data, 
                 max_size = MAX_VOCAB_SIZE, 
                 vectors = "glove.6B.100d", 
                 unk_init = torch.Tensor.normal_)

LABEL.build_vocab(train_data)

.vector_cache/glove.6B.zip: 862MB [06:30, 2.21MB/s]                           
100%|█████████▉| 399853/400000 [00:21<00:00, 18813.42it/s]

## Iterators

Let’s create the iterators for our data.

These can be iterated on to return a batch of data which will have a text attribute (the PyTorch tensors containing a batch of numericalized movie reviews) and a label attribute (the PyTorch tensors containing a batch of sentiment of movie reviews).

We also need to replace the words by it’s indexes, since any model takes only numbers as input using the vocabulary.

We use a BucketIterator instead of the standard Iterator as it creates batches in such a way that it minimizes the amount of padding.

Another thing for packed padded sequences all of the tensors within a batch need to be sorted by their lengths. This is handled in the iterator by setting `sort_within_batch = True`.

In [0]:
BATCH_SIZE = 64

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE,
    sort_within_batch = True,
    device = device)

## Model

The model used is a Bi-directional Multi Layer LSTM.

![arch](https://drive.google.com/uc?id=1yevVkm4nVt19aW36YQiehZWbJJwb7-7D)

In [0]:
class RNN(nn.Module):
    def __init__(self, vocab_size, emb_dim, hidden_dim, output_dim, n_layers, bidirectional, dropout, pad_idx):
        super().__init__()

        self.embedding = nn.Embedding(vocab_size, emb_dim, padding_idx=pad_idx)
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim
        self.rnn = nn.LSTM(
                        emb_dim,
                        hidden_dim,
                        num_layers=n_layers,
                        bidirectional=bidirectional,
                        dropout=dropout)
        self.fc = nn.Linear(hidden_dim * 2, output_dim)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, input, input_lengths):
        # input => [seq_len, batch_size]
        # input_lengths => [batch_size]
        batch_size = input.shape[1]
        embedded = self.embedding(input)
        embedded = self.dropout(embedded)
        # embedded => [seq_len, batch_size, emb_dim]

        # packing the sequence
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, input_lengths)

        packed_output, (hidden, cell) = self.rnn(packed_embedded)

        # unpacking the sequence
        output, output_lengths = nn.utils.rnn.pad_packed_sequence(packed_output)

        # output => [seq_len, batch_size, hidden_dim * num_dir]
        # hidden => [num_layers * num_dir, batch_size, hidden_dim]
        # cell => [num_layers * num_dir, batch_size, hidden_dim]

        hidden = hidden.view(self.n_layers, -1 , batch_size, self.hidden_dim)
        # [1, 2, b, h] => single layer bi dir   
        #              => final layer forward hidden [-1][0][:] => [batch_size, hidden_dim]
        #              => final layer backward hidden [-1][1][:] => [batch_size, hidden_dim]
        # [2, 2, b, h] => multi layer bi dir    => [-1][0]  [-1][1]
        #              => final layer forward hidden [-1][0][:] => [batch_size, hidden_dim]
        #              => final layer backward hidden [-1][1][:] => [batch_size, hidden_dim]

        # concatinating final forward and final backward hidden states
        hidden = torch.cat((hidden[-1][0][:], hidden[-1][1][:]), dim=1)
        # hidden => [batch_size, hidden_dim * num_dir(2)]

        hidden = self.dropout(hidden)

        logits = self.fc(hidden)
        # logits => [batch_size, output_dim]
        
        return logits

In [12]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 1
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.5
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]

model = RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM, N_LAYERS, BIDIRECTIONAL, DROPOUT, PAD_IDX)
model

RNN(
  (embedding): Embedding(25002, 100, padding_idx=1)
  (rnn): LSTM(100, 256, num_layers=2, dropout=0.5, bidirectional=True)
  (fc): Linear(in_features=512, out_features=1, bias=True)
  (dropout): Dropout(p=0.5, inplace=False)
)

In [13]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model)} trainable paramters')

The model has 4810857 trainable paramters


In [14]:
pretrained_embeddings = TEXT.vocab.vectors

print(pretrained_embeddings.shape)

torch.Size([25002, 100])


In [15]:
model.embedding.weight.data.copy_(pretrained_embeddings)

tensor([[-0.1117, -0.4966,  0.1631,  ...,  1.2647, -0.2753, -0.1325],
        [-0.8555, -0.7208,  1.3755,  ...,  0.0825, -1.1314,  0.3997],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [-0.4414,  0.4593,  0.6607,  ..., -0.2775,  0.0806, -0.4131],
        [-0.0608, -1.0925,  0.3700,  ..., -0.7628,  0.1420, -0.4692],
        [ 0.1362, -0.9169,  0.1200,  ..., -0.0716,  0.8561, -0.9624]])

As our $<unk>$ and $<pad>$ token aren't in the pre-trained vocabulary they have been initialized using unk_init (an $\mathcal{N}(0,1)$ distribution) when building our vocab. 

It is preferable to initialize them both to all zeros to explicitly tell our model that, initially, they are irrelevant for determining sentiment.

In [16]:

UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token]

model.embedding.weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM)
model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)

print(model.embedding.weight.data)

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [-0.4414,  0.4593,  0.6607,  ..., -0.2775,  0.0806, -0.4131],
        [-0.0608, -1.0925,  0.3700,  ..., -0.7628,  0.1420, -0.4692],
        [ 0.1362, -0.9169,  0.1200,  ..., -0.0716,  0.8561, -0.9624]])


## Optimizer & Criterion

Optimizer will be used to update the parameters of the module. Here, we'll use **`Adam`**. The first argument is the parameters will be updated by the optimizer, the second is the learning rate, i.e. how much we'll change the parameters by when we do a parameter update. We will use the default learning rate

Next, we'll define our loss function. In PyTorch this is commonly called a criterion.

The loss function here is binary cross entropy with logits.

In [0]:
optimizer = optim.Adam(model.parameters())

In [0]:
criterion = nn.BCEWithLogitsLoss()

model = model.to(device)
criterion = criterion.to(device)

## Accuracy

Since the labels are binary 0 and 1. Applying sigmoid on logits will convert the values to 0-1 scale. Then rounding it will give the value 0 or 1. Comparing with the ground truth lables will give the accuracy

In [0]:
def binary_accuracy(preds, y):
    predicted = torch.round(torch.sigmoid(preds))
    correct = (predicted == y).float()
    acc = correct.sum() / len(correct)
    return acc


## Training

In training the following steps are performed:

- keep the model in training mode
- Iterate over batches
- In each batch do the following:

    - PyTorch does not automatically remove (or "zero") the gradients calculated from the last gradient calculation, so they must be manually zeroed. Use the method optimizer.zero_grad()
    - Do the forward pass
    - Calculate loss using the criterion
    - Calculate the accuracy
    - Perform the backward pass
    - Update the parameters of the model
    - Update the loss, accuracy of the epoch
- Return the loss and accuracy of the epoch

In [0]:
def train(model, iterator, optimizer, criterion):
    epoch_loss = 0
    epoch_acc = 0

    model.train()
    
    for batch in iterator:
        optimizer.zero_grad()

        input, input_lengths = batch.text
        preds = model(input, input_lengths)
        preds = preds.squeeze(1)

        loss = criterion(preds, batch.label)
        acc = binary_accuracy(preds, batch.label)
        loss.backward()
        optimizer.step()

        epoch_loss += loss.item()
        epoch_acc += acc.item()
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

## Evaluating

In evaluation the following steps are performed:

- keep the model in evaluation mode
- we don't require to calculate the gradients in the evaluation mode.
- Iterate over batches
- In each batch do the following:
    - Do the forward pass
    - Calculate loss using the criterion
    - Calculate the accuracy
    - Update the loss, accuracy of the epoch
- Return the loss and accuracy of the epoch

In [0]:
def evaluate(model, iterator, criterion):
    epoch_loss = 0
    epoch_acc = 0

    model.eval()
    
    for batch in iterator:
        input, input_lengths = batch.text
        preds = model(input, input_lengths)
        preds = preds.squeeze(1)
        loss = criterion(preds, batch.label)
        acc = binary_accuracy(preds, batch.label)

        epoch_loss += loss.item()
        epoch_acc += acc.item()
    return epoch_loss / len(iterator), epoch_acc / len(iterator)


In [0]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs


## Running

`N_EPOCHS`: num of epochs (iterations over the complete training dataset) to run over the model

Save the model which has less validation loss.

Caluclate the accuracy on the test_data using the trained model

In [26]:
N_EPOCHS = 5

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    start_time = time.time()

    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)

    end_time = time.time()
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)

    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'model.pt')
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

model.load_state_dict(torch.load('model.pt'))

test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')
    

Epoch: 01 | Epoch Time: 1m 38s
	Train Loss: 0.258 | Train Acc: 89.76%
	 Val. Loss: 0.508 |  Val. Acc: 82.96%
Epoch: 02 | Epoch Time: 1m 38s
	Train Loss: 0.262 | Train Acc: 89.24%
	 Val. Loss: 0.288 |  Val. Acc: 88.87%
Epoch: 03 | Epoch Time: 1m 38s
	Train Loss: 0.183 | Train Acc: 93.09%
	 Val. Loss: 0.327 |  Val. Acc: 88.55%
Epoch: 04 | Epoch Time: 1m 38s
	Train Loss: 0.166 | Train Acc: 93.91%
	 Val. Loss: 0.283 |  Val. Acc: 89.81%
Epoch: 05 | Epoch Time: 1m 37s
	Train Loss: 0.139 | Train Acc: 94.83%
	 Val. Loss: 0.281 |  Val. Acc: 90.32%
Test Loss: 0.300 | Test Acc: 89.22%


## Inference

We can now use our model to predict the sentiment of any sentence we give it. As it has been trained on movie reviews, the sentences provided should also be movie reviews.

Our `inference` function does a few things:

- sets the model to evaluation mode
- tokenizes the sentence, i.e. splits it from a raw string into a list of tokens
- indexes the tokens by converting them into their integer representation from our vocabulary
- gets the length of our sequence
- converts the indexes, which are a Python list into a PyTorch tensor
- add a batch dimension by unsqueezeing
- converts the length into a tensor
- squashes the output prediction from a real number between 0 and 1 with the sigmoid function
- converts the tensor holding a single value into an integer with the item() method

We are expecting reviews with a negative sentiment to return a value close to 0 and positive reviews to return a value close to 1.

In [0]:
import spacy
nlp = spacy.load('en')

def inference(model, sentence):
    model.eval()
    tokenized = [tok.text for tok in nlp.tokenizer(sentence)]
    indexed = [TEXT.vocab.stoi[t] for t in tokenized]
    length = [len(indexed)]
    tensor = torch.LongTensor(indexed).to(device)
    tensor = tensor.unsqueeze(1)
    length_tensor = torch.LongTensor(length)
    prediction = torch.sigmoid(model(tensor, length_tensor))
    return prediction.item()

Example of negative review

In [28]:
inference(model, "This film is terrible")

0.012477142736315727

Example of positive review

In [29]:
inference(model, "I love this film")

0.7550132274627686