<a href="https://colab.research.google.com/github/graviraja/100-Days-of-NLP/blob/applications%2Fsentiment/applications/classification/Simple%20Sentiment%20Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis

Sentiment analysis refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information.

## IMDB - Dataset of 50K Movie Reviews

This is a dataset for binary sentiment classification containing a set of 25,000 highly polar movie reviews for training and 25,000 for testing. 

This notebook covers the basic workflow. We'll learn how to: load data, create train/test/validation splits, build a vocabulary, create data iterators, define a model and implement the train/evaluate/test loop.

The model used is a simple RNN network

![arch](https://drive.google.com/uc?id=1LcoUUyg3S7JBII4buOHVVj46LU7nLCPt)

Resources:

- [IMDB dataset](http://ai.stanford.edu/~amaas/data/sentiment/)
- [Ben Trevett Sentiment analysis](https://github.com/bentrevett/pytorch-sentiment-analysis)

## Introduction

We'll be using a recurrent neural network (RNN) for analysing the text. If you are not aware of what an RNN is then check out my notebook on [RNN](https://github.com/graviraja/100-Days-of-NLP/blob/master/architectures/RNN.ipynb)

Let's first see the basic equations of RNN,

The input to the RNN is a Sequence $X = \{x_1, x_2,...., x_t\}$ and the hidden states, $$H = \{h_1, h_2,...., h_t\}$$ are calcualted using the following equation:

$$h_t = RNN(x_t, h_{t-1})$$ 

Once the final hidden state $h_t$ is calculated, it will be fed through a linear layer, $f$ for predicting the sentiment $\hat{y} = f(h_t)$

## Code

In [0]:
import time
import random
import torch

import torch.nn as nn
import torch.optim as optim

from torchtext import data, datasets

import numpy as np
import pandas as pd

SEED = 1234

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

In [3]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

## Field
Field define how the data should be processed.
The parameters of a Field specify how the data should be processed.

We use the TEXT field to define how the review should be processed, and the LABEL field to process the sentiment.

Our TEXT field has tokenize='spacy' as an argument. This defines that the "tokenization" (the act of splitting the string into discrete "tokens") should be done using the spaCy tokenizer.
If no tokenize argument is passed, the default is simply splitting the string on spaces.

LABEL is defined by a LabelField, a special subset of the Field class specifically used for handling labels.

For more on Fields, go [here](https://pytorch.org/text/data.html#field).

In [0]:
# check the code documentation for more parameters
data.Field??

In [0]:
TEXT = data.Field(tokenize = 'spacy')
LABEL = data.LabelField(dtype = torch.float)

## IMDB dataset

`TorchText` supports various datasets used in NLP. Checkout all the datasets supported [here](https://pytorch.org/text/datasets.html)

In [11]:
train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

aclImdb_v1.tar.gz:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

downloading aclImdb_v1.tar.gz


aclImdb_v1.tar.gz: 100%|██████████| 84.1M/84.1M [00:03<00:00, 25.3MB/s]


In [12]:
print(f'Number of training examples: {len(train_data)}')
print(f'Number of testing examples: {len(test_data)}')

Number of training examples: 25000
Number of testing examples: 25000


In [0]:
train_data, valid_data = train_data.split(random_state = random.seed(SEED))

In [14]:
print(f'Number of training examples: {len(train_data)}')
print(f'Number of validation examples: {len(valid_data)}')
print(f'Number of testing examples: {len(test_data)}')

Number of training examples: 17500
Number of validation examples: 7500
Number of testing examples: 25000


In [15]:
print(vars(train_data.examples[0]))

{'text': ['Before', 'the', 'release', 'of', 'George', 'Romero', "'s", 'genre', '-', 'defining', 'Night', 'of', 'the', 'Living', 'Dead', ',', 'zombies', 'were', 'relatively', 'well', '-', 'behaved', 'creatures', '.', 'They', 'certainly', 'had', 'much', 'better', 'table', '-', 'manners', 'in', 'the', 'old', 'days', '.', 'But', 'social', 'etiquette', 'aside', 'what', 'thrills', 'did', 'these', 'early', 'zombies', 'offer', 'to', 'the', 'movie', '-', 'going', 'public', '?', 'Judging', 'by', 'this', 'film', ',', 'none', 'whatsoever.<br', '/><br', '/>The', 'story', 'is', 'about', 'an', 'expedition', 'to', 'Cambodia', ',', 'whose', 'purpose', 'is', 'to', 'find', 'and', 'destroy', 'the', 'secret', 'of', 'zombiefication', '.', 'One', 'of', 'the', 'party', 'discovers', 'the', 'secrets', 'on', 'his', 'own', 'and', 'sets', 'about', 'building', 'his', 'zombie', 'army.<br', '/><br', '/>This', 'film', 'is', 'basically', 'a', 'love', 'triangle', 'with', 'zombies', '.', 'But', 'seeing', 'as', 'this', 'i

## Vocabulary

Next, we have to build a vocabulary. 

With Torchtext’s Field that is extremely simple. we don’t need to worry about creating dicts, mapping word to index, mapping index to word, counting the words etc. All these things are done by the Field for us.

We can define the minimum frequency of the words by specifying the attribute min_freq in build_vocab method of Field. Tokens that appear less the min_freq are converted into an `<unk>` (unknown) token.

*Note : We will use only training data for creating the vocabulary*

In [0]:
MAX_VOCAB_SIZE = 25_000

TEXT.build_vocab(train_data, max_size = MAX_VOCAB_SIZE)
LABEL.build_vocab(train_data)

In [17]:
print(f"Unique tokens in TEXT vocabulary: {len(TEXT.vocab)}")
print(f"Unique tokens in LABEL vocabulary: {len(LABEL.vocab)}")


Unique tokens in TEXT vocabulary: 25002
Unique tokens in LABEL vocabulary: 2


Why is the vocab size 25002 and not 25000? One of the addition tokens is the `<unk>` token and the other is a `<pad>` token.

In [20]:
print(TEXT.vocab.freqs.most_common(10))

[('the', 202394), (',', 192899), ('.', 165582), ('and', 109516), ('a', 109451), ('of', 100959), ('to', 93453), ('is', 76282), ('in', 61204), ('I', 54045)]


In [22]:
print(TEXT.vocab.itos[:10])

['<unk>', '<pad>', 'the', ',', '.', 'and', 'a', 'of', 'to', 'is']


## Iterators

Let’s create the iterators for our data.

These can be iterated on to return a batch of data which will have a text attribute (the PyTorch tensors containing a batch of numericalized movie reviews) and a label attribute (the PyTorch tensors containing a batch of sentiment of movie reviews).

We also need to replace the words by it’s indexes, since any model takes only numbers as input using the vocabulary.

We use a BucketIterator instead of the standard Iterator as it creates batches in such a way that it minimizes the amount of padding.

This can be done as following:

In [0]:
BATCH_SIZE = 64

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE,
    device = device)

## Model

We will be using simple RNN model. 

In [0]:
class RNN(nn.Module):
    def __init__(self, input_dim, emb_dim, hidden_dim,  output_dim):
        super().__init__()

        # embedding layer
        self.embedding = nn.Embedding(input_dim, emb_dim)
        
        # rnn layer
        self.rnn = nn.RNN(emb_dim, hidden_dim)

        # prediction layer
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, data):
        # data => [seq_len, batch_size]

        embedded = self.embedding(data)
        # embedded => [seq_len, batch_size, emb_dim]

        output, hidden = self.rnn(embedded)
        # output => [seq_len, batch_size, hid_dim]
        # hidden => [1, batch_size, hid_dim]

        logits = self.fc(hidden.squeeze(0))
        # logits => [batch_size, output_dim]
        # logits => [batch_size, 1]

        return logits

In [0]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 1

model = RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM)


In [25]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 2,592,105 trainable parameters


## Optimizer & Criterion

Optimizer will be used to update the parameters of the module. Here, we'll use stochastic gradient descent (SGD). The first argument is the parameters will be updated by the optimizer, the second is the learning rate, i.e. how much we'll change the parameters by when we do a parameter update.

Next, we'll define our loss function. In PyTorch this is commonly called a criterion.

The loss function here is binary cross entropy with logits.

In [0]:
optimizer = optim.SGD(model.parameters(), lr=1e-3)
criterion = nn.BCEWithLogitsLoss()

model = model.to(device)
criterion = criterion.to(device)

## Accuracy

Since the labels are binary `0` and `1`. Applying sigmoid on `logits` will convert the values to `0-1` scale. Then rounding it will give the value `0` or `1`. Comparing with the ground truth lables will give the accuracy

In [0]:
def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """

    #round predictions to the closest integer
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float() #convert into float for division 
    acc = correct.sum() / len(correct)
    return acc


## Training

In training the following steps are performed:

- keep the model in training mode (not useful here but a good practice)
- Iterate over batches
- In each batch do the following:
    - PyTorch does not automatically remove (or "zero") the gradients calculated from the last gradient calculation, so they must be manually zeroed. Use the method `optimizer.zero_grad()`
    - Do the forward pass 
    - Calculate loss using the `criterion`
    - Calculate the accuracy 
    - Perform the `backward` pass
    - Update the parameters of the model
    - Update the loss, accuracy of the epoch

- Return the loss and accuracy of the epoch


In [0]:
def train(model, iterator, optimizer, criterion):
    epoch_loss = 0
    epoch_acc = 0

    model.train()

    for batch in iterator:
        optimizer.zero_grad()
        predictions = model(batch.text).squeeze(1)
        loss = criterion(predictions, batch.label)
        acc = binary_accuracy(predictions, batch.label)

        loss.backward()
        optimizer.step()

        epoch_loss += loss.item()
        epoch_acc += acc.item()
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

## Evaluating

In evaluation the following steps are performed:

- keep the model in evaluation mode (not useful here but a good practice)
- we don't require to calculate the gradients in the evaluation mode. 
- Iterate over batches
- In each batch do the following:
    - Do the forward pass 
    - Calculate loss using the `criterion`
    - Calculate the accuracy 
    - Update the loss, accuracy of the epoch

- Return the loss and accuracy of the epoch

In [0]:
def evaluate(model, iterator, criterion):
    epoch_loss = 0
    epoch_acc = 0

    model.eval()
    
    with torch.no_grad():
        for batch in iterator:
            predictions = model(batch.text).squeeze(1)
            loss = criterion(predictions, batch.label)
            acc = binary_accuracy(predictions, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
    return epoch_loss / len(iterator), epoch_acc / len(iterator)


In [0]:
# for caluclating the time take to run each epoch
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs


## Running

`N_EPOCHS`: num of epochs (iterations over the complete training dataset) to run over the model

Save the model which has less validation loss. 

Caluclate the accuracy on the `test_data` using the trained model

In [31]:
N_EPOCHS = 5

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut1-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

model.load_state_dict(torch.load('tut1-model.pt'))

test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')


Epoch: 01 | Epoch Time: 0m 18s
	Train Loss: 0.694 | Train Acc: 50.41%
	 Val. Loss: 0.697 |  Val. Acc: 50.21%
Epoch: 02 | Epoch Time: 0m 18s
	Train Loss: 0.693 | Train Acc: 50.04%
	 Val. Loss: 0.697 |  Val. Acc: 49.95%
Epoch: 03 | Epoch Time: 0m 18s
	Train Loss: 0.693 | Train Acc: 49.93%
	 Val. Loss: 0.697 |  Val. Acc: 50.72%
Epoch: 04 | Epoch Time: 0m 18s
	Train Loss: 0.693 | Train Acc: 49.84%
	 Val. Loss: 0.697 |  Val. Acc: 50.16%
Epoch: 05 | Epoch Time: 0m 18s
	Train Loss: 0.693 | Train Acc: 50.13%
	 Val. Loss: 0.697 |  Val. Acc: 50.59%
Test Loss: 0.710 | Test Acc: 47.48%


## Next Steps

As we can see the `accuracy` of the model on test_data is less than `50%`. Since the model we used is pretty basic there are a lot of tweeks we can do. In the later notebooks we will try the following:

- packed padded sequences
- pre-trained word embeddings
- different RNN architecture
- bidirectional RNN
- multi-layer RNN
- regularization
- a different optimizer