# Sentiment Classification using LSTM networks

This notebook is adapted from one of the excellent LSTM examples found in [in this github repository](https://github.com/bentrevett/pytorch-sentiment-analysis)

#######################################################################

MIT License

Copyright (c) 2017 Ben Trevett

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

#######################################################################


In this notebook, we perform sentiment classification using LSTM networks. The input to the network will be a string of words (sentence or a short paragraph of variable number of words) and the output will be a number indicating the sentiment of the input text. An output score of 0 indicates a strongly negative sentiment and a score of 1 indicates a strongly positive sentiment. Values inbetween 1 and 0 indicate a sentiment ranging between the two extremes.

In [6]:
# Necessary imports 

import spacy
import random
import time

import torch
from torchtext import data, datasets, legacy
import torch.nn as nn
import torch.optim as optim

In [7]:
!python -m spacy download en

⚠ As of spaCy v3.0, shortcuts like 'en' are deprecated. Please use the full
pipeline package name 'en_core_web_sm' instead.
Collecting en-core-web-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
     ---------------------------------------- 13.9/13.9 MB 1.1 MB/s eta 0:00:00

2022-02-10 16:10:02.873068: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found
2022-02-10 16:10:02.875064: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.



✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')


In [8]:
# Setting a random seed for reproducability
SEED = 1234
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

# Define the preprocessing for the data and labels

# The input data for the LST consists of raw strings.
# The parameters of a `Field` specify how the input data is to be processed
# The data is tokenized from a raw string into a set of tokens using the `spacy` tokenizer
# If no tokenizer is specified, the raw string is simply split using spaces
TEXT = legacy.data.Field(tokenize="spacy", include_lengths = True)

# Target labels are processed as floats
LABEL = legacy.data.LabelField(dtype = torch.float)



In [9]:
# We will train the LSTM network using the IMDB dataset (~84 MB)
# This consists of text movie reviews and their labeled sentiments
# The dataset is split into a train set and a test set
# We provide TEXT and LABEL as input for preprocessing the data
train_data, test_data = datasets.IMDB.splits(TEXT, LABEL, root='./data')

AttributeError: 'function' object has no attribute 'splits'

In [None]:
# The train set is further split into a train set and a validation set
# The default split is 7:3
train_data, valid_data = train_data.split(random_state = random.seed(SEED))

print('Size of train set: {}'.format(len(train_data)))
print('Size of validation set: {}'.format(len(valid_data)))
print('Size of test set: {}'.format(len(test_data)))

In [None]:
# Let's view a train data sample
print(vars(train_data.examples[0]))

In [None]:
# Create a vocabulary, where each unique word in the corpus has a unique index
# Every word is then represented as a one-hot vector using these indexes
# Since the number of unique words can be very large and thus the length of the one-hot vectors would also become very large
# To remedy this, we define an upper limit to the vocabulary size and use only the most frequent words for this
# Any words which are not included in the set of most frequent words get the same index of unk_init
MAX_VOCAB_SIZE = 25000

# Here, we use the `glove` algorithm for creating the input vectors 
# `glove.6B.100d` generates 100-dimensional vectors trained on 6B tokens
# Other options are glove.6B.50d, glove.6B.200d or glove.6B.300d
# This algorithm makes sure that the vector embedding of semantically similar words are also similar 
# This downloads the necessary files (>800MB)
TEXT.build_vocab(train_data, 
                 max_size = MAX_VOCAB_SIZE, 
                 vectors = 'glove.6B.100d', 
                 unk_init = torch.Tensor.normal_
                )

LABEL.build_vocab(train_data)

In [None]:
# Use a batch size for training
BATCH_SIZE = 64

# Use a GPU if possible
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Create the iterators for the train, valid and test sets
train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits((train_data, valid_data, test_data), 
                                                                           batch_size = BATCH_SIZE,
                                                                           sort_within_batch = True,
                                                                           device = device)

In [None]:
# Define the RNN model

class RNN(nn.Module):
    def __init__(self, 
                 vocab_size, 
                 embedding_dim, 
                 hidden_dim, 
                 output_dim, 
                 n_layers, 
                 bidirectional, 
                 dropout, 
                 pad_idx):
        
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx = pad_idx)
        
        # Define the LSTM cell
        self.rnn = nn.LSTM(embedding_dim, 
                           hidden_dim, 
                           num_layers=n_layers, 
                           bidirectional=bidirectional, 
                           dropout=dropout)
        
        # The final linear fully connected layer that produces the output
        self.fc = nn.Linear(hidden_dim * 2, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, text, text_lengths):
        
        # Convert the input word to its embedding
        embedded = self.dropout(self.embedding(text))
             
        # Pack sequence
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths)
        
        packed_output, (hidden, cell) = self.rnn(packed_embedded)
        
        # Unpack sequence
        output, output_lengths = nn.utils.rnn.pad_packed_sequence(packed_output)
       
        # Concat the final forward (hidden[-2,:,:]) and backward (hidden[-1,:,:]) hidden layers
        # and apply dropout
        
        hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1))
                            
        return self.fc(hidden)

In [None]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 1
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.5
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]

# Instantiate the network
model = RNN(INPUT_DIM, 
            EMBEDDING_DIM, 
            HIDDEN_DIM, 
            OUTPUT_DIM, 
            N_LAYERS, 
            BIDIRECTIONAL, 
            DROPOUT, 
            PAD_IDX)

In [None]:
# Replace the initial weights of the embedding layer with the pre-trained embeddings
pretrained_embeddings = TEXT.vocab.vectors
model.embedding.weight.data.copy_(pretrained_embeddings)

In [None]:
# Processing for unknown and padding tokens
# We do this by manually setting their row in the embedding weights matrix to zeros
UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token]
model.embedding.weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM)
model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)

In [None]:
# Select the optimization algorithm (here we use Adam with the default hyperparameters)
optimizer = optim.Adam(model.parameters())

In [None]:
# Since this is a binary classification problem, we use the Binary Cross Entropy loss
criterion = nn.BCEWithLogitsLoss()
model = model.to(device)
criterion = criterion.to(device)

In [None]:
# Function for calculating the accuracy of a batch of data
def binary_accuracy(preds, y):

    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float()
    return correct.sum() / len(correct)

In [None]:
# The training loop
def train(model, iterator, optimizer, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()  # Makes sure that network is in train mode, so dropout can be used
    
    # Process one batch at a time
    for batch in iterator:
        
        optimizer.zero_grad()
        text, text_lengths = batch.text
        predictions = model(text, text_lengths).squeeze(1)
        loss = criterion(predictions, batch.label)
        acc = binary_accuracy(predictions, batch.label)
        
        # Backpropagate to compute gradients
        loss.backward()

        # Perform gradient descent update
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [None]:
# Evaluation the model's performance after training
def evaluate(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()  # Set the network to eval mode so that dropout is not used
    
    with torch.no_grad():  # Do not compute gradients during evaluation
    
        for batch in iterator:

            text, text_lengths = batch.text
            predictions = model(text, text_lengths).squeeze(1)
            loss = criterion(predictions, batch.label)
            acc = binary_accuracy(predictions, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [None]:
# Compute the time taken by each epoch
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [None]:
N_EPOCHS = 5  # number of epochs

best_valid_loss = float('inf')  # Initial loss

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'lstm-model.pt')  # Save the best model
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

In [None]:
# Load the saved model
model.load_state_dict(torch.load('lstm-model.pt'))

# Evaluate it on the test set
test_loss, test_acc = evaluate(model, test_iterator, criterion)
print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

In [None]:
nlp = spacy.load('en')

# Function to predict sentiment using trained model for any input text
def predict_sentiment(model, sentence):
    model.eval()
    tokenized = [tok.text for tok in nlp.tokenizer(sentence)]
    indexed = [TEXT.vocab.stoi[t] for t in tokenized]
    length = [len(indexed)]
    tensor = torch.LongTensor(indexed).to(device)
    tensor = tensor.unsqueeze(1)
    length_tensor = torch.LongTensor(length)
    prediction = torch.sigmoid(model(tensor, length_tensor))
    return prediction.item()

In [None]:
# Try out your own sentences
# Since the model is trained on movies, use sentences about movies
predict_sentiment(model, "Star wars is such an amazing movie!")

In [None]:
predict_sentiment(model, "Return of the crazy cowboy is the worst movie ever made!")