# Sentiment analysis using LSTM

## Topic

In this notebook I will be working with and LSTM network to predict the sentiment in a sentence or paragraph. This is a supervised learning task seen as my sentences and paragraphs will be labeled. My dataset is in txt format, and my goal is to preprocess the textual data to make it machine learning ready, then feed it to LSTM cells that will take into account every word in the text to finally produce an answer (whether the input is positive or negative). So let's get started!

## Objectives

- Process the textual data to make it machine learning ready
- Predict the sentiment using a trained LSTM

## Summary

- Importing libraries
- The Dataset
- Data pre-processing
- Removing outliers
- Padding features
- Train/Validation/Test splits
- Creating the data loaders
- Defining the model
- Training the model
- Testing the model
- Inference
- Conclusion

### Importing libraries

In [63]:
import numpy as np
import pandas as pd
from string import punctuation
from collections import Counter
import torch
from torch.utils.data import TensorDataset, DataLoader
import torch.nn as nn
import matplotlib.pyplot as plt

### The Dataset

In [64]:
with open('data/reviews.txt', 'r') as f:
    messages = f.read()
with open('data/labels.txt', 'r') as f:
    labels = f.read()

In [65]:
messages[:100]

'bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life'

In [66]:
labels[:50]

'positive\nnegative\npositive\nnegative\npositive\nnegat'

So the dataset is two txt files one containing the sentences and paragraphs I need to classify and the second containing the labels corresponding to them.

### Data Pre-processing

In [67]:
messages = messages.lower()

In [68]:
text = "".join([x for x in messages if x not in punctuation])

In [69]:
messages_splitted = text.split("\n")
all_text = ' '.join(messages_splitted)
words = all_text.split()

In [70]:
words[:10]

['bromwell', 'high', 'is', 'a', 'cartoon', 'comedy', 'it', 'ran', 'at', 'the']

The first thing I did is to convert the text to lower characters then removed the punctuation. Then I splitted the text into individual paragraphs and then seperated each word on its own.

In [71]:
count = Counter(words)
vocab = sorted(count, key=count.get, reverse=True)
vocab_to_int = {word:ii for ii,word in enumerate(vocab,1)}

In [72]:
messages_int = []
for message in messages_splitted:
    messages_int.append([vocab_to_int[w] for w in message.split()])

In [73]:
messages_int[0]

[21025,
 308,
 6,
 3,
 1050,
 207,
 8,
 2138,
 32,
 1,
 171,
 57,
 15,
 49,
 81,
 5785,
 44,
 382,
 110,
 140,
 15,
 5194,
 60,
 154,
 9,
 1,
 4975,
 5852,
 475,
 71,
 5,
 260,
 12,
 21025,
 308,
 13,
 1978,
 6,
 74,
 2395,
 5,
 613,
 73,
 6,
 5194,
 1,
 24103,
 5,
 1983,
 10166,
 1,
 5786,
 1499,
 36,
 51,
 66,
 204,
 145,
 67,
 1199,
 5194,
 19869,
 1,
 37442,
 4,
 1,
 221,
 883,
 31,
 2988,
 71,
 4,
 1,
 5787,
 10,
 686,
 2,
 67,
 1499,
 54,
 10,
 216,
 1,
 383,
 9,
 62,
 3,
 1406,
 3686,
 783,
 5,
 3483,
 180,
 1,
 382,
 10,
 1212,
 13583,
 32,
 308,
 3,
 349,
 341,
 2913,
 10,
 143,
 127,
 5,
 7690,
 30,
 4,
 129,
 5194,
 1406,
 2326,
 5,
 21025,
 308,
 10,
 528,
 12,
 109,
 1448,
 4,
 60,
 543,
 102,
 12,
 21025,
 308,
 6,
 227,
 4146,
 48,
 3,
 2211,
 12,
 8,
 215,
 23]

In [74]:
print(len(vocab_to_int))

74072


In [75]:
labels_splitted = labels.split("\n")
labels_encoded = np.array([1 if label == "positive" else 0 for label in labels_splitted])

Next I created a dictionary to hold unique words with  unique numbers each corresponding to a word, and transformed my texual reviews into lists of integers where each integer refers to a word. Then I encoded the labels two 1 for positive and 0 for negative. 

### Removing outliers

In [76]:
messages_lens = Counter([len(x) for x in messages_int])
print("shortest review ", messages_lens[0])
print("Longest review", max(messages_lens))

shortest review  1
Longest review 2514


In [77]:
print("Number of reviews", len(messages_int))
non_zero = [ii for ii, r in enumerate(messages_int) if len(r) !=0]

Number of reviews 25001


In [78]:
messages_int = [messages_int[ii] for ii in non_zero]
labels_encoded = np.array([labels_encoded[ii] for ii in non_zero])
print("Number of reviews after removing zero len reviews", len(messages_int))

Number of reviews after removing zero len reviews 25000


The next thing I did is to remove the reviews that are 0 length which won't help the model in anything, and checked for the longest review which will help me decide on the sequence length later.

In [79]:
len(labels_encoded)

25000

### Padding features

In [80]:
def pad_features(messages_int, seq_length):
    features = np.zeros((len(messages_int), seq_length), dtype=int)
    for i, row in enumerate(messages_int):
        features[i, -len(row):] = np.array(row)[:seq_length]
    return features

In [81]:
seq_length = 200
features = pad_features(messages_int, seq_length)
print(len(features))

25000


In [82]:
print(features[:1])

[[    0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
  21025   308     6     3  1050   207     8  2138    32     1   171    57
     15    49    81  5785    44   382   110   140    15  5194    60   154
      9     1  4975  5852   475    71     5   260    12 21025   308    13
   1978     6    74  2395     5   613    73     6  5194     1 24103     5
   1983 10166     1  5786  1499    36    51    66   204   145    67  1199
   5194 19869     1 37442     4     1   221   883    31  2988    71     4
      1  5787    10   686     2    67  1499    54    10   216     1   383
      9    62     3  1406  3686   783     5  3483   180     1   382    10
   1212 13583    32   308     3   349 

In the above I chose a sequence length of 200, padded the reviews that are less than the sequence length long with 0s and trancated the reviews that are longer than the sequence length to be only 200 words long.

### Train/Validation/Test splits

In [83]:
split_frac = 0.8
split_idx = int(len(features)*split_frac)
train_x, remaining_x = features[:split_idx], features[split_idx:]
train_y, remaining_y = labels_encoded[:split_idx], labels_encoded[split_idx:]
test_idx = int(len(remaining_x)*0.5)
val_x, test_x = remaining_x[:test_idx], remaining_x[test_idx:]
val_y, test_y = remaining_y[:test_idx], remaining_y[test_idx:]
print("Training ", train_x.shape)
print("Validation", val_x.shape)
print("Test ", test_x.shape)

Training  (20000, 200)
Validation (2500, 200)
Test  (2500, 200)


### Creating the data loaders

In [84]:
train_data = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))
val_data = TensorDataset(torch.from_numpy(val_x), torch.from_numpy(val_y))
test_data = TensorDataset(torch.from_numpy(test_x), torch.from_numpy(test_y))

batch_size = 30
train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size, drop_last=True)
val_loader = DataLoader(val_data, shuffle=True, batch_size=batch_size, drop_last=True)
test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size, drop_last=True)

Next I splitted my data into training, validation and testing sets, made tensor datasets out of those sets and created dataloaders for them.

### Definening the model

In [85]:
class LSTM(nn.Module):
    def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, drop_prob=0.5):
        super(LSTM, self).__init__()
        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim
        
        
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers ,batch_first=True, dropout=drop_prob)
        self.dropout = nn.Dropout(drop_prob)
        self.fc = nn.Linear(hidden_dim, output_size)
        self.sig = nn.Sigmoid()
        
    def forward(self, x, hidden):
        batch_size = x.size(0)
        em = self.embedding(x)
        output, hidden = self.lstm(em, hidden)
        output = output.reshape(-1, self.hidden_dim)
        out = self.dropout(output)
        out = self.fc(out)
        out_sig = self.sig(out)
        out_sig = out_sig.view(batch_size, -1)
        out_sig = out_sig[:, -1]
        return out_sig, hidden
    
    def init_hidden(self, batch_size):
        weight = next(self.parameters()).data
        hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_(),
                  weight.new(self.n_layers, batch_size, self.hidden_dim).zero_())
        return hidden
        

In [86]:
vocab_size = len(vocab_to_int)+1
output_size = 1
embedding_dim = 200
hidden_dim = 256
n_layers = 2
net = LSTM(vocab_size, output_size, embedding_dim, hidden_dim, n_layers)
print(net)

LSTM(
  (embedding): Embedding(74073, 200)
  (lstm): LSTM(200, 256, num_layers=2, batch_first=True, dropout=0.5)
  (dropout): Dropout(p=0.5, inplace=False)
  (fc): Linear(in_features=256, out_features=1, bias=True)
  (sig): Sigmoid()
)


Here I defined my model, I chose an lstm with 2 layers, and added an embedding layer to map the words into my integers into continuous vectors, added a dropout layers to avoid overfitting and chose a Sigmoid activation because I have only two possible outputs.

### Training the model

In [87]:
lr = 0.001
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(net.parameters(), lr=lr)

In [88]:
epochs = 4
count = 0
print_every = 100
clip = 5

net.train()
for epoch in range(epochs):
    h = net.init_hidden(batch_size)

    for inputs, labels in train_loader:
        count += 1
        h = tuple([each.data for each in h])
        net.zero_grad()
        output, h = net(inputs, h)
        loss = criterion(output.squeeze(), labels.float())
        loss.backward()
        nn.utils.clip_grad_norm_(net.parameters(), clip)
        optimizer.step()
        
        if count % print_every == 0:
            net.eval()
            val_h = net.init_hidden(batch_size)
            val_losses =[]
            for x,y in val_loader:
                val_h = tuple([each.data for each in val_h])
                out, val_h = net(x,h)
                val_loss = criterion(out.squeeze(), y.float())
                val_losses.append(val_loss.item())
            net.train()
            
            print("Epoch: {}/{}...".format(epoch+1, epochs),
                  "Step: {}...".format(count),
                  "Loss: {:.6f}...".format(loss.item()),
                  "Val Loss: {:.6f}".format(np.mean(val_losses)))
            
        
        

Epoch: 1/4... Step: 100... Loss: 0.635072... Val Loss: 0.672856
Epoch: 1/4... Step: 200... Loss: 0.700416... Val Loss: 0.668398
Epoch: 1/4... Step: 300... Loss: 0.637585... Val Loss: 0.660066
Epoch: 1/4... Step: 400... Loss: 0.713489... Val Loss: 0.608056
Epoch: 1/4... Step: 500... Loss: 0.810142... Val Loss: 0.606783
Epoch: 1/4... Step: 600... Loss: 0.716677... Val Loss: 0.565104
Epoch: 2/4... Step: 700... Loss: 0.694538... Val Loss: 0.535692
Epoch: 2/4... Step: 800... Loss: 0.513789... Val Loss: 0.608465
Epoch: 2/4... Step: 900... Loss: 0.337506... Val Loss: 0.501231
Epoch: 2/4... Step: 1000... Loss: 0.314367... Val Loss: 0.498360
Epoch: 2/4... Step: 1100... Loss: 0.503584... Val Loss: 0.486660
Epoch: 2/4... Step: 1200... Loss: 0.486876... Val Loss: 0.519690
Epoch: 2/4... Step: 1300... Loss: 0.287363... Val Loss: 0.433715
Epoch: 3/4... Step: 1400... Loss: 0.271825... Val Loss: 0.461903
Epoch: 3/4... Step: 1500... Loss: 0.424831... Val Loss: 0.459054
Epoch: 3/4... Step: 1600... Loss: 

I chose the train my model over 4 epochs, during each one I initiated the hidden state and cell state, passed the input through the model, calculated the loss and used backpropagation to update the weights. After each training loop I did a validation loop and printed the validation error to compare it with the training error.

### Testing the model

In [89]:
test_losses = []
n_correct = 0
net.eval()
h = net.init_hidden(batch_size)
for x, y in test_loader:
    h = tuple([each.data for each in h])
    out, h = net(x,h)
    test_loss = criterion(out.squeeze(), y.float())
    test_losses.append(test_loss.item())
    
    pred = torch.round(out.squeeze())
    correct_tensor = pred.eq(y.float().view_as(pred))
    correct = np.squeeze(correct_tensor.numpy())
    n_correct += np.sum(correct)
print("Test loss", np.mean(test_losses))
test_acc = n_correct/len(test_loader.dataset)
print("Test accuracy", test_acc)

Test loss 0.47859233851174277
Test accuracy 0.7964


In testing I followed the same steps as validation, plus I kept track of the number of correct answers my model got. Finally I printed the test loss and test accuracy which seems alright (almost 0.8).

### Inference

In [90]:
def tokenize(test_message):
    lower = test_message.lower()
    no_pun = " ".join([x for x in lower if x not in punctuation])
    test_words = no_pun.split()
    encoded = []
    encoded.append([vocab_to_int[w] for w in test_words])
    return encoded

In [91]:
def predict(net, test_message, seq_length=200):
    tokenized = tokenize(test_message)
    padded = pad_features(tokenized, seq_length)
    padded_t = torch.from_numpy(padded)
    h = net.init_hidden(1)
    h = tuple([each.data for each in h])
    net.eval()
    out, h = net(padded_t, h)
    out = torch.round(out.squeeze())
    if out == 1:
        print("positive")
    else:
        print("negative")

In the above I simply created two functions: one to preprocess the input in the same way I did before (lower case, remove punctuation and encode the words), and the second to pad the input, pass it through the model and produce an answer.

In [92]:
test_1 = "It was one of the worst movies I've ever seen, I want my money back"
predict(net, test_1)

negative


In [93]:
test_1 = "It was a great experience, absolutely delightful, one of the best movies I have ever seen"
predict(net, test_1)

positive


Finally I tested my model on unseen data (two sentences that I made up) and got accurate answers !

### Conclusion

In this notebook I had the opportunity to experiment with lstm for sentiment analysis. My input was raw textual data (movie reviews) that I preprocessed and fed to my model. The results were as great as can be expected, I got 0.8 almost in accuracy and an error as low as 0.4 and during inference I tested my model on new sentences and got accurate answers.