<a href="https://colab.research.google.com/github/nolll77/LSTM_multiclass_text_classification/blob/master/LSTM_multiclass_text_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LSTM in Pytorch

![Predicting item ratings based on customer reviews](https://drive.google.com/uc?id=1fi3BRAmlN7ljSnZLz2TuZmz9qZstZPxi)


Human language is filled with ambiguity, many-a-times the same phrase can have multiple interpretations based on the context and can even appear confusing to humans. Such challenges make natural language processing an interesting but hard problem to solve. However, we’ve seen a lot of advancement in NLP in the past couple of years and it’s quite fascinating to explore the various techniques being used. This article aims to cover one such technique in deep learning using Pytorch: Long Short Term Memory (LSTM) models.

If you’re new to NLP or need an in-depth read on preprocessing and word embeddings, you can check out the following [article](https://towardsdatascience.com/getting-started-with-natural-language-processing-nlp-2c482420cc05).

## Gentle Intro to RNNs and LSTMs :

What sets language models apart from conventional neural networks is their dependency on context. Conventional feed-forward networks assume inputs to be independent of one another. For NLP, we need a mechanism to be able to use sequential information from previous inputs to determine the current output. Recurrent Neural Networks (RNNs) tackle this problem by having loops, allowing information to persist through the network.

![An unrolled Recurrent Neural Network](https://drive.google.com/uc?id=1vFo5VFVHYGM4PTrKNVH3--iXJ-bNOD7u)

However, conventional RNNs have the issue of exploding and vanishing gradients and are not good at processing long sequences because they suffer from short term memory.

Long Short Term Memory networks (LSTM) are a special kind of RNN, which are capable of learning long-term dependencies. They do so by maintaining an internal memory state called the “cell state” and have regulators called “gates” to control the flow of information inside each LSTM unit. 

[Here](https://towardsdatascience.com/getting-started-with-natural-language-processing-nlp-2c482420cc05)’s an excellent source explaining the specifics of LSTMs.


![Structure of an LSTM cell. (source : Varsamopoulos, Savvas & Bertels, Koen & Almudever, Carmen. (2018). Designing neural network based decoders for surface codes.)
](https://drive.google.com/uc?id=1bXZ4F4OporY_NeanJ1m0sZMVMSBf8WJL)

In [3]:
#library imports
import torch
import torch.nn as nn
import pandas as pd
import numpy as np
import re
import spacy
from collections import Counter
from torch.utils.data import Dataset, DataLoader
import torch.nn.functional as F
import string
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
from sklearn.metrics import mean_squared_error

## Basic LSTM in Pytorch with random numbers

Before we jump into the main problem, let’s take a look at the basic structure of an LSTM in Pytorch, using a random input. This is a useful step to perform before getting into complex inputs because it helps us learn how to debug the model better, check if dimensions add up and ensure that our model is working as expected.

Even though we’re going to be dealing with text, since our model can only work with numbers, we convert the input into a sequence of numbers where each number represents a particular word (more on this in the next section).

In [4]:
#input
x = torch.tensor([[1,2, 12,34, 56,78, 90,80],
                 [12,45, 99,67, 6,23, 77,82],
                 [3,24, 6,99, 12,56, 21,22]])

#### using two different models

We first pass the input (3×8) through an [embedding](https://en.wikipedia.org/wiki/Word_embedding) layer, because word embeddings are better at capturing context and are spatially more efficient than one-hot vector representations.

In Pytorch, we can use the nn.Embedding module to create this layer, which takes the vocabulary size and desired word-vector length as input. You can optionally provide a padding index, to indicate the index of the padding element in the embedding matrix.

In the following example, our vocabulary consists of 100 words, so our input to the embedding layer can only be from 0–100, and it returns us a 100×7 embedding matrix, with the 0th index representing our padding element.

In [5]:
model1 = nn.Embedding(100, 7, padding_idx=0)
model2 = nn.LSTM(input_size=7, hidden_size=3, num_layers=1, batch_first=True)

In [6]:
out1 = model1(x)
out2 = model2(out1)

In [7]:
print(out1.shape)
print(out1)

torch.Size([3, 8, 7])
tensor([[[-1.0969,  0.3833, -0.0982, -0.0817, -0.1877, -0.5058,  0.9958],
         [ 2.1139,  0.5521, -0.9124,  0.0139, -0.1538, -0.3513, -2.5576],
         [-1.2045, -1.2763,  0.2394, -2.0269, -0.2634, -1.7427, -0.3318],
         [-1.5396, -0.7897,  1.3016,  0.3635, -0.7474,  0.6954,  0.8867],
         [ 1.6468,  0.2826, -0.2182,  1.1096,  0.4718,  0.5952,  0.5073],
         [-0.1784,  2.5837, -0.0215,  1.0938, -0.7102, -1.2993,  0.5261],
         [ 1.3587, -1.9472, -0.3211,  0.2783, -1.2317,  0.6844, -1.1047],
         [ 0.3260,  1.5871,  0.8937,  0.1045,  1.1027, -0.0351, -0.8258]],

        [[-1.2045, -1.2763,  0.2394, -2.0269, -0.2634, -1.7427, -0.3318],
         [-0.1222,  0.1632, -0.5621,  0.7660,  0.9702, -0.5147, -0.0697],
         [ 0.0515,  0.1854, -1.1795,  1.9484, -0.2502, -0.5535, -0.2956],
         [ 0.4359,  2.7316,  0.6009, -0.6831,  0.5884, -0.1225, -1.1036],
         [ 0.6661,  0.0205, -0.6748, -0.0934, -1.6146,  2.3368, -0.4889],
         [ 0.3

##### We pass the embedding layer’s output into an LSTM layer (created using nn.LSTM), which takes as input the word-vector length, length of the hidden state vector and number of layers. Additionally, if the first element in our input’s shape has the batch size, we can specify batch_first = True

The LSTM layer outputs three things:

The consolidated output — of all hidden states in the sequence

*   The consolidated output — of all hidden states in the sequence
*   Hidden state of the last LSTM unit — the final output
*   Cell state

Hidden state of the last LSTM unit — the final output
Cell state


We can verify that after passing through all layers, our output has the expected dimensions:

3×8 -> embedding -> 3x8x7 -> LSTM (with hidden size=3)-> 3×3

In [8]:
out, (ht, ct) = model2(out1)
print(ht)

tensor([[[-0.0455, -0.0326, -0.1335],
         [-0.1509,  0.2473, -0.0080],
         [-0.2018,  0.2954,  0.0275]]], grad_fn=<StackBackward>)


#### using nn.sequential

In [9]:
model3 = nn.Sequential(nn.Embedding(100, 7, padding_idx=0),
                        nn.LSTM(input_size=7, hidden_size=3, num_layers=1, batch_first=True))

In [10]:
out, (ht, ct) = model3(x)
print(out)

tensor([[[-0.0903,  0.3432, -0.0756],
         [-0.0602,  0.6366, -0.0986],
         [-0.0996,  0.3600, -0.0648],
         [-0.0509,  0.2888, -0.0256],
         [-0.0213, -0.2174, -0.0161],
         [ 0.0167,  0.1775,  0.0022],
         [-0.2140,  0.4243, -0.0795],
         [-0.0702,  0.3409, -0.1100]],

        [[-0.0644,  0.0896, -0.0424],
         [-0.1845,  0.4240, -0.0694],
         [-0.3029,  0.6818, -0.0105],
         [-0.2619,  0.3583, -0.0103],
         [-0.1965,  0.3919, -0.0990],
         [-0.0777,  0.1169, -0.1686],
         [ 0.0638, -0.1781, -0.0332],
         [ 0.0261,  0.0061, -0.0348]],

        [[ 0.0533,  0.2313,  0.2094],
         [-0.0817,  0.1106, -0.0394],
         [-0.1622,  0.3708, -0.0891],
         [-0.2874,  0.6930, -0.0095],
         [-0.1387,  0.3535, -0.0616],
         [-0.0428, -0.2579, -0.0285],
         [ 0.0901, -0.4416,  0.0039],
         [ 0.0175,  0.1434,  0.0339]]], grad_fn=<TransposeBackward0>)


## Multiclass Text Classification - Predicting ratings from review comments

Let’s now look at an application of LSTMs.

Problem Statement: Given an item’s review comment, predict the rating ( takes integer values from 1 to 5, 1 being worst and 5 being best)

We are going to predict item ratings based on customer reviews bsed on this dataset from Kaggle :

[Here](https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews) or [here](https://towardsdatascience.com/getting-started-with-natural-language-processing-nlp-2c482420cc05)

In [22]:
# Auth from my Google Drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [23]:
#loading the data
path = "/content/drive/My Drive/Dataset/Jovian/Womens Clothing E-Commerce Reviews.csv"
reviews = pd.read_csv(path)
print(reviews.shape)
reviews.head()

(23486, 11)


Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


In [24]:
reviews['Title'] = reviews['Title'].fillna('')
reviews['Review Text'] = reviews['Review Text'].fillna('')
reviews['review'] = reviews['Title'] + ' ' + reviews['Review Text']

In [25]:
#keeping only relevant columns and calculating sentence lengths
reviews = reviews[['review', 'Rating']]
reviews.columns = ['review', 'rating']
reviews['review_length'] = reviews['review'].apply(lambda x: len(x.split()))
reviews.head()

Unnamed: 0,review,rating,review_length
0,Absolutely wonderful - silky and sexy and com...,4,8
1,Love this dress! it's sooo pretty. i happen...,5,62
2,Some major design flaws I had such high hopes ...,3,102
3,"My favorite buy! I love, love, love this jumps...",5,25
4,Flattering shirt This shirt is very flattering...,5,38


In [26]:
#changing ratings to 0-numbering
zero_numbering = {1:0, 2:1, 3:2, 4:3, 5:4}
reviews['rating'] = reviews['rating'].apply(lambda x: zero_numbering[x])

In [27]:
#mean sentence length
np.mean(reviews['review_length'])

60.832921740611425

#### Metric
We usually take accuracy as our metric for most classification problems, however, ratings are ordered. If the actual value is 5 but the model predicts a 4, it is not considered as bad as predicting a 1. Hence, instead of going with accuracy, we choose RMSE — root mean squared error as our North Star metric. Also, rating prediction is a pretty hard problem, even for humans, so a prediction of being off by just 1 point or lesser is considered pretty good.

Preprocessing
As mentioned earlier, we need to convert our text into a numerical form that can be fed to our model as input. I’ve used spacy for tokenization after removing punctuation, special characters, and lower casing the text:

In [28]:
#tokenization
tok = spacy.load('en')
def tokenize (text):
    text = re.sub(r"[^\x00-\x7F]+", " ", text)
    regex = re.compile('[' + re.escape(string.punctuation) + '0-9\\r\\t\\n]') # remove punctuation and numbers
    nopunct = regex.sub(" ", text.lower())
    return [token.text for token in tok.tokenizer(nopunct)]

#### We count the number of occurrences of each token in our corpus and get rid of the ones that don’t occur too frequently :

In [29]:
#count number of occurences of each word
counts = Counter()
for index, row in reviews.iterrows():
    counts.update(tokenize(row['review']))

#### We lost about 6000 words! This is expected because our corpus is quite small, less than 25k reviews, the chance of having repeated words is quite small.

We then create a vocabulary to index mapping and encode our review text using this mapping. I’ve chosen the maximum length of any review to be 70 words because the average length of reviews was around 60.

In [30]:
#deleting infrequent words
print("num_words before:",len(counts.keys()))
for word in list(counts):
    if counts[word] < 2:
        del counts[word]
print("num_words after:",len(counts.keys()))

num_words before: 14138
num_words after: 8263


In [31]:
#creating vocabulary
vocab2index = {"":0, "UNK":1}
words = ["", "UNK"]
for word in counts:
    vocab2index[word] = len(words)
    words.append(word)

In [32]:
def encode_sentence(text, vocab2index, N=70):
    tokenized = tokenize(text)
    encoded = np.zeros(N, dtype=int)
    enc1 = np.array([vocab2index.get(word, vocab2index["UNK"]) for word in tokenized])
    length = min(N, len(enc1))
    encoded[:length] = enc1[:length]
    return encoded, length

In [33]:
reviews['encoded'] = reviews['review'].apply(lambda x: np.array(encode_sentence(x,vocab2index )))
reviews.head()

Unnamed: 0,review,rating,review_length,encoded
0,Absolutely wonderful - silky and sexy and com...,3,8,"[[2, 3, 4, 5, 6, 7, 8, 7, 9, 0, 0, 0, 0, 0, 0,..."
1,Love this dress! it's sooo pretty. i happen...,4,62,"[[2, 10, 11, 12, 5, 13, 14, 15, 16, 5, 17, 18,..."
2,Some major design flaws I had such high hopes ...,2,102,"[[54, 55, 56, 57, 17, 58, 59, 60, 61, 62, 11, ..."
3,"My favorite buy! I love, love, love this jumps...",4,25,"[[68, 109, 110, 2, 17, 10, 2, 10, 2, 10, 11, 1..."
4,Flattering shirt This shirt is very flattering...,4,38,"[[122, 123, 11, 123, 52, 92, 122, 19, 124, 125..."


In [34]:
#check how balanced the dataset is
Counter(reviews['rating'])

Counter({0: 842, 1: 1565, 2: 2871, 3: 5077, 4: 13131})

In [35]:
X = list(reviews['encoded'])
y = list(reviews['rating'])
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2)

#### Pytorch Dataset

The dataset is quite straightforward because we’ve already stored our encodings in the input dataframe. We also output the length of the input sequence in each case, because we can have LSTMs that take variable-length sequences.

In [36]:
class ReviewsDataset(Dataset):
    def __init__(self, X, Y):
        self.X = X
        self.y = Y
        
    def __len__(self):
        return len(self.y)
    
    def __getitem__(self, idx):
        return torch.from_numpy(self.X[idx][0].astype(np.int32)), self.y[idx], self.X[idx][1]

In [37]:
train_ds = ReviewsDataset(X_train, y_train)
valid_ds = ReviewsDataset(X_valid, y_valid)

#### Pytorch training loop
The training loop is pretty standard. I’ve used Adam optimizer and cross-entropy loss.



In [38]:
def train_model(model, epochs=10, lr=0.001):
    parameters = filter(lambda p: p.requires_grad, model.parameters())
    optimizer = torch.optim.Adam(parameters, lr=lr)
    for i in range(epochs):
        model.train()
        sum_loss = 0.0
        total = 0
        for x, y, l in train_dl:
            x = x.long()
            y = y.long()
            y_pred = model(x, l)
            optimizer.zero_grad()
            loss = F.cross_entropy(y_pred, y)
            loss.backward()
            optimizer.step()
            sum_loss += loss.item()*y.shape[0]
            total += y.shape[0]
        val_loss, val_acc, val_rmse = validation_metrics(model, val_dl)
        if i % 5 == 1:
            print("train loss %.3f, val loss %.3f, val accuracy %.3f, and val rmse %.3f" % (sum_loss/total, val_loss, val_acc, val_rmse))

def validation_metrics (model, valid_dl):
    model.eval()
    correct = 0
    total = 0
    sum_loss = 0.0
    sum_rmse = 0.0
    for x, y, l in valid_dl:
        x = x.long()
        y = y.long()
        y_hat = model(x, l)
        loss = F.cross_entropy(y_hat, y)
        pred = torch.max(y_hat, 1)[1]
        correct += (pred == y).float().sum()
        total += y.shape[0]
        sum_loss += loss.item()*y.shape[0]
        sum_rmse += np.sqrt(mean_squared_error(pred, y.unsqueeze(-1)))*y.shape[0]
    return sum_loss/total, correct/total, sum_rmse/total

In [39]:
batch_size = 5000
vocab_size = len(words)
train_dl = DataLoader(train_ds, batch_size=batch_size, shuffle=True)
val_dl = DataLoader(valid_ds, batch_size=batch_size)

## LSTM Model
I’ve used 3 variations for the model:

### 1- LSTM with fixed length input

This pretty much has the same structure as the basic LSTM we saw earlier, with the addition of a dropout layer to prevent overfitting. Since we have a classification problem, we have a final linear layer with 5 outputs. This implementation actually works the best among the classification LSTMs, with an accuracy of about 64% and a root-mean-squared-error of only 0.817

In [40]:
class LSTM_fixed_len(torch.nn.Module) :
    def __init__(self, vocab_size, embedding_dim, hidden_dim) :
        super().__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)
        self.linear = nn.Linear(hidden_dim, 5)
        self.dropout = nn.Dropout(0.2)
        
    def forward(self, x, l):
        x = self.embeddings(x)
        x = self.dropout(x)
        lstm_out, (ht, ct) = self.lstm(x)
        return self.linear(ht[-1])

In [41]:
model_fixed =  LSTM_fixed_len(vocab_size, 50, 50)

In [42]:
train_model(model_fixed, epochs=30, lr=0.01)

train loss 1.277, val loss 1.216, val accuracy 0.560, and val rmse 1.360
train loss 1.191, val loss 1.200, val accuracy 0.560, and val rmse 1.357
train loss 1.146, val loss 1.176, val accuracy 0.554, and val rmse 1.315
train loss 1.080, val loss 1.088, val accuracy 0.562, and val rmse 1.298
train loss 0.974, val loss 1.048, val accuracy 0.562, and val rmse 1.005
train loss 0.925, val loss 1.030, val accuracy 0.588, and val rmse 0.982


In [43]:
train_model(model_fixed, epochs=30, lr=0.01)

train loss 0.927, val loss 1.075, val accuracy 0.581, and val rmse 1.160
train loss 0.830, val loss 1.013, val accuracy 0.595, and val rmse 1.002
train loss 0.808, val loss 1.009, val accuracy 0.608, and val rmse 0.910
train loss 0.757, val loss 0.987, val accuracy 0.610, and val rmse 0.885
train loss 0.692, val loss 0.972, val accuracy 0.617, and val rmse 0.865
train loss 0.643, val loss 0.992, val accuracy 0.621, and val rmse 0.854


In [44]:
train_model(model_fixed, epochs=30, lr=0.01)

train loss 0.643, val loss 0.977, val accuracy 0.622, and val rmse 0.838
train loss 0.584, val loss 1.033, val accuracy 0.625, and val rmse 0.846
train loss 0.562, val loss 1.024, val accuracy 0.631, and val rmse 0.813
train loss 0.518, val loss 1.060, val accuracy 0.627, and val rmse 0.820
train loss 0.486, val loss 1.083, val accuracy 0.633, and val rmse 0.821
train loss 0.464, val loss 1.126, val accuracy 0.628, and val rmse 0.834


### 2- LSTM with variable length input

We can modify our model a bit to make it accept variable-length inputs. This ends up increasing the training time though, because of the pack_padded_sequence function call which returns a padded batch of variable-length sequences.

In [47]:
class LSTM_variable_input(torch.nn.Module) :
    def __init__(self, vocab_size, embedding_dim, hidden_dim) :
        super().__init__()
        self.hidden_dim = hidden_dim
        self.dropout = nn.Dropout(0.3)
        self.embeddings = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)
        self.linear = nn.Linear(hidden_dim, 5)
        
    def forward(self, x, s):
        x = self.embeddings(x)
        x = self.dropout(x)
        x_pack = pack_padded_sequence(x, s, batch_first=True, enforce_sorted=False)
        out_pack, (ht, ct) = self.lstm(x_pack)
        out = self.linear(ht[-1])
        return out

In [48]:
model = LSTM_variable_input(vocab_size, 50, 50)

In [49]:
train_model(model, epochs=30, lr=0.1)

train loss 1.285, val loss 1.248, val accuracy 0.534, and val rmse 1.340
train loss 1.098, val loss 1.133, val accuracy 0.566, and val rmse 1.253
train loss 0.917, val loss 0.962, val accuracy 0.600, and val rmse 0.943
train loss 0.821, val loss 0.916, val accuracy 0.617, and val rmse 0.843
train loss 0.767, val loss 0.915, val accuracy 0.632, and val rmse 0.862
train loss 0.748, val loss 0.928, val accuracy 0.616, and val rmse 0.829


In [50]:
train_model(model, epochs=30, lr=0.05)

train loss 0.771, val loss 0.962, val accuracy 0.619, and val rmse 0.838
train loss 0.705, val loss 0.939, val accuracy 0.629, and val rmse 0.827
train loss 0.676, val loss 0.949, val accuracy 0.619, and val rmse 0.830
train loss 0.671, val loss 0.955, val accuracy 0.622, and val rmse 0.838
train loss 0.657, val loss 0.959, val accuracy 0.620, and val rmse 0.828
train loss 0.649, val loss 0.958, val accuracy 0.622, and val rmse 0.833


In [52]:
train_model(model, epochs=30, lr=0.05)

train loss 0.647, val loss 1.034, val accuracy 0.619, and val rmse 0.862
train loss 0.614, val loss 0.996, val accuracy 0.618, and val rmse 0.828
train loss 0.610, val loss 0.998, val accuracy 0.615, and val rmse 0.830
train loss 0.596, val loss 1.009, val accuracy 0.622, and val rmse 0.836
train loss 0.598, val loss 1.018, val accuracy 0.607, and val rmse 0.868
train loss 0.592, val loss 1.025, val accuracy 0.611, and val rmse 0.849


### 3- LSTM with pretrained Glove word embeddings

Instead of training our own word embeddings, we can use pre-trained Glove word vectors that have been trained on a massive corpus and probably have better context captured. For our problem, however, this doesn’t seem to help much.

Download weights [here](https://nlp.stanford.edu/projects/glove/)

In [59]:
def load_glove_vectors(glove_file="/content/drive/My Drive/Dataset/Jovian/Glove6B/glove.6B.50d.txt"):
    """Load the glove word vectors"""
    word_vectors = {}
    with open(glove_file) as f:
        for line in f:
            split = line.split()
            word_vectors[split[0]] = np.array([float(x) for x in split[1:]])
    return word_vectors

In [60]:
def get_emb_matrix(pretrained, word_counts, emb_size = 50):
    """ Creates embedding matrix from word vectors"""
    vocab_size = len(word_counts) + 2
    vocab_to_idx = {}
    vocab = ["", "UNK"]
    W = np.zeros((vocab_size, emb_size), dtype="float32")
    W[0] = np.zeros(emb_size, dtype='float32') # adding a vector for padding
    W[1] = np.random.uniform(-0.25, 0.25, emb_size) # adding a vector for unknown words 
    vocab_to_idx["UNK"] = 1
    i = 2
    for word in word_counts:
        if word in word_vecs:
            W[i] = word_vecs[word]
        else:
            W[i] = np.random.uniform(-0.25,0.25, emb_size)
        vocab_to_idx[word] = i
        vocab.append(word)
        i += 1   
    return W, np.array(vocab), vocab_to_idx

In [61]:
word_vecs = load_glove_vectors()
pretrained_weights, vocab, vocab2index = get_emb_matrix(word_vecs, counts)

In [62]:
class LSTM_glove_vecs(torch.nn.Module) :
    def __init__(self, vocab_size, embedding_dim, hidden_dim, glove_weights) :
        super().__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.embeddings.weight.data.copy_(torch.from_numpy(glove_weights))
        self.embeddings.weight.requires_grad = False ## freeze embeddings
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)
        self.linear = nn.Linear(hidden_dim, 5)
        self.dropout = nn.Dropout(0.2)
        
    def forward(self, x, l):
        x = self.embeddings(x)
        x = self.dropout(x)
        lstm_out, (ht, ct) = self.lstm(x)
        return self.linear(ht[-1])

In [63]:
model = LSTM_glove_vecs(vocab_size, 50, 50, pretrained_weights)

In [64]:
train_model(model, epochs=30, lr=0.1)

train loss 1.272, val loss 1.245, val accuracy 0.560, and val rmse 1.360
train loss 1.211, val loss 1.208, val accuracy 0.560, and val rmse 1.360
train loss 1.206, val loss 1.204, val accuracy 0.560, and val rmse 1.360
train loss 1.203, val loss 1.206, val accuracy 0.560, and val rmse 1.360
train loss 1.201, val loss 1.207, val accuracy 0.559, and val rmse 1.359
train loss 1.197, val loss 1.206, val accuracy 0.560, and val rmse 1.357


In [65]:
train_model(model, epochs=30, lr=0.05)

train loss 1.219, val loss 1.223, val accuracy 0.559, and val rmse 1.360
train loss 1.194, val loss 1.210, val accuracy 0.558, and val rmse 1.358
train loss 1.189, val loss 1.209, val accuracy 0.558, and val rmse 1.357
train loss 1.173, val loss 1.192, val accuracy 0.554, and val rmse 1.355
train loss 1.144, val loss 1.157, val accuracy 0.555, and val rmse 1.350
train loss 1.119, val loss 1.140, val accuracy 0.561, and val rmse 1.334


In [66]:
train_model(model, epochs=30, lr=0.05)

train loss 1.132, val loss 1.113, val accuracy 0.560, and val rmse 1.328
train loss 1.089, val loss 1.109, val accuracy 0.559, and val rmse 1.096
train loss 1.059, val loss 1.078, val accuracy 0.576, and val rmse 1.173
train loss 1.029, val loss 1.045, val accuracy 0.579, and val rmse 1.135
train loss 0.999, val loss 1.013, val accuracy 0.589, and val rmse 1.014
train loss 0.989, val loss 1.006, val accuracy 0.588, and val rmse 0.963


## Predicting ratings using regression instead of classification

Since ratings have an order, and a prediction of 3.6 might be better than rounding off to 4 in many cases, it is helpful to explore this as a regression problem. Not surprisingly, this approach gives us the lowest error of just 0.799 because we don’t have just integer predictions anymore.

The only change to our model is that instead of the final layer having 5 outputs, we have just one. The training loop changes a bit too, we use MSE loss and we don’t need to take the argmax anymore to get the final prediction.

In [67]:
def train_model_regr(model, epochs=10, lr=0.001):
    parameters = filter(lambda p: p.requires_grad, model.parameters())
    optimizer = torch.optim.Adam(parameters, lr=lr)
    for i in range(epochs):
        model.train()
        sum_loss = 0.0
        total = 0
        for x, y, l in train_dl:
            x = x.long()
            y = y.float()
            y_pred = model(x, l)
            optimizer.zero_grad()
            loss = F.mse_loss(y_pred, y.unsqueeze(-1))
            loss.backward()
            optimizer.step()
            sum_loss += loss.item()*y.shape[0]
            total += y.shape[0]
        val_loss = validation_metrics_regr(model, val_dl)
        if i % 5 == 1:
            print("train mse %.3f val rmse %.3f" % (sum_loss/total, val_loss))

def validation_metrics_regr (model, valid_dl):
    model.eval()
    correct = 0
    total = 0
    sum_loss = 0.0
    for x, y, l in valid_dl:
        x = x.long()
        y = y.float()
        y_hat = model(x, l)
        loss = np.sqrt(F.mse_loss(y_hat, y.unsqueeze(-1)).item())
        total += y.shape[0]
        sum_loss += loss.item()*y.shape[0]
    return sum_loss/total

In [68]:
class LSTM_regr(torch.nn.Module) :
    def __init__(self, vocab_size, embedding_dim, hidden_dim) :
        super().__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)
        self.linear = nn.Linear(hidden_dim, 1)
        self.dropout = nn.Dropout(0.2)
        
    def forward(self, x, l):
        x = self.embeddings(x)
        x = self.dropout(x)
        lstm_out, (ht, ct) = self.lstm(x)
        return self.linear(ht[-1])

In [69]:
model =  LSTM_regr(vocab_size, 50, 50)

In [70]:
train_model_regr(model, epochs=30, lr=0.05)

train mse 1.590 val rmse 1.196
train mse 1.215 val rmse 1.115
train mse 1.143 val rmse 1.123
train mse 1.103 val rmse 1.127
train mse 1.080 val rmse 1.129
train mse 1.056 val rmse 1.130


In [71]:
train_model_regr(model, epochs=30, lr=0.05)

train mse 1.481 val rmse 1.151
train mse 1.023 val rmse 1.077
train mse 0.622 val rmse 0.855
train mse 0.466 val rmse 0.795
train mse 0.400 val rmse 0.782
train mse 0.376 val rmse 0.789


## Conclusion:
LSTM appears to be theoretically involved, but its Pytorch implementation is pretty straightforward. Also, while looking at any problem, it is very important to choose the right metric, in our case if we’d gone for accuracy, the model seems to be doing a very bad job, but the RMSE shows that it is off by less than 1 rating point, which is comparable to human performance!

References:

https://www.usfca.edu/data-institute/certificates/deep-learning-part-one

https://colah.github.io/posts/2015-08-Understanding-LSTMs/

http://web.stanford.edu/class/cs224n/

https://www.jovian.ml/blog/multiclass-text-classification-using-lstm-in-pytorch