# PyTorch LSTM Explained for Text Classification

### 1. Let's first explore functionality of Embedding Layer and LSTM layer

In [57]:
import torch
from torch import nn
import torch.nn.functional as F
from torchtext.vocab import Vocab
from torch.utils.data import Dataset, DataLoader
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import pandas as pd
import numpy as np
import spacy
import re
import string

In [2]:
#!spacy download en_core_web_sm

In [3]:
#input 
# [[vector of word indices in sentence1], 
#  [vector of word indices in sentence2], 
#  [vector of word indices in sentence3]]
X = torch.tensor([[1,2, 12,34, 56,78, 90,80],
                 [12,45, 99,67, 6,23, 77,82],
                 [3,24, 6,99, 12,56, 21,22]])

We first pass the input (3x8) through an embedding layer, 
because word embeddings are better at capturing context and 
are spatially more efficient than one-hot vector representations.

In Pytorch, we can use the nn.Embedding module to create this layer, which takes the vocabulary size and desired word-vector length as input. You can optionally provide a padding index, to indicate the index of the padding element in the embedding matrix.

In the following example, our vocabulary consists of 100 words, so our input to the embedding layer can only be from 0–100, and it returns us a 100x7 embedding matrix, with the 0th index representing our padding element.

In [4]:
emb_layer = nn.Embedding(100, 7, padding_idx=0)
emb_layer.weight.shape

torch.Size([100, 7])

In [5]:
# In nn.LSTM 
# input_size: The number of expected features in the input x_n
# hidden_size: The number of features in the hidden state h_n
# num_layers: Number of recurrent layers.
# batch_first: If True, then the input and output tensors are provided as (batch, seq, feature). Default: False
lstm_layer = nn.LSTM(input_size=7, hidden_size=3, num_layers=1, batch_first=True)

In [6]:
embedded = emb_layer(X)
embedded.shape

torch.Size([3, 8, 7])

In [7]:
output, (h_n, c_n) = lstm_layer(embedded) # Inputs: input, (h_0, c_0); input shape is (batch, seq, feature)
# Outputs: output, (h_n, c_n); 
# output shape is (batch, seq, feature) and it contains of all hidden states in the sequence; 
# h_n shape is (num_layers * num_directions, batch, hidden_size) and it contains the final output
# c_n shape is (num_layers * num_directions, batch, hidden_size) and it contains Cell state

In [8]:
output.shape

torch.Size([3, 8, 3])

In [9]:
h_n.shape

torch.Size([1, 3, 3])

In [10]:
c_n.shape

torch.Size([1, 3, 3])

In [11]:
h_n

tensor([[[-0.2647,  0.0256,  0.5710],
         [-0.2057, -0.0083,  0.0182],
         [ 0.1438, -0.0867,  0.2076]]], grad_fn=<StackBackward>)

### 2. Implement Multiclass Text Classification — Predicting ratings from review comments

Problem Statement: Given an item’s review comment, predict the rating ( takes integer values from 1 to 5, 1 being worst and 5 being best)

Dataset: I’ve used the following dataset from Kaggle:
https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews

In [12]:
#loading the data
df_reviews = pd.read_csv("Womens Clothing E-Commerce Reviews.csv")
print(df_reviews.shape)
df_reviews.head()

(23486, 11)


Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


In [13]:
df_reviews = df_reviews[df_reviews['Review Text'].notna()]

#### Metric

We usually take accuracy as our metric for most classification problems, however, ratings are ordered. If the actual value is 5 but the model predicts a 4, it is not considered as bad as predicting a 1. Hence, instead of going with accuracy, we choose RMSE — root mean squared error as our North Star metric. Also, rating prediction is a pretty hard problem, even for humans, so a prediction of being off by just 1 point or lesser is considered pretty good.

#### Preprocessing

As mentioned earlier, we need to convert our text into a numerical form that can be fed to our model as input. I’ve used spacy for tokenization after removing punctuation, special characters, and lower casing the text:

In [14]:
#tokenization
tok = spacy.load('en_core_web_sm')
def tokenize (text):
    text = re.sub(r"[^\x00-\x7F]+", " ", text)
    regex = re.compile('[' + re.escape(string.punctuation) + '0-9\\r\\t\\n]') # remove punctuation and numbers
    nopunct = regex.sub(" ", text.lower())
    return [token.text for token in tok.tokenizer(nopunct)]

#### vocabulary to index mapping
We count the number of occurrences of each token in our corpus and get rid of the ones that don’t occur too frequently.

We then create a vocabulary to index mapping and encode our review text using this mapping. I’ve chosen the maximum length of any review to be 70 words because the average length of reviews was around 60. 

In [15]:
#count number of occurences of each word
counts = Counter()
for index, row in df_reviews.iterrows():
    counts.update(tokenize(row['Review Text']))

In [16]:
vocab = Vocab(counts, min_freq=2)

In [17]:
[vocab[token] for token in ['here', 'is', 'an', 'example']]

[545, 8, 67, 2343]

In [18]:
def encode_sentence(text, N=70):
    tokenized = tokenize(text)
    encoded = np.zeros(N, dtype=int)
    enc1 = np.array([vocab[word] for word in tokenized])
    length = min(N, len(enc1))
    encoded[:length] = enc1[:length]
    return encoded, length

In [19]:
encode_sentence('here is the an example')[0]

array([ 545,    8,    3,   67, 2343,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0])

In [20]:
X = df_reviews['Review Text'].apply(lambda x: np.array(encode_sentence(x))).values

  """Entry point for launching an IPython kernel.


In [21]:
X[0]

array([array([254, 523,  12, 889,   6, 648,   6,  73,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0]),
       8], dtype=object)

In [68]:
df_reviews['Rating'].unique()

array([4, 5, 3, 2, 1])

In [40]:
y = df_reviews['Rating'].apply(lambda x: int(x) - 1).values

In [41]:
y

array([3, 4, 2, ..., 2, 2, 4])

In [42]:
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2)

In [43]:
X_train

array([array([array([ 491,    2,    9,   95,   10,  154,   70,    2,    4,   43,  367,
        124,    6,    9,   85,    8,  136,  121,   14,   37,    2,   28,
        121,    2,    6,    5,   24,   19,  277,    5,   24,  289,    7,
       2639,  128,    2,    3,  251, 1025,    8,   12,   24,   12,   24,
        140,  671,   85,    2,   87,    8,  917,   71,    3,   50,    8,
         78,    2,    3,  258,    8,   75,   11,  168,    2,  107,    3,
        441, 1216,   32,   40]),
       70], dtype=object),
       array([array([   4,   57,  349,    9,   21,  144,    6,    4,  170,    5,   11,
        168,    2,    3,  235,    8,  246,    2,    3,  101,    8,    0,
          2,  115,    8,  316,    2,   95,    7,   28,  388,  192,    2,
        161,    7,  578,  204,    2,  139,   84,    2,    4,   35,  113,
          7,   12,   11,  240,    6,   67,   12,   11,  567,    2,    4,
         35,  302,   24,   39,  323,    6,  265,   12,   18, 3303,   41,
        127,    2,   27,   12]),
   

#### Pytorch Dataset and DataLoader

The dataset is quite straightforward because we’ve already stored our encodings in the input dataframe. We also output the length of the input sequence in each case, because we can have LSTMs that take variable-length sequences.

DataLoader uses dataset to interate through batches. 

In [44]:
class ReviewsDataset(Dataset):
    def __init__(self, X, Y):
        self.X = X
        self.y = Y
        
    def __len__(self):
        return len(self.y)
    
    def __getitem__(self, idx):
        return torch.from_numpy(self.X[idx][0].astype(np.int32)), self.y[idx], self.X[idx][1]

In [45]:
train_ds = ReviewsDataset(X_train, y_train)
valid_ds = ReviewsDataset(X_valid, y_valid)

In [59]:
train_dl = DataLoader(train_ds, batch_size=5000, shuffle=True, num_workers=0)
valid_dl = DataLoader(valid_ds, batch_size=5000, shuffle=True, num_workers=0)

### Pytorch training loop

The training loop is pretty standard. I’ve used Adam optimizer and cross-entropy loss.

In [60]:
def train_model(model, epochs=10, lr=0.001):
    parameters = filter(lambda p: p.requires_grad, model.parameters())
    optimizer = torch.optim.Adam(parameters, lr=lr)
    model.train()
    for epoch in range(epochs):
        sum_loss = 0.0
        total = 0
        for i, (x, y, l) in enumerate(train_dl):
            x = x.long()
            y = y.long()
            y_pred = model(x, l)
            optimizer.zero_grad()
            loss = F.cross_entropy(y_pred, y)
            loss.backward()
            optimizer.step()
            sum_loss += loss.item()*y.shape[0]
            total += y.shape[0]
            val_loss, val_acc, val_rmse = validation_metrics(model, valid_dl)
            if i % 5 == 1:
                print("train loss %.3f, val loss %.3f, val accuracy %.3f, and val rmse %.3f" % (sum_loss/total, val_loss, val_acc, val_rmse))
                
def validation_metrics (model, valid_dl):
    model.eval()
    correct = 0
    total = 0
    sum_loss = 0.0
    sum_rmse = 0.0
    for x, y, l in valid_dl:
        X = x.long()
        y = y.long()
        y_hat = model(x, l)
        loss = F.cross_entropy(y_hat, y)
        pred = torch.max(y_hat, 1)[1]
        correct += (pred == y).float().sum()
        total += y.shape[0]
        sum_loss += loss.item()*y.shape[0]
        sum_rmse += np.sqrt(mean_squared_error(pred, y.unsqueeze(-1)))*y.shape[0]
    return sum_loss/total, correct/total, sum_rmse/total

In [61]:
class LSTM_fixed_len(torch.nn.Module) :
    def __init__(self, vocab_size, embedding_dim, hidden_dim) :
        super().__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)
        self.linear = nn.Linear(hidden_dim, 5)
        self.dropout = nn.Dropout(0.2)
        
    def forward(self, x, l):
        x = self.embeddings(x)
        x = self.dropout(x)
        lstm_out, (ht, ct) = self.lstm(x)
        return self.linear(ht[-1])

In [62]:
class LSTM_fixed_len(nn.Module) :
    def __init__(self, vocab_size, embedding_dim, hidden_dim) :
        super().__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)
        self.linear = nn.Linear(hidden_dim, 5)
        self.dropout = nn.Dropout(0.2)
        
    def forward(self, x, l):
        x = self.embeddings(x)
        x = self.dropout(x)
        lstm_out, (hn, cn) = self.lstm(x)
        return self.linear(hn[-1])

In [63]:
vocab_size = len(vocab)
model_fixed =  LSTM_fixed_len(vocab_size, 50, 50)

In [64]:
train_model(model_fixed, epochs=30, lr=0.01)

train loss 1.592, val loss 1.467, val accuracy 0.537, and val rmse 1.370
train loss 1.271, val loss 1.264, val accuracy 0.552, and val rmse 1.381
train loss 1.231, val loss 1.244, val accuracy 0.552, and val rmse 1.381
train loss 1.232, val loss 1.222, val accuracy 0.552, and val rmse 1.381
train loss 1.201, val loss 1.229, val accuracy 0.552, and val rmse 1.381
train loss 1.209, val loss 1.219, val accuracy 0.552, and val rmse 1.381
train loss 1.205, val loss 1.222, val accuracy 0.552, and val rmse 1.379
train loss 1.189, val loss 1.217, val accuracy 0.552, and val rmse 1.378
train loss 1.185, val loss 1.218, val accuracy 0.551, and val rmse 1.377
train loss 1.174, val loss 1.217, val accuracy 0.548, and val rmse 1.370
train loss 1.164, val loss 1.218, val accuracy 0.542, and val rmse 1.355
train loss 1.144, val loss 1.209, val accuracy 0.546, and val rmse 1.333
train loss 1.083, val loss 1.167, val accuracy 0.559, and val rmse 1.202
train loss 1.063, val loss 1.133, val accuracy 0.56

In [65]:
train_model(model_fixed, epochs=30, lr=0.01)

train loss 1.000, val loss 1.141, val accuracy 0.575, and val rmse 1.079
train loss 0.898, val loss 1.106, val accuracy 0.558, and val rmse 1.030
train loss 0.887, val loss 1.131, val accuracy 0.563, and val rmse 1.107
train loss 0.831, val loss 1.100, val accuracy 0.571, and val rmse 1.029
train loss 0.811, val loss 1.107, val accuracy 0.567, and val rmse 1.021
train loss 0.782, val loss 1.114, val accuracy 0.579, and val rmse 1.002
train loss 0.778, val loss 1.108, val accuracy 0.571, and val rmse 0.982
train loss 0.762, val loss 1.134, val accuracy 0.576, and val rmse 1.010
train loss 0.761, val loss 1.142, val accuracy 0.544, and val rmse 0.965
train loss 0.744, val loss 1.121, val accuracy 0.588, and val rmse 0.969
train loss 0.746, val loss 1.106, val accuracy 0.582, and val rmse 0.967
train loss 0.753, val loss 1.105, val accuracy 0.580, and val rmse 0.966
train loss 0.713, val loss 1.145, val accuracy 0.584, and val rmse 1.022
train loss 0.702, val loss 1.124, val accuracy 0.57

In [66]:
train_model(model_fixed, epochs=30, lr=0.01)

train loss 0.827, val loss 1.262, val accuracy 0.582, and val rmse 1.033
train loss 0.660, val loss 1.220, val accuracy 0.571, and val rmse 1.002
train loss 0.618, val loss 1.239, val accuracy 0.581, and val rmse 1.002
train loss 0.562, val loss 1.261, val accuracy 0.564, and val rmse 0.956
train loss 0.546, val loss 1.263, val accuracy 0.564, and val rmse 0.959
train loss 0.541, val loss 1.279, val accuracy 0.577, and val rmse 0.969
train loss 0.521, val loss 1.295, val accuracy 0.565, and val rmse 0.950
train loss 0.494, val loss 1.311, val accuracy 0.569, and val rmse 0.954
train loss 0.483, val loss 1.328, val accuracy 0.576, and val rmse 0.948
train loss 0.478, val loss 1.341, val accuracy 0.574, and val rmse 0.964
train loss 0.464, val loss 1.352, val accuracy 0.580, and val rmse 0.960
train loss 0.465, val loss 1.384, val accuracy 0.561, and val rmse 0.972
train loss 0.467, val loss 1.370, val accuracy 0.578, and val rmse 0.979
train loss 0.454, val loss 1.417, val accuracy 0.55

Reference:
https://towardsdatascience.com/multiclass-text-classification-using-lstm-in-pytorch-eac56baed8df