[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/khetansarvesh/NLP/blob/main/unitask_downstream_nlp/Sentence-Level-Classification/Joint_Training_Movie_Review_Classification.ipynb)

In [None]:
import numpy as np
import pandas as pd

## **Reading Data**

In [None]:
#downloading the dataset
!wget https://github.com/khetansarvesh/NLP/blob/main/Sentence-Level-Classification/SST_Dataset.csv

In [None]:
# reading the dataset
df = pd.read_csv("SST_Dataset.csv", encoding = "ISO-8859-1")
df.dropna(inplace=True)
df

Unnamed: 0,review,label
0,bromwell high is a cartoon comedy . it ran at ...,1
1,story of a man who has unnatural feelings for ...,0
2,homelessness or houselessness as george carli...,1
3,airport starts as a brand new luxury pla...,0
4,brilliant over acting by lesley ann warren . ...,1
...,...,...
24995,i saw descent last night at the stockholm fi...,0
24996,a christmas together actually came before my t...,1
24997,some films that you pick up for a pound turn o...,0
24998,working class romantic drama from director ma...,1


The dataset we use in this example is [SST2](https://nlp.stanford.edu/sentiment/index.html), which contains sentences from movie reviews, each labeled as either positive (has the value 1) or negative (has the value 0)

In [None]:
df["label"].value_counts()/df.shape[0] #hence we can clearly see that it is a perfectly balanced dataset!!

1    0.5
0    0.5
Name: label, dtype: float64

## **Data Preprocessing**

### **Cleaning Text Features**


like removing stop words, punctions, performing stemming ...

In [None]:
from sklearn.feature_extraction import stop_words # or use from nltk.corpus import stopwords
stopwords = stop_words.ENGLISH_STOP_WORDS
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
import string
import re

def clean(doc): #doc is a string of text
    doc = doc.replace("</br>", " ") #This text contains a lot of <br/> tags replacing them with " ".
    doc = "".join([char for char in doc if char not in string.punctuation and not char.isdigit()])#remove punctuation and numbers
    doc = doc.lower() #lowering all the characters
    doc = " ".join([ps.stem(token) for token in doc.split() if token not in stopwords]) # removing stopwords and doing stemming
    return doc



In [None]:
for i in range(len(df.review.values)):
  df.review.values[i] = clean(df.review.values[i]) # puting the cleaded text back into the dataframe

df

Unnamed: 0,review,label
0,bromwel high cartoon comedi ran time program s...,1
1,stori man unnatur feel pig start open scene te...,0
2,homeless houseless georg carlin state issu yea...,1
3,airport start brand new luxuri plane load valu...,0
4,brilliant act lesley ann warren best dramat ho...,1
...,...,...
24995,saw descent night stockholm film festiv huge d...,0
24996,christma actual came time ve rais john denver ...,1
24997,film pick pound turn good rd centuri film rele...,0
24998,work class romant drama director martin ritt u...,1


### **One Hot Encoding each review**

In [None]:
all_text = ' '.join([sent for sent in df['review']])
#all_text

In [None]:
words = all_text.split()
words

['bromwel',
 'high',
 'cartoon',
 'comedi',
 'ran',
 'time',
 'program',
 'school',
 'life',
 'teacher',
 'year',
 'teach',
 'profess',
 'lead',
 'believ',
 'bromwel',
 'high',
 's',
 'satir',
 'closer',
 'realiti',
 'teacher',
 'scrambl',
 'surviv',
 'financi',
 'insight',
 'student',
 'right',
 'pathet',
 'teacher',
 'pomp',
 'petti',
 'situat',
 'remind',
 'school',
 'knew',
 'student',
 'saw',
 'episod',
 'student',
 'repeatedli',
 'tri',
 'burn',
 'school',
 'immedi',
 'recal',
 'high',
 'classic',
 'line',
 'inspector',
 'm',
 'sack',
 'teacher',
 'student',
 'welcom',
 'bromwel',
 'high',
 'expect',
 'adult',
 'age',
 'think',
 'bromwel',
 'high',
 'far',
 'fetch',
 'piti',
 'isn',
 't',
 'stori',
 'man',
 'unnatur',
 'feel',
 'pig',
 'start',
 'open',
 'scene',
 'terrif',
 'exampl',
 'absurd',
 'comedi',
 'formal',
 'orchestra',
 'audienc',
 'turn',
 'insan',
 'violent',
 'mob',
 'crazi',
 'chant',
 's',
 'singer',
 'unfortun',
 'stay',
 'absurd',
 'time',
 'gener',
 'narr',
 '

One Hot Encoding (OHE) words

In [None]:
from collections import Counter

## Build a dictionary that maps words to integers
counts = Counter(words)
vocab = sorted(counts, key=counts.get, reverse=True)
vocab_to_int = {word: ii for ii, word in enumerate(vocab,1)}
vocab_to_int

{'br': 1,
 's': 2,
 'movi': 3,
 'film': 4,
 't': 5,
 'like': 6,
 'just': 7,
 'time': 8,
 'good': 9,
 'make': 10,
 'charact': 11,
 'watch': 12,
 'stori': 13,
 'realli': 14,
 'scene': 15,
 'look': 16,
 'end': 17,
 'peopl': 18,
 'bad': 19,
 'great': 20,
 'love': 21,
 'think': 22,
 'way': 23,
 'don': 24,
 'act': 25,
 'play': 26,
 'thing': 27,
 'know': 28,
 'say': 29,
 'work': 30,
 'plot': 31,
 'year': 32,
 'actor': 33,
 'come': 34,
 'seen': 35,
 'want': 36,
 'life': 37,
 'littl': 38,
 'best': 39,
 'tri': 40,
 'did': 41,
 'man': 42,
 'doe': 43,
 'better': 44,
 'perform': 45,
 'feel': 46,
 've': 47,
 'use': 48,
 'director': 49,
 'actual': 50,
 'm': 51,
 'get': 52,
 'lot': 53,
 'real': 54,
 'old': 55,
 'cast': 56,
 'doesn': 57,
 'live': 58,
 'star': 59,
 'enjoy': 60,
 'guy': 61,
 'didn': 62,
 'new': 63,
 'role': 64,
 'funni': 65,
 'music': 66,
 'point': 67,
 'start': 68,
 'go': 69,
 'set': 70,
 'girl': 71,
 'origin': 72,
 'day': 73,
 'world': 74,
 'believ': 75,
 'turn': 76,
 'interest': 77,
 

In [None]:
len(vocab_to_int)

50352

cast : 56 means cast is a OHE vector where index 56 is 1 and rest all indexes have 0s & it is a 1*50352 dimension vector


In [None]:
## use the above dictionary to tokenize each review in reviews_split - store the tokenized reviews in reviews_ints
reviews_ints = []
for review in df['review']:
  reviews_ints.append([vocab_to_int[word] for word in review.split()])

df['review'] = reviews_ints
df

Unnamed: 0,review,label
0,"[14889, 195, 662, 86, 1829, 8, 1244, 268, 37, ...",1
1,"[13, 42, 5222, 46, 2649, 68, 206, 15, 1061, 30...",0
2,"[2621, 30969, 535, 12308, 453, 693, 32, 590, 1...",1
3,"[3553, 68, 2608, 63, 4681, 1242, 1456, 3515, 8...",0
4,"[401, 25, 11812, 907, 3074, 39, 626, 11354, 42...",1
...,...,...
24995,"[109, 3693, 173, 14695, 4, 1035, 493, 241, 241...",0
24996,"[796, 50, 283, 8, 47, 962, 186, 6719, 236, 181...",1
24997,"[4, 444, 2496, 76, 9, 3061, 846, 4, 233, 1510,...",0
24998,"[30, 488, 574, 310, 49, 1251, 26109, 874, 34, ...",1


### **Padding / Truncating each review**

As an additional pre-processing step, we want to make sure that our reviews are in good shape for standard processing. That is, our network will expect a standard input text size, and so, we'll want to shape our reviews into a specific length. We'll approach this task in two main steps:

1. Getting rid of extremely long or short reviews; the outliers
2. Padding/truncating the remaining data so that we have reviews of the same length.


#### 1. Getting rid of extremely long or short reviews

In [None]:
# Before we pad our review text, we should check for reviews of extremely short or long lengths; outliers that may mess with our training
# outlier review stats
review_lens = Counter([len(x) for x in reviews_ints])
print("Zero-length reviews: {}".format(review_lens[0]))
print("Maximum review length: {}".format(max(review_lens)))

Zero-length reviews: 0
Maximum review length: 1358


Okay, a couple issues here. We seem to have no review with zero length which is a good thing. And, the maximum review length is way too many time steps for our RNN. We'll have to remove any super short reviews and truncate super long reviews. This removes outliers and should allow our model to train more efficiently.

In [None]:
# If we had any review which is of zero length then we would have to remove them first and their corresponding label
# but not required with this dataset because we dont have any zero length review

#### 2. Padding / Truncating the remaining data so that we have reviews of same length



To deal with both short and very long reviews, we'll pad very short reviews and truncate very long reviews to a specific length.
______________
___________________________________________________________________________

For reviews shorter than some `seq_length`, we'll **left pad** with 0s.
As a small example, if the `seq_length=10` and an input review is:
```
['best', 'movie', 'ever']` = `[117, 18, 128]` as integers
```
The resultant, padded sequence should be:

```
[0, 0, 0, 0, 0, 0, 0, 117, 18, 128]
```
(you can also pad at the right instead of left - your wish)
___________________________________________________________________________
___
For reviews longer than `seq_length`, we can truncate them to the **first** `seq_length` words.
___
___
 A good `seq_length`, in this case, is 200.






**Your final `features` array should be a 2D array, with as many rows as there are reviews, and as many columns as the specified `seq_length`.**


In [None]:
# Define a function that returns an array `features` that contains the padded data, of a standard size, that we'll pass to the network.
# The data should come from `review_ints`, since we want to feed integers to the network.
# Each row should be `seq_length` elements long.

def pad_features(reviews_ints, seq_length):
    ''' Return features of review_ints, where each review is padded with 0's
        or truncated to the input seq_length.
    '''
    ## getting the correct rows x cols shape
    features = np.zeros((len(reviews_ints), seq_length), dtype=int)

    ## for each review, I grab that review
    for i, row in enumerate(reviews_ints):
      features[i, -len(row):] = np.array(row)[:seq_length]

    return features

In [None]:
seq_length = 200

features = pad_features(reviews_ints, seq_length=seq_length)
features

array([[    0,     0,     0, ...,  1403,   110,     5],
       [    0,     0,     0, ...,  5286,    35,  2770],
       [ 2621, 30969,   535, ...,   162,   164,  2621],
       ...,
       [    0,     0,     0, ..., 12394, 12054,   503],
       [    0,     0,     0, ...,    14,  1151,    23],
       [    0,     0,     0, ...,   184,     8,   423]])

In [None]:
print(features[:30,:10])

[[    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [ 2621 30969   535 12308   453   693    32   590   131   513]
 [ 3553    68  2608    63  4681  1242  1456  3515   897  1574]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [  448  3056  1263   468   290   166   372  3076  1757   528]
 [    0     0     0     0     0     0     0     0     0

In [None]:
temp = []
for i in features:
  temp2 = []
  for j in i:
    temp2.append(j)
  temp.append(temp2)

In [None]:
df['review'] = temp
df

Unnamed: 0,review,label
0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",1
1,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0
2,"[2621, 30969, 535, 12308, 453, 693, 32, 590, 1...",1
3,"[3553, 68, 2608, 63, 4681, 1242, 1456, 3515, 8...",0
4,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",1
...,...,...
24995,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0
24996,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",1
24997,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0
24998,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",1


### **Dependent and independent features split**

In [None]:
X = features
Y = df.label

In [None]:
X

array([[    0,     0,     0, ...,  1403,   110,     5],
       [    0,     0,     0, ...,  5286,    35,  2770],
       [ 2621, 30969,   535, ...,   162,   164,  2621],
       ...,
       [    0,     0,     0, ..., 12394, 12054,   503],
       [    0,     0,     0, ...,    14,  1151,    23],
       [    0,     0,     0, ...,   184,     8,   423]])

In [None]:
Y

0        1
1        0
2        1
3        0
4        1
        ..
24995    0
24996    1
24997    0
24998    1
24999    0
Name: label, Length: 25000, dtype: int64

### **Train Validation Test Split**

In [None]:
# idk why but using the sklearn library for train test split is crashing the notebook hence I am doing it manually
# using 80% as training data, next 10% as validation data and remaining 10% as testing data

X_train = X[:int(len(X)*0.8)]
X_validation = X[int(len(X)*0.8):int(len(X)*0.9)]
X_test = X[int(len(X)*0.9):]

Y_train = Y[:int(len(Y)*0.8)]
Y_validation = Y[int(len(X)*0.8):int(len(X)*0.9)]
Y_test = Y[int(len(Y)*0.9):]

In [None]:
print(X_train.shape,X_validation.shape,X_test.shape, Y_train.shape,Y_validation.shape,Y_test.shape)

(20000, 200) (2500, 200) (2500, 200) (20000,) (2500,) (2500,)


### **DataLoaders**
DataLoaders for this data can be created by following two steps:

###### 1. Create a known format for accessing our data, using [TensorDataset](https://pytorch.org/docs/stable/data.html#) which takes in an input set of data and a target set of data with the same first dimension, and creates a dataset.


In [None]:
import torch
from torch.utils.data import TensorDataset, DataLoader

In [None]:
# create Tensor datasets
train_data = TensorDataset(torch.from_numpy(np.array(X_train)), torch.from_numpy(np.array(Y_train)))
valid_data = TensorDataset(torch.from_numpy(np.array(X_validation)), torch.from_numpy(np.array(Y_validation)))
test_data = TensorDataset(torch.from_numpy(np.array(X_test)), torch.from_numpy(np.array(Y_test)))

###### 2. Create DataLoaders and batch our training, validation, and test Tensor datasets.


In [None]:
# dataloaders
batch_size = 50

# make sure to SHUFFLE your data
train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
valid_loader = DataLoader(valid_data, shuffle=True, batch_size=batch_size)
test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size)

### **Batching**

In [None]:
# obtain one batch of training data
dataiter = iter(train_loader)
sample_x, sample_y = dataiter.next()

print('Sample input size: ', sample_x.size()) # batch_size, seq_length
print('Sample input: \n', sample_x)
print()
print('Sample label size: ', sample_y.size()) # batch_size
print('Sample label: \n', sample_y)

Sample input size:  torch.Size([50, 200])
Sample input: 
 tensor([[    0,     0,     0,  ...,   303,   127,    99],
        [    0,     0,     0,  ...,  1388,   233, 10101],
        [    6,   241,    72,  ...,    28,   428,   882],
        ...,
        [ 2110,   499,  1218,  ...,  5625,   984,  1069],
        [    0,     0,     0,  ...,    18,   503,   486],
        [  227,   411,    89,  ...,    31,   287,   437]])

Sample label size:  torch.Size([50])
Sample label: 
 tensor([0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1,
        1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0,
        1, 1])


## **Training and Predicting**

### **Stacked LSTM RNN Model**

In [None]:
# First checking if GPU is available or not
import torch

train_on_gpu=torch.cuda.is_available()

if(train_on_gpu):
    print('Training on GPU.')
else:
    print('No GPU available, training on CPU.')

Training on GPU.


#### Defining the model

In [None]:
# Defining Our Model which will perform sentimental analysis

import torch.nn as nn

class Sentiment_Stacked_LSTM_RNN(nn.Module):

  def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, drop_prob=0.5):

    super(Sentiment_Stacked_LSTM_RNN, self).__init__()

    self.output_size = output_size
    self.n_layers = n_layers
    self.hidden_dim = hidden_dim

    # ----------------------------------------------------input layer-------------------------------------------------
    """An [embedding layer](https://pytorch.org/docs/stable/nn.html#embedding) that converts our word tokens (integers) into OHE embeddings of a specific size"""
    self.embedding = nn.Embedding(vocab_size, embedding_dim)

    # ----------------------------------------------------hidden layer-------------------------------------------------
    """
    An [LSTM layer](https://pytorch.org/docs/stable/nn.html#lstm) defined by a hidden_state size and number of layers
    We'll create an LSTM to use in our recurrent network, which takes in an input_size, a hidden_dim, a number of layers,
    a dropout probability (for dropout between multiple layers), and a batch_first parameter.Most of the time, you're network will have better performance with more layers;
    between 2-3. Adding more layers allows the network to learn really complex relationships.
    """
    self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers,dropout=drop_prob, batch_first=True, bidirectional = False )
    self.dropout = nn.Dropout(0.3)

    # --------------------------------------------------output layer - linear + sigmoid layer--------------------------------------
    """ A fully-connected output layer that maps the LSTM layer outputs to a desired output_size"""
    self.fc = nn.Linear(hidden_dim, output_size)
    """ A sigmoid activation layer which turns all outputs into a value 0-1; return **only the last sigmoid output** as the output of this network."""
    self.sig = nn.Sigmoid()

  def forward(self, x, hidden):
    """
    Perform a forward pass of our model on some input and hidden state.
    """
    batch_size = x.size(0)

    # embeddings and lstm_out
    embeds = self.embedding(x) #doing contextual embedding here
    lstm_out, hidden = self.lstm(embeds, hidden)

    # stack up lstm outputs
    lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim)

    # dropout and fully connected layer
    out = self.dropout(lstm_out)
    out = self.fc(out)

    # sigmoid function
    sig_out = self.sig(out)

    # reshape to be batch_size first
    sig_out = sig_out.view(batch_size, -1)
    sig_out = sig_out[:, -1] # get last batch of labels

    # return last sigmoid output and hidden state
    return sig_out, hidden


  def init_hidden(self, batch_size):
    ''' Initializes hidden state '''
    # Create two new tensors with sizes n_layers x batch_size x hidden_dim,initialized to zero, for hidden state and cell state of LSTM
    weight = next(self.parameters()).data

    if(train_on_gpu):
      hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(),weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda())
    else:
      hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_(),weight.new(self.n_layers, batch_size, self.hidden_dim).zero_())

    return hidden


In [None]:
# Instantiate the model with hyperparameters

# vocab_size: Size of our vocabulary or the range of values for our input, word tokens.
vocab_size = len(vocab_to_int) + 1 # +1 for zero padding + our word tokens

# output_size: Size of our desired output; the number of class scores we want to output (pos/neg).
output_size = 1

# embedding_dim: Number of columns in the embedding lookup table; size of our embeddings.
embedding_dim = 400

# hidden_dim: Number of units in the hidden layers of our LSTM cells. Usually larger is better performance wise. Common values are 128, 256, 512, etc.
hidden_dim = 256

# n_layers: Number of LSTM layers in the network. Typically between 1-3
n_layers = 2

# learning rate for optimizer
lr=0.001

# loss function - binary cross entropy, [BCELoss](https://pytorch.org/docs/stable/nn.html#bceloss) is designed to work with a single sigmoid output
criterion = nn.BCELoss()

net = Sentiment_Stacked_LSTM_RNN(vocab_size, output_size, embedding_dim, hidden_dim, n_layers)
print(net)

# optimizer
optimizer = torch.optim.Adam(net.parameters(), lr=lr)

SentimentRNN(
  (embedding): Embedding(50353, 400)
  (lstm): LSTM(400, 256, num_layers=2, batch_first=True, dropout=0.5)
  (dropout): Dropout(p=0.3, inplace=False)
  (fc): Linear(in_features=256, out_features=1, bias=True)
  (sig): Sigmoid()
)


#### Training the Model

In [None]:
# training parameters

# Epochs - No of times to iterate through the training dataset - 3-4 is approx where I noticed the validation loss stop decreasing
epochs = 4

counter = 0

print_every = 100

# gradient clipping - The maximum gradient value to clip at (to prevent exploding gradients).
clip=5

In [None]:
# move model to GPU, if available
if(train_on_gpu):
  net.cuda()

In [None]:
net.train()

# Training for some number of epochs

for e in range(epochs):
  # initialize hidden state
  h = net.init_hidden(batch_size)

  # batch loop
  for inputs, labels in train_loader:
    counter += 1

    if(train_on_gpu):
      inputs, labels = inputs.cuda(), labels.cuda()

    # Creating new variables for the hidden state, otherwise we'd backprop through the entire training history
    h = tuple([each.data for each in h])

    # zero accumulated gradients
    net.zero_grad()

    # get the output from the model
    output, h = net(inputs, h)

    # calculate the loss and perform backprop
    loss = criterion(output.squeeze(), labels.float())
    loss.backward()

    # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
    nn.utils.clip_grad_norm_(net.parameters(), clip)
    optimizer.step()

    # loss stats
    if counter % print_every == 0:
      # Get validation loss
      val_h = net.init_hidden(batch_size)
      val_losses = []
      net.eval()
      for inputs, labels in valid_loader:

        # Creating new variables for the hidden state, otherwise we'd backprop through the entire training history
        val_h = tuple([each.data for each in val_h])

        if(train_on_gpu):
          inputs, labels = inputs.cuda(), labels.cuda()

        output, val_h = net(inputs, val_h)
        val_loss = criterion(output.squeeze(), labels.float())
        val_losses.append(val_loss.item())

      net.train()
      print("Epoch: {}/{}...".format(e+1, epochs),"Step: {}...".format(counter),"Loss: {:.6f}...".format(loss.item()),"Val Loss: {:.6f}".format(np.mean(val_losses)))

Epoch: 1/4... Step: 100... Loss: 0.651117... Val Loss: 0.610166
Epoch: 1/4... Step: 200... Loss: 0.573071... Val Loss: 0.595148
Epoch: 1/4... Step: 300... Loss: 0.477250... Val Loss: 0.500567
Epoch: 1/4... Step: 400... Loss: 0.676299... Val Loss: 0.677923
Epoch: 2/4... Step: 500... Loss: 0.512277... Val Loss: 0.560526
Epoch: 2/4... Step: 600... Loss: 0.307610... Val Loss: 0.438921
Epoch: 2/4... Step: 700... Loss: 0.257671... Val Loss: 0.406977
Epoch: 2/4... Step: 800... Loss: 0.317289... Val Loss: 0.423925
Epoch: 3/4... Step: 900... Loss: 0.145008... Val Loss: 0.466749
Epoch: 3/4... Step: 1000... Loss: 0.174054... Val Loss: 0.459306
Epoch: 3/4... Step: 1100... Loss: 0.076392... Val Loss: 0.401823
Epoch: 3/4... Step: 1200... Loss: 0.316865... Val Loss: 0.514169
Epoch: 4/4... Step: 1300... Loss: 0.052432... Val Loss: 0.476855
Epoch: 4/4... Step: 1400... Loss: 0.070502... Val Loss: 0.527440
Epoch: 4/4... Step: 1500... Loss: 0.102899... Val Loss: 0.486298
Epoch: 4/4... Step: 1600... Loss: 

#### Testing the Model



There are a few ways to test your network.

* **Test data performance:** First, we'll see how our trained model performs on all of our defined test_data, above. We'll calculate the average loss and accuracy over the test data.

* **Inference on user-generated data:** Second, we'll see if we can input just one example review at a time (without a label), and see what the trained model predicts. Looking at new, user input data like this, and predicting an output label, is called **inference**.

In [None]:
# Get test data loss and accuracy

test_losses = [] # track loss
num_correct = 0

# init hidden state
h = net.init_hidden(batch_size)

net.eval()
# iterate over test data
for inputs, labels in test_loader:

    # Creating new variables for the hidden state, otherwise
    # we'd backprop through the entire training history
    h = tuple([each.data for each in h])

    if(train_on_gpu):
        inputs, labels = inputs.cuda(), labels.cuda()

    # get predicted outputs
    output, h = net(inputs, h)

    # calculate loss
    test_loss = criterion(output.squeeze(), labels.float())
    test_losses.append(test_loss.item())

    # convert output probabilities to predicted class (0 or 1)
    pred = torch.round(output.squeeze())  # rounds to the nearest integer

    # compare predictions to true label
    correct_tensor = pred.eq(labels.float().view_as(pred))
    correct = np.squeeze(correct_tensor.numpy()) if not train_on_gpu else np.squeeze(correct_tensor.cpu().numpy())
    num_correct += np.sum(correct)


# -- stats! -- ##
# avg test loss
print("Test loss: {:.3f}".format(np.mean(test_losses)))

# accuracy over all test data
test_acc = num_correct/len(test_loader.dataset)
print("Test accuracy: {:.3f}".format(test_acc))

Test loss: 0.550
Test accuracy: 0.816


#### Inference on a test review




You can change this test_review to any text that you want. Read it and think: is it pos or neg? Then see if your model predicts correctly!
    
> **Exercise:** Write a `predict` function that takes in a trained net, a plain text_review, and a sequence length, and prints out a custom statement for a positive or negative review!
* You can use any functions that you've already defined or define any helper functions you want to complete `predict`, but it should just take in a trained net, a text review, and a sequence length.

In [None]:
# negative test review
test_review_neg = 'The worst movie I have seen; acting was terrible and I want my money back. This movie had bad acting and the dialogue was slow.'

In [None]:
from string import punctuation

def tokenize_review(test_review):
    test_review = test_review.lower() # lowercase

    test_text = "".join([char for char in test_review if char not in string.punctuation and not char.isdigit()])#remove punctuation and numbers

    test_text = " ".join([ps.stem(token) for token in test_text.split() if token not in stopwords]) # removing stopwords and doing stemming

    # splitting by spaces
    test_words = test_text.split()

    # tokens
    test_ints = []
    test_ints.append([vocab_to_int[word] for word in test_words])

    return test_ints

# test code and generate tokenized review
test_ints = tokenize_review(test_review_neg)
print(test_ints)

[[138, 3, 35, 25, 235, 36, 162, 3, 19, 25, 287, 397]]


In [None]:
# test sequence padding
seq_length = 200
features = pad_features(test_ints, seq_length)

print(features)

[[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0 138   3  35  25 235  36 162   3  19  25
  287 397]]


In [None]:
# test conversion to tensor and pass it to model
feature_tensor = torch.from_numpy(features)
print(feature_tensor.size())

torch.Size([1, 200])


In [None]:
def predict(net, test_review, sequence_length=200):
    ''' Prints out whether a give review is predicted to be
        positive or negative in sentiment, using a trained model.

        params:
        net - A trained net
        test_review - a review made of normal text and punctuation
        sequence_length - the padded length of a review
        '''

    net.eval()

    # tokenize review
    test_ints = tokenize_review(test_review)

    # pad tokenize sequence
    seq_length = sequence_length
    features = pad_features(test_ints, seq_length)

    # convert to tensor to pass to model
    feature_tensor = torch.from_numpy(features)

    batch_size = feature_tensor.size(0)

    # initialize hidden state
    h = net.init_hidden(batch_size)

    if(train_on_gpu):
      feature_tensor = feature_tensor.cuda()

    # get the output from the model
    output, h = net(feature_tensor, h)

    # convert output probabilities to predicted class (0 or 1)
    pred = torch.round(output.squeeze())
    # printing output value, before rounding
    print('Prediction value, pre-rounding: {:.6f}'.format(output.item()))

    # print custom response based on whether test_review is pos/neg
    if(pred.item()==1):
      print('Positive review detected!')
    else:
      print('Negative review detected!')



In [None]:
# positive test review
test_review_pos = 'This movie had the best acting and the dialogue was so good. I loved it.'


In [None]:
# call function
# try negative and positive reviews!
seq_length=200
predict(net, test_review_neg, seq_length)
predict(net, test_review_pos, seq_length)

Prediction value, pre-rounding: 0.005493
Negative review detected!
Prediction value, pre-rounding: 0.921936
Positive review detected!


### **Using PreTrained Model - BERT**
Refer the next notebook