<a href="https://colab.research.google.com/github/nitinsharma006/data_science/blob/master/Neural%20Networks/RNN_and_its_variants_in_PyTorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Objective

The objective of notebook is to build a model to automatically predict tags for a given a StackExchange question by using the text of the question in PyTorch.
![alt text](https://cdn.sstatic.net/Sites/stackoverflow/company/img/logos/se/se-logo.svg?v=d29f0785ebb7)

__Dataset Specs__: Over 85,000 questions and over 1300 unique tags

[Download Link](https://www.kaggle.com/stackoverflow/statsquestions#Questions.csv)


# Steps To Follow


1. Load Data and Import Libraries

2. Dataset Preparation

      2.1 Merge Tags with Questions

      2.2 Filter Questions with respect to Top-10 Tags
      
3. Text Preprocessing

      3.1 Text Cleaning

      3.2 Text Representation

4. Model Building

      4.1 Model Architecture

      4.2 Model Training

5. Model Evaluation

      5.1 Check Performance

      5.2 Show Inference

6. Model Building for LSTM

7. Model Evaluation for LSTM

# 1. Load Data and Import Libraries

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [None]:
# extract data from the ZIP file
!unzip '/content/drive/My Drive/statsquestions.zip'

Archive:  /content/drive/My Drive/statsquestions.zip
  inflating: Answers.csv             
  inflating: Questions.csv           
  inflating: Tags.csv                
  inflating: database.sqlite         


In [None]:
#string matching
import re 

#reading files
import pandas as pd
## change display width of pandas dataframe
pd.set_option('display.max_colwidth', 200)

#array processing
import numpy as np

#handling html data
from bs4 import BeautifulSoup

#visualization
import matplotlib.pyplot as plt  

#for metrics
from sklearn import metrics

#for seed
import random

# to one hot encode labels
from sklearn.preprocessing import MultiLabelBinarizer

#defining tensors
import torch

#layers
from torch import nn

#layers and wrappers
from torch.nn import Sequential, Linear,  ReLU, Sigmoid, Dropout, BCELoss, Embedding, RNN, LSTM

#handling text data
from torchtext import data 

In [None]:
# load the stackoverflow questions dataset
questions_df = pd.read_csv('Questions.csv',encoding='latin-1')

# load the tags dataset
tags_df = pd.read_csv('Tags.csv')

In [None]:
#Glance at the first 5 rows
questions_df.head()

Unnamed: 0,Id,OwnerUserId,CreationDate,Score,Title,Body
0,6,5.0,2010-07-19T19:14:44Z,272,The Two Cultures: statistics vs. machine learning?,"<p>Last year, I read a blog post from <a href=""http://anyall.org/"">Brendan O'Connor</a> entitled <a href=""http://anyall.org/blog/2008/12/statistics-vs-machine-learning-fight/"">""Statistics vs. Mach..."
1,21,59.0,2010-07-19T19:24:36Z,4,Forecasting demographic census,<p>What are some of the ways to forecast demographic census with some validation and calibration techniques?</p>\n\n<p>Some of the concerns:</p>\n\n<ul>\n<li>Census blocks vary in sizes as rural\n...
2,22,66.0,2010-07-19T19:25:39Z,208,Bayesian and frequentist reasoning in plain English,<p>How would you describe in plain English the characteristics that distinguish Bayesian from Frequentist reasoning?</p>\n
3,31,13.0,2010-07-19T19:28:44Z,138,What is the meaning of p values and t values in statistical tests?,"<p>After taking a statistics course and then trying to help fellow students, I noticed one subject that inspires much head-desk banging is interpreting the results of statistical hypothesis tests...."
4,36,8.0,2010-07-19T19:31:47Z,58,Examples for teaching: Correlation does not mean causation,"<p>There is an old saying: ""Correlation does not mean causation"". When I teach, I tend to use the following standard examples to illustrate this point:</p>\n\n<ol>\n<li>number of storks and birth ..."


In [None]:
#shape of the dataset
questions_df.shape

(85085, 6)

In [None]:
#Take a glance at first 5 rows
tags_df.head()

Unnamed: 0,Id,Tag
0,1,bayesian
1,1,prior
2,1,elicitation
3,2,distributions
4,2,normality


In [None]:
# No. of unique tags
len(tags_df['Tag'].unique())

1315

# 2. Dataset Preparation

## 2.1 Merge Tags with Questions

In [None]:
# remove "-" from the tags
tags_df['Tag'] = tags_df['Tag'].apply(lambda x:re.sub("-"," ",x))

In [None]:
# group tags Id wise
tags_df = tags_df.groupby('Id').apply(lambda x:x['Tag'].values).reset_index(name='tags')
tags_df.head()

Unnamed: 0,Id,tags
0,1,"[bayesian, prior, elicitation]"
1,2,"[distributions, normality]"
2,3,"[software, open source]"
3,4,"[distributions, statistical significance]"
4,6,[machine learning]


In [None]:
# merge tags and questions
df = pd.merge(questions_df,tags_df,how='inner',on='Id')

In [None]:
# fetch required columns
df = df[['Id','Body','tags']]

In [None]:
#first 5 rows
df.head()

Unnamed: 0,Id,Body,tags
0,6,"<p>Last year, I read a blog post from <a href=""http://anyall.org/"">Brendan O'Connor</a> entitled <a href=""http://anyall.org/blog/2008/12/statistics-vs-machine-learning-fight/"">""Statistics vs. Mach...",[machine learning]
1,21,<p>What are some of the ways to forecast demographic census with some validation and calibration techniques?</p>\n\n<p>Some of the concerns:</p>\n\n<ul>\n<li>Census blocks vary in sizes as rural\n...,"[forecasting, population, census]"
2,22,<p>How would you describe in plain English the characteristics that distinguish Bayesian from Frequentist reasoning?</p>\n,"[bayesian, frequentist]"
3,31,"<p>After taking a statistics course and then trying to help fellow students, I noticed one subject that inspires much head-desk banging is interpreting the results of statistical hypothesis tests....","[hypothesis testing, t test, p value, interpretation, intuition]"
4,36,"<p>There is an old saying: ""Correlation does not mean causation"". When I teach, I tend to use the following standard examples to illustrate this point:</p>\n\n<ol>\n<li>number of storks and birth ...","[correlation, teaching]"


In [None]:
#shape of the dataset
df.shape

(85085, 3)

## 2.2 Filter Questions with respect to Top-10 Tags

In [None]:
# check occurence of each tag
freq={}
for i in df['tags']:
  for j in i:
    if j in freq.keys():
      freq[j] = freq[j] + 1
    else:
      freq[j] = 1

In [None]:
# sort the dictionary in descending order
freq = dict(sorted(freq.items(), key=lambda x:x[1],reverse=True))

In [None]:
# Top 10 most frequent tags
common_tags = list(freq.keys())[:10]
print(common_tags)

['r', 'regression', 'machine learning', 'time series', 'probability', 'hypothesis testing', 'self study', 'distributions', 'logistic', 'classification']


We will use only those questions/queries that are associated with the top 10 tags.

In [None]:
#finding queries associated with common tags
x=[]
y=[]

for i in range(len(df['tags'])):  

  temp=[]
  for j in df['tags'][i]:
    if j in common_tags:
      temp.append(j)
  
  #if common tags are more than 1
  if(len(temp)>1):
    x.append(df['Body'][i])
    y.append(temp)

In [None]:
# number of questions left
len(x)

11106

In [None]:
#first 5 tags
y[:5]

[['r', 'time series'],
 ['regression', 'distributions'],
 ['distributions', 'probability', 'hypothesis testing'],
 ['hypothesis testing', 'self study'],
 ['r', 'regression', 'time series']]

In [None]:
#combining the labels by space
y = [ ",".join([str(j) for j in i ]) for i in y]

In [None]:
#labels after converting to string
y[:5]

['r,time series',
 'regression,distributions',
 'distributions,probability,hypothesis testing',
 'hypothesis testing,self study',
 'r,regression,time series']

In [None]:
#save to dataframe
dframe = pd.DataFrame({'query':x,'tags':y})

In [None]:
#first 5 rows
dframe.head()

Unnamed: 0,query,tags
0,"<p>I recently started working for a tuberculosis clinic. We meet periodically to discuss the number of TB cases we're currently treating, the number of tests administered, etc. I'd like to start...","r,time series"
1,"<p>Am I looking for a better behaved distribution for the independent variable in question, or to reduce the effect of outliers, or something else?</p>\n","regression,distributions"
2,<p>There are many ways to measure how similar two probability distributions are. Among methods which are popular (in different circles) are:</p>\n\n<ol>\n<li><p>the Kolmogorov distance: the sup-d...,"distributions,probability,hypothesis testing"
3,<blockquote>\n <p>A Lab has been asked to evaluate the claim that drinking water in a\n local restaurant has a lead concentration of 6 parts per billion\n (ppb). Repeated measurements follow a ...,"hypothesis testing,self study"
4,<p>How would we measure the predictive power of predictors in time series models. For e.g. in linear regression we have the magnitude and direction of the regression co-efficients and their p-valu...,"r,regression,time series"


In [None]:
#save to csv
dframe.to_csv('stack.csv',index=False)

# 3. Text Preprocessing

Now, we will see the one of the most important library in PyTorch for handling text data - TorchText 



**TorchText** is a Natural Language Processing (NLP) library in PyTorch. This library contains the scripts for preprocessing text and data sources of few popular NLP datasets to test out the scripts.

TorchText understands and operates on text data in terms of Field objects, and then Field objects are used to define the steps for text preprocessing

There are 2 different types of field objects – **Field** and **LabelField**. 

* **Field**: Field object is used to specify preprocessing steps for each column in the dataset.

* **LabelField**: LabelField object is a special case of Field object which is used only for the preprocessing of label column. 

Before we use Field, let us look at the different parameters of Field and what are they used for.

**Parameters of Field**:

* **Tokenize**: It specifies the way of tokenizing the sentence i.e. converting sentence to words. By default, it tokenizes with respect to spaces

    * *Note*: It can be replaced by the preprocessing function as well.

* **Lower**: It converts text to lowercase

* **batch_first**: The first dimension of input and output is always batch size

* **fix_length**: Maximum length of a sentence

* **unk_token**: The string token used to represent OOV words. By default, this value is "UNK"

## 3.1 Text Cleaning

In [None]:
def cleaner(text):

  text = BeautifulSoup(text).get_text()
  
  # fetch alphabetic characters
  text = re.sub("[^a-zA-Z]", " ", text)

  # convert text to lower case
  text = text.lower()

  # split text into tokens to remove whitespaces
  tokens = text.split()
  
  return tokens

In [None]:
#define field object for query
max_len = 100
TEXT = data.Field(tokenize=cleaner, batch_first=True, fix_length=max_len)

In [None]:
#define field object for label
LABEL = data.LabelField(batch_first=True)

Next we are going to create a list of tuples where first value in every tuple contains a column name and second value is a field object. Furthermore we will arrange each tuple in the order of the columns of csv.

Let us read only required columns – query and tags

In [None]:
#define a list of tuple with field objects
fields = [('query',TEXT),('tags', LABEL)]

Now, we will load the custom dataset by defining the list of tuples. For this we use TabularDataset class

**Parameters of TabularDataset**:

* **path**: set the path of dataset

* **format**: provide extension of file. 
    * **Note**: There are a limited number of extensions accepted by TorchText. Read the docs for more details

* **fields**: give tuple of user defined fields which data would have

* **skip_header**: boolean value; if set to true - ignores the first line of the data file

In [None]:
#reading the dataset
training_data = data.TabularDataset(path = 'stack.csv', format = 'csv', fields = fields, skip_header = True)

Let see whether we can see examples of training data or not

In [None]:
print(training_data)

<torchtext.data.dataset.TabularDataset object at 0x7fc080cb2e10>


Now, we will see how to print the examples of training data

In [None]:
#print preprocessed text
print(vars(training_data.examples[0]))

{'query': ['i', 'recently', 'started', 'working', 'for', 'a', 'tuberculosis', 'clinic', 'we', 'meet', 'periodically', 'to', 'discuss', 'the', 'number', 'of', 'tb', 'cases', 'we', 're', 'currently', 'treating', 'the', 'number', 'of', 'tests', 'administered', 'etc', 'i', 'd', 'like', 'to', 'start', 'modeling', 'these', 'counts', 'so', 'that', 'we', 're', 'not', 'just', 'guessing', 'whether', 'something', 'is', 'unusual', 'or', 'not', 'unfortunately', 'i', 've', 'had', 'very', 'little', 'training', 'in', 'time', 'series', 'and', 'most', 'of', 'my', 'exposure', 'has', 'been', 'to', 'models', 'for', 'very', 'continuous', 'data', 'stock', 'prices', 'or', 'very', 'large', 'numbers', 'of', 'counts', 'influenza', 'but', 'we', 'deal', 'with', 'cases', 'per', 'month', 'mean', 'median', 'var', 'which', 'are', 'distributed', 'like', 'this', 'image', 'lost', 'to', 'the', 'mists', 'of', 'time', 'image', 'eaten', 'by', 'a', 'grue', 'i', 've', 'found', 'a', 'few', 'articles', 'that', 'address', 'models

As you can see here, the output is the cleaned text

**Note**: *cleaning is done based on the field object defined* 

Split the dataset into training and validation now

In [None]:
train_data, valid_data = training_data.split(split_ratio=0.8, random_state = random.seed(32))

## 3.2 Text Representation

The next step is to build the vocabulary for the text. For this we use *build_vocab* function on field object to construct a vocab object for the field

Below are the important parameters for build_vocab:

**Parameter**:

* **Dataset object**: which is used to specify the data on which vocabulary has to be created

* **min_freq**: Ignores the words in vocabulary which has frequency less than or equal to specified one and map it to unknown token.

In [None]:
#preparing the vocabulary for the text
TEXT.build_vocab(train_data, min_freq=3)

In [None]:
#No. of unique words
len(TEXT.vocab)

12483

In [None]:
#word index
list(TEXT.vocab.stoi.items())[:10]

[('<unk>', 0),
 ('<pad>', 1),
 ('the', 2),
 ('i', 3),
 ('to', 4),
 ('a', 5),
 ('of', 6),
 ('is', 7),
 ('and', 8),
 ('in', 9)]


***Note***: Two special tokens known as unknown and padding will be added to the vocabulary by default

* **Unknown token** is used to handle Out Of Vocabulary words. By default, the index of unknown token is 0
* **Padding token** is used to make input sequences of same length. By default, the padding token is added at index 1

In [None]:
def fetch_text(examples):

  text=[]
  for example in examples:
    query = vars(example)['query']
    text.append(query)
    
  return text

In [None]:
train_text = fetch_text(train_data)
valid_text = fetch_text(valid_data)

In [None]:
def convert2seq(text):
  
  #padding
  text = TEXT.pad(text)
  
  #converting to numbers
  text = TEXT.numericalize(text)
  
  return text

In [None]:
X_train = convert2seq(train_text)
X_valid = convert2seq(valid_text)

In [None]:
X_train[0]

tensor([    3,    17,     2,    15,    98,    33,    86,     3,   250,     4,
          520,     2,    94,    40,    21,    50,   193,    74,   294,   635,
           65,     9, 12271,     3,    17,   266,   283,    78,    26,  2481,
         3593,    33,    13,     7,  1877,    12,   103,    21,  2365,  1163,
           15,   276,   841,     6,    62,  9865,   460,     0,   460,     0,
          460,     0,   460,     0,   460,     0,   927,  4131,   460, 12271,
          460,     0,   460,     0,   460,     0,   460,     0,  8616,   286,
          286,   286,   286,   286,   286,     0,   460, 10035,   460,   329,
            7,    65,  1163,   286,    24,    15,   117,    32,  1374,    66,
           65,    69,    70,   598,   681,   104,   681,  1163, 10035,     0])

In [None]:
X_train.shape, X_valid.shape

(torch.Size([8885, 100]), torch.Size([2221, 100]))

In [None]:
def fetch_tags(data):
  tags=[]
  for example in data.examples:
    tags.append(vars(example)['tags'])
  return tags

In [None]:
train_tags = fetch_tags(train_data)
valid_tags = fetch_tags(valid_data)

In [None]:
train_tags[:5]

['r,logistic',
 'machine learning,classification',
 'r,time series',
 'r,time series',
 'probability,distributions']

In [None]:
#preparing the output labels 
train_tags_list=[i.split(",") for i in train_tags]
valid_tags_list=[i.split(",") for i in valid_tags]

In [None]:
mlb= MultiLabelBinarizer()
mlb.fit(train_tags_list)

MultiLabelBinarizer(classes=None, sparse_output=False)

In [None]:
mlb.classes_

array(['classification', 'distributions', 'hypothesis testing',
       'logistic', 'machine learning', 'probability', 'r', 'regression',
       'self study', 'time series'], dtype=object)

In [None]:
y_train  = mlb.transform(train_tags_list)
y_valid  = mlb.transform(valid_tags_list)

In [None]:
y_train.shape, y_valid.shape

((8885, 10), (2221, 10))

In [None]:
type(y_train)

numpy.ndarray

In [None]:
y_train = torch.FloatTensor(y_train)
y_valid = torch.FloatTensor(y_valid)

In [None]:
type(y_train)

torch.Tensor

# 4. Model Building

 ## 4.1 Model Architecture

Prior to defining a RNN architecture, we will understand the how RNN layer is defined in pytorch, what are the input and output shapes of an RNN layer in PyTorch

As you might remember, preprocessed text data is at first passed through Embedding Layer, then the output of this embedding layer is passed through the RNN layer

Lets see what the parameters of Embedding Layer are

* **num_embeddings**: Actual feature dimensions of input
* **embedding_dim**: Number of embedding dimensions; this is set by the user

In [None]:
# define embedding layer
emb = Embedding(num_embeddings=len(TEXT.vocab), embedding_dim=50)

In [None]:
X_train[:1].shape

torch.Size([1, 100])

In [None]:
# check sample input
sample_embedding = emb(X_train[:1])

In [None]:
sample_embedding.shape

torch.Size([1, 100, 50])

In Pytorch, you can easily define RNN layers with same hyperparameters using the RNN module of torch.nn 

Parameters of RNN layer:

* **input_size**: Number of inputs to the RNN
* **hidden_size**: Number of neurons in RNN layer.
* **batch_first**: Set first dimension to batch size
* **nonlinearity**: Activation function on RNN layer

In [None]:
#define a rnn
rnn = RNN(input_size=50, hidden_size=128, batch_first=True, nonlinearity='relu')

In [None]:
#pass the input to rnn
hidden_states,last_hidden_state = rnn(sample_embedding)

In [None]:
#Hidden state of every timestep (Batch, seq_len, no. of hidden neurons)
hidden_states.shape

torch.Size([1, 100, 128])

In [None]:
#output shape of last hidden timestep
last_hidden_state.shape

torch.Size([1, 1, 128])

In [None]:
#reshaping the hidden states
reshaped = hidden_states.reshape(hidden_states.size(0),-1)
reshaped.shape

torch.Size([1, 12800])

In [None]:
# Define Model Architecture

# Input
# Embedding(embedding_dim=50)
# RNN(128)
# Linear(128, 'relu')
# Linear(10, 'sigmoid')

class Net(nn.Module):
    
    #define all the layers used in model
    def __init__(self):
        
        #Constructor
        super(Net, self).__init__()   
        
        self.rnn_layer = nn.Sequential(
            
            #embedding layer [batch_size,vocab_size]
            Embedding(num_embeddings=len(TEXT.vocab), embedding_dim=50),
        
            #rnn layer [batch_size,100,128]
            RNN(input_size=50, hidden_size=128, nonlinearity='relu',batch_first=True)
          
            )

        self.dense_layer = nn.Sequential(
            
            #[batch_size,100*128]
            Linear(12800, 128),

            ReLU(),

            #[batch_size,128]
            Linear(128,10),
            
            #[batch_size,10]
            Sigmoid()

        )

    def forward(self, x):
        
        #rnn layer
        hidden_states, last_hidden_state = self.rnn_layer(x)

        #reshaping
        hidden_states = hidden_states.reshape(hidden_states.size(0),-1)

        #dense layer
        outputs=self.dense_layer(hidden_states)
        
        return outputs

In [None]:
#define the model
model = Net()

In [None]:
#model layers
model

Net(
  (rnn_layer): Sequential(
    (0): Embedding(12483, 50)
    (1): RNN(50, 128, batch_first=True)
  )
  (dense_layer): Sequential(
    (0): Linear(in_features=12800, out_features=128, bias=True)
    (1): ReLU()
    (2): Linear(in_features=128, out_features=10, bias=True)
    (3): Sigmoid()
  )
)

In [None]:
#pass an text to the model to understand the output
#deactivates autograd
with torch.no_grad():
  pred = model(X_train[:1])
  print(pred)

tensor([[0.5343, 0.4600, 0.4879, 0.4852, 0.5176, 0.5682, 0.5123, 0.4828, 0.5103,
         0.5097]])


In [None]:
#define optimizer and loss
optimizer = torch.optim.Adam(model.parameters())
criterion = BCELoss()

# checking if GPU is available
if torch.cuda.is_available():
    model = model.cuda()
    criterion = criterion.cuda()

## 4.2 Model Training

In [None]:
# define training function
def train(X,y,batch_size):

  #activate training phase
  model.train()
  
  #initialization
  epoch_loss= 0
  no_of_batches = 0

  #randomly create indices
  indices= torch.randperm(len(X))
  
  #loading in batches
  for i in range(0,len(indices),batch_size):
    
    #indices for a batch
    ind = indices[i:i+batch_size]
  
    #batch  
    batch_x=X[ind]
    batch_y=y[ind]
    
    #push to cuda
    if torch.cuda.is_available():
        batch_x, batch_y = batch_x.cuda(), batch_y.cuda()

    #clear gradients
    optimizer.zero_grad()
          
    #forward pass
    outputs = model(batch_x)

    #converting to a 1 dimensional tensor
    outputs = outputs.squeeze()

    #calculate loss and accuracy
    loss = criterion(outputs, batch_y)
    
    #Backward pass
    loss.backward()
    
    #Update weights
    optimizer.step()

    #Keep track of the loss and accuracy of a epoch
    epoch_loss = epoch_loss + loss.item()

    #No. of batches
    no_of_batches = no_of_batches+1

  return epoch_loss/no_of_batches

In [None]:
# define evaluation function
def evaluate(X,y,batch_size):

  #deactivate training phase
  model.eval()

  #initialization
  epoch_loss = 0
  no_of_batches = 0

  #randomly create indices
  indices= torch.randperm(len(X))

  #deactivates autograd
  with torch.no_grad():
    
    #loading in batches
    for i in range(0,len(indices),batch_size):
      
      #indices for a batch
      ind = indices[i:i+batch_size]
  
      #batch  
      batch_x= X[ind]
      batch_y= y[ind]

      #push to cuda
      if torch.cuda.is_available():
          batch_x, batch_y = batch_x.cuda(), batch_y.cuda()
        
      #Forward pass
      outputs = model(batch_x)

      #converting the output to 1 Dimensional tensor
      outputs = outputs.squeeze()

      # Calculate loss and accuracy
      loss = criterion(outputs, batch_y)
      
      #keep track of loss and accuracy of an epoch
      epoch_loss = epoch_loss + loss.item()

      #no. of batches
      no_of_batches = no_of_batches + 1

    return epoch_loss/no_of_batches

In [None]:
# define prediction function
def predict(X,batch_size):
  
  #deactivate training phase
  model.eval()

  # initialization 
  predictions = []

  # create indices
  indices = torch.arange(len(X))

  #deactivates autograd
  with torch.no_grad():
      
      for i in range(0, len(X), batch_size):
        
        #indices for a batch
        ind = indices[i:i+batch_size]

        # batch
        batch_x = X[ind]

        #push to cuda
        if torch.cuda.is_available():
            batch_x = batch_x.cuda()

        #Forward pass
        outputs = model(batch_x)

        #converting the output to 1 Dimensional tensor
        outputs = outputs.squeeze()

        # convert to numpy array
        prediction = outputs.data.cpu().numpy()
        predictions.append(prediction)
    
  # convert to single numpy array
  predictions = np.concatenate(predictions, axis=0)
    
  return predictions

In [None]:
N_EPOCHS = 10
batch_size = 32

# intialization
best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
     
    #train the model
    train_loss   = train(X_train, y_train, batch_size)
    
    #evaluate the model
    valid_loss   = evaluate(X_valid, y_valid, batch_size)

    print('\nEpoch :',epoch,
          'Training loss:',round(train_loss,4),
          '\tValidation loss:',round(valid_loss,4))

    #save the best model
    if best_valid_loss >= valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'saved_weights.pt') 
        print("\n----------------------------------------------------Saved best model------------------------------------------------------------------")   


Epoch : 0 Training loss: 0.4235 	Validation loss: 0.3702

----------------------------------------------------Saved best model------------------------------------------------------------------

Epoch : 1 Training loss: 0.29 	Validation loss: 0.3337

----------------------------------------------------Saved best model------------------------------------------------------------------

Epoch : 2 Training loss: 0.1907 	Validation loss: 0.3631

Epoch : 3 Training loss: 0.1023 	Validation loss: 0.4459

Epoch : 4 Training loss: 0.0429 	Validation loss: 0.5535

Epoch : 5 Training loss: 0.0154 	Validation loss: 0.6735

Epoch : 6 Training loss: 0.0052 	Validation loss: 0.764

Epoch : 7 Training loss: 0.0021 	Validation loss: 0.8268

Epoch : 8 Training loss: 0.0012 	Validation loss: 0.8796

Epoch : 9 Training loss: 0.0011 	Validation loss: 0.9282


# 5. Model Evaluation

## 5.1 Check Performance

Load the best model weights and now, the model is ready for the predictions

In [None]:
#load weights of best model
path='saved_weights.pt'
model.load_state_dict(torch.load(path))

<All keys matched successfully>

In [None]:
#predict probabilities
y_pred_prob = predict(X_valid, batch_size)

In [None]:
y_pred_prob[0]

array([7.0765009e-04, 3.4424686e-01, 8.5326970e-02, 1.3579858e-02,
       1.0562687e-02, 2.2845514e-01, 4.2070836e-01, 2.7319089e-01,
       7.2962564e-01, 4.7954172e-03], dtype=float32)

In [None]:
#actual tags
y_true = y_valid.cpu().numpy()

The predictions are in terms of probabilities for each of the 10 tags. Hence we need to have a threshold value to convert these probabilities to 0 or 1. Let's specify a set of candidate threshold values. We will select the threshold value that performs the best for the validation set.

In [None]:
#define candidate threshold values
threshold  = np.arange(0,0.5,0.01)
print(threshold)

[0.   0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1  0.11 0.12 0.13
 0.14 0.15 0.16 0.17 0.18 0.19 0.2  0.21 0.22 0.23 0.24 0.25 0.26 0.27
 0.28 0.29 0.3  0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.4  0.41
 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49]


Let's define a function that takes a threshold value and uses it to convert probabilities into 1 or 0.

In [None]:
# convert probabilities into classes or tags based on a threshold value
def classify(y_pred_prob, thresh):
  
  y_pred = []

  for i in y_pred_prob:
    temp=[]
      
    for j in i:
      if j>=thresh:
        temp.append(1)
      else:
        temp.append(0)
    
    y_pred.append(temp)

  return np.array(y_pred)

We will evaluate the performance of model for each candidate threshold

In [None]:
score=[]

for thresh in threshold:
    
    #classes for each threshold
    y_pred = classify(y_pred_prob, thresh) 

    #convert to 1d array
    y_pred_1d    =  y_pred.ravel()
    y_true_1d    =  y_true.ravel()
 
    score.append(metrics.f1_score(y_true_1d, y_pred_1d))

In [None]:
# find the optimal threshold
opt = threshold[score.index(max(score))]
print(opt)

0.34


In [None]:
#predictions for optimal threshold
y_pred = classify(y_pred_prob, opt)

In [None]:
#converting to 1D
y_pred_1d = y_pred.ravel()

#Classification report
print(metrics.classification_report(y_true_1d, y_pred_1d))

              precision    recall  f1-score   support

         0.0       0.91      0.90      0.91     17478
         1.0       0.65      0.68      0.66      4732

    accuracy                           0.85     22210
   macro avg       0.78      0.79      0.78     22210
weighted avg       0.86      0.85      0.85     22210



In [None]:
#convert back to tags
y_pred_label = mlb.inverse_transform(np.array(y_pred))
y_true_label = mlb.inverse_transform(np.array(y_true))

# get all validation text
queries = [" ".join(i) for i in valid_text]

# create a dataframe to show the data and prediction side by side
df = pd.DataFrame({'Questions':queries,'Actual Tags':y_true_label,'Predicted Tags':y_pred_label})

# print first five rows
df.head()

Unnamed: 0,Questions,Actual Tags,Predicted Tags
0,consider the following model y if g x beta u and y otherwise where u is iid according to some distribution function f i want to recover the distribution f without making too many assumptions that ...,"(logistic, regression)","(distributions, r, self study)"
1,i am encountering the following problems and i don t really know which model a should pick all model selection criteria indicate that i should take the model with lag after building the var model ...,"(hypothesis testing, time series)","(hypothesis testing, regression)"
2,basically i m attempting to recreate the results of an example from class in r what i m trying to do is decide whether it s best to use a single regression line for an entire data set or two lines...,"(r, regression)","(r, regression)"
3,in general i standardize my independent variables in regressions in order to properly compare the coefficients this way they have the same units standard deviations however with panel longitudinal...,"(r, regression)","(r, regression)"
4,let v to be forecasted value for periods through t and v t be its forecasted value at time t we express v t as the sum of two terms its mean at time t and its deviation from the mean at time t eps...,"(r, time series)","(self study, time series)"


## 5.2 Show Inference

In [None]:
#raw text
text = "For example, in the case of logistic regression, the learning function is a Sigmoid function that tries to separate the 2 classes"

In [None]:
#cleaning text
tokens = cleaner(text)
tokens[:5]

['for', 'example', 'in', 'the', 'case']

In [None]:
#first argument to the model is no. of samples
tokens = np.array(tokens).reshape(-1,len(tokens))
tokens.shape

(1, 21)

In [None]:
#converting text to integer sequences
seq = convert2seq(tokens)
seq

tensor([[  12,  107,    9,    2,  151,    6,   94,   40,    2,  226,   74,    7,
            5, 1570,   74,   13, 2927,    4,  960,    2,  373,    1,    1,    1,
            1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
            1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
            1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
            1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
            1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
            1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
            1,    1,    1,    1]])

In [None]:
#predictions
with torch.no_grad():
  if torch.cuda.is_available():
    seq = seq.cuda()
  pred_prob= model(seq)
  print(pred_prob)

tensor([[0.1419, 0.0319, 0.0397, 0.8632, 0.1818, 0.1252, 0.1738, 0.6588, 0.0417,
         0.0118]], device='cuda:0')


In [None]:
#classify
pred = classify(pred_prob,opt)
pred

array([[0, 0, 0, 1, 0, 0, 0, 1, 0, 0]])

In [None]:
tags  = mlb.inverse_transform(pred)[0]
tags

('logistic', 'regression')

In [None]:
def predict_tags(text):
  
  tokens = cleaner(text)
  
  tokens = np.array(tokens).reshape(-1,len(tokens))
  
  seq = convert2seq(tokens)
  
  with torch.no_grad():
    if torch.cuda.is_available():
      seq = seq.cuda()

  pred_prob= model(seq)
  pred = classify(pred_prob,opt)
  
  tags  = mlb.inverse_transform(pred)[0]
  
  return tags

In [None]:
text = "For example, in the case of logistic regression, the learning function is a Sigmoid function that tries to separate the 2 classes"

tags = predict_tags(text)
print("Query: ", text)
print("Predicted tags:",tags)

Query:  For example, in the case of logistic regression, the learning function is a Sigmoid function that tries to separate the 2 classes
Predicted tags: ('logistic', 'regression')


# 6. Model Building for LSTM

In Pytorch, you can easily define LSTM layer using the LSTM module of torch.nn 

Parameters of LSTM layer:

* **input_size**: Number of inputs to the LSTM
* **hidden_size**: Number of neurons in LSTM layer.
* **batch_first**: Set first dimension to batch size

In [None]:
sample_embedding.shape

torch.Size([1, 100, 50])

In [None]:
#define an LSTM
lstm_layer = LSTM(input_size=50, hidden_size=128, batch_first=True)

In [None]:
#pass the input to LSTM
hidden_states, (last_hidden_state,last_cell_state) = lstm_layer(sample_embedding)

In [None]:
#Hidden state of every timestep (Batch, seq_len, no. of hidden neurons)
hidden_states.shape

torch.Size([1, 100, 128])

In [None]:
#output shape of last hidden timestep
last_hidden_state.shape

torch.Size([1, 1, 128])

In [None]:
#output shape of last cell state
last_cell_state.shape

torch.Size([1, 1, 128])

In [None]:
#reshaping the hidden states
reshaped = hidden_states.reshape(hidden_states.size(0),-1)
reshaped.shape

torch.Size([1, 12800])

In [None]:
# Define Model Architecture

# Input
# Embedding(embedding_dim=100)
# LSTM(128)
# Linear(128, 'relu')
# Linear(10, 'sigmoid')

class Net(nn.Module):
    
    #Constructor
    def __init__(self):

        #Constructor
        super(Net, self).__init__()   
  
        #rnn block
        self.lstm_layer = Sequential(
            
            #embedding layer
            Embedding(num_embeddings=len(TEXT.vocab), embedding_dim=100),
        
            #lstm layer
            LSTM(input_size=100, hidden_size=128, batch_first=True)
          
            )

        #dense block
        self.dense_layer = Sequential(
            
            Linear(12800,128),

            ReLU(),

            Linear(128,10),
            
            Sigmoid()

        )
    
    #forward pass
    def forward(self, x):
        
        #rnn layer
        hidden_states, (last_hidden_state,last_cell_state) = self.lstm_layer(x)

        #flattening
        hidden_states = hidden_states.reshape(hidden_states.size(0),-1)
        
        #dense layer
        outputs=self.dense_layer(hidden_states)
        
        return outputs

In [None]:
#define the model
model = Net()

In [None]:
#layers of the model
model

Net(
  (lstm_layer): Sequential(
    (0): Embedding(12483, 100)
    (1): LSTM(100, 128, batch_first=True)
  )
  (dense_layer): Sequential(
    (0): Linear(in_features=12800, out_features=128, bias=True)
    (1): ReLU()
    (2): Linear(in_features=128, out_features=10, bias=True)
    (3): Sigmoid()
  )
)

In [None]:
#pass an text to the model to understand the output

#deactivates autograd
with torch.no_grad():
  pred = model(X_train[:1])
  print(pred)

tensor([[0.4939, 0.4825, 0.4995, 0.4991, 0.4977, 0.4955, 0.4906, 0.4846, 0.5005,
         0.4998]])


In [None]:
#define optimizer and loss
optimizer = torch.optim.Adam(model.parameters())
criterion = BCELoss()

# checking if GPU is available
if torch.cuda.is_available():
    model = model.cuda()
    criterion = criterion.cuda()

In [None]:
N_EPOCHS = 10
batch_size = 32

# intialization
best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
     
    #train the model
    train_loss   = train(X_train, y_train, batch_size)
    
    #evaluate the model
    valid_loss   = evaluate(X_valid, y_valid, batch_size)

    print('\nEpoch :',epoch,
          'Training loss:',round(train_loss,4),
          '\tValidation loss:',round(valid_loss,4))

    #save the best model 
    if best_valid_loss >= valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'saved_weights.pt') 
        print("\n----------------------------------------------------Saved best model------------------------------------------------------------------")   




Epoch : 0 Training loss: 0.3733 	Validation loss: 0.3186

----------------------------------------------------Saved best model------------------------------------------------------------------

Epoch : 1 Training loss: 0.2379 	Validation loss: 0.3272

Epoch : 2 Training loss: 0.1268 	Validation loss: 0.3728

Epoch : 3 Training loss: 0.0383 	Validation loss: 0.4867

Epoch : 4 Training loss: 0.0103 	Validation loss: 0.6054

Epoch : 5 Training loss: 0.0031 	Validation loss: 0.6885

Epoch : 6 Training loss: 0.0014 	Validation loss: 0.7525

Epoch : 7 Training loss: 0.0009 	Validation loss: 0.8011

Epoch : 8 Training loss: 0.001 	Validation loss: 0.8371

Epoch : 9 Training loss: 0.0005 	Validation loss: 0.8525


# 7. Model Evaluation for LSTM

Load the best model weights and now, the model is ready for the predictions

In [None]:
#load weights of best model
path='saved_weights.pt'
model.load_state_dict(torch.load(path))

<All keys matched successfully>

In [None]:
#predict probabilities
y_pred_prob = predict(X_valid, batch_size)

In [None]:
y_pred_prob[0]

array([0.00569856, 0.04902776, 0.09942988, 0.02161415, 0.09411245,
       0.04785799, 0.19031535, 0.76762176, 0.50701326, 0.05482058],
      dtype=float32)

In [None]:
score=[]

for thresh in threshold:
    
    #classes for each threshold
    y_pred = classify(y_pred_prob, thresh) 

    #convert to 1d array
    y_pred_1d    =  y_pred.ravel()
    y_true_1d    =  y_true.ravel()
 
    score.append(metrics.f1_score(y_true_1d, y_pred_1d))

In [None]:
# find the optimal threshold
opt = threshold[score.index(max(score))]
print(opt)

0.33


In [None]:
#predictions for optimal threshold
y_pred = classify(y_pred_prob, opt)

In [None]:
#converting to 1D
y_pred_1d = y_pred.ravel()

#Classification report
print(metrics.classification_report(y_true_1d, y_pred_1d))

              precision    recall  f1-score   support

         0.0       0.92      0.91      0.91     17478
         1.0       0.67      0.70      0.68      4732

    accuracy                           0.86     22210
   macro avg       0.79      0.80      0.80     22210
weighted avg       0.86      0.86      0.86     22210



In [None]:
y_pred_label = mlb.inverse_transform(np.array(y_pred))

In [None]:
df = pd.DataFrame({'comment':queries,'actual':y_true_label,'predictions':y_pred_label})

In [None]:
df.head()

Unnamed: 0,comment,actual,predictions
0,consider the following model y if g x beta u and y otherwise where u is iid according to some distribution function f i want to recover the distribution f without making too many assumptions that ...,"(logistic, regression)","(regression, self study)"
1,i am encountering the following problems and i don t really know which model a should pick all model selection criteria indicate that i should take the model with lag after building the var model ...,"(hypothesis testing, time series)","(hypothesis testing, r, regression)"
2,basically i m attempting to recreate the results of an example from class in r what i m trying to do is decide whether it s best to use a single regression line for an entire data set or two lines...,"(r, regression)","(r, regression)"
3,in general i standardize my independent variables in regressions in order to properly compare the coefficients this way they have the same units standard deviations however with panel longitudinal...,"(r, regression)","(r, regression)"
4,let v to be forecasted value for periods through t and v t be its forecasted value at time t we express v t as the sum of two terms its mean at time t and its deviation from the mean at time t eps...,"(r, time series)","(time series,)"


In [None]:
text = "For example, in the case of logistic regression, the learning function is a Sigmoid function that tries to separate the 2 classes"

tags = predict_tags(text)
print("Query: ",text)
print("Predicted tags:",tags)

Query:  For example, in the case of logistic regression, the learning function is a Sigmoid function that tries to separate the 2 classes
Predicted tags: ('logistic', 'r', 'regression')
