# Contents

In this Notebook I will start with the very Basics of RNN's and Build all the way to latest deep learning architectures to solve NLP problems. It will cover the Following:
* Simple RNN's
* Word Embeddings : Definition and How to get them
* LSTM's
* GRU's
* BI-Directional RNN's
* Encoder-Decoder Models (Seq2Seq Models)
* Attention Models
* Transformers - Attention is all you need
* BERT

I will divide every Topic into four subsections:
* Basic Overview
* In-Depth Understanding : In this I will attach links of articles and videos to learn about the topic in depth
* Code-Implementation
* Code Explanation

This is a comprehensive kernel and if you follow along till the end , I promise you would learn all the techniques completely

Note that the aim of this notebook is not to have a High LB score but to present a beginner guide to understand Deep Learning techniques used for NLP. Also after discussing all of these ideas , I will present a starter solution for this competiton

In [4]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from tqdm import tqdm
from sklearn.model_selection import train_test_split
import tensorflow as tf
from keras.models import Sequential
from keras.layers import LSTM, GRU,SimpleRNN
from keras.layers import Dense, Activation, Dropout
from keras.layers import Embedding
from keras.layers import BatchNormalization
# from keras.utils import np_utils
from sklearn import preprocessing, decomposition, model_selection, metrics, pipeline
from keras.layers import GlobalMaxPooling1D, Conv1D, MaxPooling1D, Flatten, Bidirectional, SpatialDropout1D
from keras.preprocessing import sequence, text
from keras.callbacks import EarlyStopping


In [5]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from plotly import graph_objs as go
import plotly.express as px
import plotly.figure_factory as ff

In [6]:
# Detect hardware, return appropriate distribution strategy
try:
    # TPU detection. No parameters necessary if TPU_NAME environment variable is
    # set: this is always the case on Kaggle.
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    print('Running on TPU ', tpu.master())
except ValueError:
    tpu = None

if tpu:
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
else:
    # Default distribution strategy in Tensorflow. Works on CPU and single GPU.
    strategy = tf.distribute.get_strategy()

print("REPLICAS: ", strategy.num_replicas_in_sync)

REPLICAS:  1


In [7]:
train = pd.read_csv('jigsaw-toxic-comment-train.csv')
validation = pd.read_csv('validation.csv')
test = pd.read_csv('test.csv')

In [8]:
train.shape

(223549, 8)

In [9]:
train.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


In [10]:
test.head()

Unnamed: 0,id,content,lang
0,0,Doctor Who adlı viki başlığına 12. doctor olar...,tr
1,1,"Вполне возможно, но я пока не вижу необходимо...",ru
2,2,"Quindi tu sei uno di quelli conservativi , ...",it
3,3,Malesef gerçekleştirilmedi ancak şöyle bir şey...,tr
4,4,:Resim:Seldabagcan.jpg resminde kaynak sorunu ...,tr


In [11]:
validation.head()

Unnamed: 0,id,comment_text,lang,toxic
0,0,Este usuario ni siquiera llega al rango de ...,es,0
1,1,Il testo di questa voce pare esser scopiazzato...,it,0
2,2,Vale. Sólo expongo mi pasado. Todo tiempo pasa...,es,1
3,3,Bu maddenin alt başlığı olarak uluslararası i...,tr,0
4,4,Belçika nın şehirlerinin yanında ilçe ve belde...,tr,0


We will drop the other columns and approach this problem as a Binary Classification Problem and also we will have our exercise done on a smaller subsection of the dataset(only 12000 data points) to make it easier to train the models

In [12]:
train.drop(['severe_toxic','obscene','threat','insult','identity_hate'],axis=1,inplace=True)
train.head()

Unnamed: 0,id,comment_text,toxic
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0


In [13]:
train = train.loc[:12000,:]
train.shape

(12001, 3)

We will check the maximum number of words that can be present in a comment , this will help us in padding later

In [15]:
train['comment_text'].apply(lambda x:len(str(x).split()))

0         43
1         17
2         42
3        113
4         13
        ... 
11996     51
11997      4
11998    132
11999     10
12000    195
Name: comment_text, Length: 12001, dtype: int64

In [16]:
train['comment_text'].apply(lambda x:len(str(x).split())).max()

1403

Writing a function for getting auc score for validation

In [17]:
def roc_auc(predictions,target):
    '''
    This methods returns the AUC Score when given the Predictions
    and Labels
    '''
    
    fpr, tpr, thresholds = metrics.roc_curve(target, predictions)
    roc_auc = metrics.auc(fpr, tpr)
    return roc_auc

### Data Preparation

In [19]:
xtrain, xvalid, ytrain, yvalid = train_test_split(train.comment_text.values, train.toxic.values, 
                                                  stratify=train.toxic.values, 
                                                  random_state=42, 
                                                  test_size=0.2, shuffle=True)

## here train.comment_text.values is our independent info i.e and train.toxic.values  is dependent variable or target variable.
## also, we are performing stratify split on target variable as we have imbalance data

# Simple RNN

## Basic Overview

What is a RNN?

Recurrent Neural Network(RNN) are a type of Neural Network where the output from previous step are fed as input to the current step. In traditional neural networks, all the inputs and outputs are independent of each other, but in cases like when it is required to predict the next word of a sentence, the previous words are required and hence there is a need to remember the previous words. Thus RNN came into existence, which solved this issue with the help of a Hidden Layer.

Why RNN's? 
- Sequential data (e.g., time series, natural language)
    - No need for frame-based processing or parallel computation
    
How does an RNN work?

1. Each word in the sentence gets passed forward through the network along with all previous words. This forms a "memory" that contains information about all previously seen words.
2. Each word in the sentence gets passed on to the next layer along with information about all other words that have come before it. This allows the network to capture dependencies between words.
3. Each word in the sentence is treated as a unique "token" that can be passed from one timestep to another. This allows the model to keep track of information about each token.

https://www.quora.com/Why-do-we-use-an-RNN-instead-of-a-simple-neural-network

## In-Depth Understanding

* https://medium.com/mindorks/understanding-the-recurrent-neural-network-44d593f112a2
* https://www.youtube.com/watch?v=2E65LDnM2cA&list=PL1F3ABbhcqa3BBWo170U4Ev2wfsF7FN8l
* https://www.d2l.ai/chapter_recurrent-neural-networks/rnn.html

## Code Implementation

So first I will implement the and then I will explain the code step by step

In [20]:
# using keras tokenizer here
token = text.Tokenizer(num_words=None)
token

## Here, we are creating tokenier onject using Keras library for text preprocessing. 
## And using None for num_words, so that it will consider all the words present in the data. 

<keras.preprocessing.text.Tokenizer at 0x189f03981f0>

In [22]:
xtrain.shape, xvalid.shape, len(list(xtrain) + list(xvalid))

((9600,), (2401,), 12001)

In [23]:
max_len = 1500 
#This variable max_len is set to 1500, which suggests that the sequences generated by the tokenizer 
#should be truncated or padded to have a maximum length of 1500 tokens.
token.fit_on_texts(list(xtrain) + list(xvalid))
# The fit_on_texts method is used to update the internal vocabulary based on the text data provided

In [26]:
xtrain_seq = token.texts_to_sequences(xtrain)
xvalid_seq = token.texts_to_sequences(xvalid)

## The texts_to_sequences method is used to convert each text in the input data (xtrain and xvalid) 
## into a sequence of integers, where each integer corresponds to the index of a word in the tokenizer's internal vocabulary.
## This step essentially transforms the raw text data into sequences of numerical values,
# which can be used as input to neural networks or other machine learning models.

In [29]:
#zero pad the sequences 
xtrain_pad = tf.keras.utils.pad_sequences(xtrain_seq, maxlen=max_len)
xvalid_pad = tf.keras.utils.pad_sequences(xvalid_seq, maxlen=max_len)

In [30]:
word_index = token.word_index
word_index # getting the dict of vocabulaory 

{'the': 1,
 'to': 2,
 'of': 3,
 'and': 4,
 'a': 5,
 'you': 6,
 'i': 7,
 'is': 8,
 'that': 9,
 'in': 10,
 'it': 11,
 'for': 12,
 'this': 13,
 'not': 14,
 'on': 15,
 'be': 16,
 'as': 17,
 'are': 18,
 'have': 19,
 'with': 20,
 'your': 21,
 'if': 22,
 'was': 23,
 'article': 24,
 'or': 25,
 'but': 26,
 'page': 27,
 'my': 28,
 'an': 29,
 'wikipedia': 30,
 'by': 31,
 'from': 32,
 'do': 33,
 'at': 34,
 'about': 35,
 'me': 36,
 'so': 37,
 'talk': 38,
 'can': 39,
 'what': 40,
 'there': 41,
 'all': 42,
 'has': 43,
 'no': 44,
 'will': 45,
 'one': 46,
 'would': 47,
 'like': 48,
 'please': 49,
 'he': 50,
 'just': 51,
 'they': 52,
 'any': 53,
 'which': 54,
 'been': 55,
 'more': 56,
 'other': 57,
 'we': 58,
 "don't": 59,
 'his': 60,
 'should': 61,
 'some': 62,
 'here': 63,
 'see': 64,
 'who': 65,
 'also': 66,
 'because': 67,
 'know': 68,
 'am': 69,
 'think': 70,
 "i'm": 71,
 'edit': 72,
 'how': 73,
 'up': 74,
 'why': 75,
 'out': 76,
 "it's": 77,
 'then': 78,
 'people': 79,
 'use': 80,
 'only': 81,
 'w

In [31]:
%%time
with strategy.scope():
    # A simpleRNN without any pretrained embeddings and one dense layer
    model = Sequential()
    model.add(Embedding(len(word_index) + 1,
                     300,
                     input_length=max_len))
    model.add(SimpleRNN(100))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 1500, 300)         13049100  
                                                                 
 simple_rnn (SimpleRNN)      (None, 100)               40100     
                                                                 
 dense (Dense)               (None, 1)                 101       
                                                                 
Total params: 13,089,301
Trainable params: 13,089,301
Non-trainable params: 0
_________________________________________________________________
CPU times: total: 828 ms
Wall time: 738 ms


model = Sequential(): This creates a sequential model, which is a linear stack of layers.

model.add(Embedding(len(word_index) + 1, 300, input_length=max_len)): This adds an Embedding layer to the model. The embedding layer is used to convert integer-encoded words (represented by indices) into dense vectors of fixed size. It's common in NLP tasks to use pre-trained word embeddings, but in this case, it seems to be learning embeddings from scratch.

len(word_index) + 1 is the input dimension, representing the size of the vocabulary.
300 is the size of the dense embedding for each word.
input_length=max_len specifies the maximum length of the input sequences.
so, considering a sentence, then at this stage, each word will have vector dimension of 300 and  max_len i.e 1500 words can be present in one sentence, more than that will be truncated below that will be padded.

model.add(SimpleRNN(100)): This adds a Simple RNN (Recurrent Neural Network) layer with 100 units to the model. RNNs are capable of processing sequences of data.

model.add(Dense(1, activation='sigmoid')): This adds a dense layer with a single unit and a sigmoid activation function. The output is binary, and sigmoid is commonly used for binary classification problems.

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']): This compiles the model. It specifies the loss function (binary_crossentropy for binary classification), the optimizer (adam), and the evaluation metric (accuracy).

In [34]:
model.fit(xtrain_pad,ytrain,epochs=5, batch_size=64)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x18988f46b50>

In [35]:
scores = model.predict(xvalid_pad)
print("Auc: %.2f%%" % (roc_auc(scores,yvalid)))

Auc: 0.73%


In [36]:
scores_model = []
scores_model.append({'Model': 'SimpleRNN','AUC_Score': roc_auc(scores,yvalid)})

## Code Explanantion
* Tokenization<br><br>
 So if you have watched the videos and referred to the links, you would know that in an RNN we input a sentence word by word. We represent every word as one hot vectors of dimensions : Numbers of words in Vocab +1. <br>
  What keras Tokenizer does is , it takes all the unique words in the corpus,forms a dictionary with words as keys and their number of occurences as values,it then sorts the dictionary in descending order of counts. It then assigns the first value 1 , second value 2 and so on. So let's suppose word 'the' occured the most in the corpus then it will assigned index 1 and vector representing 'the' would be a one-hot vector with value 1 at position 1 and rest zereos.<br>
  Try printing first 2 elements of xtrain_seq you will see every word is represented as a digit now

In [39]:
xtrain_seq[:1], len(xtrain_seq[:1][0])

([[664,
   65,
   7,
   19,
   2262,
   14102,
   5,
   2262,
   20439,
   6071,
   4,
   71,
   32,
   20440,
   6620,
   39,
   6,
   664,
   65,
   11,
   8,
   20441,
   1502,
   38,
   6072]],
 25)

# Word Embeddings

While building our simple RNN models we talked about using word-embeddings , So what is word-embeddings and how do we get word-embeddings?
Here is the answer :
* https://www.coursera.org/learn/nlp-sequence-models/lecture/6Oq70/word-representation
* https://machinelearningmastery.com/what-are-word-embeddings/
<br> <br>
The latest approach to getting word Embeddings is using pretained GLoVe or using Fasttext. Without going into too much details, I would explain how to create sentence vectors and how can we use them to create a machine learning model on top of it and since I am a fan of GloVe vectors, word2vec and fasttext. In this Notebook, I'll be using the GloVe vectors. You can download the GloVe vectors from here http://www-nlp.stanford.edu/data/glove.840B.300d.zip or you can search for GloVe in datasets on Kaggle and add the file

In [40]:
# load the GloVe vectors in a dictionary:

embeddings_index = {}
f = open('D:\kiran\RNN\glove.6B.100d.txt','r',encoding='utf-8')
for line in tqdm(f):
    values = line.split(' ')
    word = values[0]
    coefs = np.asarray([float(val) for val in values[1:]])
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

400000it [00:21, 18792.97it/s]

Found 400000 word vectors.





# LSTM's

## Basic Overview

Simple RNN's were certainly better than classical ML algorithms and gave state of the art results, but it failed to capture long term dependencies that is present in sentences . So in 1998-99 LSTM's were introduced to counter to these drawbacks.

## In Depth Understanding

Why LSTM's?
* https://www.coursera.org/learn/nlp-sequence-models/lecture/PKMRR/vanishing-gradients-with-rnns
* https://www.analyticsvidhya.com/blog/2017/12/fundamentals-of-deep-learning-introduction-to-lstm/

What are LSTM's?
* https://www.coursera.org/learn/nlp-sequence-models/lecture/KXoay/long-short-term-memory-lstm
* https://distill.pub/2019/memorization-in-rnns/
* https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21

# Code Implementation

We have already tokenized and paded our text for input to LSTM's

In [42]:
# create an embedding matrix for the words we have in the dataset
embedding_matrix = np.zeros((len(word_index) + 1, 100))
for word, i in tqdm(word_index.items()):
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

  0%|          | 0/43496 [00:00<?, ?it/s]

100%|██████████| 43496/43496 [00:00<00:00, 309782.83it/s]


In [43]:
%%time
with strategy.scope():
    
    # A simple LSTM with glove embeddings and one dense layer
    model = Sequential()
    model.add(Embedding(len(word_index) + 1,
                     100,
                     weights=[embedding_matrix],
                     input_length=max_len,
                     trainable=False))

    model.add(LSTM(100, dropout=0.3, recurrent_dropout=0.3))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam',metrics=['accuracy'])
    
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 1500, 100)         4349700   
                                                                 
 lstm (LSTM)                 (None, 100)               80400     
                                                                 
 dense_1 (Dense)             (None, 1)                 101       
                                                                 
Total params: 4,430,201
Trainable params: 80,501
Non-trainable params: 4,349,700
_________________________________________________________________
CPU times: total: 594 ms
Wall time: 400 ms


In [44]:
model.fit(xtrain_pad, ytrain, epochs=2, batch_size=64)

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x1899228ca90>

In [45]:
scores = model.predict(xvalid_pad)
print("Auc: %.2f%%" % (roc_auc(scores,yvalid)))

Auc: 0.94%


In [46]:
scores_model.append({'Model': 'LSTM','AUC_Score': roc_auc(scores,yvalid)})

In [47]:
scores_model

[{'Model': 'SimpleRNN', 'AUC_Score': 0.7277841044948511},
 {'Model': 'LSTM', 'AUC_Score': 0.9443928850775484}]

## Code Explanation

As a first step we calculate embedding matrix for our vocabulary from the pretrained GLoVe vectors . Then while building the embedding layer we pass Embedding Matrix as weights to the layer instead of training it over Vocabulary and thus we pass trainable = False.
Rest of the model is same as before except we have replaced the SimpleRNN By LSTM Units

* Comments on the Model

We now see that the model is not overfitting and achieves an auc score of 0.93 which is quite commendable , also we close in on the gap between accuracy and auc .
We see that in this case we used dropout and prevented overfitting the data

# GRU's

## Basic  Overview

Introduced by Cho, et al. in 2014, GRU (Gated Recurrent Unit) aims to solve the vanishing gradient problem which comes with a standard recurrent neural network. GRU's are a variation on the LSTM because both are designed similarly and, in some cases, produce equally excellent results . GRU's were designed to be simpler and faster than LSTM's and in most cases produce equally good results and thus there is no clear winner.

## In Depth Explanation

* https://towardsdatascience.com/understanding-gru-networks-2ef37df6c9be
* https://www.coursera.org/learn/nlp-sequence-models/lecture/agZiL/gated-recurrent-unit-gru
* https://www.geeksforgeeks.org/gated-recurrent-unit-networks/

## Code Implementation

In [49]:
%%time
with strategy.scope():
    # GRU with glove embeddings and two dense layers
     model = Sequential()
     model.add(Embedding(len(word_index) + 1,
                     100,
                     weights=[embedding_matrix],
                     input_length=max_len,
                     trainable=False))
     model.add(SpatialDropout1D(0.3))
     model.add(GRU(300))
     model.add(Dense(1, activation='sigmoid'))

     model.compile(loss='binary_crossentropy', optimizer='adam',metrics=['accuracy'])   
    
model.summary()

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_3 (Embedding)     (None, 1500, 100)         4349700   
                                                                 
 spatial_dropout1d (SpatialD  (None, 1500, 100)        0         
 ropout1D)                                                       
                                                                 
 gru (GRU)                   (None, 300)               361800    
                                                                 
 dense_2 (Dense)             (None, 1)                 301       
                                                                 
Total params: 4,711,801
Trainable params: 362,101
Non-trainable params: 4,349,700
_________________________________________________________________
CPU times: total: 750 ms
Wall time: 687 ms


In [50]:
model.fit(xtrain_pad, ytrain, epochs=1, batch_size=64)



<keras.callbacks.History at 0x18a0612f790>

In [51]:
scores = model.predict(xvalid_pad)
print("Auc: %.2f%%" % (roc_auc(scores,yvalid)))

Auc: 0.94%


In [52]:
scores_model.append({'Model': 'GRU','AUC_Score': roc_auc(scores,yvalid)})

In [53]:
scores_model

[{'Model': 'SimpleRNN', 'AUC_Score': 0.7277841044948511},
 {'Model': 'LSTM', 'AUC_Score': 0.9443928850775484},
 {'Model': 'GRU', 'AUC_Score': 0.9371851557655756}]

# Bi-Directional RNN's

## In Depth Explanation

* https://www.coursera.org/learn/nlp-sequence-models/lecture/fyXnn/bidirectional-rnn
* https://towardsdatascience.com/understanding-bidirectional-rnn-in-pytorch-5bd25a5dd66
* https://d2l.ai/chapter_recurrent-modern/bi-rnn.html

## Code Implementation

In [54]:
%%time
with strategy.scope():
    # A simple bidirectional LSTM with glove embeddings and one dense layer
    model = Sequential()
    model.add(Embedding(len(word_index) + 1,
                     100,
                     weights=[embedding_matrix],
                     input_length=max_len,
                     trainable=False))
    model.add(Bidirectional(LSTM(100, dropout=0.3, recurrent_dropout=0.3)))

    model.add(Dense(1,activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam',metrics=['accuracy'])
    
    
model.summary()

Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_4 (Embedding)     (None, 1500, 100)         4349700   
                                                                 
 bidirectional (Bidirectiona  (None, 200)              160800    
 l)                                                              
                                                                 
 dense_3 (Dense)             (None, 1)                 201       
                                                                 
Total params: 4,510,701
Trainable params: 161,001
Non-trainable params: 4,349,700
_________________________________________________________________
CPU times: total: 609 ms
Wall time: 1.03 s


In [55]:
model.fit(xtrain_pad, ytrain, epochs=1, batch_size=64)

  5/150 [>.............................] - ETA: 47:53:06 - loss: 0.4574 - accuracy: 0.8906

KeyboardInterrupt: 