This is part 3 of 'Comprehensive NLP Tutorial' series'. In  [Part-1](https://www.kaggle.com/kksienc/comprehensive-nlp-tutorial-1-ml-perspective) we performed text processing using ML techniques and in [Part-2](https://www.kaggle.com/kksienc/comprehensive-nlp-tutorial-2-dl-perspective) we used Word Embedding and Deep Learning algorithms. In this part mainly we will look into state-of-art 'BERT' technique.

<a class="kk" id="0.1"></a>
## Contents

1. [BERT Introduction](#1)
1. [Salient BERT Features](#2)
1. [BERT Implementation](#3)
    1. [BERT Pretrained Layer](#3.1)
    1. [BERT Encoding](#3.2)
        1. [Encoding Sample Example](#3.2.1)
        1. [Encoding The Dataset](#3.2.2)
    1. [BERT Modeling](#3.3)
        
 

# 1. BERT  Introduction  <a class="kk" id="1"></a>
[Back to Contents](#0.1)

[BERT](https://github.com/google-research/bert) stands for <B> Bidirectional Encoder Representations from Transformers </B>. It was created and published in 2018 by Jacob Devlin and his colleagues from Google. BERT [paper](https://arxiv.org/pdf/1810.04805.pdf) depicts that a directionally trained language model can have a deeper sense of language context and flow than single-direction language models. Since creation BERT is a go to model for NLP and has inspired many NLP architecture such as RoBERTa, XLNet etc.


<img src="https://storage.googleapis.com/kagglesdsdata/datasets/598303/1088431/bert.png?GoogleAccessId=web-data@kaggle-161607.iam.gserviceaccount.com&Expires=1587371932&Signature=L%2FPKC4s3HSp8wLRL2n6%2BgBEpPewTYSJlaBXkO3CyJmMfv%2BlIpYNq7YmK0W57RxmlCkbWcScAU%2BIGUMDO7dwKAwL0s%2F7345AA4suyA3BbQ61bLDJx5HKOz27lnNfQBPP1dpZQ4brnKDxAzcCsS8hOFArf0iZS1%2BHrTHCdWWm2%2Bah9LKmWy8%2F3NDgUJazdZAw76LSP4ULeG4IYNUbkkET%2F4e6zDQw2YGVeQhBamnpmQ3p8FjfMzRNNlyZZfWKBIdtNjl3iomyImAdq8gt0UgTOuHT5%2F2CoZsMLOjM%2B5AIerEf45KZ3FZl5j285%2FqD6pq3IcN4KLvTxyshIrpWYSK2u8Q%3D%3D" width="250">


Lets understand BERT by breaking BERT abbreviation,

- <B>Bidirectional</B> :  BERT takes whole text passage as input and reads passage in both direction to understand the meaning of each word.


- <B>Transformers</B> :  BERT is based on a Deep Transformer network. Transformer network is a type of network that can process efficiently long texts by using attention. An attention is a mechanism to learn contextual relations between words (or sub-words) in a text. 


- <B>Encoder Representation</B> : Originally  Transformer includes two separate mechanisms — an encoder that reads the text input and a decoder that produces a prediction for the task, since BERT’s goal is to generate a language model only the encoder mechanism is necessary hence 'encoder representation'
 



<B> How BERT performs Bidirectional training? </B>

BERT uses following two prediction models simultaneously with the goal of minimizing the combined loss function of the two strategies

1. <B>Masked Language Model </B>:  Before feeding word sequences into BERT, 15% of the words in each sequence are replaced with a [MASK] token. The model then attempts to predict the original value of the masked words, based on the context provided by the other, non-masked, words in the sequence.


2. <B>Next Sentence Prediction </B> :  The model receives pairs of sentences as input and learns to predict if the second sentence in the pair is the subsequent sentence in the original document. During training, 50% of the inputs are a pair in which the second sentence is the subsequent sentence in the original document, while in the other 50% a random sentence from the corpus is chosen as the second sentence. The assumption is that the random sentence will be disconnected from the first sentence.


# 2. Salient BERT Features  <a class="kk" id="2"></a>
[Back to Contents](#0.1)


1. <B>Dynamic Word Embedding </B>:Unlike previous [Part-2](https://www.kaggle.com/kksienc/comprehensive-nlp-tutorial-2-dl-perspective) word embedding techniques (Word2Vec etc), where each word had a static vector representation, BERT produces word representations that are not fixed and dynamically informed by the words around them.   
2. <B>Pre trained Model</B> : BERT uses transfer learning i.e. it fetches knowledge of pretrained model and then fine-tune it. BERT is pre-trained on a large corpus of unlabelled text including the entire Wikipedia and Book Corpus. The pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models. BERT is clearly not designed to train any dataset from scratch.
3. <B>Word Segmentation </B>:BERT does not flag OOV and unknown word like 'OOV' and 'UNK' (as   in fasttext). It decomposes these into meaningful sub-word and characters tokens and then generates embeddings.

4. <B> Multilanguage Support </B>: BERT had been adopted by Google Search for over 70 languages.


 

# 3.  BERT  Implementation  <a class="kk" id="3"></a>
[Back to Contents](#0.1)

We will use Goolge TensorFlow [library](https://pypi.org/project/bert-for-tf2/) for our BERT modeling. It uses Keras backend.
 

## 3.1 BERT Pretrained Layer  <a class="kk" id="3.1"></a>

First we will fetch our pretrained BERT layer and load the tokenizer.


In [2]:
#  to install tesorflow bert package
#!pip install bert-for-tf2

import tensorflow as tf
import tensorflow_hub as hub
from tensorflow.keras import layers
import bert

#Loding pretrained bert layer
BertTokenizer = bert.bert_tokenization.FullTokenizer
bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1",
                            trainable=False)


# Loading tokenizer from the bert layer
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = BertTokenizer(vocab_file, do_lower_case)


## 3.2 BERT  Encoding  <a class="kk" id="3.2"></a>


-  Each sentence is first tokenized into tokens 
-  A [CLS] token is inserted at the beginning of the first sentence and a [SEP] token is inserted at the end of each sentence.
-  Tokens that comply with the fixed vocabulary are fetched and assigned with following 3 properties.
    1. Token IDs -  A Unique token-id from BERT’s tokenizer.
    2. Padding ID (Mask-Id) - to indicate which elements in the sequence are tokens and which are padding elements.
    3. Segment IDs - to distinguish different sentences.   

Lets encode a sample text 
 

### 3.2.1 BERT Text Encoding : Sample Example  <a class="kk" id="3.2.1"></a>
 

In [3]:
text = 'It may have worked better with more examples'
# tokenize
tokens_list = tokenizer.tokenize(text)
print('Text after tokenization')
print(tokens_list)

# initilize dimension
max_len =25
text = tokens_list[:max_len-2]
input_sequence = ["[CLS]"] + text + ["[SEP]"]
print("After adding [CLS] and [SEP]: ")
print(input_sequence)


tokens = tokenizer.convert_tokens_to_ids(input_sequence )
print("tokens to id ")
print(tokens)

pad_len = max_len -len(input_sequence)
tokens += [0] * pad_len
print("tokens: ")
print(tokens)

print(pad_len)
pad_masks = [1] * len(input_sequence) + [0] * pad_len
print("Pad Masking: ")
print(pad_masks)

segment_ids = [0] * max_len
print("Segment Ids: ")
print(segment_ids)

Text after tokenization
['it', 'may', 'have', 'worked', 'better', 'with', 'more', 'examples']
After adding [CLS] and [SEP]: 
['[CLS]', 'it', 'may', 'have', 'worked', 'better', 'with', 'more', 'examples', '[SEP]']
tokens to id 
[101, 2009, 2089, 2031, 2499, 2488, 2007, 2062, 4973, 102]
tokens: 
[101, 2009, 2089, 2031, 2499, 2488, 2007, 2062, 4973, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
15
Pad Masking: 
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Segment Ids: 
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


### 3.2.2 Encoding Dataset  <a class="kk" id="3.2.2"></a>
 
Now lets  fetch and encode our dataset

In [4]:
# fetch & cleaning  datsset
import pandas as pd
from nltk.corpus import stopwords 
from nltk.corpus import wordnet
from spellchecker import SpellChecker
from nltk.stem import WordNetLemmatizer 
import nltk 
import re

train_df = pd.read_csv("nlp-getting-started/train.csv")
test_df = pd.read_csv("nlp-getting-started/test.csv")


def convert_to_antonym(sentence):
    words = nltk.word_tokenize(sentence)
    new_words = []
    temp_word = ''
    for word in words:
        antonyms = []
        if word == 'not':
            temp_word = 'not_'
        elif temp_word == 'not_':
            for syn in wordnet.synsets(word):
                for s in syn.lemmas():
                    for a in s.antonyms():
                        antonyms.append(a.name())
            if len(antonyms) >= 1:
                word = antonyms[0]
            else:
                word = temp_word + word # when antonym is not found, it will
                                    # remain not_happy
            
            temp_word = ''
        if word != 'not':
            new_words.append(word)
    return ' '.join(new_words)


def correct_spellings(text):
    spell = SpellChecker()
    corrected_words = []
    misspelled_words = spell.unknown(text.split())
    for word in text.split():
        if word in misspelled_words:
            corrected_words.append(spell.correction(word))
        else:
            corrected_words.append(word)
    return " ".join(corrected_words)
        

 
 
def clean_text(text):
    """
        text: a string
        
        return: modified initial string
  """
    text = text.lower() # lowercase text
    text= re.sub(r'[^\w\s#]',' ',text) #Removing every thing other than space, word and hash
    text  = re.sub(r"https?://\S+|www\.\S+", "", text )
    text= re.sub(r'[0-9]',' ',text)
    #text = correct_spellings(text)
    text = convert_to_antonym(text)
    text = re.sub(' +', ' ', text) # replace REPLACE_BY_SPACE_RE symbols by space in text    
    return text


train_df['text'] = train_df['text'].apply(clean_text)
test_df['text'] = test_df['text'].apply(clean_text)

sentences= pd.DataFrame(columns=['text'])
sentences['text']= pd.concat([train_df["text"], test_df["text"]])



In [5]:
# function to encode the text into tokens, masks, and segment flags
import numpy as np
def bert_encode(texts, tokenizer, max_len=512):
    all_tokens = []
    all_masks = []
    all_segments = []
    
    for text in texts:
        text = tokenizer.tokenize(text)
            
        text = text[:max_len-2]
        input_sequence = ["[CLS]"] + text + ["[SEP]"]
        pad_len = max_len - len(input_sequence)
        
        tokens = tokenizer.convert_tokens_to_ids(input_sequence)
        tokens += [0] * pad_len
        pad_masks = [1] * len(input_sequence) + [0] * pad_len
        segment_ids = [0] * max_len
        
        all_tokens.append(tokens)
        all_masks.append(pad_masks)
        all_segments.append(segment_ids)
    
    return np.array(all_tokens), np.array(all_masks), np.array(all_segments)

MAX_LEN = 64

# encode train set 
train_input = bert_encode(train_df.text.values, tokenizer, max_len=MAX_LEN)
# encode  test set 
test_input = bert_encode(test_df.text.values, tokenizer, max_len= MAX_LEN )
train_labels = train_df.target.values

In [6]:
# lets see encoded train set 
train_input

(array([[  101,  2256, 15616, ...,     0,     0,     0],
        [  101,  3224,  2543, ...,     0,     0,     0],
        [  101,  2035,  3901, ...,     0,     0,     0],
        ...,
        [  101,  1049, 11396, ...,     0,     0,     0],
        [  101,  2610, 11538, ...,     0,     0,     0],
        [  101,  1996,  6745, ...,     0,     0,     0]]),
 array([[1, 1, 1, ..., 0, 0, 0],
        [1, 1, 1, ..., 0, 0, 0],
        [1, 1, 1, ..., 0, 0, 0],
        ...,
        [1, 1, 1, ..., 0, 0, 0],
        [1, 1, 1, ..., 0, 0, 0],
        [1, 1, 1, ..., 0, 0, 0]]),
 array([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]]))

Input has 3 array first for token id , second for mask id and third for segment id 

## 3.3 BERT Modeling  <a class="kk" id="3.3"></a>

Lets build & train a basic BERT model.

In [7]:
# first define input for token, mask and segment id  
from tensorflow.keras.layers import  Input
input_word_ids = Input(shape=(MAX_LEN,), dtype=tf.int32, name="input_word_ids")
input_mask = Input(shape=(MAX_LEN,), dtype=tf.int32, name="input_mask")
segment_ids = Input(shape=(MAX_LEN,), dtype=tf.int32, name="segment_ids")

#  output  
from tensorflow.keras.layers import Dense
pooled_output, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])  
clf_output = sequence_output[:, 0, :]
out = Dense(1, activation='sigmoid')(clf_output)   

# intilize model
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
model = Model(inputs=[input_word_ids, input_mask, segment_ids], outputs=out)
model.compile(Adam(lr=2e-6), loss='binary_crossentropy', metrics=['accuracy'])
model.summary()

# train
train_history = model.fit(
    train_input, train_labels,
    validation_split=0.2,
    epochs=2,
    batch_size=32
)

model.save('model.h5')


Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_word_ids (InputLayer)     [(None, 64)]         0                                            
__________________________________________________________________________________________________
input_mask (InputLayer)         [(None, 64)]         0                                            
__________________________________________________________________________________________________
segment_ids (InputLayer)        [(None, 64)]         0                                            
__________________________________________________________________________________________________
keras_layer (KerasLayer)        [(None, 768), (None, 109482241   input_word_ids[0][0]             
                                                                 input_mask[0][0]             

#### Predict Output 

In [8]:
test_pred = model.predict(test_input)

In [9]:
preds = test_pred.round().astype(int)
preds

array([[0],
       [0],
       [0],
       ...,
       [0],
       [1],
       [0]])

## References

- https://www.kaggle.com/xhlulu/disaster-nlp-keras-bert-using-tfhub
- https://peltarion.com/knowledge-center/tutorials/bert-movie-review-sentiment-analysis
- https://stackabuse.com/text-classification-with-bert-tokenizer-and-tf-2-0-in-python/
- https://github.com/imgarylai/bert-embedding
- http://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/
- http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/
- https://www.kaggle.com/xhlulu/disaster-nlp-keras-bert-using-tfhub
- https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270
- https://androidkt.com/simple-text-classification-using-bert-in-tensorflow-keras-2-0/
- https://www.kaggle.com/xhlulu/disaster-nlp-keras-bert-using-tfhub
- https://www.kaggle.com/ratan123/in-depth-guide-to-google-s-bert
