#### What is BERT ??
BERT, short for **Bidirectional Encoder Representations from Transformers**, is a Machine Learning (ML) model for natural language processing. It was developed in 2018 by researchers at Google AI Language and serves as a swiss army knife solution to 11+ of the most common language tasks, such as sentiment analysis and named entity recognition.

Language has historically been difficult for computers to ‘understand’. Sure, computers can collect, store, and read text inputs but they lack basic language context.

So, along came Natural Language Processing (NLP): the field of artificial intelligence aiming for computers to read, analyze, interpret and derive meaning from text and spoken words. This practice combines linguistics, statistics, and Machine Learning to assist computers in ‘understanding’ human language.

Individual NLP tasks have traditionally been solved by individual models created for each specific task. That is, until— BERT!

BERT revolutionized the NLP space by solving for 11+ of the most common NLP tasks (and better than previous models) making it the jack of all NLP trades.

BERT (Bidirectional Encoder Representations from Transformers) is a neural network architecture that uses the Transformer architecture to pre-train deep bidirectional representations of text. The architecture consists of two parts:

Pre-training: BERT is first trained on a large corpus of unlabeled text using two pre-training objectives: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP).

Fine-tuning: After pre-training, the BERT model is fine-tuned on a downstream task, such as text classification, question answering, or named entity recognition.

The BERT architecture consists of multiple layers of Transformer encoders, with each layer having a multi-head attention mechanism and a feedforward network. The architecture can be configured with different numbers of layers and hidden units depending on the specific task at hand.

**Here is the high-level architecture of BERT:**

**Input embeddings**: Word embeddings are fed into the model as input, along with a positional encoding that indicates the position of each token in the sequence.

**Encoder**: The input embeddings are passed through multiple layers of Transformer encoders. Each encoder layer has two sub-layers: a multi-head attention mechanism and a feedforward network.

**Multi-Head Attention**: The multi-head attention mechanism computes attention scores between each pair of tokens in the sequence, allowing the model to capture the relationships between all tokens in the sequence. This is done using multiple attention heads, each of which learns to attend to different parts of the input sequence.

**Feedforward Network**: The output of the multi-head attention mechanism is passed through a feedforward network, which applies a non-linear activation function and produces a new representation of the input.

**Output Layer**: The final output layer takes the output from the last encoder layer and produces a prediction for the downstream task.

Overall, BERT is a powerful architecture that has achieved state-of-the-art results on a wide range of natural language processing tasks.


### install KTRAIN model

In [1]:
!pip install ktrain

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting ktrain
  Downloading ktrain-0.33.2.tar.gz (25.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m25.3/25.3 MB[0m [31m58.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 KB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting cchardet
  Downloading cchardet-2.1.7-cp39-cp39-manylinux2010_x86_64.whl (265 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.4/265.4 KB[0m [31m31.7 MB/s[0m eta [36m0:00:00[0m
Collecting syntok>1.3.3
  Downloading syntok-1.4.4-py3-none-any.whl (24 kB)
Collecting transformers>=4.17.0
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━

In [2]:
import tensorflow as tf


In [3]:
## check the vesrion
tf.__version__


'2.11.0'

In [4]:
#load amazon Movie Dataset
!git clone https://github.com/kaushik-prasad-dey/nlp/IMDB-Movie-Reviews-Large-Dataset-50k.git

Cloning into 'IMDB-Movie-Reviews-Large-Dataset-50k'...
remote: Enumerating objects: 10, done.[K
remote: Counting objects: 100% (10/10), done.[K
remote: Compressing objects: 100% (8/8), done.[K
remote: Total 10 (delta 1), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (10/10), 25.78 MiB | 13.30 MiB/s, done.


In [5]:
import pandas as pd
import numpy as np
import ktrain
from ktrain import text
import tensorflow as tf

In [6]:
#loaded trained data set
trained_data=pd.read_excel('/content/IMDB-Movie-Reviews-Large-Dataset-50k/train.xlsx', dtype = str)

In [7]:
#loaded test data set
test_data=pd.read_excel('/content/IMDB-Movie-Reviews-Large-Dataset-50k/test.xlsx', dtype = str)

In [8]:
#check first five rows of trained data
trained_data.head()

Unnamed: 0,Reviews,Sentiment
0,"When I first tuned in on this morning news, I ...",neg
1,"Mere thoughts of ""Going Overboard"" (aka ""Babes...",neg
2,Why does this movie fall WELL below standards?...,neg
3,Wow and I thought that any Steven Segal movie ...,neg
4,"The story is seen before, but that does'n matt...",neg


In [9]:
#check last five rows of trained data
trained_data.tail()

Unnamed: 0,Reviews,Sentiment
24995,Everyone plays their part pretty well in this ...,pos
24996,It happened with Assault on Prescient 13 in 20...,neg
24997,My God. This movie was awful. I can't complain...,neg
24998,"When I first popped in Happy Birthday to Me, I...",neg
24999,"So why does this show suck? Unfortunately, tha...",neg


In [10]:
#check first five rows of test data
test_data.head()

Unnamed: 0,Reviews,Sentiment
0,Who would have thought that a movie about a ma...,pos
1,After realizing what is going on around us ......,pos
2,I grew up watching the original Disney Cindere...,neg
3,David Mamet wrote the screenplay and made his ...,pos
4,"Admittedly, I didn't have high expectations of...",neg


In [11]:
#check last five rows of test data
test_data.tail()

Unnamed: 0,Reviews,Sentiment
24995,This fanciful horror flick has Vincent Price p...,neg
24996,"The Intruder (L'Intrus), a film directed by Fr...",pos
24997,Holy crap. This was the worst film I have seen...,neg
24998,Clocking in at an interminable three hours and...,neg
24999,Rented and watched this short (< 90 minutes) w...,pos


In [12]:
#check the rows & columns in train & test data set
trained_data.shape, test_data.shape

((25000, 2), (25000, 2))

In [13]:
#use ktrain package with that
(X_train, y_train), (X_test, y_test), preproc = text.texts_from_df(train_df=trained_data,
                                                                   text_column = 'Reviews',
                                                                   label_columns = 'Sentiment',
                                                                   val_df = test_data,
                                                                   maxlen = 500,
                                                                   preprocess_mode = 'bert')

['neg', 'pos']
   neg  pos
0  1.0  0.0
1  1.0  0.0
2  1.0  0.0
3  1.0  0.0
4  1.0  0.0
['neg', 'pos']
   neg  pos
0  0.0  1.0
1  0.0  1.0
2  1.0  0.0
3  0.0  1.0
4  1.0  0.0
downloading pretrained BERT model (uncased_L-12_H-768_A-12.zip)...
[██████████████████████████████████████████████████]
extracting pretrained BERT model...
done.

cleanup downloaded zip...
done.

preprocessing train...
language: en


Is Multi-Label? False
preprocessing test...
language: en


Text Classifier

In [15]:
##creating the model with text classifier
## it should be inside the model
model = text.text_classifier(name = 'bert', train_data = (X_train, y_train), preproc = preproc)

Is Multi-Label? False
maxlen is 500




done.


In [16]:
## create Learner data
learner_data = ktrain.get_learner(model=model, train_data=(X_train, y_train), val_data = (X_test, y_test), batch_size = 6)

In [17]:
## from this model know the best learning rate. fit_onecycle is one method. it takes lot of the memory.
## fit one cycle. and lr =2e-5(2* 10 to the power -5)
learner_data.fit_onecycle(lr = 2e-5, epochs = 2)



begin training using onecycle policy with max lr of 2e-05...
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f0ac0242160>

In [18]:
predictor = ktrain.get_predictor(learner_data.model, preproc)

In [19]:
###now we have to test our Model
data = ['this movie was horrible, the plot was really boring. acting was not so good',
        'the fild is really sucked. there is not plot and acting was very bad',
        'what a great movie. great plot. acting was good. will see it again']

In [21]:
#now this is predictor predict
predictor.predict(data, return_proba=True)



array([[0.9987639 , 0.00123613],
       [0.9978115 , 0.00218859],
       [0.00347453, 0.99652547]], dtype=float32)

In [22]:
#now get some predictor classes
predictor.get_classes()

['neg', 'pos']

### Saving & Loading the Bert Model

In [23]:
## save the model
predictor.save('/content/IMDB-Movie-Reviews-Large-Dataset-50k/Save_bert_model')

In [24]:
###Zip the folder
!zip -r /content/IMDB-Movie-Reviews-Large-Dataset-50k/Save_bert_model/bertModel.zip /content/IMDB-Movie-Reviews-Large-Dataset-50k/Save_bert_model

  adding: content/IMDB-Movie-Reviews-Large-Dataset-50k/Save_bert_model/ (stored 0%)
  adding: content/IMDB-Movie-Reviews-Large-Dataset-50k/Save_bert_model/tf_model.preproc (deflated 48%)
  adding: content/IMDB-Movie-Reviews-Large-Dataset-50k/Save_bert_model/tf_model.h5 (deflated 11%)


In [25]:
### Load the Predictor model
predictor_load_model=ktrain.load_predictor('/content/IMDB-Movie-Reviews-Large-Dataset-50k/Save_bert_model')



In [26]:
### now check it from the predcitor model
predictor_load_model.get_classes()

['neg', 'pos']

In [28]:
#### Now same data we can check from the exported model
predictor_load_model.predict(data)



['neg', 'neg', 'pos']

In [29]:
predictor_load_model.predict(data,return_proba=True)



array([[0.9987639 , 0.00123613],
       [0.9978115 , 0.00218859],
       [0.00347453, 0.99652547]], dtype=float32)