#`This notebook is an illustration to create a Sentiment Analyzer with BERT Using Ktrain`🚆

For more information, check out [ktrain](https://github.com/amaiya/ktrain)

<img src="https://www.aitimes.kr/news/photo/201901/13117_13465_1541.jpg" width="360" height="220"> 

The [BERT paper](https://arxiv.org/abs/1810.04805) was released along with the [source code](https://github.com/google-research/bert) and pre-trained models.

Thanks to [Amaiya](https://github.com/amaiya) for giving this contribution in the community.

## **Setup**
Install the required packages and setup the imports:  

In [None]:
!pip install ktrain

I have loaded the preprocessed IMDB Sentiment dataset on my Google Drive. Let's download it using gdown:

In [None]:
!gdown --id 1dI1d0-7pd99scl0oHzwaPn6f8qlw9nI2
!gdown --id 1I4Dtp4-MvZst_Bda4530CX5UVttxp1Ee  

In [4]:
import numpy as np
import pandas as pd
import tensorflow as tf
import ktrain
from ktrain import text

In [6]:
data_train = pd.read_excel('/content/train.xlsx', dtype=str)
data_train 

Unnamed: 0,Reviews,Sentiment
0,"When I first tuned in on this morning news, I ...",neg
1,"Mere thoughts of ""Going Overboard"" (aka ""Babes...",neg
2,Why does this movie fall WELL below standards?...,neg
3,Wow and I thought that any Steven Segal movie ...,neg
4,"The story is seen before, but that does'n matt...",neg
...,...,...
24995,Everyone plays their part pretty well in this ...,pos
24996,It happened with Assault on Prescient 13 in 20...,neg
24997,My God. This movie was awful. I can't complain...,neg
24998,"When I first popped in Happy Birthday to Me, I...",neg


In [7]:
data_test = pd.read_excel('/content/test.xlsx', dtype=str)
data_test

Unnamed: 0,Reviews,Sentiment
0,Who would have thought that a movie about a ma...,pos
1,After realizing what is going on around us ......,pos
2,I grew up watching the original Disney Cindere...,neg
3,David Mamet wrote the screenplay and made his ...,pos
4,"Admittedly, I didn't have high expectations of...",neg
...,...,...
24995,This fanciful horror flick has Vincent Price p...,neg
24996,"The Intruder (L'Intrus), a film directed by Fr...",pos
24997,Holy crap. This was the worst film I have seen...,neg
24998,Clocking in at an interminable three hours and...,neg


The helper function below is just similar to the [encode_plus()](https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.PreTrainedTokenizer.encode_plus) method presented by HuggingFace.

All the heavy lifting is done by this one simple function call and that's amazing!!

## The KTrain

In [12]:
(x_train, y_train), (x_test, y_test), preprocess = text.texts_from_df(
    train_df = data_train, 
    text_column = 'Reviews',
    label_columns = 'Sentiment',
    val_df = data_test,
    maxlen = 400,
    preprocess_mode = 'bert'
) 

downloading pretrained BERT model (uncased_L-12_H-768_A-12.zip)...
[██████████████████████████████████████████████████]
extracting pretrained BERT model...
done.

cleanup downloaded zip...
done.

preprocessing train...
language: en


Is Multi-Label? False
preprocessing test...
language: en


In [14]:
x_train[0].shape

(25000, 400)

Let's define the model:

In [15]:
model = text.text_classifier(name='bert', 
                             train_data=(x_train, y_train),
                             preproc = preprocess)

Is Multi-Label? False
maxlen is 400
done.


Let's define the Learner function which trains the model:

In [16]:
learner = ktrain.get_learner(model = model,
                            train_data = (x_train, y_train),
                            val_data = (x_test, y_test),
                            batch_size = 6)

# `THE BELOW CODE SHOULD ONLY BE EXECUTED TO FIND THE OPTIMAL LEARNING RATE`

Thanks to the Authors of Ktrain, they found out that for IMDB Sentiment Analysis the optimal Learning Rate is around 2e-5.

But, if you want to try it out, you can run the cell below if you have any other problem statement.

# `NOTE: THIS WILL TAKE A WHOLE DAY TO RUN!!!`

In [None]:
learner.lr_find()
learner.lr_plot() 

There is a minor bug in this fit method. The model stops training at 24996/25000, you just need to stop the execution of the cell and you'll be fine. In the output it shows it has trained all the samples, so we are good to go.

In [17]:
learner.fit_onecycle(lr=2e-5, epochs=1)  



begin training using onecycle policy with max lr of 2e-05...
Train on 25000 samples, validate on 25000 samples


KeyboardInterrupt: ignored

DO NOT INCREASE THE EPOCH TO MORE THAN 3 THAT WILL OVERFIT THE MODEL.

Let's see how good our model is doing on unseen examples.

In [18]:
predictor = ktrain.get_predictor(learner.model, preprocess)

In [19]:
data = ['This movie was shit! The plot was Boring As Hell! Acting was Okay',
        'The film sucked! there is no plot and acting was horrible!',
        'what a beautiful move, great plot, will see again fosho!']

In [20]:
predictor.predict(data) 

['neg', 'neg', 'pos']

As you can see our Model is doing a good job, it has an accuracy of 89% which is quite good for a BERT model.

In [21]:
predictor.save('/content/bert') 

## Summary

You now have learned to:

* Intuitively understand what KTrain is
* Evaluate the model on Validation data
* Predict sentiment on unseen text using IMDB reviews

## References 
- [Ktrain](https://github.com/amaiya/ktrain)
- [BERT](https://huggingface.co/transformers/model_doc/bert.html)
