<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#IMDB-dataset" data-toc-modified-id="IMDB-dataset-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>IMDB dataset</a></span><ul class="toc-item"><li><span><a href="#Load-dataset" data-toc-modified-id="Load-dataset-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Load dataset</a></span></li><li><span><a href="#Create-labels" data-toc-modified-id="Create-labels-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Create labels</a></span></li><li><span><a href="#Rename-columns" data-toc-modified-id="Rename-columns-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Rename columns</a></span></li></ul></li><li><span><a href="#Bert-training" data-toc-modified-id="Bert-training-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Bert training</a></span><ul class="toc-item"><li><span><a href="#Config" data-toc-modified-id="Config-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Config</a></span></li><li><span><a href="#Training" data-toc-modified-id="Training-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Training</a></span></li><li><span><a href="#Calculate-validation-accuracy" data-toc-modified-id="Calculate-validation-accuracy-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Calculate validation accuracy</a></span></li></ul></li></ul></div>

# BERT IMDB training
> Train BERT on IMDB dataset for sentiment classification

We use the `simpletransformers` library to train BERT (large) for sentiment classification on the IMDB dataset.

In [1]:
import sys
sys.path.append('../../')
import pandas as pd
from sklearn.model_selection import train_test_split
from simpletransformers.classification import ClassificationModel

## IMDB dataset
The IMDB dataset contains 50k movie review annotated with "positive"/"negative" feedback indicating the sentiment. It can be downloaded from Kaggle ([link](https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews)). 

### Load dataset

In [2]:
df = pd.read_csv('../data/imdb-dataset.csv')

In [3]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


### Create labels

In [4]:
df['label'] = (df['sentiment']=='positive').astype(int)

In [5]:
df.head()

Unnamed: 0,review,sentiment,label
0,One of the other reviewers has mentioned that ...,positive,1
1,A wonderful little production. <br /><br />The...,positive,1
2,I thought this was a wonderful way to spend ti...,positive,1
3,Basically there's a family where a little boy ...,negative,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,1


### Rename columns

In [6]:
df.rename({'review': 'text'}, axis=1, inplace=True)
df.drop('sentiment', axis=1, inplace=True)

In [7]:
df.head()

Unnamed: 0,text,label
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


Train test split;

In [8]:
df_train, df_valid = train_test_split(df, test_size=0.2)

## Bert training

### Config

In [9]:
args = {
    'fp16':False,
    'wandb_project': 'bert-imdb',
    'num_train_epochs': 3,
    'overwrite_output_dir':True,
    'learning_rate': 1e-5,
}

### Training

In [10]:
model = ClassificationModel('bert', 'bert-large-cased', use_cuda=False,args=args) 
model.train_model(df_train, output_dir='bert-imdb')
result, model_outputs, wrong_predictions = model.eval_model(df_valid)

  "Dataframe headers not specified. Falling back to using column 0 as text and column 1 as labels."


Converting to features started. Cache is not used.


HBox(children=(FloatProgress(value=0.0, max=40000.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, description='Epoch', max=3.0, style=ProgressStyle(description_width='i…

[34m[1mwandb[0m: Wandb version 0.10.21 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade


HBox(children=(FloatProgress(value=0.0, description='Current iteration', max=5000.0, style=ProgressStyle(descr…

Running loss: 0.466722

	add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
	add_(Tensor other, *, Number alpha) (Triggered internally at  ../torch/csrc/utils/python_arg_parser.cpp:882.)
  exp_avg.mul_(beta1).add_(1.0 - beta1, grad)


Running loss: 0.760077

KeyboardInterrupt: 

### Calculate validation accuracy

In [None]:
(result['tp']+result['tn'])/(result['tp']+result['tn']+result['fp']+result['fn'])

In [None]:
model.predict(['The movie was really good'])