# BERT IMDB training
> Train BERT on IMDB dataset for sentiment classification

We use the `simpletransformers` library to train BERT (large) for sentiment classification on the IMDB dataset.

In [None]:
import sys
sys.path.append('../../')
import pandas as pd
from sklearn.model_selection import train_test_split
from simpletransformers.classification import ClassificationModel

## IMDB dataset
The IMDB dataset contains 50k movie review annotated with "positive"/"negative" feedback indicating the sentiment. It can be downloaded from Kaggle ([link](https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews)). 

### Load dataset

In [None]:
df = pd.read_csv('../data/imdb-dataset.csv')

In [None]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


### Create labels

In [None]:
df['label'] = (df['sentiment']=='positive').astype(int)

In [None]:
df.head()

Unnamed: 0,review,sentiment,label
0,One of the other reviewers has mentioned that ...,positive,1
1,A wonderful little production. <br /><br />The...,positive,1
2,I thought this was a wonderful way to spend ti...,positive,1
3,Basically there's a family where a little boy ...,negative,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,1


### Rename columns

In [None]:
df.rename({'review': 'text'}, axis=1, inplace=True)
df.drop('sentiment', axis=1, inplace=True)

In [None]:
df.head()

Unnamed: 0,text,label
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


Train test split;

In [None]:
df_train, df_valid = train_test_split(df, test_size=0.2)

## Bert training

### Config

In [None]:
args = {
    'fp16':False,
    'wandb_project': 'bert-imdb',
    'num_train_epochs': 3,
    'overwrite_output_dir':True,
    'learning_rate': 1e-5,
}

### Training

In [None]:
model = ClassificationModel('bert', 'bert-large-cased', use_cuda=True,args=args) 
model.train_model(df_train, output_dir='bert-imdb')
result, model_outputs, wrong_predictions = model.eval_model(df_valid)

### Calculate validation accuracy

In [None]:
(result['tp']+result['tn'])/(result['tp']+result['tn']+result['fp']+result['fn'])

0.9098

In [None]:
model.predict(['The movie was really good'])

Converting to features started. Cache is not used.


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




(array([1]), array([[-3.6786678,  4.1541786]], dtype=float32))