# BERT: Bidirectional Encoder Representations from Transformers

The pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.

<img src="https://i.imgur.com/O7ps2Hl.jpg" alt="ensemble" width="800px"/>

**Reference**
* Hugging Face Models : [link](https://huggingface.co/models)
* Bert-base-uncased : [link](https://huggingface.co/bert-base-uncased)
* Hugging Face BERT Docs: [link](https://huggingface.co/transformers/model_doc/bert.html)
* BERT Paper : [link](https://arxiv.org/abs/1810.04805)

## Install transformers

In [None]:
!pip install transformers==3

Collecting transformers==3
  Downloading transformers-3.0.0-py3-none-any.whl (754 kB)
[K     |████████████████████████████████| 754 kB 4.3 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 36.1 MB/s 
Collecting tokenizers==0.8.0-rc4
  Downloading tokenizers-0.8.0rc4-cp37-cp37m-manylinux1_x86_64.whl (3.0 MB)
[K     |████████████████████████████████| 3.0 MB 36.5 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 45.1 MB/s 
Installing collected packages: tokenizers, sentencepiece, sacremoses, transformers
Successfully installed sacremoses-0.0.45 sentencepiece-0.1.96 tokenizers-0.8.0rc4 transformers-3.0.0


## Import

In [None]:
from transformers import BertTokenizer, BertModel

import shutil, sys  
import numpy as np
import pandas as pd

import torch
from torch.utils.data import DataLoader
from torch.utils.data import Dataset

root_path = 'Sentiment Classification on Movie Reviews/'

In [None]:
torch.cuda.empty_cache()

In [None]:
!nvidia-smi

Thu Aug 26 09:48:35 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   69C    P8    32W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## BERT Tokenizer

The `tokenizer.encode_plus` function combines multiple steps for us:

1. Split the sentence into tokens.
2. Add the special `[CLS]` and `[SEP]` tokens.
3. Map the tokens to their IDs.
4. Pad or truncate all sentences to the same length.
5. Create the `attention masks` which explicitly differentiate real tokens from `[PAD]` tokens.


**Reference**
* Utilities for Tokenizers `encode_plus()`: [Docs](https://huggingface.co/transformers/internal/tokenization_utils.html#transformers.tokenization_utils_base.PreTrainedTokenizerBase.encode_plus)


In [None]:
 tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

## Set the Device

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

'cuda'

## IMDB Dataset

* **IMDB Dataset**: [Kaggle](https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?select=IMDB+Dataset.csv)

### Preprocessing

In [None]:
df = pd.read_csv(root_path + 'IMDB Dataset.csv')
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [None]:
remove_chr = ['<br>', '<br />']
df['review'].str.contains('|'.join(remove_chr))

0         True
1         True
2         True
3         True
4         True
         ...  
49995     True
49996     True
49997     True
49998    False
49999     True
Name: review, Length: 50000, dtype: bool

In [None]:
df['review'] = df['review'].str.replace(r'(<.*\/>)', '')
df['review']

0        One of the other reviewers has mentioned that ...
1        A wonderful little production. The realism rea...
2        I thought this was a wonderful way to spend ti...
3        Basically there's a family where a little boy ...
4        Petter Mattei's "Love in the Time of Money" is...
                               ...                        
49995    I thought this movie did a down right good job...
49996    Bad plot, bad dialogue, bad acting, idiotic di...
49997    I am a Catholic taught in parochial elementary...
49998    I'm going to have to disagree with the previou...
49999    No one expects the Star Trek movies to be high...
Name: review, Length: 50000, dtype: object

In [None]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. The realism rea...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [None]:
# show labels
df['sentiment'].unique()

array(['positive', 'negative'], dtype=object)

In [None]:
df.review.isnull().sum()

0

### Custom Dataset Class

In [None]:
class IMDBDataset(Dataset):
    def __init__(self, mode, filepath, tokenizer, max_len=256):
        assert mode in ['train', 'val']

        self.mode = mode
        # self.df = pd.read_csv(filepath).sample(frac=0.1) # get smaple
        self.df = pd.read_csv(filepath)  # please use this line
        self.tokenizer = tokenizer
        self.max_len = max_len

        # label to index
        self.label_map = {
            'positive':1,
            'negative':0
        }

        self.len = len(self.df)
        self.train_len = int(self.len * 0.8)
        if mode == 'train':
            self.df = self.df[: self.train_len]
            print('train size:', len(self.df))
        else: 
            self.df = self.df[self.train_len:]
            print('validation size:', len(self.df))
        
    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        text = self.df.review.str.replace(r'(<.*\/>)', '').iloc[idx]
        label_str = self.df.sentiment.iloc[idx]
        label = self.label_map[label_str]

        inputs = self.tokenizer.encode_plus(
            text=text,
            text_pair=None,
            add_special_tokens=True,
            max_length=self.max_len,
            padding='max_length',
            truncation=True
        )
        
        ids = inputs['input_ids']
        mask = inputs['attention_mask']
        token_type_ids = inputs['token_type_ids']

        return {
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),
            'token_type_ids': torch.tensor(token_type_ids, dtype=torch.long),
            'targets': torch.tensor(label, dtype=torch.float)
        }

### Datasets & DataLoader

In [None]:
# train
train_dataset = IMDBDataset('train', root_path + 'IMDB Dataset.csv', tokenizer)
test_dataset = IMDBDataset('val', root_path + 'IMDB Dataset.csv', tokenizer)

# test
train_dataloader = DataLoader(train_dataset, 16, shuffle=True)
test_dataloader = DataLoader(test_dataset, 16, shuffle=True)

train size: 4000
validation size: 1000


## Fine-tune BERT

### Model

In [None]:
class FineTuneBERT(torch.nn.Module):
    def __init__(self, dropout_p=0.3):
        super(FineTuneBERT, self).__init__()

        self.bert = BertModel.from_pretrained('bert-base-uncased')
        self.dropout = torch.nn.Dropout(dropout_p)
        self.classifier = torch.nn.Linear(768, 1)

    def forward(self, ids, mask, token_type_ids):
        _, output= self.bert(ids, attention_mask = mask, token_type_ids = token_type_ids)
        output = self.dropout(output)
        output = self.classifier(output)
        return output

In [None]:
model = FineTuneBERT()
model.to(device)

Downloading:   0%|          | 0.00/433 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

FineTuneBERT(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)

### Loss Function & Optimizer

In [None]:
loss_func = torch.nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(
    model.parameters(), 
    lr=1e-4
)

## Save & Load Checkpoint

### Loading Checkpoint

In [None]:
def load_ckp(checkpoint_fpath, model, optimizer):
    """
    checkpoint_path: path to save checkpoint
    model: model that we want to load checkpoint parameters into       
    optimizer: optimizer we defined in previous training
    """
    # load check point
    checkpoint = torch.load(checkpoint_fpath)
    # initialize state_dict from checkpoint to model
    model.load_state_dict(checkpoint['state_dict'])
    # initialize optimizer from checkpoint to optimizer
    optimizer.load_state_dict(checkpoint['optimizer'])
    # initialize valid_loss_min from checkpoint to valid_loss_min
    valid_loss_min = checkpoint['valid_loss_min']
    # return model, optimizer, epoch value, min validation loss 
    return model, optimizer, checkpoint['epoch'], valid_loss_min.item()

### Save Checkpoint

In [None]:
 def save_ckp(state, is_best, checkpoint_path, best_model_path):
    """
    state: checkpoint we want to save
    is_best: is this the best checkpoint; min validation loss
    checkpoint_path: path to save checkpoint
    best_model_path: path to save best model
    """
    f_path = checkpoint_path
    # save checkpoint data to the path given, checkpoint_path
    torch.save(state, f_path)
    # if it is a best model, min validation loss
    if is_best:
        best_fpath = best_model_path
        # copy that checkpoint file to best path given, best_model_path
        shutil.copyfile(f_path, best_fpath)

## Training Function

In [None]:
def train(model, train_loader, optimizer, loss_func, device):
    train_loss = 0
    model.train()
    for b_idx, data in enumerate(train_loader):
        ids = data['ids'].to(device)
        mask = data['mask'].to(device)
        token_type_ids = data['token_type_ids'].to(device)
        targets = data['targets'].to(device)

        outputs = model(ids, mask, token_type_ids)
        
        optimizer.zero_grad()
        loss = loss_func(outputs, targets.unsqueeze(1))
        loss.backward()
        optimizer.step()

        train_loss += loss.item()
    
    return train_loss/len(train_loader.sampler)

## Validation Function

In [None]:
def validation(model, valid_loader, optimizer, loss_func, device):
    valid_loss = 0
    with torch.no_grad():
        for b_idx, data in enumerate(valid_loader):
            ids = data['ids'].to(device)
            mask = data['mask'].to(device)
            token_type_ids = data['token_type_ids'].to(device)
            targets = data['targets'].to(device)
            
            outputs = model(ids, mask, token_type_ids)

            loss = loss_func(outputs, targets.unsqueeze(1))
            valid_loss += loss.item()
    
    return valid_loss/len(valid_loader.sampler)

## Train & Validation

In [None]:
min_valid_loss = None
checkpoint_path = './current_checkpoint.pt'
best_model_path = './best_model.pt'

for epoch in range(10):
    train_loss = train(model, train_dataloader, optimizer, loss_func, device)
    valid_loss = validation(model, test_dataloader, optimizer, loss_func, device)
    
    print('Epoch: {} \n\t - Avgerage Training Loss: {:.6f} \n\t - Average Validation Loss: {:.6f}'.format(
            epoch + 1, 
            train_loss,
            valid_loss
    ))

    if min_valid_loss is None:
        min_valid_loss = train_loss
    
    if valid_loss < min_valid_loss:
        # create checkpoint variable and add important data
        checkpoint = {
                'epoch': epoch + 1,
                'valid_loss_min': valid_loss,
                'state_dict': model.state_dict(),
                'optimizer': optimizer.state_dict()
        }
        print('** Validation loss decreased ({:.6f} --> {:.6f}).  Saving model ...'.format(min_valid_loss, valid_loss))
        
        # save checkpoint as best model
        save_ckp(checkpoint, True, checkpoint_path, best_model_path)
        min_valid_loss = valid_loss

Epoch: 1 
	 - Avgerage Training Loss: 0.044130 
	 - Average Validation Loss: 0.044097
** Validation loss decreased (0.044130 --> 0.044097).  Saving model ...
Epoch: 2 
	 - Avgerage Training Loss: 0.044208 
	 - Average Validation Loss: 0.044197
Epoch: 3 
	 - Avgerage Training Loss: 0.044302 
	 - Average Validation Loss: 0.044278
Epoch: 4 
	 - Avgerage Training Loss: 0.043956 
	 - Average Validation Loss: 0.043745
** Validation loss decreased (0.044097 --> 0.043745).  Saving model ...
Epoch: 5 
	 - Avgerage Training Loss: 0.044041 
	 - Average Validation Loss: 0.043603
** Validation loss decreased (0.043745 --> 0.043603).  Saving model ...
Epoch: 6 
	 - Avgerage Training Loss: 0.043927 
	 - Average Validation Loss: 0.044891
Epoch: 7 
	 - Avgerage Training Loss: 0.043853 
	 - Average Validation Loss: 0.043902
Epoch: 8 
	 - Avgerage Training Loss: 0.043868 
	 - Average Validation Loss: 0.043752
Epoch: 9 
	 - Avgerage Training Loss: 0.043948 
	 - Average Validation Loss: 0.043713
Epoch: 10 