This notebook is done following 
* [Building text classifier with Differential Privacy](https://github.com/pytorch/opacus/blob/main/tutorials/building_text_classifier.ipynb)
* [Fine-tuning with custom datasets](https://huggingface.co/transformers/v3.4.0/custom_datasets.html#seq-imdb)

# Libraries
https://huggingface.co/docs/transformers/training

## Install

In [1]:
!pip install datasets
import datasets

[0m

## Import

In [2]:
from tqdm.auto import tqdm
from transformers import AutoModelForSequenceClassification
from torch.optim import AdamW
import torch
from torch.utils.data import DataLoader

import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import gc

pd.set_option('display.max_columns', None)

In [3]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/tokenize-ucberkeley-bert-uncased/__results__.html
/kaggle/input/tokenize-ucberkeley-bert-uncased/validation.pkl
/kaggle/input/tokenize-ucberkeley-bert-uncased/train.pkl
/kaggle/input/tokenize-ucberkeley-bert-uncased/test.pkl
/kaggle/input/tokenize-ucberkeley-bert-uncased/__notebook__.ipynb
/kaggle/input/tokenize-ucberkeley-bert-uncased/__output__.json
/kaggle/input/tokenize-ucberkeley-bert-uncased/custom.css


In [4]:
import random

def seed_torch(seed=7):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    

global_seed = 2022
seed_torch(global_seed)

## Get device

In [5]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
print(device)

cuda


# Load tokenized data

From my [other notebook](https://www.kaggle.com/code/khairulislam/tokenize-jigsaw-comments). The dataset is tokenized from the [Jigsaw competition]( https://www.kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification) and [all_data.csv](https://www.kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification/data?select=all_data.csv)

In [6]:
text = 'text'
target = 'labels'
root = '/kaggle/input/tokenize-ucberkeley-bert-uncased/'

In [7]:
import pickle
    
with open(root + 'train.pkl', 'rb') as input_file:
    train_all_tokenized = pickle.load(input_file)
    input_file.close()
    
with open(root + 'validation.pkl', 'rb') as input_file:
    validation_all_tokenized = pickle.load(input_file)
    input_file.close()
    
with open(root + 'test.pkl', 'rb') as input_file:
    test_all_tokenized = pickle.load(input_file)
    input_file.close()

In [8]:
train_all_tokenized

Dataset({
    features: ['comment_id', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 26994
})

In [9]:
id_column = 'comment_id'
train_tokenized = train_all_tokenized.remove_columns(id_column)
test_tokenized = test_all_tokenized.remove_columns(id_column)
validation_tokenized = validation_all_tokenized.remove_columns(id_column)

# Model

BERT (Bidirectional Encoder Representations from Transformers) is a state of the art approach to various NLP tasks. It uses a Transformer architecture and relies heavily on the concept of pre-training.

We'll use a pre-trained BERT-base model, provided in huggingface [transformers](https://github.com/huggingface/transformers) repo. It gives us a pytorch implementation for the classic BERT architecture, as well as a tokenizer and weights pre-trained on a public English corpus (Wikipedia).

Please follow these [installation instrucitons](https://github.com/huggingface/transformers#installation) before proceeding.

In [10]:
# https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertForSequenceClassification
from transformers import AutoModelForSequenceClassification

def load_pretrained_model(model_name, num_labels):
    model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)

    trainable_layers = [model.bert.encoder.layer[-1], model.bert.pooler, model.classifier]
    total_params = 0
    trainable_params = 0

    for p in model.parameters():
        p.requires_grad = False
        total_params += p.numel()

    for layer in trainable_layers:
        for p in layer.parameters():
            p.requires_grad = True
            trainable_params += p.numel()

    print(f"Total parameters count: {total_params}") # ~108M
    print(f"Trainable parameters count: {trainable_params}") # ~7M

    return model

In [11]:
num_labels = 3
model_name = "bert-base-uncased"

# Training

## Utils

### Results

In [12]:
from sklearn.metrics import f1_score, roc_auc_score, accuracy_score

# https://huggingface.co/docs/datasets/metrics
def calculate_result(labels, probs, threshold=0.5):
    preds = np.where(probs >= threshold, 1, 0)
    return {
        'accuracy': np.round(accuracy_score(labels, preds), 4),
        'f1': np.round(f1_score(labels, preds), 4),
        'auc': np.round(roc_auc_score(labels, probs), 4)
    }

def dump_results(result_dir):
    train_df = pd.DataFrame({'id':train_all_tokenized[id_column], 'labels':train_all_tokenized[target], 
      'probs': train_probs, 'split':['train']* len(train_all_tokenized)
    })
    val_df = pd.DataFrame({'id':validation_all_tokenized[id_column], 'labels':validation_all_tokenized[target], 
      'probs': val_probs, 'split':['validation']* len(validation_all_tokenized)
    })
    test_df = pd.DataFrame({'id':test_all_tokenized[id_column], 'labels':test_all_tokenized[target], 
      'probs': test_probs, 'split':['test']* len(test_all_tokenized)
    })

    total_df = pd.concat([train_df, val_df, test_df],ignore_index=True)

    total_df.to_csv(f'{result_dir}/results.csv', index=False)

### Train, test

In [13]:
from tqdm.notebook import tqdm

sigmoid = torch.nn.Sigmoid()

def evaluate(model, test_dataloader, epoch, data_type='Test'):    
    model.eval()

    losses, total_labels = [], []
    total_probs = torch.tensor([], dtype=torch.float32)
    progress_bar = tqdm(range(len(test_dataloader)), desc=f'Epoch {epoch} ({data_type})')
    
    for batch in test_dataloader:
        inputs = {k: v.to(device) for k, v in batch.items()}

        with torch.no_grad():
            outputs = model(**inputs)
            
        loss = outputs[0]
        
        probs = sigmoid(outputs.logits.detach().cpu())[:, 1]
        labels = inputs[target].detach().cpu().numpy()
        
        losses.append(loss.item())
        total_probs = torch.cat((total_probs, probs), dim=0)
        total_labels.extend(labels)
        
        progress_bar.update(1)
        
        progress_bar.set_postfix(
            loss=np.round(np.mean(losses), 4), 
            f1=np.round(f1_score(total_labels, total_probs>=0.5), 4)
        )
    
    model.train()
    test_result = calculate_result(total_labels, total_probs)
    return np.mean(losses), test_result, total_probs

def train(model, train_dataloader, optimizer, epoch):
    model.train()
    
    losses, total_labels = [], []
    total_probs = torch.tensor([], dtype=torch.float32)
    progress_bar = tqdm(range(len(train_dataloader)), desc=f'Epoch {epoch} (Train)')

    for step, data in enumerate(train_dataloader):
        optimizer.zero_grad()

        inputs = {k: v.to(device) for k, v in data.items()}
        outputs = model(**inputs) # output = loss, logits, hidden_states, attentions

        # targets = data[target].to(device, dtype = torch.long)
        # loss = loss_function(outputs.logits, targets)
        loss = outputs[0]

        loss.backward()
        optimizer.step()

        losses.append(loss.item())

        # preds = np.argmax(outputs.logits.detach().cpu().numpy(), axis=1)
        probs = sigmoid(outputs.logits.detach().cpu())[:, 1]
        labels = inputs[target].detach().cpu().numpy()
        
        total_probs = torch.cat((total_probs, probs), dim=0)
        total_labels.extend(labels)

        progress_bar.update(1)
        progress_bar.set_postfix(
            loss=np.round(np.mean(losses), 4), 
            f1=np.round(f1_score(total_labels, total_probs>=0.5), 4)
        )


    train_loss = np.mean(losses)
    train_result = calculate_result(np.array(total_labels), np.array(total_probs))

    return train_loss, train_result, total_probs
    # return preds_df

### Early stop, save and load model
* https://pytorch.org/tutorials/beginner/saving_loading_models.html
* https://debuggercafe.com/saving-and-loading-the-best-model-in-pytorch/
* https://debuggercafe.com/using-learning-rate-scheduler-and-early-stopping-with-pytorch/

In [14]:
class EarlyStopping:
    """
    Early stopping to stop the training when the loss does not improve after
    certain epochs.
    """
    def __init__(self, patience=5, min_delta=0):
        """
        :param patience: how many epochs to wait before stopping when loss is
               not improving
        :param min_delta: minimum difference between new loss and old loss for
               new loss to be considered as an improvement
        """
        self.patience = patience
        self.min_delta = min_delta
        self.counter = 0
        self.best_loss = None
        self.early_stop = False
    def __call__(self, val_loss):
        if self.best_loss == None:
            self.best_loss = val_loss
        elif self.best_loss - val_loss > self.min_delta:
            self.best_loss = val_loss
            # reset counter if validation loss improves
            self.counter = 0
        elif self.best_loss - val_loss < self.min_delta:
            self.counter += 1
            print(f"Early stopping counter {self.counter} of {self.patience}")
            if self.counter >= self.patience:
                print('Early stopping..')
                self.early_stop = True

class ModelCheckPoint:
    """
    Class to save the best model while training. If the current epoch's 
    loss is less than the previous least less, then save the
    model state.
    """
    def __init__(self, best_loss=float('inf'), filepath='best_model.pt'):
        self.best_loss = best_loss
        self.filepath = filepath
        
    def __call__(self, model, optimizer, lr_scheduler, epoch, current_loss):
        if current_loss >= self.best_loss:
            return
        print(f"\nLoss improved from {self.best_loss:.3f} to {current_loss:.3f}. Saving model.")
        self.best_loss = current_loss
        save_model(model, optimizer, lr_scheduler, epoch, self.filepath)
            
def save_model(model, optimizer, lr_scheduler, epoch, filepath='model.pt'):
    """
    Function to save the trained model to disk.
    """
    torch.save(
      {
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'lr_scheduler_state_dict':lr_scheduler.state_dict()
      }, 
      filepath
    )

def load_model(model, optimizer, lr_scheduler, device, filepath='model.pt'):
    """
    Function to load the trained model from disk.
    """
    checkpoint = torch.load(filepath, map_location=device) # Choose whatever GPU device number you want  
    epoch = checkpoint['epoch']
    optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
    lr_scheduler.load_state_dict(checkpoint['lr_scheduler_state_dict'])
    model.load_state_dict(checkpoint['model_state_dict'])
    print(f'Loaded best model from epoch {epoch}')
    
    return model, optimizer, lr_scheduler, epoch

## Data loader

In [15]:
BATCH_SIZE = 64

train_dataloader = DataLoader(train_tokenized, batch_size=BATCH_SIZE)
validation_dataloader = DataLoader(validation_tokenized, batch_size=BATCH_SIZE)
test_dataloader = DataLoader(test_tokenized, batch_size=BATCH_SIZE)

## Model, hyper-parameters and callbacks

In [16]:
model = load_pretrained_model(model_name, num_labels)

# Define optimizer
LEARNING_RATE = 1e-3
optimizer = torch.optim.AdamW(model.parameters(), lr=LEARNING_RATE, eps=1e-8)
EPOCHS = 10

# https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.ReduceLROnPlateau.html
lr_scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=1) 

# ExponentialLR(optimizer, gamma=0.8), Decays the learning rate of each parameter group by gamma every epoch.

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

Total parameters count: 109484547
Trainable parameters count: 7680771


In [17]:
model_type = model_name.split(r'/')[-1]
result_dir = f'models/{model_type}'
best_model_path = f'{result_dir}/model.pt'

os.makedirs(result_dir, exist_ok=True)

check_point = ModelCheckPoint(filepath=best_model_path)
early_stopping = EarlyStopping(patience=3, min_delta=0)

## Loop

In [18]:
start_epoch = 1
# load a previous model if there is any
# model, optimizer, lr_scheduler, start_epoch = load_model(model, optimizer, lr_scheduler, device, filepath=best_model_path)
model = model.to(device)

for epoch in range(start_epoch, EPOCHS+1):
    gc.collect()
    
    train_loss, train_result, train_probs = train(model, train_dataloader, optimizer, epoch)
    val_loss, val_result, val_probs = evaluate(model, validation_dataloader, epoch, 'Validation')

    print(
      f"Epoch: {epoch} | "
      f"Train loss: {train_loss:.3f} | "
      f"Train result: {train_result} |\n"
      f"Validation loss: {val_loss:.3f} | "
      f"Validation result: {val_result} | "
    )
    
    loss = -val_result['f1']
    lr_scheduler.step(loss)
    check_point(model, optimizer, lr_scheduler, epoch, loss)
    
    early_stopping(loss)
    if early_stopping.early_stop:
        break
    print()

Epoch 1 (Train):   0%|          | 0/422 [00:00<?, ?it/s]

Epoch 1 (Validation):   0%|          | 0/91 [00:00<?, ?it/s]

Epoch: 1 | Train loss: 0.393 | Train result: {'accuracy': 0.3783, 'f1': 0.4272, 'auc': 0.8302} |
Validation loss: 0.371 | Validation result: {'accuracy': 0.2375, 'f1': 0.3839, 'auc': 0.8743} | 

Loss improved from inf to -0.384. Saving model.



Epoch 2 (Train):   0%|          | 0/422 [00:00<?, ?it/s]

Epoch 2 (Validation):   0%|          | 0/91 [00:00<?, ?it/s]

Epoch: 2 | Train loss: 0.364 | Train result: {'accuracy': 0.4741, 'f1': 0.4679, 'auc': 0.8656} |
Validation loss: 0.366 | Validation result: {'accuracy': 0.4194, 'f1': 0.4493, 'auc': 0.8743} | 

Loss improved from -0.384 to -0.449. Saving model.



Epoch 3 (Train):   0%|          | 0/422 [00:00<?, ?it/s]

Epoch 3 (Validation):   0%|          | 0/91 [00:00<?, ?it/s]

Epoch: 3 | Train loss: 0.356 | Train result: {'accuracy': 0.5692, 'f1': 0.5132, 'auc': 0.8732} |
Validation loss: 0.361 | Validation result: {'accuracy': 0.4946, 'f1': 0.4819, 'auc': 0.877} | 

Loss improved from -0.449 to -0.482. Saving model.



Epoch 4 (Train):   0%|          | 0/422 [00:00<?, ?it/s]

Epoch 4 (Validation):   0%|          | 0/91 [00:00<?, ?it/s]

Epoch: 4 | Train loss: 0.352 | Train result: {'accuracy': 0.6268, 'f1': 0.5442, 'auc': 0.8754} |
Validation loss: 0.356 | Validation result: {'accuracy': 0.6271, 'f1': 0.5485, 'auc': 0.8794} | 

Loss improved from -0.482 to -0.548. Saving model.



Epoch 5 (Train):   0%|          | 0/422 [00:00<?, ?it/s]

Epoch 5 (Validation):   0%|          | 0/91 [00:00<?, ?it/s]

Epoch: 5 | Train loss: 0.349 | Train result: {'accuracy': 0.698, 'f1': 0.5892, 'auc': 0.8834} |
Validation loss: 0.363 | Validation result: {'accuracy': 0.5955, 'f1': 0.5284, 'auc': 0.8754} | 
Early stopping counter 1 of 3



Epoch 6 (Train):   0%|          | 0/422 [00:00<?, ?it/s]

Epoch 6 (Validation):   0%|          | 0/91 [00:00<?, ?it/s]

Epoch: 6 | Train loss: 0.342 | Train result: {'accuracy': 0.7314, 'f1': 0.6117, 'auc': 0.8872} |
Validation loss: 0.360 | Validation result: {'accuracy': 0.6927, 'f1': 0.5848, 'auc': 0.877} | 

Loss improved from -0.548 to -0.585. Saving model.



Epoch 7 (Train):   0%|          | 0/422 [00:00<?, ?it/s]

Epoch 7 (Validation):   0%|          | 0/91 [00:00<?, ?it/s]

Epoch: 7 | Train loss: 0.339 | Train result: {'accuracy': 0.7499, 'f1': 0.6265, 'auc': 0.8892} |
Validation loss: 0.360 | Validation result: {'accuracy': 0.712, 'f1': 0.5956, 'auc': 0.8767} | 

Loss improved from -0.585 to -0.596. Saving model.



Epoch 8 (Train):   0%|          | 0/422 [00:00<?, ?it/s]

Epoch 8 (Validation):   0%|          | 0/91 [00:00<?, ?it/s]

Epoch: 8 | Train loss: 0.339 | Train result: {'accuracy': 0.7669, 'f1': 0.6374, 'auc': 0.8896} |
Validation loss: 0.367 | Validation result: {'accuracy': 0.777, 'f1': 0.6372, 'auc': 0.8747} | 

Loss improved from -0.596 to -0.637. Saving model.



Epoch 9 (Train):   0%|          | 0/422 [00:00<?, ?it/s]

Epoch 9 (Validation):   0%|          | 0/91 [00:00<?, ?it/s]

Epoch: 9 | Train loss: 0.337 | Train result: {'accuracy': 0.7864, 'f1': 0.6518, 'auc': 0.8922} |
Validation loss: 0.367 | Validation result: {'accuracy': 0.7582, 'f1': 0.6254, 'auc': 0.8759} | 
Early stopping counter 1 of 3



Epoch 10 (Train):   0%|          | 0/422 [00:00<?, ?it/s]

Epoch 10 (Validation):   0%|          | 0/91 [00:00<?, ?it/s]

Epoch: 10 | Train loss: 0.331 | Train result: {'accuracy': 0.8011, 'f1': 0.6629, 'auc': 0.8965} |
Validation loss: 0.369 | Validation result: {'accuracy': 0.7838, 'f1': 0.6402, 'auc': 0.875} | 

Loss improved from -0.637 to -0.640. Saving model.



In [19]:
# load the best model
model, _, _, best_epoch = load_model(model, optimizer, lr_scheduler, device, filepath=best_model_path)

train_loss, train_result, train_probs = evaluate(model, train_dataloader, best_epoch, 'Train')
# no need to reevaluate if the validation set if the last model is the best one
if best_epoch != epoch:
    val_loss, val_result, val_probs = evaluate(model, validation_dataloader, best_epoch, 'Validation')
test_loss, test_result, test_probs = evaluate(model, test_dataloader, best_epoch, 'Test')

Loaded best model from epoch 10


Epoch 10 (Train):   0%|          | 0/422 [00:00<?, ?it/s]

Epoch 10 (Test):   0%|          | 0/91 [00:00<?, ?it/s]

## Dump results and others

In [20]:
dump_results(result_dir)

In [21]:
import json

config = {
    "model_name": model_name,
    "undersample": False,
    "seed": global_seed,

    "epochs": EPOCHS,
    "learning_rate": LEARNING_RATE,

    "batch_size": BATCH_SIZE,
    "max_sequence_length": 128
}

with open(f'{result_dir}/config.json', 'w') as output:
    json.dump(config, output, indent=4)