This notebook is done following 
* [Building text classifier with Differential Privacy](https://github.com/pytorch/opacus/blob/main/tutorials/building_text_classifier.ipynb)
* [Fine-tuning with custom datasets](https://huggingface.co/transformers/v3.4.0/custom_datasets.html#seq-imdb)

# Libraries
https://huggingface.co/docs/transformers/training

## Install

In [1]:
!pip install datasets
import datasets

[0m

In [2]:
!pip install opacus

Collecting opacus
  Downloading opacus-1.1.2-py3-none-any.whl (176 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m176.1/176.1 KB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: opacus
Successfully installed opacus-1.1.2
[0m

In [3]:
# !pip install transformers

## Import

In [4]:
from tqdm.auto import tqdm
from transformers import AutoModelForSequenceClassification
from torch.optim import AdamW
import torch
from torch.utils.data import DataLoader

from opacus.utils.batch_memory_manager import BatchMemoryManager

import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import gc

pd.set_option('display.max_columns', None)

In [5]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/tokenize-ucberkeley-distilbert/__results__.html
/kaggle/input/tokenize-ucberkeley-distilbert/validation.pkl
/kaggle/input/tokenize-ucberkeley-distilbert/train.pkl
/kaggle/input/tokenize-ucberkeley-distilbert/test.pkl
/kaggle/input/tokenize-ucberkeley-distilbert/__notebook__.ipynb
/kaggle/input/tokenize-ucberkeley-distilbert/__output__.json
/kaggle/input/tokenize-ucberkeley-distilbert/custom.css


In [6]:
import random

def seed_torch(seed=7):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    

global_seed = 2022
seed_torch(global_seed)

## Get device

In [7]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
print(device)

cuda


# Load tokenized data

From my [other notebook](https://www.kaggle.com/code/khairulislam/tokenize-jigsaw-comments). The dataset is tokenized from the [Jigsaw competition]( https://www.kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification) and [all_data.csv](https://www.kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification/data?select=all_data.csv)

In [8]:
text = 'text'
target = 'labels'
root = '/kaggle/input/tokenize-ucberkeley-distilbert/'

In [9]:
import pickle
    
with open(root + 'train.pkl', 'rb') as input_file:
    train_all_tokenized = pickle.load(input_file)
    input_file.close()
    
with open(root + 'validation.pkl', 'rb') as input_file:
    validation_all_tokenized = pickle.load(input_file)
    input_file.close()
    
with open(root + 'test.pkl', 'rb') as input_file:
    test_all_tokenized = pickle.load(input_file)
    input_file.close()

In [10]:
id_column = 'comment_id'
train_tokenized = train_all_tokenized.remove_columns(id_column)
test_tokenized = test_all_tokenized.remove_columns(id_column)
validation_tokenized = validation_all_tokenized.remove_columns(id_column)

# Model

BERT (Bidirectional Encoder Representations from Transformers) is a state of the art approach to various NLP tasks. It uses a Transformer architecture and relies heavily on the concept of pre-training.

We'll use a pre-trained BERT-base model, provided in huggingface [transformers](https://github.com/huggingface/transformers) repo. It gives us a pytorch implementation for the classic BERT architecture, as well as a tokenizer and weights pre-trained on a public English corpus (Wikipedia).

Please follow these [installation instrucitons](https://github.com/huggingface/transformers#installation) before proceeding.

In [11]:
# https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertForSequenceClassification
from transformers import AutoModelForSequenceClassification

def load_pretrained_model(model_name, num_labels):
    model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)

    trainable_layers = [model.distilbert.transformer.layer[-1], model.pre_classifier, model.classifier]
    total_params = 0
    trainable_params = 0

    for p in model.parameters():
        p.requires_grad = False
        total_params += p.numel()

    for layer in trainable_layers:
        for p in layer.parameters():
            p.requires_grad = True
            trainable_params += p.numel()

    print(f"Total parameters count: {total_params}") # ~108M
    print(f"Trainable parameters count: {trainable_params}, percent {(trainable_params*100/total_params):.3f}") # ~7M

    return model

In [12]:
num_labels = 3
model_name = "distilbert-base-uncased"

# Private Training

## Utils

### Results

In [13]:
from sklearn.metrics import f1_score, roc_auc_score, accuracy_score

# https://huggingface.co/docs/datasets/metrics
def calculate_result(labels, probs, threshold=0.5):
    preds = np.where(probs >= threshold, 1, 0)
    return {
        'accuracy': np.round(accuracy_score(labels, preds), 4),
        'f1': np.round(f1_score(labels, preds), 4),
        'auc': np.round(roc_auc_score(labels, probs), 4)
    }

def dump_results(result_dir):
    train_df = pd.DataFrame({'id':train_all_tokenized[id_column], 'labels':train_all_tokenized[target], 
      'probs': train_probs, 'split':['train']* len(train_all_tokenized)
    })
    val_df = pd.DataFrame({'id':validation_all_tokenized[id_column], 'labels':validation_all_tokenized[target], 
      'probs': val_probs, 'split':['validation']* len(validation_all_tokenized)
    })
    test_df = pd.DataFrame({'id':test_all_tokenized[id_column], 'labels':test_all_tokenized[target], 
      'probs': test_probs, 'split':['test']* len(test_all_tokenized)
    })

    total_df = pd.concat([train_df, val_df, test_df],ignore_index=True)

    total_df.to_csv(result_dir + 'results.csv', index=False)

### Train, test

In [14]:
from tqdm.notebook import tqdm


sigmoid = torch.nn.Sigmoid()


def evaluate(model, test_dataloader, epoch, data_type='Test'):    
    model.eval()

    losses, total_labels = [], []
    total_probs = torch.tensor([], dtype=torch.float32)
    progress_bar = tqdm(range(len(test_dataloader)), desc=f'Epoch {epoch} ({data_type})')
    
    for batch in test_dataloader:
        inputs = {k: v.to(device) for k, v in batch.items()}

        with torch.no_grad():
            outputs = model(**inputs)
            
        loss = outputs[0]
        
        probs = sigmoid(outputs.logits.detach().cpu())[:, 1]
        labels = inputs[target].detach().cpu().numpy()
        
        losses.append(loss.item())
        total_probs = torch.cat((total_probs, probs), dim=0)
        total_labels.extend(labels)
        
        progress_bar.update(1)
        progress_bar.set_postfix(
            loss=np.round(np.mean(losses), 4), 
            f1=np.round(f1_score(total_labels, total_probs>=0.5), 4)
        )
    
    model.train()
    test_result = calculate_result(total_labels, total_probs)
    return np.mean(losses), test_result, total_probs

def dp_train(model, train_dataloader, optimizer, epoch):
    losses, total_labels = [], []
    total_probs = torch.tensor([], dtype=torch.float32)

    with BatchMemoryManager(
        data_loader=train_dataloader, 
        max_physical_batch_size=MAX_PHYSICAL_BATCH_SIZE, 
        optimizer=optimizer
    ) as memory_safe_data_loader:
        progress_bar = tqdm(range(len(memory_safe_data_loader)), desc=f'Epoch {epoch} (Train)')

        for step, data in enumerate(memory_safe_data_loader):
            optimizer.zero_grad()

            inputs = {k: v.to(device) for k, v in data.items()}
            outputs = model(**inputs) # output = loss, logits, hidden_states, attentions

            # loss = loss_function(outputs.logits, targets)
            loss = outputs[0]

            loss.backward()
            optimizer.step()

            losses.append(loss.item())

            # preds = np.argmax(outputs.logits.detach().cpu().numpy(), axis=1)
            probs = sigmoid(outputs.logits.detach().cpu())[:, 1]
            labels = inputs[target].detach().cpu().numpy()
            
            total_probs = torch.cat((total_probs, probs), dim=0)
            total_labels.extend(labels)

            progress_bar.update(1)
            progress_bar.set_postfix(
                loss=np.round(np.mean(losses), 4), 
                f1=np.round(f1_score(total_labels, total_probs>=0.5), 4)
            )

    train_loss = np.mean(losses)
    train_result = calculate_result(np.array(total_labels), np.array(total_probs))

    return train_loss, train_result, total_probs

### Early stop, save and load model
* https://pytorch.org/tutorials/beginner/saving_loading_models.html
* https://debuggercafe.com/saving-and-loading-the-best-model-in-pytorch/
* https://debuggercafe.com/using-learning-rate-scheduler-and-early-stopping-with-pytorch/

In [15]:
class EarlyStopping:
    """
    Early stopping to stop the training when the loss does not improve after
    certain epochs.
    """
    def __init__(self, patience=5, min_delta=0):
        """
        :param patience: how many epochs to wait before stopping when loss is
               not improving
        :param min_delta: minimum difference between new loss and old loss for
               new loss to be considered as an improvement
        """
        self.patience = patience
        self.min_delta = min_delta
        self.counter = 0
        self.best_loss = None
        self.early_stop = False
    def __call__(self, val_loss):
        if self.best_loss == None:
            self.best_loss = val_loss
        elif self.best_loss - val_loss > self.min_delta:
            self.best_loss = val_loss
            # reset counter if validation loss improves
            self.counter = 0
        elif self.best_loss - val_loss < self.min_delta:
            self.counter += 1
            print(f"Early stopping counter {self.counter} of {self.patience}")
            if self.counter >= self.patience:
                print('Early stopping..')
                self.early_stop = True

class ModelCheckPoint:
    """
    Class to save the best model while training. If the current epoch's 
    loss is less than the previous least less, then save the
    model state.
    """
    def __init__(self, best_loss=float('inf'), filepath='best_model.pt'):
        self.best_loss = best_loss
        self.filepath = filepath
        
    def __call__(self, model, optimizer, lr_scheduler, epoch, current_loss):
        if current_loss >= self.best_loss:
            return
        print(f"\nLoss improved from {self.best_loss:.3f} to {current_loss:.3f}. Saving model.")
        self.best_loss = current_loss
        save_model(model, optimizer, lr_scheduler, epoch, self.filepath)
            
def save_model(model, optimizer, lr_scheduler, epoch, filepath='model.pt'):
    """
    Function to save the trained model to disk.
    """
    torch.save(
      {
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'lr_scheduler_state_dict':lr_scheduler.state_dict()
      }, 
      filepath
    )

def load_model(model, optimizer, lr_scheduler, device, filepath='model.pt'):
    """
    Function to load the trained model from disk.
    """
    checkpoint = torch.load(filepath, map_location=device) # Choose whatever GPU device number you want  
    epoch = checkpoint['epoch']
    optimizer.load_state_dict(checkpoint['optimizer_state_dict'])

    lr_scheduler.load_state_dict(checkpoint['lr_scheduler_state_dict'])
    model.load_state_dict(checkpoint['model_state_dict'])
    print(f'Loaded best model from epoch {epoch}')

    return model, optimizer, lr_scheduler, epoch
    

## Data loader

[How to choose batch size in DP](https://github.com/pytorch/opacus/blob/main/tutorials/building_text_classifier.ipynb)

In [16]:
BATCH_SIZE = 128

# needed for DP training
MAX_PHYSICAL_BATCH_SIZE = 64

train_dataloader = DataLoader(train_tokenized, batch_size=BATCH_SIZE)
validation_dataloader = DataLoader(validation_tokenized, batch_size=BATCH_SIZE)
test_dataloader = DataLoader(test_tokenized, batch_size=BATCH_SIZE)

## Model and optimizer

In [17]:
EPOCHS = 10
delta_list = [5e-2, 1e-3, 1e-5]
NOISE_MULTIPLIER = 0.45
LEARNING_RATE = 1e-3
MAX_GRAD_NORM = 1

In [18]:
# load a fresh model each time
model = load_pretrained_model(model_name, num_labels)

# Set the model to train mode (HuggingFace models load in eval mode)
model = model.train().to(device)

# Define optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=LEARNING_RATE, eps=1e-8)

# https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
# https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.ReduceLROnPlateau.html
lr_scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=1, verbose=True)
# lr_scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.8)

# to load a previous checkpoint
# model, optimizer, lr_scheduler, epoch = load_model(model, optimizer, lr_scheduler, filepath='model.pt')

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.weight', 'vocab_transform.bias', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.bias', 'pre_classifi

Total parameters count: 66955779
Trainable parameters count: 7680771, percent 11.471


In [19]:
result_dir = ''
best_model_path = result_dir + 'dp_model.pt'

if result_dir != '':
    os.makedirs(result_dir, exist_ok=True)

check_point = ModelCheckPoint(filepath=best_model_path)
early_stopping = EarlyStopping(patience=3, min_delta=0)

## Privacy Engine

In [20]:
from opacus import PrivacyEngine

privacy_engine = PrivacyEngine()

In [21]:
model, optimizer, train_dataloader = privacy_engine.make_private(
    module=model,
    optimizer=optimizer,
    data_loader=train_dataloader,
    noise_multiplier=NOISE_MULTIPLIER,
    max_grad_norm=MAX_GRAD_NORM,
    poisson_sampling=False,
)

## Loop

In [22]:
start_epoch = 1
# load a previous model if there is any
# model, optimizer, lr_scheduler, start_epoch = load_model(model, optimizer, lr_scheduler, device, filepath=best_model_path)

for epoch in range(start_epoch, EPOCHS+1):
    gc.collect()
    
    train_loss, train_result, train_probs = dp_train(model, train_dataloader, optimizer, epoch)
    val_loss, val_result, val_probs = evaluate(model, validation_dataloader, epoch, 'Validation')
    
    epsilons = []
    for delta in delta_list:
        epsilons.append(privacy_engine.get_epsilon(delta))

    print(
      f"Epoch: {epoch} | "
      f"ɛ: {np.round(epsilons, 2)} |"
      f"Train loss: {train_loss:.3f} | "
      f"Train result: {train_result} |\n"
      f"Validation loss: {val_loss:.3f} | "
      f"Validation result: {val_result} | "
    )
    
    loss = -val_result['f1']
    lr_scheduler.step(loss)
    check_point(model, optimizer, lr_scheduler, epoch, loss)
    
    early_stopping(loss)
    if early_stopping.early_stop:
        break
    print()

Epoch 1 (Train):   0%|          | 0/422 [00:00<?, ?it/s]

Epoch 1 (Validation):   0%|          | 0/46 [00:00<?, ?it/s]

Epoch: 1 | ɛ: [2.21 5.45 8.88] |Train loss: 0.969 | Train result: {'accuracy': 0.7633, 'f1': 0.1081, 'auc': 0.6451} |
Validation loss: 1.006 | Validation result: {'accuracy': 0.7789, 'f1': 0.3215, 'auc': 0.7778} | 

Loss improved from inf to -0.322. Saving model.



Epoch 2 (Train):   0%|          | 0/422 [00:00<?, ?it/s]

Epoch 2 (Validation):   0%|          | 0/46 [00:00<?, ?it/s]

Epoch: 2 | ɛ: [ 2.92  6.55 10.33] |Train loss: 0.860 | Train result: {'accuracy': 0.7862, 'f1': 0.4894, 'auc': 0.7918} |
Validation loss: 0.918 | Validation result: {'accuracy': 0.7863, 'f1': 0.5142, 'auc': 0.8095} | 

Loss improved from -0.322 to -0.514. Saving model.



Epoch 3 (Train):   0%|          | 0/422 [00:00<?, ?it/s]

Epoch 3 (Validation):   0%|          | 0/46 [00:00<?, ?it/s]

Epoch: 3 | ɛ: [ 3.49  7.39 11.42] |Train loss: 0.768 | Train result: {'accuracy': 0.7881, 'f1': 0.5604, 'auc': 0.8145} |
Validation loss: 0.870 | Validation result: {'accuracy': 0.801, 'f1': 0.5662, 'auc': 0.8245} | 

Loss improved from -0.514 to -0.566. Saving model.



Epoch 4 (Train):   0%|          | 0/422 [00:00<?, ?it/s]

Epoch 4 (Validation):   0%|          | 0/46 [00:00<?, ?it/s]

Epoch: 4 | ɛ: [ 4.01  8.14 12.39] |Train loss: 0.760 | Train result: {'accuracy': 0.7983, 'f1': 0.5902, 'auc': 0.8326} |
Validation loss: 0.865 | Validation result: {'accuracy': 0.8131, 'f1': 0.5904, 'auc': 0.8413} | 

Loss improved from -0.566 to -0.590. Saving model.



Epoch 5 (Train):   0%|          | 0/422 [00:00<?, ?it/s]

Epoch 5 (Validation):   0%|          | 0/46 [00:00<?, ?it/s]

Epoch: 5 | ɛ: [ 4.46  8.8  13.2 ] |Train loss: 0.754 | Train result: {'accuracy': 0.8134, 'f1': 0.6006, 'auc': 0.8419} |
Validation loss: 0.914 | Validation result: {'accuracy': 0.8247, 'f1': 0.5718, 'auc': 0.845} | 
Early stopping counter 1 of 3



Epoch 6 (Train):   0%|          | 0/422 [00:00<?, ?it/s]

Epoch 6 (Validation):   0%|          | 0/46 [00:00<?, ?it/s]

Epoch: 6 | ɛ: [ 4.91  9.45 14.01] |Train loss: 0.771 | Train result: {'accuracy': 0.8224, 'f1': 0.6061, 'auc': 0.844} |
Validation loss: 0.923 | Validation result: {'accuracy': 0.8264, 'f1': 0.5782, 'auc': 0.8468} | 
Epoch 00006: reducing learning rate of group 0 to 1.0000e-04.
Early stopping counter 2 of 3



Epoch 7 (Train):   0%|          | 0/422 [00:00<?, ?it/s]

Epoch 7 (Validation):   0%|          | 0/46 [00:00<?, ?it/s]

Epoch: 7 | ɛ: [ 5.32 10.   14.72] |Train loss: 0.763 | Train result: {'accuracy': 0.8256, 'f1': 0.609, 'auc': 0.8484} |
Validation loss: 0.881 | Validation result: {'accuracy': 0.8223, 'f1': 0.594, 'auc': 0.8486} | 

Loss improved from -0.590 to -0.594. Saving model.



Epoch 8 (Train):   0%|          | 0/422 [00:00<?, ?it/s]

Epoch 8 (Validation):   0%|          | 0/46 [00:00<?, ?it/s]

Epoch: 8 | ɛ: [ 5.7  10.54 15.37] |Train loss: 0.757 | Train result: {'accuracy': 0.8248, 'f1': 0.6162, 'auc': 0.8506} |
Validation loss: 0.885 | Validation result: {'accuracy': 0.8232, 'f1': 0.5984, 'auc': 0.8492} | 

Loss improved from -0.594 to -0.598. Saving model.



Epoch 9 (Train):   0%|          | 0/422 [00:00<?, ?it/s]

Epoch 9 (Validation):   0%|          | 0/46 [00:00<?, ?it/s]

Epoch: 9 | ɛ: [ 6.09 11.08 16.03] |Train loss: 0.756 | Train result: {'accuracy': 0.8255, 'f1': 0.6182, 'auc': 0.851} |
Validation loss: 0.884 | Validation result: {'accuracy': 0.8239, 'f1': 0.6002, 'auc': 0.8496} | 

Loss improved from -0.598 to -0.600. Saving model.



Epoch 10 (Train):   0%|          | 0/422 [00:00<?, ?it/s]

Epoch 10 (Validation):   0%|          | 0/46 [00:00<?, ?it/s]

Epoch: 10 | ɛ: [ 6.47 11.62 16.68] |Train loss: 0.760 | Train result: {'accuracy': 0.8253, 'f1': 0.6163, 'auc': 0.8507} |
Validation loss: 0.894 | Validation result: {'accuracy': 0.8245, 'f1': 0.6006, 'auc': 0.8503} | 

Loss improved from -0.600 to -0.601. Saving model.



In [23]:
# load the best model
model, _, _, best_epoch = load_model(model, optimizer, lr_scheduler, device, filepath=best_model_path)

train_loss, train_result, train_probs = evaluate(model, train_dataloader, best_epoch, 'Train')
# no need to reevaluate if the validation set if the last model is the best one
if best_epoch != epoch:
    val_loss, val_result, val_probs = evaluate(model, validation_dataloader, best_epoch, 'Validation')
test_loss, test_result, test_probs = evaluate(model, test_dataloader, best_epoch, 'Test')

Loaded best model from epoch 10


Epoch 10 (Train):   0%|          | 0/211 [00:00<?, ?it/s]

Epoch 10 (Test):   0%|          | 0/46 [00:00<?, ?it/s]

## Dump results and others

In [24]:
dump_results(result_dir)

In [25]:
import json

config = {
    "model_name": model_name,
    "undersample": False,
    "seed": global_seed,

    "epochs": EPOCHS,
    "noise_multiplier":NOISE_MULTIPLIER,
    "learning_rate": LEARNING_RATE,

    "delta": delta_list[0],
    "max_grad_norm": MAX_GRAD_NORM,

    "batch_size": BATCH_SIZE,
    "max_physical_batch_size": MAX_PHYSICAL_BATCH_SIZE,
    "max_sequence_length": 128
}


with open(result_dir + 'config.json', 'w') as output:
    json.dump(config, output, indent=4)