This notebook is done following 
* [Building text classifier with Differential Privacy](https://github.com/pytorch/opacus/blob/main/tutorials/building_text_classifier.ipynb)
* [Fine-tuning with custom datasets](https://huggingface.co/transformers/v3.4.0/custom_datasets.html#seq-imdb)

# Libraries
https://huggingface.co/docs/transformers/training

## Install

In [1]:
!pip install datasets
!pip install transformers
!pip install opacus

Collecting datasets
  Downloading datasets-2.1.0-py3-none-any.whl (325 kB)
[?25l[K     |█                               | 10 kB 19.8 MB/s eta 0:00:01[K     |██                              | 20 kB 10.6 MB/s eta 0:00:01[K     |███                             | 30 kB 8.9 MB/s eta 0:00:01[K     |████                            | 40 kB 8.3 MB/s eta 0:00:01[K     |█████                           | 51 kB 4.3 MB/s eta 0:00:01[K     |██████                          | 61 kB 5.1 MB/s eta 0:00:01[K     |███████                         | 71 kB 5.5 MB/s eta 0:00:01[K     |████████                        | 81 kB 5.7 MB/s eta 0:00:01[K     |█████████                       | 92 kB 6.3 MB/s eta 0:00:01[K     |██████████                      | 102 kB 5.2 MB/s eta 0:00:01[K     |███████████                     | 112 kB 5.2 MB/s eta 0:00:01[K     |████████████                    | 122 kB 5.2 MB/s eta 0:00:01[K     |█████████████                   | 133 kB 5.2 MB/s eta 0:00:01[

## Import

In [2]:
from tqdm.auto import tqdm
from torch.utils.data import DataLoader
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from torch.optim import AdamW
from transformers import get_scheduler
from torch.utils.data import TensorDataset, DataLoader
import torch
from torch.nn.utils.rnn import pad_sequence
import gc
from opacus.utils.batch_memory_manager import BatchMemoryManager

import warnings
warnings.filterwarnings("ignore")

## [Check GPU footprint](https://stackoverflow.com/questions/59789059/gpu-out-of-memory-error-message-on-google-colab)

In [4]:
# memory footprint support libraries/code
!ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi
!pip install gputil

import psutil
import humanize
import os
import GPUtil as GPU

GPUs = GPU.getGPUs()
# XXX: only one GPU on Colab and isn’t guaranteed
gpu = GPUs[0]
def printm():
    process = psutil.Process(os.getpid())
    print("Gen RAM Free: " + humanize.naturalsize(psutil.virtual_memory().available), " |     Proc size: " + humanize.naturalsize(process.memory_info().rss))
    print("GPU RAM Free: {0:.0f}MB | Used: {1:.0f}MB | Util {2:3.0f}% | Total     {3:.0f}MB".format(gpu.memoryFree, gpu.memoryUsed, gpu.memoryUtil*100, gpu.memoryTotal))
printm()

Collecting gputil
  Downloading GPUtil-1.4.0.tar.gz (5.5 kB)
Building wheels for collected packages: gputil
  Building wheel for gputil (setup.py) ... [?25l[?25hdone
  Created wheel for gputil: filename=GPUtil-1.4.0-py3-none-any.whl size=7411 sha256=c36cb6e89066c37c989e9e0fdad73ec30a1403a92478337411b74fae77ad4926
  Stored in directory: /root/.cache/pip/wheels/6e/f8/83/534c52482d6da64622ddbf72cd93c35d2ef2881b78fd08ff0c
Successfully built gputil
Installing collected packages: gputil
Successfully installed gputil-1.4.0
Gen RAM Free: 12.0 GB  |     Proc size: 1.6 GB
GPU RAM Free: 11441MB | Used: 0MB | Util   0% | Total     11441MB


## Get device

In [5]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
print(device)

cuda


# Dataset

## Download

First, we need to download the dataset.

In [6]:
from datasets import load_dataset

# dataset = load_dataset("yelp_review_full")
imdb_dataset = load_dataset("imdb")

for key in imdb_dataset.keys():
  print(key, imdb_dataset[key].shape)

# positive or negative review
num_labels = 2

imdb_dataset["train"][100]

Downloading builder script:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

Downloading and preparing dataset imdb/plain_text (download: 80.23 MiB, generated: 127.02 MiB, post-processed: Unknown size, total: 207.25 MiB) to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1...


Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset imdb downloaded and prepared to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

train (25000, 2)
test (25000, 2)
unsupervised (50000, 2)


{'label': 0,
 'text': "Terrible movie. Nuff Said.<br /><br />These Lines are Just Filler. The movie was bad. Why I have to expand on that I don't know. This is already a waste of my time. I just wanted to warn others. Avoid this movie. The acting sucks and the writing is just moronic. Bad in every way. The only nice thing about the movie are Deniz Akkaya's breasts. Even that was ruined though by a terrible and unneeded rape scene. The movie is a poorly contrived and totally unbelievable piece of garbage.<br /><br />OK now I am just going to rag on IMDb for this stupid rule of 10 lines of text minimum. First I waste my time watching this offal. Then feeling compelled to warn others I create an account with IMDb only to discover that I have to write a friggen essay on the film just to express how bad I think it is. Totally unnecessary."}

In [7]:
lengths = []
for i in ['train', 'test']:
  for item in imdb_dataset[i]:
    lengths.append(len(item['text']))

In [8]:
import pandas as pd
df = pd.DataFrame({'Lengths':lengths})
df.describe()

Unnamed: 0,Lengths
count,50000.0
mean,1309.43102
std,989.728014
min,32.0
25%,699.0
50%,970.0
75%,1590.25
max,13704.0


## Tokenizer

In [30]:
from transformers import BertConfig, BertTokenizer

model_name = "bert-base-cased"
config = BertConfig.from_pretrained(
    model_name,
    num_labels=2,
)
tokenizer = BertTokenizer.from_pretrained(
    model_name,
    do_lower_case=False,
)

## Prepare the data
Before we begin training, we need to preprocess the data and convert it to the format our model expects.

(Note: it'll take 5-10 minutes to run on a laptop)

In [10]:
MAX_SEQ_LENGTH = 512

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", max_length=MAX_SEQ_LENGTH, truncation=True)

tokenized_datasets = imdb_dataset.map(tokenize_function, batched=True)

  0%|          | 0/25 [00:00<?, ?ba/s]

  0%|          | 0/25 [00:00<?, ?ba/s]

  0%|          | 0/50 [00:00<?, ?ba/s]

In [53]:
tokenized_datasets = tokenized_datasets.remove_columns(["text"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")

In [54]:
# select a smaller subset for faster debugging
small_train_dataset = tokenized_datasets["train"].shuffle(seed=2022).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=2022).select(range(1000))

# Model

BERT (Bidirectional Encoder Representations from Transformers) is a state of the art approach to various NLP tasks. It uses a Transformer architecture and relies heavily on the concept of pre-training.

We'll use a pre-trained BERT-base model, provided in huggingface [transformers](https://github.com/huggingface/transformers) repo. It gives us a pytorch implementation for the classic BERT architecture, as well as a tokenizer and weights pre-trained on a public English corpus (Wikipedia).

Please follow these [installation instrucitons](https://github.com/huggingface/transformers#installation) before proceeding.

In [34]:
# https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertForSequenceClassification
from transformers import BertForSequenceClassification

def load_pretrained_model(model_name, config):
    model = BertForSequenceClassification.from_pretrained(model_name, config=config)

    trainable_layers = [model.bert.encoder.layer[-1], model.bert.pooler, model.classifier]
    total_params = 0
    trainable_params = 0

    for p in model.parameters():
      p.requires_grad = False
      total_params += p.numel()

    for layer in trainable_layers:
      for p in layer.parameters():
          p.requires_grad = True
          trainable_params += p.numel()
          total_params += p.numel()

    print(f"Total parameters count: {total_params}") # ~108M
    print(f"Trainable parameters count: {trainable_params}") # ~7M

    return model

In [103]:
model = load_pretrained_model(model_name, config)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

Total parameters count: 115991812
Trainable parameters count: 7680002


# Data loader

In [76]:
BATCH_SIZE = 16
MAX_PHYSICAL_BATCH_SIZE = 2

In [56]:
# train_dataloader = get_dataloader(small_train_dataset, BATCH_SIZE)
# test_dataloader = get_dataloader(small_eval_dataset, BATCH_SIZE)

train_dataloader = DataLoader(small_train_dataset, shuffle=True, batch_size=BATCH_SIZE)
test_dataloader = DataLoader(small_eval_dataset, batch_size=BATCH_SIZE)

# Training

In [104]:
EPOCHS = 3
EPSILON = 7.5
DELTA = 1 / len(train_dataloader) # Parameter for privacy accounting. Probability of not achieving privacy guarant
NOISE_MULTIPLIER = 0.1
LEARNING_RATE = 1e-3
MAX_GRAD_NORM = 1

In [105]:
model = model.to(device)

# Set the model to train mode (HuggingFace models load in eval mode)
model = model.train()
# Define optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=LEARNING_RATE, eps=1e-8)

# criterion
loss_function = torch.nn.CrossEntropyLoss()

## Evaluation cycle

In [91]:
import numpy as np
from tqdm.notebook import tqdm
from sklearn.metrics import f1_score, roc_auc_score, accuracy_score

# https://huggingface.co/docs/datasets/metrics
def calculate_result(labels, preds):
    return {
        'accuracy': np.round(accuracy_score(labels, preds), 4),
        'f1': np.round(f1_score(labels, preds), 4),
        'auc': np.round(roc_auc_score(labels, preds), 4)
    }

def evaluate(model):    
    model.eval()

    losses, total_preds, total_labels = [], [], []
    
    for batch in test_dataloader:
        inputs = {k: v.to(device) for k, v in batch.items()}

        with torch.no_grad():
            outputs = model(**inputs)
            
        loss = outputs[0]
        
        preds = np.argmax(outputs.logits.detach().cpu().numpy(), axis=1)
        labels = inputs['labels'].detach().cpu().numpy()
        
        losses.append(loss.item())
        total_preds.extend(preds)
        total_labels.extend(labels)
    
    model.train()
    return np.mean(losses), calculate_result(total_labels, total_preds)

## Privacy Engine

In [66]:
from opacus import PrivacyEngine

privacy_engine = PrivacyEngine()

In [106]:
# model, optimizer, train_dataloader = privacy_engine.make_private_with_epsilon(
#     module=model,
#     optimizer=optimizer,
#     data_loader=train_dataloader,
#     target_delta=DELTA,
#     target_epsilon=EPSILON, 
#     epochs=EPOCHS,
#     max_grad_norm=MAX_GRAD_NORM,
# )

model, optimizer, train_dataloader = privacy_engine.make_private(
    module=model,
    optimizer=optimizer,
    data_loader=train_dataloader,
    noise_multiplier=NOISE_MULTIPLIER,
    max_grad_norm=MAX_GRAD_NORM,
    poisson_sampling=False,
)

## Train

In [107]:
import gc
gc.collect()

2885

In [108]:
for epoch in range(1, EPOCHS+1):
    losses, total_preds, total_labels = [], [], []

    with BatchMemoryManager(
        data_loader=train_dataloader, 
        max_physical_batch_size=MAX_PHYSICAL_BATCH_SIZE, 
        optimizer=optimizer
    ) as memory_safe_data_loader:
        for step, data in enumerate(tqdm(memory_safe_data_loader)):
            optimizer.zero_grad()

            inputs = {k: v.to(device) for k, v in data.items()}
            outputs = model(**inputs) # output = loss, logits, hidden_states, attentions

            targets = data['labels'].to(device, dtype = torch.long)
            # loss = loss_function(outputs.logits, targets)
            loss = outputs[0]

            loss.backward()
            optimizer.step()

            losses.append(loss.item())

            preds = np.argmax(outputs.logits.detach().cpu().numpy(), axis=1)
            labels = targets.detach().cpu().numpy()
            total_preds.extend(preds)
            total_labels.extend(labels)
           

    train_loss = np.mean(losses)
    train_result = calculate_result(np.array(total_labels), np.array(total_preds))

    eps = privacy_engine.get_epsilon(DELTA)
    eval_loss, eval_result = evaluate(model)

    print(
      f"Epoch: {epoch} | "
      f"ɛ: {eps:.2f} |"
      f"Train loss: {train_loss:.3f} | "
      f"Train result: {train_result} |\n"
      f"Eval loss: {eval_loss:.3f} | "
      f"Eval result: {eval_result} | "
    )

  0%|          | 0/500 [00:00<?, ?it/s]

Epoch: 1 | ɛ: 14859.97 |Train loss: 1.471 | Train result: {'accuracy': 0.544, 'f1': 0.7035, 'auc': 0.5014} |
Eval loss: 1.298 | Eval result: {'accuracy': 0.543, 'f1': 0.7038, 'auc': 0.5} | 


  0%|          | 0/500 [00:00<?, ?it/s]

Epoch: 2 | ɛ: 15845.02 |Train loss: 1.037 | Train result: {'accuracy': 0.613, 'f1': 0.7034, 'auc': 0.5911} |
Eval loss: 0.925 | Eval result: {'accuracy': 0.699, 'f1': 0.7097, 'auc': 0.701} | 


  0%|          | 0/500 [00:00<?, ?it/s]

Epoch: 3 | ɛ: 16830.06 |Train loss: 1.081 | Train result: {'accuracy': 0.692, 'f1': 0.727, 'auc': 0.6861} |
Eval loss: 1.007 | Eval result: {'accuracy': 0.732, 'f1': 0.7637, 'auc': 0.7258} | 


In [102]:
# just check if the model is underfitting
sum(total_preds), sum(total_labels), len(total_labels)

(909, 543, 1000)