<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Finetuning-Pre-trained-BERT-Model-on-Text-Classification-Task-And-Inferencing-with-ONNX-Runtime" data-toc-modified-id="Finetuning-Pre-trained-BERT-Model-on-Text-Classification-Task-And-Inferencing-with-ONNX-Runtime-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Finetuning Pre-trained BERT Model on Text Classification Task And Inferencing with ONNX Runtime</a></span><ul class="toc-item"><li><span><a href="#Tokenizer" data-toc-modified-id="Tokenizer-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Tokenizer</a></span></li><li><span><a href="#Model-FineTuning" data-toc-modified-id="Model-FineTuning-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Model FineTuning</a></span></li></ul></li><li><span><a href="#Reference" data-toc-modified-id="Reference-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Reference</a></span></li></ul></div>

In [1]:
import os
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


# Finetuning Pre-trained BERT Model on Text Classification Task And Inferencing with ONNX Runtime

In this article, we'll be going over two main things:

- Process of finetuning a pre-trained BERT model towards a text classification task, more specificially, the [Quora Question Pairs](https://www.kaggle.com/c/quora-question-pairs/data) challenge.
- Process of converting our model into [ONNX](https://onnx.ai/) format, and perform inferencing benchmark with [ONNX runtime](https://www.onnxruntime.ai/).

In [2]:
from datasets import load_dataset, DatasetDict, Dataset

dataset_dict = load_dataset("quora")
dataset_dict

Found cached dataset quora (/Users/gm_main/.cache/huggingface/datasets/quora/default/0.0.0/36ba4cd42107f051a158016f1bea6ae3f4685c5df843529108a54e42d86c1e04)


  0%|          | 0/1 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['questions', 'is_duplicate'],
        num_rows: 404290
    })
})

In [3]:
dataset_dict['train'][0]

{'questions': {'id': [1, 2],
  'text': ['What is the step by step guide to invest in share market in india?',
   'What is the step by step guide to invest in share market?']},
 'is_duplicate': False}

In [4]:
test_size = 0.1
val_size = 0.1
dataset_dict_test = dataset_dict['train'].train_test_split(test_size=test_size)
dataset_dict_train_val = dataset_dict_test['train'].train_test_split(test_size=val_size)

dataset_dict = DatasetDict({
    "train": dataset_dict_train_val["train"],
    "val": dataset_dict_train_val["test"],
    "test": dataset_dict_test["test"]
})
dataset_dict

DatasetDict({
    train: Dataset({
        features: ['questions', 'is_duplicate'],
        num_rows: 327474
    })
    val: Dataset({
        features: ['questions', 'is_duplicate'],
        num_rows: 36387
    })
    test: Dataset({
        features: ['questions', 'is_duplicate'],
        num_rows: 40429
    })
})

## Tokenizer

We won't be going over the details of the pre-trained tokenizer or model and only load a pre-trained one available from the huggingface model repository.

In [5]:
from transformers import AutoTokenizer

# https://huggingface.co/transformers/model_doc/distilbert.html
pretrained_model_name_or_path = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path)
tokenizer

PreTrainedTokenizerFast(name_or_path='distilbert-base-uncased', vocab_size=30522, model_max_len=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

We can feed our tokenizer directly with a pair of sentences.

In [6]:
encoded_input = tokenizer(
    'What is the step by step guide to invest in share market in india?',
    'What is the step by step guide to invest in share market?'
)
encoded_input

{'input_ids': [101, 2054, 2003, 1996, 3357, 2011, 3357, 5009, 2000, 15697, 1999, 3745, 3006, 1999, 2634, 1029, 102, 2054, 2003, 1996, 3357, 2011, 3357, 5009, 2000, 15697, 1999, 3745, 3006, 1029, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Decoding the tokenized inputs, this model's tokenizer adds some special tokens such as, `[SEP]`, that is used to indicate which token belongs to which segment/pair.

In [7]:
tokenizer.decode(encoded_input["input_ids"])

'[CLS] what is the step by step guide to invest in share market in india? [SEP] what is the step by step guide to invest in share market? [SEP]'

The proprocessing step will be task specific, if we happen to be using another dataset, this function needs to be modified accordingly.

In [8]:
def tokenize_fn(examples):
    labels = [int(label) for label in examples['is_duplicate']]
    texts = [question['text'] for question in examples['questions']]
    texts1 = [text[0] for text in texts]
    texts2 = [text[1] for text in texts]
    tokenized_examples = tokenizer(texts1, texts2)
    tokenized_examples['labels'] = labels
    return tokenized_examples

In [9]:
dataset_dict_tokenized = dataset_dict.map(
    tokenize_fn,
    batched=True,
    num_proc=8,
    remove_columns=['is_duplicate', 'questions']
)
dataset_dict_tokenized

                

#0:   0%|          | 0/41 [00:00<?, ?ba/s]

#1:   0%|          | 0/41 [00:00<?, ?ba/s]

#2:   0%|          | 0/41 [00:00<?, ?ba/s]

#6:   0%|          | 0/41 [00:00<?, ?ba/s]

#4:   0%|          | 0/41 [00:00<?, ?ba/s]

#3:   0%|          | 0/41 [00:00<?, ?ba/s]

#5:   0%|          | 0/41 [00:00<?, ?ba/s]

#7:   0%|          | 0/41 [00:00<?, ?ba/s]

                

#0:   0%|          | 0/5 [00:00<?, ?ba/s]

#1:   0%|          | 0/5 [00:00<?, ?ba/s]

#2:   0%|          | 0/5 [00:00<?, ?ba/s]

#3:   0%|          | 0/5 [00:00<?, ?ba/s]

#4:   0%|          | 0/5 [00:00<?, ?ba/s]

#5:   0%|          | 0/5 [00:00<?, ?ba/s]

#6:   0%|          | 0/5 [00:00<?, ?ba/s]

#7:   0%|          | 0/5 [00:00<?, ?ba/s]

                

#2:   0%|          | 0/6 [00:00<?, ?ba/s]

#1:   0%|          | 0/6 [00:00<?, ?ba/s]

#0:   0%|          | 0/6 [00:00<?, ?ba/s]

#4:   0%|          | 0/6 [00:00<?, ?ba/s]

#3:   0%|          | 0/6 [00:00<?, ?ba/s]

#5:   0%|          | 0/6 [00:00<?, ?ba/s]

#7:   0%|          | 0/6 [00:00<?, ?ba/s]

#6:   0%|          | 0/6 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 327474
    })
    val: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 36387
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 40429
    })
})

In [10]:
dataset_dict_tokenized['train'][0]

{'input_ids': [101,
  2054,
  2024,
  2070,
  1997,
  2115,
  2047,
  2095,
  1005,
  1055,
  18853,
  2005,
  2418,
  1029,
  102,
  2054,
  2003,
  2115,
  2047,
  2095,
  1005,
  1055,
  18853,
  2005,
  2418,
  1029,
  102],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1],
 'labels': 1}

## Model FineTuning

Having preprocessed our raw dataset, for our text classification task, we use `AutoModelForSequenceClassification` class to load the pre-trained model, the only other argument we need to specify is the number of class/label our text classification task has. Upon instantiating the model for the first time, we'll see some warnings generated, telling us we should fine tune this model on our down stream tasks before using it. 

In [11]:
model_checkpoint = 'text_classification'
num_labels = 2

In [12]:
from transformers import (
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding
)

# we'll save the model after fine tuning it once, so we can skip the fine tuning part during
# the second round if we detect that we already have one available
if os.path.isdir(model_checkpoint):
    model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint)
else:
    model = AutoModelForSequenceClassification.from_pretrained(pretrained_model_name_or_path, num_labels=num_labels)

print('# of parameters: ', model.num_parameters())
model

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_projector.bias', 'vocab_transform.bias', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.bias', 'pre_classi

# of parameters:  66955010


DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
       

In [13]:
data_collator = DataCollatorWithPadding(tokenizer, padding=True)
data_collator

DataCollatorWithPadding(tokenizer=PreTrainedTokenizerFast(name_or_path='distilbert-base-uncased', vocab_size=30522, model_max_len=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}), padding=True, max_length=None, pad_to_multiple_of=None, return_tensors='pt')

We can perform all sorts of hyper parameter tuning on the fine tuning step, here we'll pick some default parameters for illustration purposes.

In [14]:
batch_size = 128
args = TrainingArguments(
    "quora",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=2,
    weight_decay=0.01,
    load_best_model_at_end=True
)

trainer = Trainer(
    model,
    args,
    data_collator=data_collator,
    train_dataset=dataset_dict_tokenized["train"],
    eval_dataset=dataset_dict_tokenized['val']
)

ValueError: --load_best_model_at_end requires the save and eval strategy to match, but found
- Evaluation strategy: epoch
- Save strategy: steps

In [None]:
if not os.path.isdir(model_checkpoint):
    trainer.train()
    model.save_pretrained(model_checkpoint)

The next couple of code chunks performs batch inferencing on our dataset, and reports standard binary classification evaluation metrics.

In [None]:
import torch
import torch.nn.functional as F

def predict(model, example, round_digits: int = 5):
    input_ids = example['input_ids'].to(model.device)
    attention_mask = example['attention_mask'].to(model.device)
    batch_labels = example['labels'].detach().cpu().numpy().tolist()
    model.eval()
    with torch.no_grad():
        batch_output = model(input_ids, attention_mask)

    batch_scores = F.softmax(batch_output.logits, dim=-1)[:, 1]
    batch_scores = np.round(batch_scores.detach().cpu().numpy(), round_digits).tolist()
    return batch_scores, batch_labels

In [None]:
from torch.utils.data import DataLoader

def predict_data_loader(model, data_loader: DataLoader) -> pd.DataFrame:
    scores = []
    labels = []
    for example in data_loader:
        batch_scores, batch_labels = predict(model, example)
        scores += batch_scores
        labels += batch_labels

    df_predictions = pd.DataFrame.from_dict({'scores': scores, 'labels': labels})
    return df_predictions

In [None]:
data_loader = DataLoader(dataset_dict_tokenized['test'], collate_fn=data_collator, batch_size=64)
start = time.time()
df_predictions = predict_data_loader(model, data_loader)
end = time.time()
print('elapsed: ', end - start)
print(df_predictions.shape)
df_predictions.head()

In [None]:
import sklearn.metrics as metrics

def compute_binary_classification_metrics(y_true, y_score, round_digits: int = 3):
    auc = metrics.roc_auc_score(y_true, y_score)
    log_loss = metrics.log_loss(y_true, y_score)

    precision, recall, threshold = metrics.precision_recall_curve(y_true, y_score)
    f1 = 2 * (precision * recall) / (precision + recall)

    mask = ~np.isnan(f1)
    f1 = f1[mask]
    precision = precision[mask]
    recall = recall[mask]

    best_index = np.argmax(f1)
    precision = precision[best_index]
    recall = recall[best_index]
    f1 = f1[best_index]
    return {
        'auc': round(auc, round_digits),
        'precision': round(precision, round_digits),
        'recall': round(recall, round_digits),
        'f1': round(f1, round_digits),
        'log_loss': round(log_loss, round_digits)
    }

In [None]:
compute_binary_classification_metrics(df_predictions['labels'], df_predictions['scores'])

# Reference

- [Jupyter Notebook: Fine-tuning a model on a text classification task](https://nbviewer.org/github/huggingface/notebooks/blob/main/examples/text_classification.ipynb)
- [Blog: Faster and smaller quantized NLP with Hugging Face and ONNX Runtime](https://medium.com/microsoftazure/faster-and-smaller-quantized-nlp-with-hugging-face-and-onnx-runtime-ec5525473bb7)
- [PyTorch Documentation: Dynamic Quantization](https://pytorch.org/tutorials/recipes/recipes/dynamic_quantization.html)
- [Finetuning Pre-trained BERT Model on Text Classification Task](https://github.com/ethen8181/machine-learning/blob/master/model_deployment/transformers/text_classification_onnxruntime.ipynb)