## Fine-tuning BERT on GoEmotions for Emotion Classification

This notebook demonstrates fine-tuning a BERT model for multi-label emotion classification using the `GoEmotions` dataset, followed by export to ONNX

## Set-up and imports

In [1]:
!pip install transformers datasets torch scikit-learn --quiet

In [3]:
import torch
import numpy as np
from datasets import load_dataset
from transformers import (
    BertTokenizerFast,
    BertForSequenceClassification,
    Trainer,
    TrainingArguments,
    DataCollatorWithPadding
)
from sklearn.metrics import f1_score

## Loading and preprocessing dataset

In [4]:
dataset = load_dataset("go_emotions")
num_labels = dataset["train"].features["labels"].feature.num_classes

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


### Tokenizing

Tokenizing the text using BertTokenizerFast and pad/truncate sequences to a max length of 128 tokens.

In [5]:
model_name = "bert-base-uncased"
tokenizer = BertTokenizerFast.from_pretrained(model_name)

def tokenize(batch):
    return tokenizer(
        batch["text"],
        padding="max_length",
        truncation=True,
        max_length=128
    )

dataset = dataset.map(tokenize, batched=True)

Map:   0%|          | 0/43410 [00:00<?, ? examples/s]

Map:   0%|          | 0/5426 [00:00<?, ? examples/s]

Map:   0%|          | 0/5427 [00:00<?, ? examples/s]

### Multi-Label Preparation

Cconvert the dataset labels to multi-hot vectors and ensuring the dataset is returned as PyTorch tensors, as well as preparing a custom collator to handle multi-label tensors during batching.

In [6]:
def to_multihot(batch):
    multi_hot = []
    for labels in batch["labels"]:
        vec = [0.0] * num_labels
        for l in labels:
            vec[l] = 1.0
        multi_hot.append(vec)
    batch["labels"] = multi_hot
    return batch

dataset = dataset.map(to_multihot, batched=True)

dataset.set_format(
    type="torch",
    columns=["input_ids", "attention_mask", "labels"]
)

Map:   0%|          | 0/43410 [00:00<?, ? examples/s]

Map:   0%|          | 0/5426 [00:00<?, ? examples/s]

Map:   0%|          | 0/5427 [00:00<?, ? examples/s]

In [7]:
class MultiLabelCollator(DataCollatorWithPadding):
  def __call__(self, features):
    batch = super().__call__(features)
    batch["labels"] = batch["labels"].float()
    return batch

data_collator = MultiLabelCollator(tokenizer=tokenizer)

## Model

### Model initialization

In [7]:
model = BertForSequenceClassification.from_pretrained(
    model_name,
    num_labels=num_labels,
    problem_type="multi_label_classification"
)

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

BertForSequenceClassification LOAD REPORT from: bert-base-uncased
Key                                        | Status     | 
-------------------------------------------+------------+-
cls.predictions.transform.dense.bias       | UNEXPECTED | 
cls.predictions.transform.LayerNorm.bias   | UNEXPECTED | 
cls.predictions.transform.LayerNorm.weight | UNEXPECTED | 
cls.predictions.bias                       | UNEXPECTED | 
cls.seq_relationship.weight                | UNEXPECTED | 
cls.seq_relationship.bias                  | UNEXPECTED | 
cls.predictions.transform.dense.weight     | UNEXPECTED | 
classifier.bias                            | MISSING    | 
classifier.weight                          | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.


### Defining F1-macro metric for evaluation

In [8]:
def compute_metrics(pred):
    logits = torch.tensor(pred.predictions)
    probs = torch.sigmoid(logits)
    y_pred = (probs > 0.5).int().numpy()
    y_true = pred.label_ids
    return {
        "f1_macro": f1_score(y_true, y_pred, average="macro")
    }

### Training arguments

In [10]:
training_args = TrainingArguments(
    output_dir="./bert_goemotions",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_steps=100,
    load_best_model_at_end=True,
)

### Trainer set-up and training

In [11]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    compute_metrics=compute_metrics,
    data_collator=data_collator
)

In [12]:
trainer.train()
trainer.evaluate(dataset["test"])

Epoch,Training Loss,Validation Loss,F1 Macro
1,0.092583,0.09241,0.245512
2,0.082855,0.085603,0.38172
3,0.073822,0.085399,0.400494


Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

There were missing keys in the checkpoint model loaded: ['bert.embeddings.LayerNorm.weight', 'bert.embeddings.LayerNorm.bias', 'bert.encoder.layer.0.attention.output.LayerNorm.weight', 'bert.encoder.layer.0.attention.output.LayerNorm.bias', 'bert.encoder.layer.0.output.LayerNorm.weight', 'bert.encoder.layer.0.output.LayerNorm.bias', 'bert.encoder.layer.1.attention.output.LayerNorm.weight', 'bert.encoder.layer.1.attention.output.LayerNorm.bias', 'bert.encoder.layer.1.output.LayerNorm.weight', 'bert.encoder.layer.1.output.LayerNorm.bias', 'bert.encoder.layer.2.attention.output.LayerNorm.weight', 'bert.encoder.layer.2.attention.output.LayerNorm.bias', 'bert.encoder.layer.2.output.LayerNorm.weight', 'bert.encoder.layer.2.output.LayerNorm.bias', 'bert.encoder.layer.3.attention.output.LayerNorm.weight', 'bert.encoder.layer.3.attention.output.LayerNorm.bias', 'bert.encoder.layer.3.output.LayerNorm.weight', 'bert.encoder.layer.3.output.LayerNorm.bias', 'bert.encoder.layer.4.attention.output.La

{'eval_loss': 0.08445675671100616,
 'eval_f1_macro': 0.4156366580553526,
 'eval_runtime': 34.6601,
 'eval_samples_per_second': 156.578,
 'eval_steps_per_second': 4.905,
 'epoch': 3.0}

In [13]:
trainer.save_model("./bert_goemotions_model")
tokenizer.save_pretrained("./bert_goemotions_model")

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

('./bert_goemotions_model/tokenizer_config.json',
 './bert_goemotions_model/tokenizer.json')

In [15]:
!pip install optimum[onnxruntime]

Collecting optimum[onnxruntime]
  Downloading optimum-2.1.0-py3-none-any.whl.metadata (14 kB)
Collecting optimum-onnx[onnxruntime] (from optimum[onnxruntime])
  Downloading optimum_onnx-0.1.0-py3-none-any.whl.metadata (4.8 kB)
Collecting transformers>=4.29 (from optimum[onnxruntime])
  Downloading transformers-4.57.6-py3-none-any.whl.metadata (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.0/44.0 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting onnx (from optimum-onnx[onnxruntime]; extra == "onnxruntime"->optimum[onnxruntime])
  Downloading onnx-1.20.1-cp312-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Collecting onnxruntime>=1.18.0 (from optimum-onnx[onnxruntime]; extra == "onnxruntime"->optimum[onnxruntime])
  Downloading onnxruntime-1.24.2-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.0 kB)
Collecting huggingface_hub>=0.8.0 (from optimum[onnxruntime])
  Downloading huggingface_hub-0.36.2-py3

## Saving and exporting the model

In [9]:

sample = tokenizer(
    "I am extremely happy today!",
    return_tensors="pt",
    padding="max_length",
    truncation=True,
    max_length=128
)

from optimum.onnxruntime import ORTModelForSequenceClassification

ort_model = ORTModelForSequenceClassification.from_pretrained(
    "./bert_goemotions_model",
    export=True
)

ort_model.save_pretrained("./onnx_model")

Multiple distributions found for package optimum. Picked distribution: optimum-onnx
Flax classes are deprecated and will be removed in Diffusers v1.0.0. We recommend migrating to PyTorch classes or pinning your version of Diffusers.
Flax classes are deprecated and will be removed in Diffusers v1.0.0. We recommend migrating to PyTorch classes or pinning your version of Diffusers.
`torch_dtype` is deprecated! Use `dtype` instead!
  inverted_mask = torch.tensor(1.0, dtype=dtype) - expanded_mask


In [11]:
import onnxruntime as ort
import numpy as np
import torch

sample = tokenizer(
    "I am extremely happy today!",
    return_tensors="np",
    padding="max_length",
    truncation=True,
    max_length=128
)

session = ort.InferenceSession("onnx_model/model.onnx")

inputs = {
    "input_ids": sample["input_ids"],
    "attention_mask": sample["attention_mask"],
    "token_type_ids": sample["token_type_ids"]
}

outputs = session.run(None, inputs)

logits = torch.tensor(outputs[0])
probs = torch.sigmoid(logits)

print("Probabilities:", probs)

Probabilities: tensor([[0.0437, 0.0147, 0.0042, 0.0076, 0.0358, 0.0219, 0.0083, 0.0057, 0.0082,
         0.0144, 0.0125, 0.0038, 0.0036, 0.0776, 0.0043, 0.0607, 0.0031, 0.8185,
         0.0188, 0.0032, 0.0144, 0.0094, 0.0233, 0.0256, 0.0030, 0.0110, 0.0073,
         0.0233]])
