## Demo of BERT fine-tuned on Brighter dataset


This notebook demonstrates inference using a BERT model fine-tuned on the Swedish subset of the [BRIGHTER dataset](https://huggingface.co/datasets/brighter-dataset/BRIGHTER-emotion-categories). 

**Base model:** [KB-BERT](https://huggingface.co/KBLab/bert-base-swedish-cased)

### Dataset Information

**Dataset size:** 3,963 examples

**Label distribution:**
- Joy: 1,584 (40.6%)
- Anger: 813 (20.8%)
- Disgust: 748 (19.2%)
- Sadness: 470 (12.0%)
- Surprise: 189 (4.8%)
- Fear: 100 (2.6%)

The original dataset splits were reorganized to use 80% for training, 10% for validation, and 10% for testing.

### Handling Class Imbalance

To account for the imbalanced distribution, two techniques were used:
1. **Upsampling** of rare labels (fear ×4, surprise ×2)
2. **Weighted loss** (BCEWithLogitsLoss with pos_weight)

### Performance

**Overall metrics:**
- F1 micro: 0.77
- F1 macro: 0.67
- Accuracy: 0.68

**Per-label F1 scores:**
- Joy: 0.95
- Anger: 0.73
- Disgust: 0.72
- Sadness: 0.72
- Fear: 0.57
- Surprise: 0.35

This code demonstrates a BERT-model fine-tuned on the Swedish subset [Brighter dataset](https://huggingface.co/datasets/brighter-dataset/BRIGHTER-emotion-categories). The base model is [KB-BERT](https://huggingface.co/KBLab/bert-base-swedish-cased).

The original dataset splits were not kept as the training split was smaller than the others, instead the majority was used for training. The distribution of labels in the full dataset are not balanced:

Dataset size: 3963 examples <br>
Label distribution:
- anger     :   813 ( 20.8%)
- disgust   :   748 ( 19.2%)
- fear      :   100 (  2.6%)
- joy       :  1584 ( 40.6%)
- sadness   :   470 ( 12.0%)
- surprise  :   189 (  4.8%)

In order to account for the class-imbalance, upsampling of the low frequency classes and weighted loss was employed.
Results after fine-tuning:

Per-label F1 score (test):
 - anger     : 0.7253
 - disgust   : 0.7168
 - fear      : 0.5714
 - joy       : 0.9467
 - sadness   : 0.7200
 - surprise  : 0.3492

Test Results:
- eval_f1_micro: 0.7724
- eval_f1_macro: 0.6716
- eval_accuracy: 0.6801

In [4]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

  from .autonotebook import tqdm as notebook_tqdm


### Inference

In [5]:
# Load model
model_path = 'sbx/KB-bert-base-swedish-cased_emotions_brighter'
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)

Loading weights: 100%|██████████| 201/201 [00:00<00:00, 1503.93it/s, Materializing param=classifier.weight]                                      


In [None]:
def get_preds_single_sent(sent,tokenizer,model):

    label_list = ['anger', 'disgust', 'fear', 'joy', 'sadness', 'surprise']

    #threshold to filter out labels with lower scores
    threshold=0.1

    inputs = tokenizer(
        sent,
        return_tensors="pt",
        max_length= 512,
        padding='max_length',
        truncation = True
    )

    model.eval()

    with torch.no_grad():
        logits = model(**inputs).logits  
        probs = torch.sigmoid(logits).squeeze().tolist()  
        
    label_scores = list(zip(label_list, probs))

    predicted_labels = sorted(
        [(label,round(score,2)) for label,score in label_scores if score > threshold],
        key=lambda x:x[1],
        reverse=True)

    labels_str = (
        "|"
        if not predicted_labels
        else "|" + "|".join(f"{lbl},{score:.2f}" for lbl, score in predicted_labels) + "|")

    return sent,labels_str

In [7]:
# Try the model on a sentence
sent_to_classify = 'Den här produkten lever inte alls upp till mina förväntningar, skäms!'
print(get_preds_single_sent(sent_to_classify,tokenizer,model))

('Den här produkten lever inte alls upp till mina förväntningar, skäms!', '|anger,0.97|disgust,0.97|surprise,0.35|sadness,0.27|fear,0.16|')
