# **Teaching a Transformer to Feel: Emotion Detection in English Text**
##### - Osman Türkmen (h12310665)

### **Project Idea and Goal**
The goal of this project is to explore whether a pre-trained Transformer model can accurately recognize human emotions expressed in English text. Modern NLP models such as **DistilRoBERTa** already capture a wide range of linguistic patterns, but their performance heavily depends on the data they were trained on.

In this project, we **first evaluate the baseline performance of an existing emotion-classification model** on a clean, well-established benchmark dataset. After analyzing the typical **errors** of the baseline model, we **fine-tune** it on a larger and more diverse emotion dataset containing all seven target classes (anger, disgust, fear, joy, neutral, sadness, surprise).

The main objective is to investigate whether fine-tuning on a richer dataset leads to measurable **improvements in performance**. By comparing baseline and fine-tuned performance on the same test set, we aim to demonstrate how additional supervised training enables a Transformer to better “understand” emotional nuances in text.

### **Motivation**

Understanding **human emotions** in text is essential for many modern applications such as customer feedback analysis, mental-health monitoring, social-media moderation and conversational AI.
However, pre-trained neural models - although powerful - often struggle to generalize across different datasets, domains, or subtle emotional nuances.

By exploring how well an existing emotion-classification model performs before and after fine-tuning on new training data, we can better understand both the **strengths and the limitations** of transformer-based NLP systems. This project therefore aims to investigate **whether targeted fine-tuning can significantly improve an emotion model’s performance** and make it more reliable for real-world use cases.

#### **Loading packages**
We first start by loading the necessary packages and setting up our environment.

In [1]:
import pandas as pd
import torch

from sklearn.metrics import accuracy_score, classification_report, f1_score
from sklearn.model_selection import train_test_split

from datasets import load_dataset, Dataset

from transformers import (AutoTokenizer, AutoModelForSequenceClassification, DataCollatorWithPadding, TrainingArguments, Trainer)

#pip installs...
!pip install transformers[torch]
!pip install datasets
!pip install scikit-learn


  from .autonotebook import tqdm as notebook_tqdm




#### **Run the model**
We run the emotion model from hugging face on a single text example using Hugging Face's pipeline.

In [2]:
from transformers import pipeline

classifier = pipeline(
    task="text-classification",
    model="j-hartmann/emotion-english-distilroberta-base",
    return_all_scores=True
)

classifier("I love this!")


Device set to use cpu


[[{'label': 'anger', 'score': 0.004419783595949411},
  {'label': 'disgust', 'score': 0.0016119909705594182},
  {'label': 'fear', 'score': 0.00041385178337804973},
  {'label': 'joy', 'score': 0.9771687984466553},
  {'label': 'neutral', 'score': 0.005764589179307222},
  {'label': 'sadness', 'score': 0.0020923891570419073},
  {'label': 'surprise', 'score': 0.00852868054062128}]]

In [3]:
#HuggingFace model & tokenizer loading

model_name = "j-hartmann/emotion-english-distilroberta-base"  #specify which pretrained model we want to use

tokenizer = AutoTokenizer.from_pretrained(model_name) #load the tokenizer associated with the model
model = AutoModelForSequenceClassification.from_pretrained(model_name)  #load the pretrained model weights

trainer = Trainer(model=model)   #initialize a Trainer object using only the model (no training yet)


In [4]:
#Dataset wrapper
class SimpleDataset:
    def __init__(self, tokenized):        #store the already-tokenized batch
        self.tokenized = tokenized

    def __len__(self):                   #returns number of samples in the dataset
        return len(self.tokenized["input_ids"])

    def __getitem__(self, idx):          #returns a single sample (one row)
        return {k: v[idx] for k, v in self.tokenized.items()}


#### **Load the first dataset**
Throughout this project, this data set will be used to test the initial model performance and then compare the results with the new (fine-tuned) model performance.

In [5]:
#get dair-ai dataset from hugging face
ds = load_dataset("dair-ai/emotion", "split")

# create a dataframe using ONLY the test split (text + label)
df_dair = pd.DataFrame({
    "text": ds["test"]["text"],
    "label": ds["test"]["label"]
})


In [6]:
#change the labels from dair-ai to text
id2label = {
    0: "sadness",
    1: "joy",
    2: "love",
    3: "anger",
    4: "fear",
    5: "surprise"
}
#replace numbers in dataframe with text labels
df_dair["label"] = df_dair["label"].map(id2label)


In [7]:
# Tokenize

texts = df_dair["text"].astype(str).tolist()       #extract text samples as a Python list

tokenized = tokenizer(
    texts,
    truncation=True,          #cut off texts that are too long for the model’s max sequence length
    padding=True             #pad shorter texts so all sequences have the same length
)

pred_dataset = SimpleDataset(tokenized) #wrap the tokenized output into our SimpleDataset class so that trainer can use it



In [8]:
#use Hugging Face Trainer to run model on our tokenized test dataset...
predictions = trainer.predict(pred_dataset) #returns model outputs (logits)
logits = predictions.predictions #extract logits from the predictions ouput




In [9]:
softmax = torch.nn.functional.softmax(torch.tensor(logits), dim=-1).numpy()  #convert logits to class probabilities using softmax

pred_ids = softmax.argmax(axis=1)   #get the index of the highest probability for each sample
pred_labels = [model.config.id2label[i] for i in pred_ids]      #convert class IDs to human-readable emotion labels

df_dair["baseline_pred"] = pred_labels #store baseline predictions in dataframe


#### **Baseline Performance on the DAIR-AI Dataset (Untrained Model)**

The baseline DistilRoBERTa model reaches 83–84% accuracy, which is already strong given that it has not been fine-tuned on the dataset. However, the per-class performance varies a lot:

* Emotions like **anger, fear, joy, and sadness** achieve high precision and recall (≈0.85–0.91), showing that the model already understands these categories well.

* The model completely fails to predict **“love” and “disgust”**, giving them 0% precision/recall. These emotions are likely underrepresented or too subtle for the generic model.

* Surprise has moderate results (precision 0.59, recall 0.80), suggesting confusion with related emotions.

Overall, the baseline performs well for common emotions but struggles with rare or ambiguous ones—highlighting the need for fine-tuning.

In [10]:
#Extract true labels from the dataset
y_true = df_dair["label"] #true emotion labels from the dataset
y_pred = df_dair["baseline_pred"] #predicted labels from untrained model

#compute overall accuracy
acc = accuracy_score(y_true, y_pred) #percentage of correct predictions
print("Baseline Accuracy:", acc)

#print a full classification report (including precision, recall, F1-score)
print("\nClassification Report:")
print(classification_report(y_true, y_pred))


Baseline Accuracy: 0.839

Classification Report:
              precision    recall  f1-score   support

       anger       0.83      0.91      0.87       275
     disgust       0.00      0.00      0.00         0
        fear       0.85      0.87      0.86       224
         joy       0.83      0.93      0.88       695
        love       0.00      0.00      0.00       159
     neutral       0.00      0.00      0.00         0
     sadness       0.90      0.91      0.91       581
    surprise       0.59      0.80      0.68        66

    accuracy                           0.84      2000
   macro avg       0.50      0.55      0.52      2000
weighted avg       0.78      0.84      0.81      2000



  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


In [11]:
#Error Analysis from the baseline-model (dair-ai)

df_dair_errors = df_dair[df_dair["label"] != df_dair["baseline_pred"]].copy()    #select all misclassified samples

print(f"Number of errors: {len(df_dair_errors)} von {len(df_dair)} Beispielen") #show how many predictions were wrong
print("\nExample misclassifications:")
print(df_dair_errors[["text", "label", "baseline_pred"]].head(20))

#analyze where model makes systematic mistakes (true label vs. predicted label)
print("\nError confusion matrix (only misclassified rows):")
print(pd.crosstab(df_dair_errors["label"], df_dair_errors["baseline_pred"]))        #counts of mistake types


Number of errors: 322 von 2000 Beispielen

Example misclassifications:
                                                  text     label baseline_pred
10                  i don t feel particularly agitated      fear         anger
14   i find myself in the odd position of feeling s...      love           joy
55   i have tried to see what it would be like if i...   sadness           joy
62   i spent wandering around still kinda dazed and...       joy      surprise
71   i feel like a naughty school girl because i am...      love       sadness
72   i am right handed however i play billiards lef...  surprise          fear
74   i were to go overseas or cross the border then...      love       sadness
79        i want each of you to feel my gentle embrace      love           joy
93   i was feeling weird the other day and it went ...      fear      surprise
94           when a friend dropped a frog down my neck     anger       disgust
96   i love neglecting this blog but sometimes i fe...      

#### **Load the second dataset**
Now we are loading the second dataset which we will use to fine-tune our model. After training our model based on this dataset, we will again look at the model performance on the test set (dair ai dataset).

In [12]:
#load boltuix dataset from hugging face
ds_bolt = load_dataset("boltuix/emotions-dataset")

#convert to dataframe
df_bolt_raw = ds_bolt["train"].to_pandas()

print(df_bolt_raw.head())
print(df_bolt_raw["Label"].value_counts()) #show how many samples exist per emotion class


                                            Sentence      Label
0  Unfortunately later died from eating tainted m...  happiness
1  Last time I saw was loooong ago. Basically bef...    neutral
2  You mean by number of military personnel? Beca...    neutral
3  Need to go middle of the road no NAME is going...    sadness
4           feel melty miserable enough imagine must    sadness
Label
happiness    31205
sadness      17809
neutral      15733
anger        13341
love         10512
fear          8795
disgust       8407
confusion     8209
surprise      4560
shame         4248
guilt         3470
sarcasm       2534
desire        2483
Name: count, dtype: int64


In [13]:
#label mapping: convert the diverse boltuix labels into our 7-emotion schema
label_map_bolt = {
    "happiness": "joy",
    "sadness": "sadness",
    "neutral": "neutral",
    "anger": "anger",
    "love": None,
    "fear": "fear",
    "disgust": "disgust",
    "confusion": None,         #drop classes we don't want (mapped to None)
    "surprise": "surprise",
    "shame": None,
    "guilt": None,
    "sarcasm": None,
    "desire": None
}

#create new mapped label column
df_bolt_raw["label_mapped"] = df_bolt_raw["Label"].map(label_map_bolt)

#keep only rows that belong to our final 7 emotions
df_bolt = df_bolt_raw[df_bolt_raw["label_mapped"].notna()].copy()

#rename columns to a "standard" format
df_bolt["text"] = df_bolt["Sentence"]
df_bolt["label"] = df_bolt["label_mapped"]

#keep only relevant columns
df_bolt = df_bolt[["text", "label"]]

print(df_bolt["label"].value_counts())


label
joy         31205
sadness     17809
neutral     15733
anger       13341
fear         8795
disgust      8407
surprise     4560
Name: count, dtype: int64


In [14]:
#get label dictionaries directly from the model (mapping IDs <-> emotion names)

print(model.config.id2label) #shows something like {0:'anger', 1:'disgust', ...}
print(model.config.label2id) #reverse mapping: {'anger':0, 'disgust':1, ...}

label2id = model.config.label2id  #store the model's mapping (string → numeric ID)
id2label = model.config.id2label  #store the reverse model's mapping

#map our text emotion labels in the boltuix dataset into numeric IDs expected by the model
df_bolt["label_id"] = df_bolt["label"].map(label2id)

#we can check whether some labels werent mapped
print("Number of unmapped labels:", df_bolt["label_id"].isna().sum())
print(df_bolt[["label", "label_id"]].head())


{0: 'anger', 1: 'disgust', 2: 'fear', 3: 'joy', 4: 'neutral', 5: 'sadness', 6: 'surprise'}
{'anger': 0, 'disgust': 1, 'fear': 2, 'joy': 3, 'neutral': 4, 'sadness': 5, 'surprise': 6}
Number of unmapped labels: 0
     label  label_id
0      joy         3
1  neutral         4
2  neutral         4
3  sadness         5
4  sadness         5


In [15]:
#Train / validation split on boltuix dataset (7 emotions)

#use sample n=20000, otherwise takes too long at "trainer_ft.train" later
df_bolt_train_all = df_bolt.sample(n=30000, random_state=42).copy()


#split into 80% train and 20% validation
df_train, df_val = train_test_split(
    df_bolt_train_all,
    test_size=0.2,
    stratify=df_bolt_train_all["label"],  #make sure each emotion is represented proportionally
    random_state=42,         #fix random seed for reproducibility
)

print("Train size:", len(df_train)) #print how many examples are in the training set
print("Val size:", len(df_val))      #print how many examples are in the validation set

print("\nLabel distribution in the training set:")
print(df_train["label"].value_counts())


Train size: 24000
Val size: 6000

Label distribution in the training set:
label
joy         7439
sadness     4341
neutral     3819
anger       3227
fear        2071
disgust     2049
surprise    1054
Name: count, dtype: int64


In [16]:
#Convert Pandas DataFrames into HuggingFace Dataset objects (train + val)
ds_train = Dataset.from_pandas(df_train[["text", "label_id"]]) #create HF Dataset for training
ds_val = Dataset.from_pandas(df_val[["text", "label_id"]])  #same for validation


#tokenizer function used to convert raw text into model input IDs
def tokenize_batch(batch):
    return tokenizer(batch["text"], truncation=True)  #truncate long text to model’s max length

#apply tokenization to the whole dataset (batched for speed)
ds_train_tok = ds_train.map(tokenize_batch, batched=True)   #tokenize full training split
ds_val_tok = ds_val.map(tokenize_batch, batched=True)  #same for validation

# Rename label column to "labels" → required by Trainer API
ds_train_tok = ds_train_tok.rename_column("label_id", "labels")  #Trainer expects column name "labels"
ds_val_tok = ds_val_tok.rename_column("label_id", "labels")   #same for validation dataset

#remove unused columns (text + Pandas index)
ds_train_tok = ds_train_tok.remove_columns(["text", "__index_level_0__"]) #keep only numeric tensors
ds_val_tok = ds_val_tok.remove_columns(["text", "__index_level_0__"])   #clean validation split

#set PyTorch tensor format so Trainer returns torch tensors
ds_train_tok.set_format("torch") #enable torch tensors for training
ds_val_tok.set_format("torch")  #same for validation

#datacollator pads sequences dynamically during training
data_collator = DataCollatorWithPadding(tokenizer=tokenizer) #handles dynamic padding inside batches


Map: 100%|██████████| 24000/24000 [00:00<00:00, 101281.31 examples/s]
Map: 100%|██████████| 6000/6000 [00:00<00:00, 100766.88 examples/s]


In [17]:
#Trainer

def compute_metrics(eval_pred):
    logits, labels = eval_pred            #unpack model outputs and true labels
    preds = logits.argmax(axis=-1)         #get predicted class IDs
    acc = accuracy_score(labels, preds)     #compute accuracy
    f1_macro = f1_score(labels, preds, average="macro")  #macro-F1 across all classes
    return {"accuracy": acc, "f1_macro": f1_macro}  #return metric dictionary


training_args = TrainingArguments(
    output_dir="./emotion-finetuned",          #where to save model outputs
    num_train_epochs=2,                        #number of epochs (same as lecture)
    per_device_train_batch_size=16,            #batch size during training
    per_device_eval_batch_size=32,             #batch size during evaluation
    weight_decay=0.01,                         #regularization value
)

trainer_ft = Trainer(
    model=model,                               #the pretrained j-hartmann model
    args=training_args,                        #training configuration
    train_dataset=ds_train_tok,                #tokenized training set
    eval_dataset=ds_val_tok,                   #tokenized validation set
    data_collator=data_collator,               #pads batches dynamically
    tokenizer=tokenizer,                       #tokenizer for decoding/logging
    compute_metrics=compute_metrics,           #evaluation metrics callback
)

trainer_ft.train()                #fine-tune the model


#show validation performance after training
val_metrics = trainer_ft.evaluate(ds_val_tok)
print("Validation metrics:", val_metrics)


  trainer_ft = Trainer(


Step,Training Loss
500,0.8836
1000,0.8154
1500,0.7373
2000,0.578
2500,0.5667
3000,0.542




Validation metrics: {'eval_loss': 0.6998604536056519, 'eval_accuracy': 0.7515, 'eval_f1_macro': 0.6992774275556101, 'eval_runtime': 47.7922, 'eval_samples_per_second': 125.544, 'eval_steps_per_second': 3.934, 'epoch': 2.0}


In [18]:
#re-evaluate dair-ai test set with the fine-tuned model
texts_dair = df_dair["text"].astype(str).tolist()                        #get list of raw texts from dair-ai dataframe
tokenized_dair = tokenizer(texts_dair, truncation=True, padding=True)    #tokenize texts using the same tokenizer as in training
pred_dataset_dair = SimpleDataset(tokenized_dair)                        #wrap tokenized inputs in our SimpleDataset

#run fine-tuned model on dair-ai data
predictions_dair_ft = trainer_ft.predict(pred_dataset_dair)              #use fine-tuned Trainer to get predictions
logits_dair_ft = predictions_dair_ft.predictions                         #extract raw logits from prediction output

#apply softmax to convert logits into probabilities
softmax_dair_ft = torch.nn.functional.softmax(
    torch.tensor(logits_dair_ft), dim=-1
).numpy()

#get predicted class ids and map them back to string labels
pred_ids_dair_ft = softmax_dair_ft.argmax(axis=1)                        #take index of highest probability for each example
pred_labels_dair_ft = [model.config.id2label[int(i)] for i in pred_ids_dair_ft]  #map indices to emotion labels

#store fine-tuned predictions in the dair-ai dataframe
df_dair["pred_finetuned"] = pred_labels_dair_ft                          #add new column with fine-tuned predictions


#compare baseline vs fine-tuned accuracy on dair-ai
acc_dair_base = accuracy_score(df_dair["label"], df_dair["baseline_pred"])      #accuracy before fine-tuning
acc_dair_ft = accuracy_score(df_dair["label"], df_dair["pred_finetuned"])       #accuracy after fine-tuning

print("Baseline Accuracy (dair-ai):", acc_dair_base)                     #print baseline accuracy
print("Finetuned Accuracy (dair-ai):", acc_dair_ft)                      #print fine-tuned accuracy

print("\nClassification report (finetuned, dair-ai):")
print(classification_report(df_dair["label"], df_dair["pred_finetuned"])) #show precision/recall/F1 per class




Baseline Accuracy (dair-ai): 0.839
Finetuned Accuracy (dair-ai): 0.8675

Classification report (finetuned, dair-ai):
              precision    recall  f1-score   support

       anger       0.92      0.91      0.91       275
     disgust       0.00      0.00      0.00         0
        fear       0.87      0.90      0.88       224
         joy       0.81      0.99      0.89       695
        love       0.00      0.00      0.00       159
     sadness       0.95      0.96      0.95       581
    surprise       0.76      0.67      0.71        66

    accuracy                           0.87      2000
   macro avg       0.61      0.63      0.62      2000
weighted avg       0.80      0.87      0.83      2000



  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


# Results
Evaluating the pre-trained (baseline) model on the dair-ai test split yielded an accuracy of **83.9%**, with strong performance on dominant classes such as **joy, fear, anger, and sadness**, but complete failure on minority labels like **love, neutral, and disgust**. These missing classes significantly lowered the macro-averaged metrics (macro F1 ≈ 0.52).

After fine-tuning the model on the curated boltuix dataset (7 emotions), performance on the dair-ai benchmark improved across almost all major categories. The fine-tuned model reached an accuracy of **86.75%**, an improvement of almost **3 percentage points** over the baseline. Precision, recall, and F1 also increased especially for **anger, fear, joy, sadness, and surprise**, while the minority classes remained difficult due to their absence or scarcity in the boltuix training data.

Overall, fine-tuning led to **clear and consistent gains**, demonstrating that additional domain-specific training data can make a general emotion model more robust on unseen test samples.

# Future work
While fine-tuning led to measurable improvements, several avenues remain to further strengthen model performance. First, expanding the training dataset to include underrepresented emotions such as love, neutral, and disgust would address the model’s current blind spots and improve macro-level metrics. A more balanced dataset - or targeted data augmentation - could help reduce class imbalance effects. Second, experimenting with more advanced architectures (e.g., RoBERTa-large, DeBERTa-v3) or optimization strategies (longer training, learning-rate tuning, or layer-freezing) may yield additional gains. Finally, evaluating the model on more diverse, real-world text sources and applying techniques like calibration or error-based retraining would support building a more robust and generalizable emotion classifier.

## LICENSES

**Dataset 1 "Dair-ai"**:
* @inproceedings{saravia-etal-2018-carer,
    title = "{CARER}: Contextualized Affect Representations for Emotion Recognition",
    author = "Saravia, Elvis  and
      Liu, Hsien-Chi Toby  and
      Huang, Yen-Hao  and
      Wu, Junlin  and
      Chen, Yi-Shin",
    booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing",
    month = oct # "-" # nov,
    year = "2018",
    address = "Brussels, Belgium",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/D18-1404",
    doi = "10.18653/v1/D18-1404",
    pages = "3687--3697",
    abstract = "Emotions are expressed in nuanced ways, which varies by collective or individual experiences, knowledge, and beliefs. Therefore, to understand emotion, as conveyed through text, a robust mechanism capable of capturing and modeling different linguistic nuances and phenomena is needed. We propose a semi-supervised, graph-based algorithm to produce rich structural descriptors which serve as the building blocks for constructing contextualized affect representations from text. The pattern-based representations are further enriched with word embeddings and evaluated through several emotion recognition tasks. Our experimental results demonstrate that the proposed method outperforms state-of-the-art techniques on emotion recognition tasks.",
}
* link: https://huggingface.co/datasets/dair-ai/emotion

**Dataset 2 "boltuix"**:
* The boltuix/emotions-dataset is released under the MIT License, which permits free use, modification, and redistribution for research purposes.
* link: https://huggingface.co/datasets/boltuix/emotions-dataset


**Model**: 

Jochen Hartmann, "Emotion English DistilRoBERTa-base". https://huggingface.co/j-hartmann/emotion-english-distilroberta-base/, 2022.
link: https://huggingface.co/j-hartmann/emotion-english-distilroberta-base
