# Emotion Dataset

This notebook performs a lightweight exploration of the `Empathetic Dialogues` dataset. Its purpose is to understand the data schema, emotion labels, and text characteristics of the `situation` field before training.

And intentionally excludes any model training or evaluation.

## Import Libraries

In [30]:
import os

os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [31]:
%pip install -q datasets transformers pandas

from datasets import load_dataset
from transformers import AutoTokenizer

Note: you may need to restart the kernel to use updated packages.


## Load & Inspect Dataset

In [32]:
# Load the emotion dataset
dataset = load_dataset("bdotloh/empathetic-dialogues-contexts")
print(dataset)

# Display first 5 samples from the training set
for i in range(5):
    print(f"{i + 1}.", dataset["train"][i])

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'situation', 'emotion'],
        num_rows: 19209
    })
    validation: Dataset({
        features: ['Unnamed: 0', 'situation', 'emotion'],
        num_rows: 2756
    })
    test: Dataset({
        features: ['Unnamed: 0', 'situation', 'emotion'],
        num_rows: 2542
    })
})
1. {'Unnamed: 0': 0, 'situation': 'I remember going to the fireworks with my best friend. There was a lot of people, but it only felt like us in the world.', 'emotion': 'sentimental'}
2. {'Unnamed: 0': 1, 'situation': ' i used to scare for darkness', 'emotion': 'afraid'}
3. {'Unnamed: 0': 2, 'situation': 'I showed a guy how to run a good bead in welding class and he caught on quick.', 'emotion': 'proud'}
4. {'Unnamed: 0': 3, 'situation': 'I have always been loyal to my wife.', 'emotion': 'faithful'}
5. {'Unnamed: 0': 4, 'situation': "A recent job interview that I had made me feel very anxious because I felt like I didn't come prepared.", 'emot

## Label Names & Distribution

In [33]:
df = dataset["train"].to_pandas()

label_names = sorted(df["emotion"].unique())
print("Label names:", label_names)

label_counts = df["emotion"].value_counts().sort_index()
label_counts

Label names: ['afraid', 'angry', 'annoyed', 'anticipating', 'anxious', 'apprehensive', 'ashamed', 'caring', 'confident', 'content', 'devastated', 'disappointed', 'disgusted', 'embarrassed', 'excited', 'faithful', 'furious', 'grateful', 'guilty', 'hopeful', 'impressed', 'jealous', 'joyful', 'lonely', 'nostalgic', 'prepared', 'proud', 'sad', 'sentimental', 'surprised', 'terrified', 'trusting']


emotion
afraid          619
angry           677
annoyed         657
anticipating    597
anxious         611
apprehensive    460
ashamed         485
caring          504
confident       610
content         568
devastated      553
disappointed    593
disgusted       611
embarrassed     560
excited         731
faithful        372
furious         593
grateful        637
guilty          611
hopeful         615
impressed       612
jealous         579
joyful          596
lonely          633
nostalgic       595
prepared        584
proud           666
sad             660
sentimental     512
surprised       997
terrified       614
trusting        497
Name: count, dtype: int64

In [34]:
# Calculate and display the proportion of each label in the training set

df = dataset["train"].to_pandas()
split_counts = df["emotion"].value_counts(normalize=True).sort_values(ascending=False)
split_counts

emotion
surprised       0.051903
excited         0.038055
angry           0.035244
proud           0.034671
sad             0.034359
annoyed         0.034203
grateful        0.033162
lonely          0.032953
afraid          0.032224
hopeful         0.032016
terrified       0.031964
impressed       0.031860
anxious         0.031808
guilty          0.031808
disgusted       0.031808
confident       0.031756
anticipating    0.031079
joyful          0.031027
nostalgic       0.030975
furious         0.030871
disappointed    0.030871
prepared        0.030402
jealous         0.030142
content         0.029569
embarrassed     0.029153
devastated      0.028789
sentimental     0.026654
caring          0.026238
trusting        0.025873
ashamed         0.025249
apprehensive    0.023947
faithful        0.019366
Name: proportion, dtype: float64

The label set contains 32 unique emotions and is moderately imbalanced:

- Most common: surprised (~5.19%)
- Least common: faithful (~1.94%)

For model evaluation, macro-averaged metrics such as macro-F1 are more informative than accuracy alone.

## Token Length

In [35]:
# Calculate token lengths using DistilBERT tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased", use_fast=True)

df["token_length"] = df["situation"].apply(lambda x: len(tokenizer.tokenize(x)))

df["token_length"].describe()

count    19209.000000
mean        20.803165
std         11.782897
min          1.000000
25%         13.000000
50%         18.000000
75%         26.000000
max        124.000000
Name: token_length, dtype: float64

**Token lengths (using DistilBERT tokenizer as a baseline):**
- 25% under ~13 tokens
- 75% under ~26 tokens
- Max ~124 tokens

A `max_length` of 32 covers most samples while limiting padding for typical inputs.

## Insights

- Use a transformer-based sentence classifier due to short, context-dependent text.
- Limit maximum sequence length to ~128 to cover observed maxima with minimal padding.
- Account for class imbalance during evaluation using macro-averaged metrics.

These insights were implemented in `ml/training/train.py`.