# Emotion Dataset

This notebook is a lightweight exploration of the `Lemotif` emotion dataset. Its purpose is to understand the data schema, emotion labels, and text characteristics of the `Answer` field before training.

Model training and evaluation are intentionally excluded from this notebook.

## Import Libraries

In [8]:
import os

os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [9]:
%pip install -q datasets transformers pandas

from datasets import load_dataset
from transformers import AutoTokenizer

Note: you may need to restart the kernel to use updated packages.


## Load & Inspect Dataset

In [10]:
# Load the emotion dataset
url = (
    "https://raw.githubusercontent.com/xaliceli/lemotif/"
    "refs/heads/master/assets/data/lemotif-data-cleaned-flat.csv"
)

dataset = load_dataset("csv", data_files=url, split="train")
print(dataset)

# Display first 5 samples from the training set
for i in range(5):
    print(f"{i + 1}.", dataset[i])

Dataset({
    features: ['Answer', 'Answer.f1.afraid.raw', 'Answer.f1.angry.raw', 'Answer.f1.anxious.raw', 'Answer.f1.ashamed.raw', 'Answer.f1.awkward.raw', 'Answer.f1.bored.raw', 'Answer.f1.calm.raw', 'Answer.f1.confused.raw', 'Answer.f1.disgusted.raw', 'Answer.f1.excited.raw', 'Answer.f1.frustrated.raw', 'Answer.f1.happy.raw', 'Answer.f1.jealous.raw', 'Answer.f1.nostalgic.raw', 'Answer.f1.proud.raw', 'Answer.f1.sad.raw', 'Answer.f1.satisfied.raw', 'Answer.f1.surprised.raw', 'Answer.t1.exercise.raw', 'Answer.t1.family.raw', 'Answer.t1.food.raw', 'Answer.t1.friends.raw', 'Answer.t1.god.raw', 'Answer.t1.health.raw', 'Answer.t1.love.raw', 'Answer.t1.recreation.raw', 'Answer.t1.school.raw', 'Answer.t1.sleep.raw', 'Answer.t1.work.raw'],
    num_rows: 1473
})
1. {'Answer': 'My family was the most salient part of my day, since most days the care of my 2 children occupies the majority of my time. They are 2 years old and 7 months and I love them, but they also require so much attention that m

## Label Names & Distribution

In [11]:
df = dataset.to_pandas()

# Extract emotion column names (those starting with 'Answer.f1.')
emotion_cols = [col for col in df.columns if col.startswith('Answer.f1.')]
label_names = [col.replace('Answer.f1.', '').replace('.raw', '') for col in sorted(emotion_cols)]
print("Label names:", label_names)
print(f"Number of emotion labels: {len(label_names)}")

# Count label occurrences
label_counts = df[emotion_cols].astype(int).sum().sort_values(ascending=False)
label_counts

Label names: ['afraid', 'angry', 'anxious', 'ashamed', 'awkward', 'bored', 'calm', 'confused', 'disgusted', 'excited', 'frustrated', 'happy', 'jealous', 'nostalgic', 'proud', 'sad', 'satisfied', 'surprised']
Number of emotion labels: 18


Answer.f1.happy.raw         730
Answer.f1.satisfied.raw     591
Answer.f1.calm.raw          368
Answer.f1.proud.raw         337
Answer.f1.excited.raw       251
Answer.f1.frustrated.raw    141
Answer.f1.anxious.raw       125
Answer.f1.surprised.raw      64
Answer.f1.nostalgic.raw      61
Answer.f1.bored.raw          49
Answer.f1.sad.raw            43
Answer.f1.angry.raw          28
Answer.f1.confused.raw       28
Answer.f1.disgusted.raw      22
Answer.f1.afraid.raw         18
Answer.f1.ashamed.raw        17
Answer.f1.awkward.raw        15
Answer.f1.jealous.raw         3
dtype: int64

In [12]:
# Calculate and display the proportion of each label in the training set

emotion_cols = [col for col in df.columns if col.startswith('Answer.f1.')]
split_counts = df[emotion_cols].astype(int).sum() / len(df)
split_counts = split_counts.sort_values(ascending=False)
split_counts

Answer.f1.happy.raw         0.495587
Answer.f1.satisfied.raw     0.401222
Answer.f1.calm.raw          0.249830
Answer.f1.proud.raw         0.228785
Answer.f1.excited.raw       0.170401
Answer.f1.frustrated.raw    0.095723
Answer.f1.anxious.raw       0.084861
Answer.f1.surprised.raw     0.043449
Answer.f1.nostalgic.raw     0.041412
Answer.f1.bored.raw         0.033265
Answer.f1.sad.raw           0.029192
Answer.f1.angry.raw         0.019009
Answer.f1.confused.raw      0.019009
Answer.f1.disgusted.raw     0.014936
Answer.f1.afraid.raw        0.012220
Answer.f1.ashamed.raw       0.011541
Answer.f1.awkward.raw       0.010183
Answer.f1.jealous.raw       0.002037
dtype: float64

The dataset contains 18 emotion labels in a multi-label format:

- Text column `Answer` is a narrative descriptions of personal experiences
- Emotion labels `Answer.f1.*.raw` are Binary columns showing the presence of each emotion

For model evaluation, use multi-label metrics such as Macro-AUC.

## Token Length (using DistilBERT tokenizer as a baseline)

In [15]:
# Calculate token lengths using DistilBERT tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased", use_fast=True)

df["token_length"] = df["Answer"].apply(lambda x: len(tokenizer.tokenize(x)))

df["token_length"].describe()

count    1473.000000
mean       38.467753
std        24.963625
min         4.000000
25%        21.000000
50%        33.000000
75%        49.000000
max       235.000000
Name: token_length, dtype: float64

A `max_length` of 64 or 128 would cover most samples, while 32 would be too short for many entries.

### Dataset Imbalance and Implications

The dataset is very imbalanced, with some emotions (such as "happy" and "satisfied") being much more common than others (like "jealous" and "awkward"). This imbalance will negatively impact model performance, especially for the minority classes, as the model will become biased towards predicting the majority classes.

To address this, it may be necessary to:
- Augment the dataset by generating more samples for underrepresented emotions.
- Stratify the data during train/test splits to ensure all classes are represented proportionally.
- Consider class weighting or oversampling.

Addressing imbalance helps the model perform better across all emotions, making predictions more useful in real-world scenarios.