# Emotion Dataset

This notebook performs a lightweight exploration of the `dair-ai/emotion` dataset. Its purpose is to understand the data schema, labels, and text characteristics before training.

This notebook intentionally excludes any model training or evaluation.

## Import Libraries

In [28]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [29]:
%pip install -q datasets transformers pandas

from datasets import load_dataset
from transformers import AutoTokenizer

Note: you may need to restart the kernel to use updated packages.


## Load & Inspect Dataset

In [30]:
# Load the emotion dataset
dataset = load_dataset("dair-ai/emotion")
print(dataset)

# Display first 5 samples from the training set
for i in range(5):
    print(f"{i+1}.", dataset["train"][i])

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
})
1. {'text': 'i didnt feel humiliated', 'label': 0}
2. {'text': 'i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake', 'label': 0}
3. {'text': 'im grabbing a minute to post i feel greedy wrong', 'label': 3}
4. {'text': 'i am ever feeling nostalgic about the fireplace i will know that it is still on the property', 'label': 2}
5. {'text': 'i am feeling grouchy', 'label': 3}


## Label Names & Distribution

In [31]:
label_names = dataset["train"].features["label"].names
print("Label names:", label_names)

df = dataset["train"].to_pandas()

label_counts = df["label"].value_counts().sort_index()
label_counts.index = [label_names[i] for i in label_counts.index]

label_counts

Label names: ['sadness', 'joy', 'love', 'anger', 'fear', 'surprise']


sadness     4666
joy         5362
love        1304
anger       2159
fear        1937
surprise     572
Name: count, dtype: int64

In [32]:
# Calculate and display the proportion of each label in the training set
split_counts = (
    dataset["train"].to_pandas()["label"].value_counts(normalize=True)
)

split_counts.index = [label_names[i] for i in split_counts.index]
split_counts.sort_values(ascending=False)

joy         0.335125
sadness     0.291625
anger       0.134937
fear        0.121063
love        0.081500
surprise    0.035750
Name: proportion, dtype: float64

The dataset is moderately imbalanced, which is typical for emotion datasets.
For model evaluation, metrics such as macro-F1 are more informative than accuracy alone.

## Token Length

In [33]:
# Calculate token lengths using DistilBERT tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased", use_fast=True)

df["token_length"] = df["text"].apply(
    lambda x: len(tokenizer.tokenize(x))
)

df["token_length"].describe()

count    16000.000000
mean        20.259500
std         11.575801
min          2.000000
25%         11.000000
50%         18.000000
75%         27.000000
max         85.000000
Name: token_length, dtype: float64

Most samples are short:
- 75% of samples are under ~30 tokens
- Even the longest samples stay under 100 tokens

This suggests that a `max_length` of 64 or 128 is good enough for transformer-based models without excessive padding.

## Insights

- Use a transformer-based sentence classifier due to short, context-dependent text.
- Limit maximum sequence length to reduce padding and improve efficiency.
- Account for class imbalance during evaluation using macro-averaged metrics.'

These insights were implemented in `ml/training/train.py`.
