# 1. Environment & Data Loading

This notebook reproduces the EMGSD baseline (ALBERT) from the HEARTS repository.
We first check the environment and load the EMGSD dataset from Hugging Face.


In [1]:
import sys
import torch
import transformers
from datasets import load_dataset

print("Python version:", sys.version)
print("PyTorch version:", torch.__version__)
print("Transformers version:", transformers.__version__)

# Load EMGSD from Hugging Face
ds = load_dataset("holistic-ai/EMGSD")
print(ds)

# Inspect one training example
example = ds["train"][0]
for k, v in example.items():
    print(f"{k}: {v}")


  import pynvml  # type: ignore[import]


ModuleNotFoundError: No module named 'transformers'

# 2. Model & Training Configuration

We specify the model backbone (ALBERT), the number of labels, and the key training hyperparameters.
These values are chosen to be consistent with the original HEARTS repository.


In [5]:
from collections import Counter


# Count how many distinct labels we have in EMGSD
label_list = sorted(list(set(ds["train"]["label"])))
print("Number of unique labels:", len(label_list))
print("Label examples:", label_list[:10])

# Map string labels -> integer ids for training
label2id = {lab: i for i, lab in enumerate(label_list)}
id2label = {i: lab for lab, i in label2id.items()}
print("Example mapping:", list(label2id.items())[:5])

MODEL_NAME = "albert-base-v2"
NUM_LABELS = len(label_list)
BATCH_SIZE = 16
LEARNING_RATE = 2e-5
NUM_EPOCHS = 3

print("MODEL_NAME:", MODEL_NAME)
print("NUM_LABELS:", NUM_LABELS)
print("BATCH_SIZE:", BATCH_SIZE)
print("LEARNING_RATE:", LEARNING_RATE)
print("NUM_EPOCHS:", NUM_EPOCHS)


Number of unique labels: 13
Label examples: ['neutral_gender', 'neutral_lgbtq+', 'neutral_nationality', 'neutral_profession', 'neutral_race', 'neutral_religion', 'stereotype_gender', 'stereotype_lgbtq+', 'stereotype_nationality', 'stereotype_profession']
Example mapping: [('neutral_gender', 0), ('neutral_lgbtq+', 1), ('neutral_nationality', 2), ('neutral_profession', 3), ('neutral_race', 4)]
MODEL_NAME: albert-base-v2
NUM_LABELS: 13
BATCH_SIZE: 16
LEARNING_RATE: 2e-05
NUM_EPOCHS: 3


# 3. Training Script Call

We call a separate training script (`cw2/src/train_emgsd_albert.py`) so that
the whole baseline can be reproduced from the command line and from this notebook.


In [None]:
!python ../src/train_emgsd_albert.py \
    --model_name albert-base-v2 \
    --output_dir ../results/emgsd_baseline \
    --batch_size 16 \
    --learning_rate 2e-5 \
    --num_epochs 3


# 4. Evaluation & Comparison

After training, we load the saved metrics and compare our EMGSD performance
against the values reported in the HEARTS paper (Â±5% tolerance on macro-F1).


In [None]:
import json
from pathlib import Path

metrics_path = Path("../results/emgsd_baseline/metrics.json")

if metrics_path.exists():
    with open(metrics_path) as f:
        metrics = json.load(f)
    print("Loaded metrics:", metrics)
else:
    print("metrics.json not found yet. Run the training cell above first.")
