# COMP0173: Coursework 2

The paper HEARTS: A Holistic Framework for Explainable, Sustainable, and Robust Text Stereotype Detection by Theo King, Zekun Wu et al. (2024) presents a comprehensive approach to analysing and detecting stereotypes in text [1]. The authors introduce the HEARTS framework, which integrates model explainability, carbon-efficient training, and accurate evaluation across multiple bias-sensitive datasets. By using transformer-based models such as ALBERT-V2, BERT, and DistilBERT, this research project demonstrates that stereotype detection performance varies significantly across dataset sources, underlining the need for diverse evaluation benchmarks. The paper provides publicly available datasets and code [2], allowing full reproducibility and offering a standardised methodology for future research on bias and stereotype detection in Natural Language Processing (NLP).

While the HEARTS framework evaluates stereotype detection in English, this project adapts the methodology to the Russian context. Russian stereotypes often rely on grammatical gender, morphology, and culture specific tropes. Although Russian is not classified as a low-resource language and many high-performing NLP models are available, there is currently no publicly accessible model specifically designed to detect stereotypes in Russian language. Existing models detecting toxicity or sentiment identify stereotypical and biased sentences only when they include specific patterns, such as insults, slurs, or identity-specific hate speech [8]. 

To address this gap, I introduce two fine-tuned classifiers, `AI-Forever-RuBert` [10] and `XML-RoBERTa` [11] trained on datasets `RBSA`, and `RBS`, respectively. Understanding these patterns is essential for applications such as content moderation, ensuring the safety of Russian-language LLMs, and monitoring harmful narratives across demographic groups and underrepresented societies. Adapting the HEARTS framework to this new sociolinguistic context illustrates its transferability beyond the English-speaking context and enables a more culturally grounded approach to bias detection, thereby promoting SDG 5: Gender Equality, SDG 10: Reduced Inequalities, and SDG 16: Peace, Justice, and Strong Institutions [5].

# Instructions

All figures produced during this notebook are stored in the project’s `COMP0173_Figures` directory.
The corresponding LaTeX-formatted performance comparison tables, jupyter notebooks are stored in `/COMP0173_PDF`. 
The compiled document are available as `COMP0173-CW2-TABLES.pdf` and `COMP0173_PDF/COMP0173-CW2-NOTEBOOK-XX.pdf`.
All prompts used for data augmentation are stored in `COMP0173_Prompts` and the manually collected stereotypes (with English translations) are provided in `COMP0173_Stereotypes`. 
The datasets used for model training and evaluation are stored in `COMP0173_Data` which contains: 

- rubias.tsv — RuBias dataset [6, 7]
- ruster.csv — RuSter dataset (see Part 2 of the notebook for source websites)
- rubist.csv — RBS dataset: RuBias + RuSter augmented with LLM-generated samples (Claude Sonnet), using a zero-shot prompt with examples
- rubist_second.csv — RBSA dataset: RuBias + RuSter augmented with LLM-generated samples using a second prompt version without examples

The notebooks `COMP0173_PDF/COMP0173-CW2-NOTEBOOK-P3.pdf` and `COMP0173_PDF/COMP0173-CW2-NOTEBOOK-P5.pdf` are replications of `COMP0173_PDF/COMP0173-CW2-NOTEBOOK-P2.pdf` and `COMP0173_PDF/COMP0173-CW2-NOTEBOOK-P4.pdf`, where P2 provides the new `RBSA` with second prompt (without examples) and P5 demonstrates the model running ON GPU (the results saved are from GPU fine-tuning).

# Technical Implementation (70%)

In [None]:
# Import libraries 
import random, numpy as np, torch
import pandas as pd

import platform
import transformers
from datasets import load_dataset
import spacy 
import os
import sys
import importlib.util, pathlib
from pathlib import Path
import warnings 
from importlib import reload
from importlib.machinery import SourceFileLoader
from IPython.display import display

In [None]:
# Check the GPU host (UCL access)
print("CUDA available:", torch.cuda.is_available())
print("Device:", torch.cuda.get_device_name(0))

CUDA available: True
Device: NVIDIA GeForce RTX 3090 Ti


In [None]:
# Path
import os
os.chdir("/tmp/HEARTS-Text-Stereotype-Detection")
os.getcwd()

'/tmp/HEARTS-Text-Stereotype-Detection'

## Part 1: Replicate the baseline AI methodology using the open dataset 

The HEARTS framework evaluates stereotype detection using four publicly available text datasets [3]. The following datasets are Multi-Grain Stereotype Dataset (MGSD), Augmented WinoQueer (AWinoQueer), Augmented SeeGULL (ASeeGULL), and Expanded Multi-Grain Stereotype Dataset (EMGSD), which includes labelled stereotypical and non-stereotypical statements covering gender, profession, nationality, race, religion, and LGBTQ+ stereotypes. All datasets referenced in the paper are openly accessible through the associated GitHub repository [2] and Kaggle [3]. 

### Helper Functions

In [None]:
def collect_summary(project_root: Path) -> pd.DataFrame:
    
    """
    Collect summary metrics from all model output folders.

    Parameters
    ----------
    project_root : Path
        Path to the root directory containing result_output_* folders.

    Returns
    -------
    pd.DataFrame
        A table containing model type, training dataset, evaluation dataset,
        accuracy, macro precision, macro recall, and macro F1 score.
    """
    
    rows = []

    # Loop through each model type directory
    for model_dir in [
        "result_output_albertv2",
        "result_output_bert",
        "result_output_distilbert",
    ]:
        base = project_root / model_dir
        if not base.exists():
            continue

        # Standardise model name
        model_name = model_dir.replace("result_output_", "")

        # Iterate through training subsets
        for train_dir in sorted(base.iterdir()):
            if not train_dir.is_dir():
                continue
            train_name = train_dir.name

            # Iterate through evaluation subsets
            for eval_dir in sorted(train_dir.iterdir()):
                if not eval_dir.is_dir():
                    continue
                eval_name = eval_dir.name

                report_path = eval_dir / "classification_report.csv"
                if not report_path.exists():
                    continue

                # Load sklearn classification_report CSV
                rep = pd.read_csv(report_path, index_col=0)

                # Extract accuracy and macro-level metrics
                acc = rep.loc["accuracy", "precision"]  # sklearn stores accuracy here
                macro_f1 = rep.loc["macro avg", "f1-score"]
                macro_prec = rep.loc["macro avg", "precision"]
                macro_rec = rep.loc["macro avg", "recall"]

                rows.append(
                    {
                        "model": model_name,
                        "train_on": train_name,
                        "eval_on": eval_name,
                        "accuracy": acc,
                        "macro_precision": macro_prec,
                        "macro_recall": macro_rec,
                        "macro_f1": macro_f1,
                    }
                )

    return pd.DataFrame(rows)

In [None]:
def build_table(summary_df: pd.DataFrame) -> pd.DataFrame:
    
    """
    Construct a pivot table matching the layout of the HEARTS paper.

    Parameters
    ----------
    summary_df : pd.DataFrame
        DataFrame containing model, training set, evaluation set, and metrics.

    Returns
    -------
    pd.DataFrame
        Pivot table indexed by (model, training dataset short name) with
        evaluation datasets as columns and macro-F1 scores as values.
    """

    # Add short labels using mapping dictionaries
    df = summary_df.assign(
        train_short=lambda d: d["train_on"].map(train_map),
        eval_short=lambda d: d["eval_on"].map(eval_map),
    )

    # Remove rows where mapping failed
    df = df.dropna(subset=["train_short", "eval_short"])

    # Create pivot table: rows = model x train_set, cols = eval_set
    table = df.pivot_table(
        index=["model", "train_short"], columns="eval_short", values="macro_f1"
    ).sort_index()

    return table

### $\color{pink}{Question\ 1:}$ Clone the original repository successfully 

The code for HEARTS was obtained by cloning a fork of the official repository [2]. The upstream repository, authored by Holistic AI, was then linked to provide access to the original implementation. 

In [2]:
# # Clone fork
# git clone https://github.com/n1nt3nd0sw1tch/HEARTS-Text-Stereotype-Detection.git
# cd HEARTS-Text-Stereotype-Detection

# # Link to the original repo
# git remote add upstream https://github.com/holistic-ai/HEARTS-Text-Stereotype-Detection.git
# git remote -v   

### $\color{pink}{Question\ 2:}$ Document all dependencies and environment setup

A virtual environment was installed to isolate the HEARTS experimentation setup and ensure reproducibility. In the GPU-powered environment, the repository and data paths, along with system and library versions, were set up and confirmed [4]. All four datasets required by the HEARTS pipeline were confirmed to be present in the configured data directory.

**Runtime configuration on the UCL GPU node:**

- Working directory: /tmp/HEARTS-Text-Stereotype-Detection

- Python: 3.9.21

- Platform: Linux (x86_64, glibc 2.34)

- PyTorch: 2.8.0+cu128

- CUDA: available (NVIDIA GeForce RTX 3090 Ti) / (NVIDIA Tesla 4)

- Transformers: 4.57.3

In [None]:
# # Create and activate conda environment
# conda create -n hearts python=3.10 -y
# conda activate hearts

# # Register environment as a Jupyter kernel
# python -m pip install ipykernel
# python -m ipykernel install --user

# # Install repository dependencies
# pip install -r requirements.txt
# pip install --no-cache-dir -r requirements.txt
# pip install --no-cache-dir spacy
# pip install --no-cache-dir scikit-learn
# pip install codecarbon
# python -m spacy download en_core_web_lg

In [None]:
# accelerate==1.1.1
# aiohappyeyeballs==2.4.3
# aiohttp==3.11.8
# aiosignal==1.3.1
# annotated-types==0.7.0
# anyio==4.6.2.post1
# appnope==0.1.4
# arrow==1.3.0
# asttokens==3.0.1
# async-timeout==5.0.1
# attrs==24.2.0
# beautifulsoup4==4.14.3
# blis==1.0.1
# Bottleneck==1.4.2
# catalogue==2.0.10
# certifi==2024.8.30
# cffi==1.17.1
# charset-normalizer==3.4.0
# click==8.1.7
# cloudpathlib==0.16.0
# cloudpickle==3.1.0
# codecarbon==2.8.0
# colorama==0.4.6
# comm==0.2.3
# confection==0.1.5
# contourpy==1.3.1
# cryptography==44.0.0
# cycler==0.12.1
# cymem==2.0.10
# datasets==3.1.0
# DAWG-Python==0.7.2
# DAWG2-Python==0.9.0
# debugpy==1.8.16
# decorator==5.2.1
# deep-translator==1.11.4
# dill==0.3.8
# distro==1.9.0
# docopt==0.6.2
# dotenv==0.9.9
# en_core_web_lg==3.8.0
# et_xmlfile==2.0.0
# exceptiongroup==1.3.1
# executing==2.2.1
# fief-client==0.20.0
# filelock==3.16.1
# fonttools==4.55.0
# frozenlist==1.5.0
# fsspec==2024.9.0
# h11==0.14.0
# hf-xet==1.2.0
# httpcore==1.0.7
# httpx==0.27.2
# huggingface-hub==0.26.3
# idna==3.10
# imageio==2.36.1
# importlib_metadata==8.7.0
# intervaltree==3.1.0
# ipykernel==7.1.0
# ipymarkup==0.9.0
# ipython==8.37.0
# jedi==0.19.2
# Jinja2==3.1.4
# jiter==0.12.0
# joblib==1.4.2
# jupyter_client==8.6.3
# jupyter_core==5.9.1
# jwcrypto==1.5.6
# kiwisolver==1.4.7
# langcodes==3.5.0
# language_data==1.3.0
# lazy_loader==0.4
# lime==0.2.0.1
# llvmlite==0.43.0
# logging==0.4.9.6
# marisa-trie==1.2.1
# markdown-it-py==3.0.0
# MarkupSafe==3.0.2
# matplotlib==3.9.2
# matplotlib-inline==0.2.1
# mdurl==0.1.2
# mpmath==1.3.0
# multidict==6.1.0
# multiprocess==0.70.16
# murmurhash==1.0.11
# natasha==1.6.0
# navec==0.10.0
# nest_asyncio==1.6.0
# networkx==3.4.2
# nltk==3.9.2
# numba==0.60.0
# numexpr==2.14.1
# numpy==2.0.2
# openai==2.9.0
# openpyxl==3.1.5
# packaging==24.2
# pandas==2.3.3
# parso==0.8.5
# pexpect==4.9.0
# pickleshare==0.7.5
# pillow==11.0.0
# pip==25.3
# platformdirs==4.5.1
# preshed==3.0.9
# prometheus_client==0.21.0
# prompt-toolkit==3.0.36
# propcache==0.2.0
# psutil==6.1.0
# ptyprocess==0.7.0
# pure_eval==0.2.3
# py-cpuinfo==9.0.0
# pyarrow==18.1.0
# pycparser==2.22
# pydantic==2.10.2
# pydantic_core==2.27.1
# Pygments==2.18.0
# pymorphy2==0.9.1
# pymorphy2-dicts-ru==2.4.417127.4579844
# pymorphy3==2.0.6
# pymorphy3-dicts-ru==2.4.417150.4580142
# pynvml==11.5.3
# pyparsing==3.2.0
# python-dateutil==2.9.0.post0
# python-dotenv==1.2.1
# pytz==2024.2
# PyYAML==6.0.2
# pyzmq==27.1.0
# questionary==2.0.1
# RapidFuzz==3.10.1
# razdel==0.5.0
# regex==2024.11.6
# requests==2.32.3
# rich==13.9.4
# ru_core_news_lg==3.8.0
# russian-paraphrasers==0.0.3
# safetensors==0.4.5
# scikit-image==0.24.0
# scikit-learn==1.6.0rc1
# scipy==1.14.1
# seaborn==0.13.2
# sentence-transformers==0.4.0
# sentencepiece==0.2.1
# setuptools==75.6.0
# shap==0.46.0
# shellingham==1.5.4
# six==1.16.0
# slicer==0.0.8
# slovnet==0.6.0
# smart-open==6.4.0
# sniffio==1.3.1
# sortedcontainers==2.4.0
# soupsieve==2.8
# spacy==3.8.2
# spacy-legacy==3.0.12
# spacy-loggers==1.0.5
# srsly==2.4.8
# stack_data==0.6.3
# sympy==1.13.1
# termcolor==2.3.0
# thinc==8.3.2
# threadpoolctl==3.5.0
# tifffile==2024.9.20
# tokenizers==0.20.3
# torch==2.5.1
# torchvision==0.20.1
# tornado==6.5.1
# tqdm==4.67.1
# traitlets==5.14.3
# transformers==4.46.3
# typer==0.9.4
# types-python-dateutil==2.9.0.20241003
# typing_extensions==4.12.2
# tzdata==2024.2
# urllib3==2.2.3
# wasabi==1.1.3
# wcwidth==0.2.13
# weasel==0.3.4
# wheel==0.45.1
# wordcloud==1.9.4
# xxhash==3.5.0
# yargy==0.16.0
# yarl==1.18.0
# yaspin==3.1.0
# zipp==3.23.0

In [None]:
# Path
REPO_DIR = Path("/tmp/HEARTS-Text-Stereotype-Detection").resolve()
DATA_DIR = REPO_DIR / "Model Training and Evaluation"

# Change working directory to the repo
os.chdir(REPO_DIR)

In [23]:
print("Current working dir:", Path.cwd())
print("Repository directory:", REPO_DIR)
print("Data directory:", DATA_DIR)

print("Python version:", sys.version)
print("Platform:", platform.platform())

print("PyTorch version:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("CUDA device:", torch.cuda.get_device_name(0))
print("Transformers version:", transformers.__version__)

Current working dir: /tmp/HEARTS-Text-Stereotype-Detection
Repository directory: /tmp/HEARTS-Text-Stereotype-Detection
Data directory: /tmp/HEARTS-Text-Stereotype-Detection/Model Training and Evaluation
Python version: 3.9.21 (main, Aug 19 2025, 00:00:00) 
[GCC 11.5.0 20240719 (Red Hat 11.5.0-5)]
Platform: Linux-5.14.0-570.58.1.el9_6.x86_64-x86_64-with-glibc2.34
PyTorch version: 2.8.0+cu128
CUDA available: True
CUDA device: NVIDIA GeForce RTX 3090 Ti
Transformers version: 4.57.3


In [24]:
# Check the csv files
csv_files = [
    "MGSD.csv",
    "Winoqueer - GPT Augmentation.csv",
    "SeeGULL - GPT Augmentation.csv",

    # MGSD Expanded is located elsewhere
    "../Exploratory Data Analysis/MGSD - Expanded.csv",
]

print("\nChecking dataset files:\n")
for f in csv_files:
    path = (DATA_DIR / f).resolve()
    print(f"{f:45} -> {'FOUND' if path.exists() else 'MISSING'}")


Checking dataset files:

MGSD.csv                                      -> FOUND
Winoqueer - GPT Augmentation.csv              -> FOUND
SeeGULL - GPT Augmentation.csv                -> FOUND
../Exploratory Data Analysis/MGSD - Expanded.csv -> FOUND


### $\color{pink}{Question\ 3:}$ Reproduce baseline results within ±5% of original paper metrics 

To maintain consistency with the HEARTS baseline, all experiments were conducted using a fixed random seed across Python, NumPy, and PyTorch (both CPU and CUDA). A dedicated module `BERT_Models_Fine_Tuning_Replication.py` was duplicated and used from the original `BERT_Models_Fine_Tuning.py`, with minimal fixes to file paths, training arguments, and compatibility issues, and a single entry-point function.

The `run_full_hearts_pipeline()` function:

1. Trains ALBERT, BERT, and DistilBERT on each of the four training settings
(mgsd_trained, winoqueer_gpt_augmentation_trained,
seegull_gpt_augmentation_trained,
merged_winoqueer_seegull_gpt_augmentation_trained), and

2. Evaluates each model on the four evaluation sets
(mgsd, winoqueer_gpt_augmentation,
seegull_gpt_augmentation,
merged_winoqueer_seegull_gpt_augmentation).

For each configuration, the accuracy, macro-precision, macro-recall, and macro-F1 scores are recorded in a CSV file and summarised in a compact results table. The accuracy and macro-F1 scores achieved in this study for all three model families and most dataset combinations are within approximately $±5$ percentage points of the metrics reported in the HEARTS paper, thereby meeting the baseline replication requirement.

In [25]:
# Set a fixed seed
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

In [None]:
# Provide the path to the python file 
module_path = pathlib.Path("Model Training and Evaluation/BERT_Models_Fine_Tuning_Replication.py")

# Upload models 
spec = importlib.util.spec_from_file_location("hearts_models", module_path)
hearts_models = importlib.util.module_from_spec(spec)
spec.loader.exec_module(hearts_models)

# Run
hearts_models.run_full_hearts_pipeline()

Loading: /tmp/HEARTS-Text-Stereotype-Detection/Model Training and Evaluation/MGSD.csv
First few examples from the training data:
        group                                               text  label  \
38518    race  Syria is a well known area nowadays. There is ...      1   
12979    race                   The people of germany are angry.      1   
35494    race  Lebanon is a small country in the Middle East ...      1   
42213    race  We have a french tutor. She is very humble and...      0   
16293  gender                              Every male acts dumb.      1   

      data_name  
38518      MGSD  
12979      MGSD  
35494      MGSD  
42213      MGSD  
16293      MGSD  
First few examples from the testing data:
            group                                               text  label  \
17851        race      When i see an arab, i think they are a happy.      0   
37449        race  People in India dress differently. The types o...      0   
23634      gender    Everyone has

loading configuration file config.json from cache at /cs/student/msc/aisd/2025/bekenova/.cache/huggingface/hub/models--albert--albert-base-v2/snapshots/8e2f239c5f8a2c0f253781ca60135db913e5c80c/config.json
Model config AlbertConfig {
  "architectures": [
    "AlbertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0,
  "bos_token_id": 2,
  "classifier_dropout_prob": 0.1,
  "down_scale_factor": 1,
  "embedding_size": 128,
  "eos_token_id": 3,
  "gap_size": 0,
  "hidden_act": "gelu_new",
  "hidden_dropout_prob": 0,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "inner_group_num": 1,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "albert",
  "net_structure_type": 0,
  "num_attention_heads": 12,
  "num_hidden_groups": 1,
  "num_hidden_layers": 12,
  "num_memory_blocks": 0,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.57.3",
  "type_vocab_size": 2,
  "vocab_size": 30000
}

loading

Sample tokenized input from train: {'group': 'religion', 'text': 'Have you ever met someone named Mohammed? I need a new pair of pajamas.', 'label': 0, 'data_name': 'MGSD', '__index_level_0__': 36953, 'input_ids': [2, 57, 42, 462, 798, 737, 377, 13479, 60, 31, 376, 21, 78, 2146, 16, 23358, 18, 9, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


Map: 100%|████████████████████████| 8299/8299 [00:00<00:00, 25066.94 examples/s]
Map: 100%|████████████████████████| 8299/8299 [00:00<00:00, 25612.81 examples/s]
PyTorch: setting up devices
  trainer = Trainer(


Sample tokenized input from validation: {'group': 'race', 'text': 'My friend is German and sometimes we hand out. He gets very angry sometimes.', 'label': 1, 'data_name': 'MGSD', '__index_level_0__': 36651, 'input_ids': [2, 51, 860, 25, 548, 17, 1030, 95, 224, 70, 9, 24, 3049, 253, 3365, 1030, 9, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 1}


The following columns in the Training set don't have a corresponding argument in `AlbertForSequenceClassification.forward` and have been ignored: group, __index_level_0__, data_name, text. If group, __index_level_0__, data_name, text are not expected by `AlbertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 33,194
  Num Epochs = 6
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 3,114
  Number of trainable parameters = 11,685,122


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Balanced Accuracy
1,0.4359,0.421646,0.762911,0.75982,0.761298,0.75982
2,0.3794,0.391999,0.806277,0.771857,0.783901,0.771857
3,0.2602,0.417022,0.807062,0.786549,0.794795,0.786549
4,0.1704,0.474694,0.804454,0.798714,0.801402,0.798714
5,0.0907,0.675297,0.803132,0.781201,0.789853,0.781201
6,0.032,0.788018,0.801895,0.795548,0.798498,0.795548


The following columns in the Evaluation set don't have a corresponding argument in `AlbertForSequenceClassification.forward` and have been ignored: group, __index_level_0__, data_name, text. If group, __index_level_0__, data_name, text are not expected by `AlbertForSequenceClassification.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 8299
  Batch size = 64
Saving model checkpoint to model_output_albertv2/mgsd_trained/checkpoint-519
Configuration saved in model_output_albertv2/mgsd_trained/checkpoint-519/config.json
Model weights saved in model_output_albertv2/mgsd_trained/checkpoint-519/model.safetensors
tokenizer config file saved in model_output_albertv2/mgsd_trained/checkpoint-519/tokenizer_config.json
Special tokens file saved in model_output_albertv2/mgsd_trained/checkpoint-519/special_tokens_map.json
The following columns in the Evaluation set don't have a corresponding argument in `AlbertForSequenceClassification.forward` and have

Saved fine-tuned model to: model_output_albertv2/mgsd_trained
Number of unique labels: 2


Map: 100%|██████████████████████████| 653/653 [00:00<00:00, 32979.49 examples/s]
Map: 100%|██████████████████████████| 653/653 [00:00<00:00, 32939.82 examples/s]
Device set to use cuda:0


Sample tokenized input from test: {'text': 'most people who are Gay are kites', 'label': 0, 'group': 'Gay', 'data_name': 'Winoqueer - GPT Augmentation', '__index_level_0__': 3236, 'input_ids': [2, 127, 148, 72, 50, 3398, 50, 19196, 18, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


loading configuration file model_output_albertv2/mgsd_trained/config.json
Model config AlbertConfig {
  "architectures": [
    "AlbertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0,
  "bos_token_id": 2,
  "classifier_dropout_prob": 0.1,
  "down_scale_factor": 1,
  "dtype": "float32",
  "embedding_size": 128,
  "eos_token_id": 3,
  "gap_size": 0,
  "hidden_act": "gelu_new",
  "hidden_dropout_prob": 0,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "inner_group_num": 1,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "albert",
  "net_structure_type": 0,
  "num_attention_heads": 12,
  "num_hidden_groups": 1,
  "num_hidden_layers": 12,
  "num_memory_blocks": 0,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "transformers_version": "4.57.3",
  "type_vocab_size": 2,
  "vocab_size": 30000
}

loading weights file model_output_albertv2/mgsd_train

Saved full results to: result_output_albertv2/mgsd_trained/winoqueer_gpt_augmentation/full_results.csv
Saved classification report to: result_output_albertv2/mgsd_trained/winoqueer_gpt_augmentation/classification_report.csv
Number of unique labels: 2


Map: 100%|██████████████████████████| 414/414 [00:00<00:00, 26623.15 examples/s]
Map: 100%|██████████████████████████| 414/414 [00:00<00:00, 30635.88 examples/s]
Device set to use cuda:0


Sample tokenized input from test: {'text': 'The Somali refugees were labeled as unacquainted by some', 'label': 0, 'group': 'Somalis', 'data_name': 'SeeGULL - GPT Augmentation', '__index_level_0__': 840, 'input_ids': [2, 14, 16330, 8790, 46, 14348, 28, 367, 26692, 108, 1427, 34, 109, 3, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0], 'labels': 0}


loading configuration file model_output_albertv2/mgsd_trained/config.json
Model config AlbertConfig {
  "architectures": [
    "AlbertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0,
  "bos_token_id": 2,
  "classifier_dropout_prob": 0.1,
  "down_scale_factor": 1,
  "dtype": "float32",
  "embedding_size": 128,
  "eos_token_id": 3,
  "gap_size": 0,
  "hidden_act": "gelu_new",
  "hidden_dropout_prob": 0,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "inner_group_num": 1,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "albert",
  "net_structure_type": 0,
  "num_attention_heads": 12,
  "num_hidden_groups": 1,
  "num_hidden_layers": 12,
  "num_memory_blocks": 0,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "transformers_version": "4.57.3",
  "type_vocab_size": 2,
  "vocab_size": 30000
}

loading weights file model_output_albertv2/mgsd_train

Saved full results to: result_output_albertv2/mgsd_trained/seegull_gpt_augmentation/full_results.csv
Saved classification report to: result_output_albertv2/mgsd_trained/seegull_gpt_augmentation/classification_report.csv
Number of unique labels: 2


Map: 100%|██████████████████████| 10374/10374 [00:00<00:00, 25262.18 examples/s]
Map: 100%|██████████████████████| 10374/10374 [00:00<00:00, 27014.61 examples/s]
Device set to use cuda:0


Sample tokenized input from test: {'group': 'race', 'text': 'When i see an arab, i think they are a happy.', 'label': 0, 'data_name': 'MGSD', '__index_level_0__': 17851, 'input_ids': [2, 76, 31, 196, 40, 3666, 15, 31, 277, 59, 50, 21, 1700, 9, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


loading configuration file model_output_albertv2/mgsd_trained/config.json
Model config AlbertConfig {
  "architectures": [
    "AlbertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0,
  "bos_token_id": 2,
  "classifier_dropout_prob": 0.1,
  "down_scale_factor": 1,
  "dtype": "float32",
  "embedding_size": 128,
  "eos_token_id": 3,
  "gap_size": 0,
  "hidden_act": "gelu_new",
  "hidden_dropout_prob": 0,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "inner_group_num": 1,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "albert",
  "net_structure_type": 0,
  "num_attention_heads": 12,
  "num_hidden_groups": 1,
  "num_hidden_layers": 12,
  "num_memory_blocks": 0,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "transformers_version": "4.57.3",
  "type_vocab_size": 2,
  "vocab_size": 30000
}

loading weights file model_output_albertv2/mgsd_train

Saved full results to: result_output_albertv2/mgsd_trained/mgsd/full_results.csv
Saved classification report to: result_output_albertv2/mgsd_trained/mgsd/classification_report.csv
Number of unique labels: 2


Map: 100%|██████████████████████| 11441/11441 [00:00<00:00, 25754.99 examples/s]
Map: 100%|██████████████████████| 11441/11441 [00:00<00:00, 29745.51 examples/s]
Device set to use cuda:0


Sample tokenized input from test: {'text': 'The Somali refugees were labeled as unacquainted by some', 'label': 0, 'group': 'Somalis', 'data_name': 'SeeGULL - GPT Augmentation', 'input_ids': [2, 14, 16330, 8790, 46, 14348, 28, 367, 26692, 108, 1427, 34, 109, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


loading configuration file config.json from cache at /cs/student/msc/aisd/2025/bekenova/.cache/huggingface/hub/models--albert--albert-base-v2/snapshots/8e2f239c5f8a2c0f253781ca60135db913e5c80c/config.json
Model config AlbertConfig {
  "architectures": [
    "AlbertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0,
  "bos_token_id": 2,
  "classifier_dropout_prob": 0.1,
  "down_scale_factor": 1,
  "embedding_size": 128,
  "eos_token_id": 3,
  "gap_size": 0,
  "hidden_act": "gelu_new",
  "hidden_dropout_prob": 0,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "inner_group_num": 1,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "albert",
  "net_structure_type": 0,
  "num_attention_heads": 12,
  "num_hidden_groups": 1,
  "num_hidden_layers": 12,
  "num_memory_blocks": 0,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.57.3",
  "type_vocab_size": 2,
  "vocab_size": 30000
}

loading

Saved full results to: result_output_albertv2/mgsd_trained/merged_winoqueer_seegull_gpt_augmentation/full_results.csv
Saved classification report to: result_output_albertv2/mgsd_trained/merged_winoqueer_seegull_gpt_augmentation/classification_report.csv
Number of unique labels: 2


Some weights of the model checkpoint at albert/albert-base-v2 were not used when initializing AlbertForSequenceClassification: ['predictions.LayerNorm.bias', 'predictions.LayerNorm.weight', 'predictions.bias', 'predictions.decoder.bias', 'predictions.dense.bias', 'predictions.dense.weight']
- This IS expected if you are initializing AlbertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing AlbertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of AlbertForSequenceClassification were not initialized from the model checkpoint at albert/albert-base-v2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRA

Sample tokenized input from train: {'text': 'we expect that nonbinary people are out to share their views.', 'label': 0, 'group': 'NB', 'data_name': 'Winoqueer - GPT Augmentation', '__index_level_0__': 1788, 'input_ids': [2, 95, 4186, 30, 538, 4772, 1857, 148, 50, 70, 20, 1891, 66, 4146, 9, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


Map: 100%|██████████████████████████| 523/523 [00:00<00:00, 33515.47 examples/s]
Map: 100%|██████████████████████████| 523/523 [00:00<00:00, 31945.78 examples/s]
PyTorch: setting up devices
  trainer = Trainer(


Sample tokenized input from validation: {'text': 'all lesbians are studious.', 'label': 0, 'group': 'Lesbian', 'data_name': 'Winoqueer - GPT Augmentation', '__index_level_0__': 2716, 'input_ids': [2, 65, 10564, 18, 50, 1120, 267, 9, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


The following columns in the Training set don't have a corresponding argument in `AlbertForSequenceClassification.forward` and have been ignored: group, __index_level_0__, data_name, text. If group, __index_level_0__, data_name, text are not expected by `AlbertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 2,088
  Num Epochs = 6
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 198
  Number of trainable parameters = 11,685,122


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Balanced Accuracy
1,No log,0.406159,0.871846,0.795361,0.819554,0.795361
2,0.507000,0.200735,0.925505,0.899788,0.911097,0.899788
3,0.507000,0.078651,0.97494,0.971805,0.973351,0.971805
4,0.127600,0.054762,0.979424,0.98556,0.982406,0.98556
5,0.040100,0.057458,0.982292,0.982292,0.982292,0.982292
6,0.040100,0.056016,0.980816,0.983926,0.98235,0.983926


The following columns in the Evaluation set don't have a corresponding argument in `AlbertForSequenceClassification.forward` and have been ignored: group, __index_level_0__, data_name, text. If group, __index_level_0__, data_name, text are not expected by `AlbertForSequenceClassification.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 523
  Batch size = 64
Saving model checkpoint to model_output_albertv2/winoqueer_gpt_augmentation_trained/checkpoint-33
Configuration saved in model_output_albertv2/winoqueer_gpt_augmentation_trained/checkpoint-33/config.json
Model weights saved in model_output_albertv2/winoqueer_gpt_augmentation_trained/checkpoint-33/model.safetensors
tokenizer config file saved in model_output_albertv2/winoqueer_gpt_augmentation_trained/checkpoint-33/tokenizer_config.json
Special tokens file saved in model_output_albertv2/winoqueer_gpt_augmentation_trained/checkpoint-33/special_tokens_map.json
The following columns in the 

Saved fine-tuned model to: model_output_albertv2/winoqueer_gpt_augmentation_trained
Number of unique labels: 2


Map: 100%|██████████████████████████| 653/653 [00:00<00:00, 32616.59 examples/s]
Map: 100%|██████████████████████████| 653/653 [00:00<00:00, 32063.69 examples/s]
Device set to use cuda:0


Sample tokenized input from test: {'text': 'most people who are Gay are kites', 'label': 0, 'group': 'Gay', 'data_name': 'Winoqueer - GPT Augmentation', '__index_level_0__': 3236, 'input_ids': [2, 127, 148, 72, 50, 3398, 50, 19196, 18, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


loading configuration file model_output_albertv2/winoqueer_gpt_augmentation_trained/config.json
Model config AlbertConfig {
  "architectures": [
    "AlbertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0,
  "bos_token_id": 2,
  "classifier_dropout_prob": 0.1,
  "down_scale_factor": 1,
  "dtype": "float32",
  "embedding_size": 128,
  "eos_token_id": 3,
  "gap_size": 0,
  "hidden_act": "gelu_new",
  "hidden_dropout_prob": 0,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "inner_group_num": 1,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "albert",
  "net_structure_type": 0,
  "num_attention_heads": 12,
  "num_hidden_groups": 1,
  "num_hidden_layers": 12,
  "num_memory_blocks": 0,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "transformers_version": "4.57.3",
  "type_vocab_size": 2,
  "vocab_size": 30000
}

loading weights file model_outp

Saved full results to: result_output_albertv2/winoqueer_gpt_augmentation_trained/winoqueer_gpt_augmentation/full_results.csv
Saved classification report to: result_output_albertv2/winoqueer_gpt_augmentation_trained/winoqueer_gpt_augmentation/classification_report.csv
Number of unique labels: 2


Map: 100%|██████████████████████████| 414/414 [00:00<00:00, 26737.94 examples/s]
Map: 100%|██████████████████████████| 414/414 [00:00<00:00, 30124.07 examples/s]
Device set to use cuda:0


Sample tokenized input from test: {'text': 'The Somali refugees were labeled as unacquainted by some', 'label': 0, 'group': 'Somalis', 'data_name': 'SeeGULL - GPT Augmentation', '__index_level_0__': 840, 'input_ids': [2, 14, 16330, 8790, 46, 14348, 28, 367, 26692, 108, 1427, 34, 109, 3, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0], 'labels': 0}


loading configuration file model_output_albertv2/winoqueer_gpt_augmentation_trained/config.json
Model config AlbertConfig {
  "architectures": [
    "AlbertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0,
  "bos_token_id": 2,
  "classifier_dropout_prob": 0.1,
  "down_scale_factor": 1,
  "dtype": "float32",
  "embedding_size": 128,
  "eos_token_id": 3,
  "gap_size": 0,
  "hidden_act": "gelu_new",
  "hidden_dropout_prob": 0,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "inner_group_num": 1,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "albert",
  "net_structure_type": 0,
  "num_attention_heads": 12,
  "num_hidden_groups": 1,
  "num_hidden_layers": 12,
  "num_memory_blocks": 0,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "transformers_version": "4.57.3",
  "type_vocab_size": 2,
  "vocab_size": 30000
}

loading weights file model_outp

Saved full results to: result_output_albertv2/winoqueer_gpt_augmentation_trained/seegull_gpt_augmentation/full_results.csv
Saved classification report to: result_output_albertv2/winoqueer_gpt_augmentation_trained/seegull_gpt_augmentation/classification_report.csv
Number of unique labels: 2


Map: 100%|██████████████████████| 10374/10374 [00:00<00:00, 24882.03 examples/s]
Map: 100%|██████████████████████| 10374/10374 [00:00<00:00, 26299.02 examples/s]
Device set to use cuda:0


Sample tokenized input from test: {'group': 'race', 'text': 'When i see an arab, i think they are a happy.', 'label': 0, 'data_name': 'MGSD', '__index_level_0__': 17851, 'input_ids': [2, 76, 31, 196, 40, 3666, 15, 31, 277, 59, 50, 21, 1700, 9, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


loading configuration file model_output_albertv2/winoqueer_gpt_augmentation_trained/config.json
Model config AlbertConfig {
  "architectures": [
    "AlbertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0,
  "bos_token_id": 2,
  "classifier_dropout_prob": 0.1,
  "down_scale_factor": 1,
  "dtype": "float32",
  "embedding_size": 128,
  "eos_token_id": 3,
  "gap_size": 0,
  "hidden_act": "gelu_new",
  "hidden_dropout_prob": 0,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "inner_group_num": 1,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "albert",
  "net_structure_type": 0,
  "num_attention_heads": 12,
  "num_hidden_groups": 1,
  "num_hidden_layers": 12,
  "num_memory_blocks": 0,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "transformers_version": "4.57.3",
  "type_vocab_size": 2,
  "vocab_size": 30000
}

loading weights file model_outp

Saved full results to: result_output_albertv2/winoqueer_gpt_augmentation_trained/mgsd/full_results.csv
Saved classification report to: result_output_albertv2/winoqueer_gpt_augmentation_trained/mgsd/classification_report.csv
Number of unique labels: 2


Map: 100%|██████████████████████| 11441/11441 [00:00<00:00, 26019.70 examples/s]
Map: 100%|██████████████████████| 11441/11441 [00:00<00:00, 29392.62 examples/s]
Device set to use cuda:0


Sample tokenized input from test: {'text': 'The Somali refugees were labeled as unacquainted by some', 'label': 0, 'group': 'Somalis', 'data_name': 'SeeGULL - GPT Augmentation', 'input_ids': [2, 14, 16330, 8790, 46, 14348, 28, 367, 26692, 108, 1427, 34, 109, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


loading configuration file config.json from cache at /cs/student/msc/aisd/2025/bekenova/.cache/huggingface/hub/models--albert--albert-base-v2/snapshots/8e2f239c5f8a2c0f253781ca60135db913e5c80c/config.json
Model config AlbertConfig {
  "architectures": [
    "AlbertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0,
  "bos_token_id": 2,
  "classifier_dropout_prob": 0.1,
  "down_scale_factor": 1,
  "embedding_size": 128,
  "eos_token_id": 3,
  "gap_size": 0,
  "hidden_act": "gelu_new",
  "hidden_dropout_prob": 0,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "inner_group_num": 1,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "albert",
  "net_structure_type": 0,
  "num_attention_heads": 12,
  "num_hidden_groups": 1,
  "num_hidden_layers": 12,
  "num_memory_blocks": 0,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.57.3",
  "type_vocab_size": 2,
  "vocab_size": 30000
}

loading

Saved full results to: result_output_albertv2/winoqueer_gpt_augmentation_trained/merged_winoqueer_seegull_gpt_augmentation/full_results.csv
Saved classification report to: result_output_albertv2/winoqueer_gpt_augmentation_trained/merged_winoqueer_seegull_gpt_augmentation/classification_report.csv
Number of unique labels: 2


loading configuration file config.json from cache at /cs/student/msc/aisd/2025/bekenova/.cache/huggingface/hub/models--albert--albert-base-v2/snapshots/8e2f239c5f8a2c0f253781ca60135db913e5c80c/config.json
Model config AlbertConfig {
  "architectures": [
    "AlbertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0,
  "bos_token_id": 2,
  "classifier_dropout_prob": 0.1,
  "down_scale_factor": 1,
  "embedding_size": 128,
  "eos_token_id": 3,
  "gap_size": 0,
  "hidden_act": "gelu_new",
  "hidden_dropout_prob": 0,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "inner_group_num": 1,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "albert",
  "net_structure_type": 0,
  "num_attention_heads": 12,
  "num_hidden_groups": 1,
  "num_hidden_layers": 12,
  "num_memory_blocks": 0,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.57.3",
  "type_vocab_size": 2,
  "vocab_size": 30000
}

loading

Sample tokenized input from train: {'text': 'The average Chadian income is modest compared to developed countries', 'label': 0, 'group': 'Chadian', 'data_name': 'SeeGULL - GPT Augmentation', '__index_level_0__': 1249, 'input_ids': [2, 14, 862, 8708, 806, 1587, 25, 8904, 2428, 20, 885, 1166, 3, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


Map: 100%|██████████████████████████| 332/332 [00:00<00:00, 28725.74 examples/s]
Map: 100%|██████████████████████████| 332/332 [00:00<00:00, 29558.04 examples/s]
PyTorch: setting up devices
  trainer = Trainer(


Sample tokenized input from validation: {'text': 'The Ethiopian woman was short and petite', 'label': 1, 'group': 'Ethiopian', 'data_name': 'SeeGULL - GPT Augmentation', '__index_level_0__': 609, 'input_ids': [2, 14, 16659, 524, 23, 502, 17, 22471, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 1}


The following columns in the Training set don't have a corresponding argument in `AlbertForSequenceClassification.forward` and have been ignored: group, __index_level_0__, data_name, text. If group, __index_level_0__, data_name, text are not expected by `AlbertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 1,324
  Num Epochs = 6
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 126
  Number of trainable parameters = 11,685,122


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Balanced Accuracy
1,No log,0.496939,0.810158,0.65085,0.658717,0.65085
2,No log,0.278642,0.872588,0.869186,0.870844,0.869186
3,0.429700,0.291309,0.867118,0.896091,0.876893,0.896091
4,0.429700,0.279223,0.883646,0.900677,0.890901,0.900677
5,0.137800,0.261838,0.898946,0.907484,0.902943,0.907484
6,0.137800,0.265918,0.89517,0.905222,0.899807,0.905222


The following columns in the Evaluation set don't have a corresponding argument in `AlbertForSequenceClassification.forward` and have been ignored: group, __index_level_0__, data_name, text. If group, __index_level_0__, data_name, text are not expected by `AlbertForSequenceClassification.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 332
  Batch size = 64
Saving model checkpoint to model_output_albertv2/seegull_gpt_augmentation_trained/checkpoint-21
Configuration saved in model_output_albertv2/seegull_gpt_augmentation_trained/checkpoint-21/config.json
Model weights saved in model_output_albertv2/seegull_gpt_augmentation_trained/checkpoint-21/model.safetensors
tokenizer config file saved in model_output_albertv2/seegull_gpt_augmentation_trained/checkpoint-21/tokenizer_config.json
Special tokens file saved in model_output_albertv2/seegull_gpt_augmentation_trained/checkpoint-21/special_tokens_map.json
The following columns in the Evaluation

Saved fine-tuned model to: model_output_albertv2/seegull_gpt_augmentation_trained
Number of unique labels: 2


Map: 100%|██████████████████████████| 653/653 [00:00<00:00, 32752.72 examples/s]
Map: 100%|██████████████████████████| 653/653 [00:00<00:00, 32542.19 examples/s]
Device set to use cuda:0


Sample tokenized input from test: {'text': 'most people who are Gay are kites', 'label': 0, 'group': 'Gay', 'data_name': 'Winoqueer - GPT Augmentation', '__index_level_0__': 3236, 'input_ids': [2, 127, 148, 72, 50, 3398, 50, 19196, 18, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


loading configuration file model_output_albertv2/seegull_gpt_augmentation_trained/config.json
Model config AlbertConfig {
  "architectures": [
    "AlbertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0,
  "bos_token_id": 2,
  "classifier_dropout_prob": 0.1,
  "down_scale_factor": 1,
  "dtype": "float32",
  "embedding_size": 128,
  "eos_token_id": 3,
  "gap_size": 0,
  "hidden_act": "gelu_new",
  "hidden_dropout_prob": 0,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "inner_group_num": 1,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "albert",
  "net_structure_type": 0,
  "num_attention_heads": 12,
  "num_hidden_groups": 1,
  "num_hidden_layers": 12,
  "num_memory_blocks": 0,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "transformers_version": "4.57.3",
  "type_vocab_size": 2,
  "vocab_size": 30000
}

loading weights file model_output

Saved full results to: result_output_albertv2/seegull_gpt_augmentation_trained/winoqueer_gpt_augmentation/full_results.csv
Saved classification report to: result_output_albertv2/seegull_gpt_augmentation_trained/winoqueer_gpt_augmentation/classification_report.csv
Number of unique labels: 2


Map: 100%|██████████████████████████| 414/414 [00:00<00:00, 26535.68 examples/s]
Map: 100%|██████████████████████████| 414/414 [00:00<00:00, 31447.60 examples/s]
Device set to use cuda:0


Sample tokenized input from test: {'text': 'The Somali refugees were labeled as unacquainted by some', 'label': 0, 'group': 'Somalis', 'data_name': 'SeeGULL - GPT Augmentation', '__index_level_0__': 840, 'input_ids': [2, 14, 16330, 8790, 46, 14348, 28, 367, 26692, 108, 1427, 34, 109, 3, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0], 'labels': 0}


loading configuration file model_output_albertv2/seegull_gpt_augmentation_trained/config.json
Model config AlbertConfig {
  "architectures": [
    "AlbertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0,
  "bos_token_id": 2,
  "classifier_dropout_prob": 0.1,
  "down_scale_factor": 1,
  "dtype": "float32",
  "embedding_size": 128,
  "eos_token_id": 3,
  "gap_size": 0,
  "hidden_act": "gelu_new",
  "hidden_dropout_prob": 0,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "inner_group_num": 1,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "albert",
  "net_structure_type": 0,
  "num_attention_heads": 12,
  "num_hidden_groups": 1,
  "num_hidden_layers": 12,
  "num_memory_blocks": 0,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "transformers_version": "4.57.3",
  "type_vocab_size": 2,
  "vocab_size": 30000
}

loading weights file model_output

Saved full results to: result_output_albertv2/seegull_gpt_augmentation_trained/seegull_gpt_augmentation/full_results.csv
Saved classification report to: result_output_albertv2/seegull_gpt_augmentation_trained/seegull_gpt_augmentation/classification_report.csv
Number of unique labels: 2


Map: 100%|██████████████████████| 10374/10374 [00:00<00:00, 25123.87 examples/s]
Map: 100%|██████████████████████| 10374/10374 [00:00<00:00, 26863.03 examples/s]
Device set to use cuda:0


Sample tokenized input from test: {'group': 'race', 'text': 'When i see an arab, i think they are a happy.', 'label': 0, 'data_name': 'MGSD', '__index_level_0__': 17851, 'input_ids': [2, 76, 31, 196, 40, 3666, 15, 31, 277, 59, 50, 21, 1700, 9, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


loading configuration file model_output_albertv2/seegull_gpt_augmentation_trained/config.json
Model config AlbertConfig {
  "architectures": [
    "AlbertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0,
  "bos_token_id": 2,
  "classifier_dropout_prob": 0.1,
  "down_scale_factor": 1,
  "dtype": "float32",
  "embedding_size": 128,
  "eos_token_id": 3,
  "gap_size": 0,
  "hidden_act": "gelu_new",
  "hidden_dropout_prob": 0,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "inner_group_num": 1,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "albert",
  "net_structure_type": 0,
  "num_attention_heads": 12,
  "num_hidden_groups": 1,
  "num_hidden_layers": 12,
  "num_memory_blocks": 0,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "transformers_version": "4.57.3",
  "type_vocab_size": 2,
  "vocab_size": 30000
}

loading weights file model_output

Saved full results to: result_output_albertv2/seegull_gpt_augmentation_trained/mgsd/full_results.csv
Saved classification report to: result_output_albertv2/seegull_gpt_augmentation_trained/mgsd/classification_report.csv
Number of unique labels: 2


Map: 100%|██████████████████████| 11441/11441 [00:00<00:00, 25824.60 examples/s]
Map: 100%|██████████████████████| 11441/11441 [00:00<00:00, 29088.25 examples/s]
Device set to use cuda:0


Sample tokenized input from test: {'text': 'The Somali refugees were labeled as unacquainted by some', 'label': 0, 'group': 'Somalis', 'data_name': 'SeeGULL - GPT Augmentation', 'input_ids': [2, 14, 16330, 8790, 46, 14348, 28, 367, 26692, 108, 1427, 34, 109, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


loading configuration file config.json from cache at /cs/student/msc/aisd/2025/bekenova/.cache/huggingface/hub/models--albert--albert-base-v2/snapshots/8e2f239c5f8a2c0f253781ca60135db913e5c80c/config.json
Model config AlbertConfig {
  "architectures": [
    "AlbertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0,
  "bos_token_id": 2,
  "classifier_dropout_prob": 0.1,
  "down_scale_factor": 1,
  "embedding_size": 128,
  "eos_token_id": 3,
  "gap_size": 0,
  "hidden_act": "gelu_new",
  "hidden_dropout_prob": 0,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "inner_group_num": 1,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "albert",
  "net_structure_type": 0,
  "num_attention_heads": 12,
  "num_hidden_groups": 1,
  "num_hidden_layers": 12,
  "num_memory_blocks": 0,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.57.3",
  "type_vocab_size": 2,
  "vocab_size": 30000
}

loading

Saved full results to: result_output_albertv2/seegull_gpt_augmentation_trained/merged_winoqueer_seegull_gpt_augmentation/full_results.csv
Saved classification report to: result_output_albertv2/seegull_gpt_augmentation_trained/merged_winoqueer_seegull_gpt_augmentation/classification_report.csv
Number of unique labels: 2


loading configuration file config.json from cache at /cs/student/msc/aisd/2025/bekenova/.cache/huggingface/hub/models--albert--albert-base-v2/snapshots/8e2f239c5f8a2c0f253781ca60135db913e5c80c/config.json
Model config AlbertConfig {
  "architectures": [
    "AlbertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0,
  "bos_token_id": 2,
  "classifier_dropout_prob": 0.1,
  "down_scale_factor": 1,
  "embedding_size": 128,
  "eos_token_id": 3,
  "gap_size": 0,
  "hidden_act": "gelu_new",
  "hidden_dropout_prob": 0,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "inner_group_num": 1,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "albert",
  "net_structure_type": 0,
  "num_attention_heads": 12,
  "num_hidden_groups": 1,
  "num_hidden_layers": 12,
  "num_memory_blocks": 0,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.57.3",
  "type_vocab_size": 2,
  "vocab_size": 30000
}

loading

Sample tokenized input from train: {'text': 'The commander could tell the soldier was disciplined.', 'label': 1, 'group': 'profession', 'data_name': 'MGSD', '__index_level_0__': 7380, 'input_ids': [2, 14, 1647, 110, 494, 14, 3994, 23, 7927, 43, 9, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 1}


Map: 100%|████████████████████████| 9152/9152 [00:00<00:00, 25961.68 examples/s]
Map: 100%|████████████████████████| 9152/9152 [00:00<00:00, 26573.65 examples/s]
PyTorch: setting up devices
  trainer = Trainer(
The following columns in the Training set don't have a corresponding argument in `AlbertForSequenceClassification.forward` and have been ignored: group, __index_level_0__, data_name, text. If group, __index_level_0__, data_name, text are not expected by `AlbertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 36,608
  Num Epochs = 6
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 3,432
  Number of trainable parameters = 11,685,122


Sample tokenized input from validation: {'text': 'we expect that Queer people are artistic.', 'label': 0, 'group': 'Queer', 'data_name': 'Winoqueer - GPT Augmentation', '__index_level_0__': 3705, 'input_ids': [2, 95, 4186, 30, 20242, 148, 50, 4917, 9, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Balanced Accuracy
1,0.3926,0.404256,0.781526,0.756578,0.765872,0.756578
2,0.3483,0.373967,0.808086,0.800762,0.804144,0.800762
3,0.2564,0.388171,0.821823,0.807886,0.813955,0.807886
4,0.1782,0.438449,0.81938,0.810596,0.81461,0.810596
5,0.0779,0.606645,0.809638,0.812734,0.811131,0.812734
6,0.0399,0.738903,0.808658,0.808987,0.808822,0.808987


The following columns in the Evaluation set don't have a corresponding argument in `AlbertForSequenceClassification.forward` and have been ignored: group, __index_level_0__, data_name, text. If group, __index_level_0__, data_name, text are not expected by `AlbertForSequenceClassification.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 9152
  Batch size = 64
Saving model checkpoint to model_output_albertv2/merged_winoqueer_seegull_gpt_augmentation_trained/checkpoint-572
Configuration saved in model_output_albertv2/merged_winoqueer_seegull_gpt_augmentation_trained/checkpoint-572/config.json
Model weights saved in model_output_albertv2/merged_winoqueer_seegull_gpt_augmentation_trained/checkpoint-572/model.safetensors
tokenizer config file saved in model_output_albertv2/merged_winoqueer_seegull_gpt_augmentation_trained/checkpoint-572/tokenizer_config.json
Special tokens file saved in model_output_albertv2/merged_winoqueer_seegull_gpt_augmenta

Saved fine-tuned model to: model_output_albertv2/merged_winoqueer_seegull_gpt_augmentation_trained
Number of unique labels: 2


Map: 100%|██████████████████████████| 653/653 [00:00<00:00, 32820.62 examples/s]
Map: 100%|██████████████████████████| 653/653 [00:00<00:00, 33051.52 examples/s]
Device set to use cuda:0


Sample tokenized input from test: {'text': 'most people who are Gay are kites', 'label': 0, 'group': 'Gay', 'data_name': 'Winoqueer - GPT Augmentation', '__index_level_0__': 3236, 'input_ids': [2, 127, 148, 72, 50, 3398, 50, 19196, 18, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


loading configuration file model_output_albertv2/merged_winoqueer_seegull_gpt_augmentation_trained/config.json
Model config AlbertConfig {
  "architectures": [
    "AlbertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0,
  "bos_token_id": 2,
  "classifier_dropout_prob": 0.1,
  "down_scale_factor": 1,
  "dtype": "float32",
  "embedding_size": 128,
  "eos_token_id": 3,
  "gap_size": 0,
  "hidden_act": "gelu_new",
  "hidden_dropout_prob": 0,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "inner_group_num": 1,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "albert",
  "net_structure_type": 0,
  "num_attention_heads": 12,
  "num_hidden_groups": 1,
  "num_hidden_layers": 12,
  "num_memory_blocks": 0,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "transformers_version": "4.57.3",
  "type_vocab_size": 2,
  "vocab_size": 30000
}

loading weights 

Saved full results to: result_output_albertv2/merged_winoqueer_seegull_gpt_augmentation_trained/winoqueer_gpt_augmentation/full_results.csv
Saved classification report to: result_output_albertv2/merged_winoqueer_seegull_gpt_augmentation_trained/winoqueer_gpt_augmentation/classification_report.csv
Number of unique labels: 2


Map: 100%|██████████████████████████| 414/414 [00:00<00:00, 26342.45 examples/s]
Map: 100%|██████████████████████████| 414/414 [00:00<00:00, 30666.18 examples/s]
Device set to use cuda:0


Sample tokenized input from test: {'text': 'The Somali refugees were labeled as unacquainted by some', 'label': 0, 'group': 'Somalis', 'data_name': 'SeeGULL - GPT Augmentation', '__index_level_0__': 840, 'input_ids': [2, 14, 16330, 8790, 46, 14348, 28, 367, 26692, 108, 1427, 34, 109, 3, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0], 'labels': 0}


loading configuration file model_output_albertv2/merged_winoqueer_seegull_gpt_augmentation_trained/config.json
Model config AlbertConfig {
  "architectures": [
    "AlbertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0,
  "bos_token_id": 2,
  "classifier_dropout_prob": 0.1,
  "down_scale_factor": 1,
  "dtype": "float32",
  "embedding_size": 128,
  "eos_token_id": 3,
  "gap_size": 0,
  "hidden_act": "gelu_new",
  "hidden_dropout_prob": 0,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "inner_group_num": 1,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "albert",
  "net_structure_type": 0,
  "num_attention_heads": 12,
  "num_hidden_groups": 1,
  "num_hidden_layers": 12,
  "num_memory_blocks": 0,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "transformers_version": "4.57.3",
  "type_vocab_size": 2,
  "vocab_size": 30000
}

loading weights 

Saved full results to: result_output_albertv2/merged_winoqueer_seegull_gpt_augmentation_trained/seegull_gpt_augmentation/full_results.csv
Saved classification report to: result_output_albertv2/merged_winoqueer_seegull_gpt_augmentation_trained/seegull_gpt_augmentation/classification_report.csv
Number of unique labels: 2


Map: 100%|██████████████████████| 10374/10374 [00:00<00:00, 25022.22 examples/s]
Map: 100%|██████████████████████| 10374/10374 [00:00<00:00, 26777.21 examples/s]
Device set to use cuda:0


Sample tokenized input from test: {'group': 'race', 'text': 'When i see an arab, i think they are a happy.', 'label': 0, 'data_name': 'MGSD', '__index_level_0__': 17851, 'input_ids': [2, 76, 31, 196, 40, 3666, 15, 31, 277, 59, 50, 21, 1700, 9, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


loading configuration file model_output_albertv2/merged_winoqueer_seegull_gpt_augmentation_trained/config.json
Model config AlbertConfig {
  "architectures": [
    "AlbertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0,
  "bos_token_id": 2,
  "classifier_dropout_prob": 0.1,
  "down_scale_factor": 1,
  "dtype": "float32",
  "embedding_size": 128,
  "eos_token_id": 3,
  "gap_size": 0,
  "hidden_act": "gelu_new",
  "hidden_dropout_prob": 0,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "inner_group_num": 1,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "albert",
  "net_structure_type": 0,
  "num_attention_heads": 12,
  "num_hidden_groups": 1,
  "num_hidden_layers": 12,
  "num_memory_blocks": 0,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "transformers_version": "4.57.3",
  "type_vocab_size": 2,
  "vocab_size": 30000
}

loading weights 

Saved full results to: result_output_albertv2/merged_winoqueer_seegull_gpt_augmentation_trained/mgsd/full_results.csv
Saved classification report to: result_output_albertv2/merged_winoqueer_seegull_gpt_augmentation_trained/mgsd/classification_report.csv
Number of unique labels: 2


Map: 100%|██████████████████████| 11441/11441 [00:00<00:00, 25863.35 examples/s]
Map: 100%|██████████████████████| 11441/11441 [00:00<00:00, 29440.71 examples/s]
Device set to use cuda:0


Sample tokenized input from test: {'text': 'The Somali refugees were labeled as unacquainted by some', 'label': 0, 'group': 'Somalis', 'data_name': 'SeeGULL - GPT Augmentation', 'input_ids': [2, 14, 16330, 8790, 46, 14348, 28, 367, 26692, 108, 1427, 34, 109, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}
Saved full results to: result_output_albertv2/merged_winoqueer_seegull_gpt_augmentation_trained/merged_winoqueer_seegull_gpt_augmentation/full_results.csv
Saved classification report to: result_output_albertv2/merged_winoqueer_seegull_gpt_augmentation_trained/merged_winoqueer_seegull_gpt_augmentation/classification_report.csv
Number of unique labels: 2


loading configuration file config.json from cache at /cs/student/msc/aisd/2025/bekenova/.cache/huggingface/hub/models--distilbert--distilbert-base-uncased/snapshots/12040accade4e8a0f71eabdb258fecc2e7e948be/config.json
Model config DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.57.3",
  "vocab_size": 30522
}

loading weights file model.safetensors from cache at /cs/student/msc/aisd/2025/bekenova/.cache/huggingface/hub/models--distilbert--distilbert-base-uncased/snapshots/12040accade4e8a0f71eabdb258fecc2e7e948be/model.safetensors
Some weights of the model checkpoint at distilber

Sample tokenized input from train: {'group': 'religion', 'text': 'Have you ever met someone named Mohammed? I need a new pair of pajamas.', 'label': 0, 'data_name': 'MGSD', '__index_level_0__': 36953, 'input_ids': [101, 2031, 2017, 2412, 2777, 2619, 2315, 12619, 1029, 1045, 2342, 1037, 2047, 3940, 1997, 27621, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


Map: 100%|████████████████████████| 8299/8299 [00:00<00:00, 31421.08 examples/s]
Map: 100%|████████████████████████| 8299/8299 [00:00<00:00, 26846.98 examples/s]
PyTorch: setting up devices
  trainer = Trainer(


Sample tokenized input from validation: {'group': 'race', 'text': 'My friend is German and sometimes we hand out. He gets very angry sometimes.', 'label': 1, 'data_name': 'MGSD', '__index_level_0__': 36651, 'input_ids': [101, 2026, 2767, 2003, 2446, 1998, 2823, 2057, 2192, 2041, 1012, 2002, 4152, 2200, 4854, 2823, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 1}


The following columns in the Training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: group, __index_level_0__, data_name, text. If group, __index_level_0__, data_name, text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 33,194
  Num Epochs = 6
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 3,114
  Number of trainable parameters = 66,955,010


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Balanced Accuracy
1,0.4392,0.419942,0.768544,0.768677,0.76861,0.768677
2,0.3641,0.394085,0.79547,0.785017,0.789643,0.785017
3,0.2615,0.422996,0.792217,0.786708,0.789284,0.786708
4,0.2122,0.45562,0.801655,0.796885,0.799141,0.796885
5,0.1607,0.503743,0.802036,0.80128,0.801655,0.80128
6,0.1337,0.528433,0.802224,0.806487,0.804241,0.806487


The following columns in the Evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: group, __index_level_0__, data_name, text. If group, __index_level_0__, data_name, text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 8299
  Batch size = 64
Saving model checkpoint to model_output_distilbert/mgsd_trained/checkpoint-519
Configuration saved in model_output_distilbert/mgsd_trained/checkpoint-519/config.json
Model weights saved in model_output_distilbert/mgsd_trained/checkpoint-519/model.safetensors
tokenizer config file saved in model_output_distilbert/mgsd_trained/checkpoint-519/tokenizer_config.json
Special tokens file saved in model_output_distilbert/mgsd_trained/checkpoint-519/special_tokens_map.json
The following columns in the Evaluation set don't have a corresponding argument in `DistilBertForSequenceClassifica

Saved fine-tuned model to: model_output_distilbert/mgsd_trained
Number of unique labels: 2


Map: 100%|██████████████████████████| 653/653 [00:00<00:00, 36817.36 examples/s]
Map: 100%|██████████████████████████| 653/653 [00:00<00:00, 36790.66 examples/s]
Device set to use cuda:0


Sample tokenized input from test: {'text': 'most people who are Gay are kites', 'label': 0, 'group': 'Gay', 'data_name': 'Winoqueer - GPT Augmentation', '__index_level_0__': 3236, 'input_ids': [101, 2087, 2111, 2040, 2024, 5637, 2024, 20497, 2015, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


loading configuration file model_output_distilbert/mgsd_trained/config.json
Model config DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "dtype": "float32",
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "problem_type": "single_label_classification",
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.57.3",
  "vocab_size": 30522
}

loading weights file model_output_distilbert/mgsd_trained/model.safetensors
loading file vocab.txt
loading file tokenizer.json
loading file added_tokens.json
loading file special_tokens_map.json
loading file tokenizer_config.json
loading file chat_template.jinja


Saved full results to: result_output_distilbert/mgsd_trained/winoqueer_gpt_augmentation/full_results.csv
Saved classification report to: result_output_distilbert/mgsd_trained/winoqueer_gpt_augmentation/classification_report.csv
Number of unique labels: 2


Map: 100%|██████████████████████████| 414/414 [00:00<00:00, 33828.33 examples/s]
Map: 100%|██████████████████████████| 414/414 [00:00<00:00, 34445.69 examples/s]
Device set to use cuda:0


Sample tokenized input from test: {'text': 'The Somali refugees were labeled as unacquainted by some', 'label': 0, 'group': 'Somalis', 'data_name': 'SeeGULL - GPT Augmentation', '__index_level_0__': 840, 'input_ids': [101, 1996, 16831, 8711, 2020, 12599, 2004, 14477, 2278, 16211, 18447, 2098, 2011, 2070, 102, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0], 'labels': 0}


loading configuration file model_output_distilbert/mgsd_trained/config.json
Model config DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "dtype": "float32",
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "problem_type": "single_label_classification",
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.57.3",
  "vocab_size": 30522
}

loading weights file model_output_distilbert/mgsd_trained/model.safetensors
loading file vocab.txt
loading file tokenizer.json
loading file added_tokens.json
loading file special_tokens_map.json
loading file tokenizer_config.json
loading file chat_template.jinja


Saved full results to: result_output_distilbert/mgsd_trained/seegull_gpt_augmentation/full_results.csv
Saved classification report to: result_output_distilbert/mgsd_trained/seegull_gpt_augmentation/classification_report.csv
Number of unique labels: 2


Map: 100%|██████████████████████| 10374/10374 [00:00<00:00, 31346.42 examples/s]
Map: 100%|██████████████████████| 10374/10374 [00:00<00:00, 39870.79 examples/s]
Device set to use cuda:0


Sample tokenized input from test: {'group': 'race', 'text': 'When i see an arab, i think they are a happy.', 'label': 0, 'data_name': 'MGSD', '__index_level_0__': 17851, 'input_ids': [101, 2043, 1045, 2156, 2019, 5424, 1010, 1045, 2228, 2027, 2024, 1037, 3407, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


loading configuration file model_output_distilbert/mgsd_trained/config.json
Model config DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "dtype": "float32",
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "problem_type": "single_label_classification",
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.57.3",
  "vocab_size": 30522
}

loading weights file model_output_distilbert/mgsd_trained/model.safetensors
loading file vocab.txt
loading file tokenizer.json
loading file added_tokens.json
loading file special_tokens_map.json
loading file tokenizer_config.json
loading file chat_template.jinja


Saved full results to: result_output_distilbert/mgsd_trained/mgsd/full_results.csv
Saved classification report to: result_output_distilbert/mgsd_trained/mgsd/classification_report.csv
Number of unique labels: 2


Map: 100%|██████████████████████| 11441/11441 [00:00<00:00, 32204.35 examples/s]
Map: 100%|██████████████████████| 11441/11441 [00:00<00:00, 45824.66 examples/s]
Device set to use cuda:0


Sample tokenized input from test: {'text': 'The Somali refugees were labeled as unacquainted by some', 'label': 0, 'group': 'Somalis', 'data_name': 'SeeGULL - GPT Augmentation', 'input_ids': [101, 1996, 16831, 8711, 2020, 12599, 2004, 14477, 2278, 16211, 18447, 2098, 2011, 2070, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


loading configuration file config.json from cache at /cs/student/msc/aisd/2025/bekenova/.cache/huggingface/hub/models--distilbert--distilbert-base-uncased/snapshots/12040accade4e8a0f71eabdb258fecc2e7e948be/config.json
Model config DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.57.3",
  "vocab_size": 30522
}

loading weights file model.safetensors from cache at /cs/student/msc/aisd/2025/bekenova/.cache/huggingface/hub/models--distilbert--distilbert-base-uncased/snapshots/12040accade4e8a0f71eabdb258fecc2e7e948be/model.safetensors


Saved full results to: result_output_distilbert/mgsd_trained/merged_winoqueer_seegull_gpt_augmentation/full_results.csv
Saved classification report to: result_output_distilbert/mgsd_trained/merged_winoqueer_seegull_gpt_augmentation/classification_report.csv
Number of unique labels: 2


Some weights of the model checkpoint at distilbert/distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_transform.bias', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bia

Sample tokenized input from train: {'text': 'we expect that nonbinary people are out to share their views.', 'label': 0, 'group': 'NB', 'data_name': 'Winoqueer - GPT Augmentation', '__index_level_0__': 1788, 'input_ids': [101, 2057, 5987, 2008, 2512, 21114, 2854, 2111, 2024, 2041, 2000, 3745, 2037, 5328, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


Map: 100%|██████████████████████████| 523/523 [00:00<00:00, 23435.62 examples/s]
Map: 100%|██████████████████████████| 523/523 [00:00<00:00, 33046.41 examples/s]
PyTorch: setting up devices
  trainer = Trainer(


Sample tokenized input from validation: {'text': 'all lesbians are studious.', 'label': 0, 'group': 'Lesbian', 'data_name': 'Winoqueer - GPT Augmentation', '__index_level_0__': 2716, 'input_ids': [101, 2035, 11690, 2015, 2024, 2996, 2271, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


The following columns in the Training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: group, __index_level_0__, data_name, text. If group, __index_level_0__, data_name, text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 2,088
  Num Epochs = 6
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 198
  Number of trainable parameters = 66,955,010


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Balanced Accuracy
1,No log,0.255498,0.898562,0.891883,0.895099,0.891883
2,0.405700,0.130537,0.953198,0.958998,0.956015,0.958998
3,0.405700,0.098093,0.958824,0.961791,0.960287,0.961791
4,0.106500,0.071885,0.970682,0.976706,0.973609,0.976706
5,0.047400,0.067363,0.970682,0.976706,0.973609,0.976706
6,0.047400,0.063389,0.970682,0.976706,0.973609,0.976706


The following columns in the Evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: group, __index_level_0__, data_name, text. If group, __index_level_0__, data_name, text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 523
  Batch size = 64
Saving model checkpoint to model_output_distilbert/winoqueer_gpt_augmentation_trained/checkpoint-33
Configuration saved in model_output_distilbert/winoqueer_gpt_augmentation_trained/checkpoint-33/config.json
Model weights saved in model_output_distilbert/winoqueer_gpt_augmentation_trained/checkpoint-33/model.safetensors
tokenizer config file saved in model_output_distilbert/winoqueer_gpt_augmentation_trained/checkpoint-33/tokenizer_config.json
Special tokens file saved in model_output_distilbert/winoqueer_gpt_augmentation_trained/checkpoint-33/special_tokens_map.json
The followi

Saved fine-tuned model to: model_output_distilbert/winoqueer_gpt_augmentation_trained
Number of unique labels: 2


Map: 100%|██████████████████████████| 653/653 [00:00<00:00, 37391.37 examples/s]
Map: 100%|██████████████████████████| 653/653 [00:00<00:00, 37079.54 examples/s]
Device set to use cuda:0


Sample tokenized input from test: {'text': 'most people who are Gay are kites', 'label': 0, 'group': 'Gay', 'data_name': 'Winoqueer - GPT Augmentation', '__index_level_0__': 3236, 'input_ids': [101, 2087, 2111, 2040, 2024, 5637, 2024, 20497, 2015, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


loading configuration file model_output_distilbert/winoqueer_gpt_augmentation_trained/config.json
Model config DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "dtype": "float32",
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "problem_type": "single_label_classification",
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.57.3",
  "vocab_size": 30522
}

loading weights file model_output_distilbert/winoqueer_gpt_augmentation_trained/model.safetensors
loading file vocab.txt
loading file tokenizer.json
loading file added_tokens.json
loading file special_tokens_map.json
loading file tokenizer_config.json
loading file chat_template.jinja


Saved full results to: result_output_distilbert/winoqueer_gpt_augmentation_trained/winoqueer_gpt_augmentation/full_results.csv
Saved classification report to: result_output_distilbert/winoqueer_gpt_augmentation_trained/winoqueer_gpt_augmentation/classification_report.csv
Number of unique labels: 2


Map: 100%|██████████████████████████| 414/414 [00:00<00:00, 34003.21 examples/s]
Map: 100%|██████████████████████████| 414/414 [00:00<00:00, 33467.13 examples/s]
Device set to use cuda:0


Sample tokenized input from test: {'text': 'The Somali refugees were labeled as unacquainted by some', 'label': 0, 'group': 'Somalis', 'data_name': 'SeeGULL - GPT Augmentation', '__index_level_0__': 840, 'input_ids': [101, 1996, 16831, 8711, 2020, 12599, 2004, 14477, 2278, 16211, 18447, 2098, 2011, 2070, 102, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0], 'labels': 0}


loading configuration file model_output_distilbert/winoqueer_gpt_augmentation_trained/config.json
Model config DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "dtype": "float32",
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "problem_type": "single_label_classification",
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.57.3",
  "vocab_size": 30522
}

loading weights file model_output_distilbert/winoqueer_gpt_augmentation_trained/model.safetensors
loading file vocab.txt
loading file tokenizer.json
loading file added_tokens.json
loading file special_tokens_map.json
loading file tokenizer_config.json
loading file chat_template.jinja


Saved full results to: result_output_distilbert/winoqueer_gpt_augmentation_trained/seegull_gpt_augmentation/full_results.csv
Saved classification report to: result_output_distilbert/winoqueer_gpt_augmentation_trained/seegull_gpt_augmentation/classification_report.csv
Number of unique labels: 2


Map: 100%|██████████████████████| 10374/10374 [00:00<00:00, 31307.60 examples/s]
Map: 100%|██████████████████████| 10374/10374 [00:00<00:00, 40814.33 examples/s]
Device set to use cuda:0


Sample tokenized input from test: {'group': 'race', 'text': 'When i see an arab, i think they are a happy.', 'label': 0, 'data_name': 'MGSD', '__index_level_0__': 17851, 'input_ids': [101, 2043, 1045, 2156, 2019, 5424, 1010, 1045, 2228, 2027, 2024, 1037, 3407, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


loading configuration file model_output_distilbert/winoqueer_gpt_augmentation_trained/config.json
Model config DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "dtype": "float32",
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "problem_type": "single_label_classification",
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.57.3",
  "vocab_size": 30522
}

loading weights file model_output_distilbert/winoqueer_gpt_augmentation_trained/model.safetensors
loading file vocab.txt
loading file tokenizer.json
loading file added_tokens.json
loading file special_tokens_map.json
loading file tokenizer_config.json
loading file chat_template.jinja


Saved full results to: result_output_distilbert/winoqueer_gpt_augmentation_trained/mgsd/full_results.csv
Saved classification report to: result_output_distilbert/winoqueer_gpt_augmentation_trained/mgsd/classification_report.csv
Number of unique labels: 2


Map: 100%|██████████████████████| 11441/11441 [00:00<00:00, 24977.79 examples/s]
Map: 100%|██████████████████████| 11441/11441 [00:00<00:00, 45280.82 examples/s]
Device set to use cuda:0


Sample tokenized input from test: {'text': 'The Somali refugees were labeled as unacquainted by some', 'label': 0, 'group': 'Somalis', 'data_name': 'SeeGULL - GPT Augmentation', 'input_ids': [101, 1996, 16831, 8711, 2020, 12599, 2004, 14477, 2278, 16211, 18447, 2098, 2011, 2070, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


loading configuration file config.json from cache at /cs/student/msc/aisd/2025/bekenova/.cache/huggingface/hub/models--distilbert--distilbert-base-uncased/snapshots/12040accade4e8a0f71eabdb258fecc2e7e948be/config.json
Model config DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.57.3",
  "vocab_size": 30522
}

loading weights file model.safetensors from cache at /cs/student/msc/aisd/2025/bekenova/.cache/huggingface/hub/models--distilbert--distilbert-base-uncased/snapshots/12040accade4e8a0f71eabdb258fecc2e7e948be/model.safetensors
Some weights of the model checkpoint at distilber

Saved full results to: result_output_distilbert/winoqueer_gpt_augmentation_trained/merged_winoqueer_seegull_gpt_augmentation/full_results.csv
Saved classification report to: result_output_distilbert/winoqueer_gpt_augmentation_trained/merged_winoqueer_seegull_gpt_augmentation/classification_report.csv
Number of unique labels: 2


loading configuration file config.json from cache at /cs/student/msc/aisd/2025/bekenova/.cache/huggingface/hub/models--distilbert--distilbert-base-uncased/snapshots/12040accade4e8a0f71eabdb258fecc2e7e948be/config.json
Model config DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.57.3",
  "vocab_size": 30522
}

loading file vocab.txt from cache at /cs/student/msc/aisd/2025/bekenova/.cache/huggingface/hub/models--distilbert--distilbert-base-uncased/snapshots/12040accade4e8a0f71eabdb258fecc2e7e948be/vocab.txt
loading file tokenizer.json from cache at /cs/student/msc/aisd/2025/beken

Sample tokenized input from train: {'text': 'The average Chadian income is modest compared to developed countries', 'label': 0, 'group': 'Chadian', 'data_name': 'SeeGULL - GPT Augmentation', '__index_level_0__': 1249, 'input_ids': [101, 1996, 2779, 9796, 2937, 3318, 2003, 10754, 4102, 2000, 2764, 3032, 102, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


Map: 100%|██████████████████████████| 332/332 [00:00<00:00, 35049.31 examples/s]
Map: 100%|██████████████████████████| 332/332 [00:00<00:00, 33797.12 examples/s]
PyTorch: setting up devices
  trainer = Trainer(


Sample tokenized input from validation: {'text': 'The Ethiopian woman was short and petite', 'label': 1, 'group': 'Ethiopian', 'data_name': 'SeeGULL - GPT Augmentation', '__index_level_0__': 609, 'input_ids': [101, 1996, 15101, 2450, 2001, 2460, 1998, 20146, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 1}


The following columns in the Training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: group, __index_level_0__, data_name, text. If group, __index_level_0__, data_name, text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 1,324
  Num Epochs = 6
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 126
  Number of trainable parameters = 66,955,010


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Balanced Accuracy
1,No log,0.568993,0.332831,0.5,0.399638,0.5
2,No log,0.321311,0.847768,0.866822,0.855378,0.866822
3,0.484200,0.277298,0.864496,0.89607,0.87429,0.89607
4,0.484200,0.276468,0.865551,0.900554,0.875095,0.900554
5,0.142700,0.227934,0.902528,0.92322,0.911085,0.92322
6,0.142700,0.241385,0.89545,0.920937,0.905213,0.920937


The following columns in the Evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: group, __index_level_0__, data_name, text. If group, __index_level_0__, data_name, text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 332
  Batch size = 64
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
Saving model checkpoint to model_output_distilbert/seegull_gpt_augmentation_trained/checkpoint-21
Configuration saved in model_output_distilbert/seegull_gpt_augmentation_trained/checkpoint-21/config.json
Model weights saved in model_output_distilbert/seegull_gpt_augmentation_trained/checkpoint-21/model.safetensors
tokenizer config file saved in model_output_distilbert/seegull_gpt_augmentation_trained/checkpoint-21/tokenizer_config.json
Special tokens file saved in model_output_distilbert/seegull_gpt_augment

Saved fine-tuned model to: model_output_distilbert/seegull_gpt_augmentation_trained
Number of unique labels: 2


Map: 100%|██████████████████████████| 653/653 [00:00<00:00, 36814.89 examples/s]
Map: 100%|██████████████████████████| 653/653 [00:00<00:00, 36601.37 examples/s]
Device set to use cuda:0


Sample tokenized input from test: {'text': 'most people who are Gay are kites', 'label': 0, 'group': 'Gay', 'data_name': 'Winoqueer - GPT Augmentation', '__index_level_0__': 3236, 'input_ids': [101, 2087, 2111, 2040, 2024, 5637, 2024, 20497, 2015, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


loading configuration file model_output_distilbert/seegull_gpt_augmentation_trained/config.json
Model config DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "dtype": "float32",
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "problem_type": "single_label_classification",
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.57.3",
  "vocab_size": 30522
}

loading weights file model_output_distilbert/seegull_gpt_augmentation_trained/model.safetensors
loading file vocab.txt
loading file tokenizer.json
loading file added_tokens.json
loading file special_tokens_map.json
loading file tokenizer_config.json
loading file chat_template.jinja


Saved full results to: result_output_distilbert/seegull_gpt_augmentation_trained/winoqueer_gpt_augmentation/full_results.csv
Saved classification report to: result_output_distilbert/seegull_gpt_augmentation_trained/winoqueer_gpt_augmentation/classification_report.csv
Number of unique labels: 2


Map: 100%|██████████████████████████| 414/414 [00:00<00:00, 33272.18 examples/s]
Map: 100%|██████████████████████████| 414/414 [00:00<00:00, 33910.91 examples/s]
Device set to use cuda:0


Sample tokenized input from test: {'text': 'The Somali refugees were labeled as unacquainted by some', 'label': 0, 'group': 'Somalis', 'data_name': 'SeeGULL - GPT Augmentation', '__index_level_0__': 840, 'input_ids': [101, 1996, 16831, 8711, 2020, 12599, 2004, 14477, 2278, 16211, 18447, 2098, 2011, 2070, 102, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0], 'labels': 0}


loading configuration file model_output_distilbert/seegull_gpt_augmentation_trained/config.json
Model config DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "dtype": "float32",
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "problem_type": "single_label_classification",
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.57.3",
  "vocab_size": 30522
}

loading weights file model_output_distilbert/seegull_gpt_augmentation_trained/model.safetensors
loading file vocab.txt
loading file tokenizer.json
loading file added_tokens.json
loading file special_tokens_map.json
loading file tokenizer_config.json
loading file chat_template.jinja


Saved full results to: result_output_distilbert/seegull_gpt_augmentation_trained/seegull_gpt_augmentation/full_results.csv
Saved classification report to: result_output_distilbert/seegull_gpt_augmentation_trained/seegull_gpt_augmentation/classification_report.csv
Number of unique labels: 2


Map: 100%|██████████████████████| 10374/10374 [00:00<00:00, 31216.77 examples/s]
Map: 100%|██████████████████████| 10374/10374 [00:00<00:00, 41061.68 examples/s]
Device set to use cuda:0


Sample tokenized input from test: {'group': 'race', 'text': 'When i see an arab, i think they are a happy.', 'label': 0, 'data_name': 'MGSD', '__index_level_0__': 17851, 'input_ids': [101, 2043, 1045, 2156, 2019, 5424, 1010, 1045, 2228, 2027, 2024, 1037, 3407, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


loading configuration file model_output_distilbert/seegull_gpt_augmentation_trained/config.json
Model config DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "dtype": "float32",
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "problem_type": "single_label_classification",
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.57.3",
  "vocab_size": 30522
}

loading weights file model_output_distilbert/seegull_gpt_augmentation_trained/model.safetensors
loading file vocab.txt
loading file tokenizer.json
loading file added_tokens.json
loading file special_tokens_map.json
loading file tokenizer_config.json
loading file chat_template.jinja


Saved full results to: result_output_distilbert/seegull_gpt_augmentation_trained/mgsd/full_results.csv
Saved classification report to: result_output_distilbert/seegull_gpt_augmentation_trained/mgsd/classification_report.csv
Number of unique labels: 2


Map: 100%|██████████████████████| 11441/11441 [00:00<00:00, 24997.10 examples/s]
Map: 100%|██████████████████████| 11441/11441 [00:00<00:00, 44625.57 examples/s]
Device set to use cuda:0


Sample tokenized input from test: {'text': 'The Somali refugees were labeled as unacquainted by some', 'label': 0, 'group': 'Somalis', 'data_name': 'SeeGULL - GPT Augmentation', 'input_ids': [101, 1996, 16831, 8711, 2020, 12599, 2004, 14477, 2278, 16211, 18447, 2098, 2011, 2070, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


loading configuration file config.json from cache at /cs/student/msc/aisd/2025/bekenova/.cache/huggingface/hub/models--distilbert--distilbert-base-uncased/snapshots/12040accade4e8a0f71eabdb258fecc2e7e948be/config.json
Model config DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.57.3",
  "vocab_size": 30522
}

loading weights file model.safetensors from cache at /cs/student/msc/aisd/2025/bekenova/.cache/huggingface/hub/models--distilbert--distilbert-base-uncased/snapshots/12040accade4e8a0f71eabdb258fecc2e7e948be/model.safetensors
Some weights of the model checkpoint at distilber

Saved full results to: result_output_distilbert/seegull_gpt_augmentation_trained/merged_winoqueer_seegull_gpt_augmentation/full_results.csv
Saved classification report to: result_output_distilbert/seegull_gpt_augmentation_trained/merged_winoqueer_seegull_gpt_augmentation/classification_report.csv
Number of unique labels: 2


loading configuration file config.json from cache at /cs/student/msc/aisd/2025/bekenova/.cache/huggingface/hub/models--distilbert--distilbert-base-uncased/snapshots/12040accade4e8a0f71eabdb258fecc2e7e948be/config.json
Model config DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.57.3",
  "vocab_size": 30522
}

loading file vocab.txt from cache at /cs/student/msc/aisd/2025/bekenova/.cache/huggingface/hub/models--distilbert--distilbert-base-uncased/snapshots/12040accade4e8a0f71eabdb258fecc2e7e948be/vocab.txt
loading file tokenizer.json from cache at /cs/student/msc/aisd/2025/beken

Sample tokenized input from train: {'text': 'The commander could tell the soldier was disciplined.', 'label': 1, 'group': 'profession', 'data_name': 'MGSD', '__index_level_0__': 7380, 'input_ids': [101, 1996, 3474, 2071, 2425, 1996, 5268, 2001, 28675, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 1}


Map: 100%|████████████████████████| 9152/9152 [00:00<00:00, 31388.93 examples/s]
Map: 100%|████████████████████████| 9152/9152 [00:00<00:00, 28018.19 examples/s]
PyTorch: setting up devices
  trainer = Trainer(


Sample tokenized input from validation: {'text': 'we expect that Queer people are artistic.', 'label': 0, 'group': 'Queer', 'data_name': 'Winoqueer - GPT Augmentation', '__index_level_0__': 3705, 'input_ids': [101, 2057, 5987, 2008, 19483, 2111, 2024, 6018, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


The following columns in the Training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: group, __index_level_0__, data_name, text. If group, __index_level_0__, data_name, text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 36,608
  Num Epochs = 6
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 3,432
  Number of trainable parameters = 66,955,010


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Balanced Accuracy
1,0.4016,0.402048,0.775252,0.789526,0.780651,0.789526
2,0.3401,0.37161,0.805155,0.811742,0.808178,0.811742
3,0.2466,0.392711,0.806723,0.818871,0.811804,0.818871
4,0.199,0.434471,0.811132,0.820265,0.815176,0.820265
5,0.1492,0.480236,0.812676,0.824637,0.817728,0.824637
6,0.1191,0.513283,0.814352,0.821738,0.817718,0.821738


The following columns in the Evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: group, __index_level_0__, data_name, text. If group, __index_level_0__, data_name, text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 9152
  Batch size = 64
Saving model checkpoint to model_output_distilbert/merged_winoqueer_seegull_gpt_augmentation_trained/checkpoint-572
Configuration saved in model_output_distilbert/merged_winoqueer_seegull_gpt_augmentation_trained/checkpoint-572/config.json
Model weights saved in model_output_distilbert/merged_winoqueer_seegull_gpt_augmentation_trained/checkpoint-572/model.safetensors
tokenizer config file saved in model_output_distilbert/merged_winoqueer_seegull_gpt_augmentation_trained/checkpoint-572/tokenizer_config.json
Special tokens file saved in model_output_distilbert/merged_winoqueer_se

Saved fine-tuned model to: model_output_distilbert/merged_winoqueer_seegull_gpt_augmentation_trained
Number of unique labels: 2


Map: 100%|██████████████████████████| 653/653 [00:00<00:00, 38347.86 examples/s]
Map: 100%|██████████████████████████| 653/653 [00:00<00:00, 35948.50 examples/s]
Device set to use cuda:0


Sample tokenized input from test: {'text': 'most people who are Gay are kites', 'label': 0, 'group': 'Gay', 'data_name': 'Winoqueer - GPT Augmentation', '__index_level_0__': 3236, 'input_ids': [101, 2087, 2111, 2040, 2024, 5637, 2024, 20497, 2015, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


loading configuration file model_output_distilbert/merged_winoqueer_seegull_gpt_augmentation_trained/config.json
Model config DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "dtype": "float32",
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "problem_type": "single_label_classification",
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.57.3",
  "vocab_size": 30522
}

loading weights file model_output_distilbert/merged_winoqueer_seegull_gpt_augmentation_trained/model.safetensors
loading file vocab.txt
loading file tokenizer.json
loading file added_tokens.json
loading file special_tokens_map.json
loading file tokenizer_config.json
loading file chat_template

Saved full results to: result_output_distilbert/merged_winoqueer_seegull_gpt_augmentation_trained/winoqueer_gpt_augmentation/full_results.csv
Saved classification report to: result_output_distilbert/merged_winoqueer_seegull_gpt_augmentation_trained/winoqueer_gpt_augmentation/classification_report.csv
Number of unique labels: 2


Map: 100%|██████████████████████████| 414/414 [00:00<00:00, 34410.20 examples/s]
Map: 100%|██████████████████████████| 414/414 [00:00<00:00, 34705.93 examples/s]
Device set to use cuda:0


Sample tokenized input from test: {'text': 'The Somali refugees were labeled as unacquainted by some', 'label': 0, 'group': 'Somalis', 'data_name': 'SeeGULL - GPT Augmentation', '__index_level_0__': 840, 'input_ids': [101, 1996, 16831, 8711, 2020, 12599, 2004, 14477, 2278, 16211, 18447, 2098, 2011, 2070, 102, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0], 'labels': 0}


loading configuration file model_output_distilbert/merged_winoqueer_seegull_gpt_augmentation_trained/config.json
Model config DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "dtype": "float32",
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "problem_type": "single_label_classification",
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.57.3",
  "vocab_size": 30522
}

loading weights file model_output_distilbert/merged_winoqueer_seegull_gpt_augmentation_trained/model.safetensors
loading file vocab.txt
loading file tokenizer.json
loading file added_tokens.json
loading file special_tokens_map.json
loading file tokenizer_config.json
loading file chat_template

Saved full results to: result_output_distilbert/merged_winoqueer_seegull_gpt_augmentation_trained/seegull_gpt_augmentation/full_results.csv
Saved classification report to: result_output_distilbert/merged_winoqueer_seegull_gpt_augmentation_trained/seegull_gpt_augmentation/classification_report.csv
Number of unique labels: 2


Map: 100%|██████████████████████| 10374/10374 [00:00<00:00, 31747.00 examples/s]
Map: 100%|██████████████████████| 10374/10374 [00:00<00:00, 39195.75 examples/s]
Device set to use cuda:0


Sample tokenized input from test: {'group': 'race', 'text': 'When i see an arab, i think they are a happy.', 'label': 0, 'data_name': 'MGSD', '__index_level_0__': 17851, 'input_ids': [101, 2043, 1045, 2156, 2019, 5424, 1010, 1045, 2228, 2027, 2024, 1037, 3407, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


loading configuration file model_output_distilbert/merged_winoqueer_seegull_gpt_augmentation_trained/config.json
Model config DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "dtype": "float32",
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "problem_type": "single_label_classification",
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.57.3",
  "vocab_size": 30522
}

loading weights file model_output_distilbert/merged_winoqueer_seegull_gpt_augmentation_trained/model.safetensors
loading file vocab.txt
loading file tokenizer.json
loading file added_tokens.json
loading file special_tokens_map.json
loading file tokenizer_config.json
loading file chat_template

Saved full results to: result_output_distilbert/merged_winoqueer_seegull_gpt_augmentation_trained/mgsd/full_results.csv
Saved classification report to: result_output_distilbert/merged_winoqueer_seegull_gpt_augmentation_trained/mgsd/classification_report.csv
Number of unique labels: 2


Map: 100%|██████████████████████| 11441/11441 [00:00<00:00, 32604.98 examples/s]
Map: 100%|██████████████████████| 11441/11441 [00:00<00:00, 44820.72 examples/s]
Device set to use cuda:0


Sample tokenized input from test: {'text': 'The Somali refugees were labeled as unacquainted by some', 'label': 0, 'group': 'Somalis', 'data_name': 'SeeGULL - GPT Augmentation', 'input_ids': [101, 1996, 16831, 8711, 2020, 12599, 2004, 14477, 2278, 16211, 18447, 2098, 2011, 2070, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}
Saved full results to: result_output_distilbert/merged_winoqueer_seegull_gpt_augmentation_trained/merged_winoqueer_seegull_gpt_augmentation/full_results.csv
Saved classification report to: result_output_distilbert/merged_winoqueer_seegull_gpt_augmentation_trained/merged_winoqueer_seegull_gpt_augmentation/classification_report.csv
Number of unique labels: 2


loading configuration file config.json from cache at /cs/student/msc/aisd/2025/bekenova/.cache/huggingface/hub/models--google-bert--bert-base-uncased/snapshots/86b5e0934494bd15c9632b12f734a8a67f723594/config.json
Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.57.3",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file model.safetensors from cache at /cs/student/msc/aisd/2025/bekenova/.cache/huggingface/hub/models--google-bert--bert-base-uncased/snapshots/86b5e0

Sample tokenized input from train: {'group': 'religion', 'text': 'Have you ever met someone named Mohammed? I need a new pair of pajamas.', 'label': 0, 'data_name': 'MGSD', '__index_level_0__': 36953, 'input_ids': [101, 2031, 2017, 2412, 2777, 2619, 2315, 12619, 1029, 1045, 2342, 1037, 2047, 3940, 1997, 27621, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


Map: 100%|████████████████████████| 8299/8299 [00:00<00:00, 28170.47 examples/s]
Map: 100%|████████████████████████| 8299/8299 [00:00<00:00, 25554.07 examples/s]
PyTorch: setting up devices
  trainer = Trainer(


Sample tokenized input from validation: {'group': 'race', 'text': 'My friend is German and sometimes we hand out. He gets very angry sometimes.', 'label': 1, 'data_name': 'MGSD', '__index_level_0__': 36651, 'input_ids': [101, 2026, 2767, 2003, 2446, 1998, 2823, 2057, 2192, 2041, 1012, 2002, 4152, 2200, 4854, 2823, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 1}


The following columns in the Training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: group, __index_level_0__, data_name, text. If group, __index_level_0__, data_name, text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 33,194
  Num Epochs = 6
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 3,114
  Number of trainable parameters = 109,483,778


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Balanced Accuracy
1,0.4065,0.38637,0.786611,0.78623,0.78642,0.78623
2,0.3317,0.364428,0.818632,0.803181,0.809778,0.803181
3,0.2079,0.415161,0.822286,0.811744,0.816476,0.811744
4,0.14,0.46241,0.814637,0.825875,0.819431,0.825875
5,0.09,0.546776,0.818686,0.827222,0.822507,0.827222
6,0.0719,0.591007,0.814922,0.823312,0.818678,0.823312


The following columns in the Evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: group, __index_level_0__, data_name, text. If group, __index_level_0__, data_name, text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 8299
  Batch size = 64
Saving model checkpoint to model_output_bert/mgsd_trained/checkpoint-519
Configuration saved in model_output_bert/mgsd_trained/checkpoint-519/config.json
Model weights saved in model_output_bert/mgsd_trained/checkpoint-519/model.safetensors
tokenizer config file saved in model_output_bert/mgsd_trained/checkpoint-519/tokenizer_config.json
Special tokens file saved in model_output_bert/mgsd_trained/checkpoint-519/special_tokens_map.json
The following columns in the Evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: group, __in

Saved fine-tuned model to: model_output_bert/mgsd_trained
Number of unique labels: 2


Map: 100%|██████████████████████████| 653/653 [00:00<00:00, 33534.30 examples/s]
Map: 100%|██████████████████████████| 653/653 [00:00<00:00, 32467.73 examples/s]
Device set to use cuda:0


Sample tokenized input from test: {'text': 'most people who are Gay are kites', 'label': 0, 'group': 'Gay', 'data_name': 'Winoqueer - GPT Augmentation', '__index_level_0__': 3236, 'input_ids': [101, 2087, 2111, 2040, 2024, 5637, 2024, 20497, 2015, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


loading configuration file model_output_bert/mgsd_trained/config.json
Model config BertConfig {
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "dtype": "float32",
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "transformers_version": "4.57.3",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file model_output_bert/mgsd_trained/model.safetensors
loading file vocab.txt
loading file tokenizer.json
loading file added_tokens.json
loading file special_tokens_map.json
loading file tokenizer_config.json


Saved full results to: result_output_bert/mgsd_trained/winoqueer_gpt_augmentation/full_results.csv
Saved classification report to: result_output_bert/mgsd_trained/winoqueer_gpt_augmentation/classification_report.csv
Number of unique labels: 2


Map: 100%|██████████████████████████| 414/414 [00:00<00:00, 30932.22 examples/s]
Map: 100%|██████████████████████████| 414/414 [00:00<00:00, 30669.97 examples/s]
Device set to use cuda:0


Sample tokenized input from test: {'text': 'The Somali refugees were labeled as unacquainted by some', 'label': 0, 'group': 'Somalis', 'data_name': 'SeeGULL - GPT Augmentation', '__index_level_0__': 840, 'input_ids': [101, 1996, 16831, 8711, 2020, 12599, 2004, 14477, 2278, 16211, 18447, 2098, 2011, 2070, 102, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0], 'labels': 0}


loading configuration file model_output_bert/mgsd_trained/config.json
Model config BertConfig {
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "dtype": "float32",
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "transformers_version": "4.57.3",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file model_output_bert/mgsd_trained/model.safetensors
loading file vocab.txt
loading file tokenizer.json
loading file added_tokens.json
loading file special_tokens_map.json
loading file tokenizer_config.json


Saved full results to: result_output_bert/mgsd_trained/seegull_gpt_augmentation/full_results.csv
Saved classification report to: result_output_bert/mgsd_trained/seegull_gpt_augmentation/classification_report.csv
Number of unique labels: 2


Map: 100%|██████████████████████| 10374/10374 [00:00<00:00, 27824.77 examples/s]
Map: 100%|██████████████████████| 10374/10374 [00:00<00:00, 27138.07 examples/s]
Device set to use cuda:0


Sample tokenized input from test: {'group': 'race', 'text': 'When i see an arab, i think they are a happy.', 'label': 0, 'data_name': 'MGSD', '__index_level_0__': 17851, 'input_ids': [101, 2043, 1045, 2156, 2019, 5424, 1010, 1045, 2228, 2027, 2024, 1037, 3407, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


loading configuration file model_output_bert/mgsd_trained/config.json
Model config BertConfig {
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "dtype": "float32",
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "transformers_version": "4.57.3",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file model_output_bert/mgsd_trained/model.safetensors
loading file vocab.txt
loading file tokenizer.json
loading file added_tokens.json
loading file special_tokens_map.json
loading file tokenizer_config.json


Saved full results to: result_output_bert/mgsd_trained/mgsd/full_results.csv
Saved classification report to: result_output_bert/mgsd_trained/mgsd/classification_report.csv
Number of unique labels: 2


Map: 100%|██████████████████████| 11441/11441 [00:00<00:00, 29222.57 examples/s]
Map: 100%|██████████████████████| 11441/11441 [00:00<00:00, 29499.71 examples/s]
Device set to use cuda:0


Sample tokenized input from test: {'text': 'The Somali refugees were labeled as unacquainted by some', 'label': 0, 'group': 'Somalis', 'data_name': 'SeeGULL - GPT Augmentation', 'input_ids': [101, 1996, 16831, 8711, 2020, 12599, 2004, 14477, 2278, 16211, 18447, 2098, 2011, 2070, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}
Saved full results to: result_output_bert/mgsd_trained/merged_winoqueer_seegull_gpt_augmentation/full_results.csv
Saved classification report to: result_output_bert/mgsd_trained/merged_winoqueer_seegull_gpt_augmentation/classification_report.csv
Number of unique labels: 2


loading configuration file config.json from cache at /cs/student/msc/aisd/2025/bekenova/.cache/huggingface/hub/models--google-bert--bert-base-uncased/snapshots/86b5e0934494bd15c9632b12f734a8a67f723594/config.json
Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.57.3",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file model.safetensors from cache at /cs/student/msc/aisd/2025/bekenova/.cache/huggingface/hub/models--google-bert--bert-base-uncased/snapshots/86b5e0

Sample tokenized input from train: {'text': 'we expect that nonbinary people are out to share their views.', 'label': 0, 'group': 'NB', 'data_name': 'Winoqueer - GPT Augmentation', '__index_level_0__': 1788, 'input_ids': [101, 2057, 5987, 2008, 2512, 21114, 2854, 2111, 2024, 2041, 2000, 3745, 2037, 5328, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


Map: 100%|██████████████████████████| 523/523 [00:00<00:00, 35062.03 examples/s]
Map: 100%|██████████████████████████| 523/523 [00:00<00:00, 32251.54 examples/s]
PyTorch: setting up devices
  trainer = Trainer(


Sample tokenized input from validation: {'text': 'all lesbians are studious.', 'label': 0, 'group': 'Lesbian', 'data_name': 'Winoqueer - GPT Augmentation', '__index_level_0__': 2716, 'input_ids': [101, 2035, 11690, 2015, 2024, 2996, 2271, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


The following columns in the Training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: group, __index_level_0__, data_name, text. If group, __index_level_0__, data_name, text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 2,088
  Num Epochs = 6
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 198
  Number of trainable parameters = 109,483,778


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Balanced Accuracy
1,No log,0.253598,0.91786,0.907957,0.912658,0.907957
2,0.398300,0.096847,0.967516,0.965981,0.966743,0.965981
3,0.398300,0.050443,0.982302,0.986956,0.984581,0.986956
4,0.081300,0.033483,0.985294,0.993017,0.989021,0.993017
5,0.027200,0.03884,0.985294,0.993017,0.989021,0.993017
6,0.027200,0.037719,0.985294,0.993017,0.989021,0.993017


The following columns in the Evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: group, __index_level_0__, data_name, text. If group, __index_level_0__, data_name, text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 523
  Batch size = 64
Saving model checkpoint to model_output_bert/winoqueer_gpt_augmentation_trained/checkpoint-33
Configuration saved in model_output_bert/winoqueer_gpt_augmentation_trained/checkpoint-33/config.json
Model weights saved in model_output_bert/winoqueer_gpt_augmentation_trained/checkpoint-33/model.safetensors
tokenizer config file saved in model_output_bert/winoqueer_gpt_augmentation_trained/checkpoint-33/tokenizer_config.json
Special tokens file saved in model_output_bert/winoqueer_gpt_augmentation_trained/checkpoint-33/special_tokens_map.json
The following columns in the Evaluation set don't hav

Saved fine-tuned model to: model_output_bert/winoqueer_gpt_augmentation_trained
Number of unique labels: 2


Map: 100%|██████████████████████████| 653/653 [00:00<00:00, 36305.42 examples/s]
Map: 100%|██████████████████████████| 653/653 [00:00<00:00, 33955.45 examples/s]
Device set to use cuda:0


Sample tokenized input from test: {'text': 'most people who are Gay are kites', 'label': 0, 'group': 'Gay', 'data_name': 'Winoqueer - GPT Augmentation', '__index_level_0__': 3236, 'input_ids': [101, 2087, 2111, 2040, 2024, 5637, 2024, 20497, 2015, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


loading configuration file model_output_bert/winoqueer_gpt_augmentation_trained/config.json
Model config BertConfig {
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "dtype": "float32",
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "transformers_version": "4.57.3",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file model_output_bert/winoqueer_gpt_augmentation_trained/model.safetensors
loading file vocab.txt
loading file tokenizer.json
loading file added_tokens.json
loading file special_tokens_

Saved full results to: result_output_bert/winoqueer_gpt_augmentation_trained/winoqueer_gpt_augmentation/full_results.csv
Saved classification report to: result_output_bert/winoqueer_gpt_augmentation_trained/winoqueer_gpt_augmentation/classification_report.csv
Number of unique labels: 2


Map: 100%|██████████████████████████| 414/414 [00:00<00:00, 31383.94 examples/s]
Map: 100%|██████████████████████████| 414/414 [00:00<00:00, 31320.54 examples/s]
Device set to use cuda:0


Sample tokenized input from test: {'text': 'The Somali refugees were labeled as unacquainted by some', 'label': 0, 'group': 'Somalis', 'data_name': 'SeeGULL - GPT Augmentation', '__index_level_0__': 840, 'input_ids': [101, 1996, 16831, 8711, 2020, 12599, 2004, 14477, 2278, 16211, 18447, 2098, 2011, 2070, 102, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0], 'labels': 0}


loading configuration file model_output_bert/winoqueer_gpt_augmentation_trained/config.json
Model config BertConfig {
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "dtype": "float32",
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "transformers_version": "4.57.3",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file model_output_bert/winoqueer_gpt_augmentation_trained/model.safetensors
loading file vocab.txt
loading file tokenizer.json
loading file added_tokens.json
loading file special_tokens_

Saved full results to: result_output_bert/winoqueer_gpt_augmentation_trained/seegull_gpt_augmentation/full_results.csv
Saved classification report to: result_output_bert/winoqueer_gpt_augmentation_trained/seegull_gpt_augmentation/classification_report.csv
Number of unique labels: 2


Map: 100%|██████████████████████| 10374/10374 [00:00<00:00, 28386.71 examples/s]
Map: 100%|██████████████████████| 10374/10374 [00:00<00:00, 27367.28 examples/s]
Device set to use cuda:0


Sample tokenized input from test: {'group': 'race', 'text': 'When i see an arab, i think they are a happy.', 'label': 0, 'data_name': 'MGSD', '__index_level_0__': 17851, 'input_ids': [101, 2043, 1045, 2156, 2019, 5424, 1010, 1045, 2228, 2027, 2024, 1037, 3407, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


loading configuration file model_output_bert/winoqueer_gpt_augmentation_trained/config.json
Model config BertConfig {
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "dtype": "float32",
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "transformers_version": "4.57.3",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file model_output_bert/winoqueer_gpt_augmentation_trained/model.safetensors
loading file vocab.txt
loading file tokenizer.json
loading file added_tokens.json
loading file special_tokens_

Saved full results to: result_output_bert/winoqueer_gpt_augmentation_trained/mgsd/full_results.csv
Saved classification report to: result_output_bert/winoqueer_gpt_augmentation_trained/mgsd/classification_report.csv
Number of unique labels: 2


Map: 100%|██████████████████████| 11441/11441 [00:00<00:00, 29391.16 examples/s]
Map: 100%|██████████████████████| 11441/11441 [00:00<00:00, 29863.43 examples/s]
Device set to use cuda:0


Sample tokenized input from test: {'text': 'The Somali refugees were labeled as unacquainted by some', 'label': 0, 'group': 'Somalis', 'data_name': 'SeeGULL - GPT Augmentation', 'input_ids': [101, 1996, 16831, 8711, 2020, 12599, 2004, 14477, 2278, 16211, 18447, 2098, 2011, 2070, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


loading configuration file config.json from cache at /cs/student/msc/aisd/2025/bekenova/.cache/huggingface/hub/models--google-bert--bert-base-uncased/snapshots/86b5e0934494bd15c9632b12f734a8a67f723594/config.json
Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.57.3",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file model.safetensors from cache at /cs/student/msc/aisd/2025/bekenova/.cache/huggingface/hub/models--google-bert--bert-base-uncased/snapshots/86b5e0

Saved full results to: result_output_bert/winoqueer_gpt_augmentation_trained/merged_winoqueer_seegull_gpt_augmentation/full_results.csv
Saved classification report to: result_output_bert/winoqueer_gpt_augmentation_trained/merged_winoqueer_seegull_gpt_augmentation/classification_report.csv
Number of unique labels: 2


loading configuration file config.json from cache at /cs/student/msc/aisd/2025/bekenova/.cache/huggingface/hub/models--google-bert--bert-base-uncased/snapshots/86b5e0934494bd15c9632b12f734a8a67f723594/config.json
Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.57.3",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading file vocab.txt from cache at /cs/student/msc/aisd/2025/bekenova/.cache/huggingface/hub/models--google-bert--bert-base-uncased/snapshots/86b5e0934494bd15c9632b

Sample tokenized input from train: {'text': 'The average Chadian income is modest compared to developed countries', 'label': 0, 'group': 'Chadian', 'data_name': 'SeeGULL - GPT Augmentation', '__index_level_0__': 1249, 'input_ids': [101, 1996, 2779, 9796, 2937, 3318, 2003, 10754, 4102, 2000, 2764, 3032, 102, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


Map: 100%|██████████████████████████| 332/332 [00:00<00:00, 32066.99 examples/s]
Map: 100%|██████████████████████████| 332/332 [00:00<00:00, 27773.52 examples/s]
PyTorch: setting up devices
  trainer = Trainer(


Sample tokenized input from validation: {'text': 'The Ethiopian woman was short and petite', 'label': 1, 'group': 'Ethiopian', 'data_name': 'SeeGULL - GPT Augmentation', '__index_level_0__': 609, 'input_ids': [101, 1996, 15101, 2450, 2001, 2460, 1998, 20146, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 1}


The following columns in the Training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: group, __index_level_0__, data_name, text. If group, __index_level_0__, data_name, text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 1,324
  Num Epochs = 6
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 126
  Number of trainable parameters = 109,483,778


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Balanced Accuracy
1,No log,0.627826,0.332831,0.5,0.399638,0.5
2,No log,0.505917,0.810927,0.614834,0.608315,0.614834
3,0.606300,0.355087,0.84346,0.880213,0.850821,0.880213
4,0.606300,0.261053,0.87664,0.907362,0.886805,0.907362
5,0.240100,0.23396,0.889153,0.920896,0.899742,0.920896
6,0.240100,0.221348,0.898477,0.929925,0.90929,0.929925


The following columns in the Evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: group, __index_level_0__, data_name, text. If group, __index_level_0__, data_name, text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 332
  Batch size = 64
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
Saving model checkpoint to model_output_bert/seegull_gpt_augmentation_trained/checkpoint-21
Configuration saved in model_output_bert/seegull_gpt_augmentation_trained/checkpoint-21/config.json
Model weights saved in model_output_bert/seegull_gpt_augmentation_trained/checkpoint-21/model.safetensors
tokenizer config file saved in model_output_bert/seegull_gpt_augmentation_trained/checkpoint-21/tokenizer_config.json
Special tokens file saved in model_output_bert/seegull_gpt_augmentation_trained/checkpoint-21/special_tokens

Saved fine-tuned model to: model_output_bert/seegull_gpt_augmentation_trained
Number of unique labels: 2


Map: 100%|██████████████████████████| 653/653 [00:00<00:00, 34310.20 examples/s]
Map: 100%|██████████████████████████| 653/653 [00:00<00:00, 33772.90 examples/s]
Device set to use cuda:0


Sample tokenized input from test: {'text': 'most people who are Gay are kites', 'label': 0, 'group': 'Gay', 'data_name': 'Winoqueer - GPT Augmentation', '__index_level_0__': 3236, 'input_ids': [101, 2087, 2111, 2040, 2024, 5637, 2024, 20497, 2015, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


loading configuration file model_output_bert/seegull_gpt_augmentation_trained/config.json
Model config BertConfig {
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "dtype": "float32",
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "transformers_version": "4.57.3",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file model_output_bert/seegull_gpt_augmentation_trained/model.safetensors
loading file vocab.txt
loading file tokenizer.json
loading file added_tokens.json
loading file special_tokens_map.

Saved full results to: result_output_bert/seegull_gpt_augmentation_trained/winoqueer_gpt_augmentation/full_results.csv
Saved classification report to: result_output_bert/seegull_gpt_augmentation_trained/winoqueer_gpt_augmentation/classification_report.csv
Number of unique labels: 2


Map: 100%|██████████████████████████| 414/414 [00:00<00:00, 31931.63 examples/s]
Map: 100%|██████████████████████████| 414/414 [00:00<00:00, 30476.19 examples/s]
Device set to use cuda:0


Sample tokenized input from test: {'text': 'The Somali refugees were labeled as unacquainted by some', 'label': 0, 'group': 'Somalis', 'data_name': 'SeeGULL - GPT Augmentation', '__index_level_0__': 840, 'input_ids': [101, 1996, 16831, 8711, 2020, 12599, 2004, 14477, 2278, 16211, 18447, 2098, 2011, 2070, 102, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0], 'labels': 0}


loading configuration file model_output_bert/seegull_gpt_augmentation_trained/config.json
Model config BertConfig {
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "dtype": "float32",
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "transformers_version": "4.57.3",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file model_output_bert/seegull_gpt_augmentation_trained/model.safetensors
loading file vocab.txt
loading file tokenizer.json
loading file added_tokens.json
loading file special_tokens_map.

Saved full results to: result_output_bert/seegull_gpt_augmentation_trained/seegull_gpt_augmentation/full_results.csv
Saved classification report to: result_output_bert/seegull_gpt_augmentation_trained/seegull_gpt_augmentation/classification_report.csv
Number of unique labels: 2


Map: 100%|██████████████████████| 10374/10374 [00:00<00:00, 28378.76 examples/s]
Map: 100%|██████████████████████| 10374/10374 [00:00<00:00, 27341.39 examples/s]
Device set to use cuda:0


Sample tokenized input from test: {'group': 'race', 'text': 'When i see an arab, i think they are a happy.', 'label': 0, 'data_name': 'MGSD', '__index_level_0__': 17851, 'input_ids': [101, 2043, 1045, 2156, 2019, 5424, 1010, 1045, 2228, 2027, 2024, 1037, 3407, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


loading configuration file model_output_bert/seegull_gpt_augmentation_trained/config.json
Model config BertConfig {
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "dtype": "float32",
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "transformers_version": "4.57.3",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file model_output_bert/seegull_gpt_augmentation_trained/model.safetensors
loading file vocab.txt
loading file tokenizer.json
loading file added_tokens.json
loading file special_tokens_map.

Saved full results to: result_output_bert/seegull_gpt_augmentation_trained/mgsd/full_results.csv
Saved classification report to: result_output_bert/seegull_gpt_augmentation_trained/mgsd/classification_report.csv
Number of unique labels: 2


Map: 100%|██████████████████████| 11441/11441 [00:00<00:00, 29226.24 examples/s]
Map: 100%|██████████████████████| 11441/11441 [00:00<00:00, 30288.71 examples/s]
Device set to use cuda:0


Sample tokenized input from test: {'text': 'The Somali refugees were labeled as unacquainted by some', 'label': 0, 'group': 'Somalis', 'data_name': 'SeeGULL - GPT Augmentation', 'input_ids': [101, 1996, 16831, 8711, 2020, 12599, 2004, 14477, 2278, 16211, 18447, 2098, 2011, 2070, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


loading configuration file config.json from cache at /cs/student/msc/aisd/2025/bekenova/.cache/huggingface/hub/models--google-bert--bert-base-uncased/snapshots/86b5e0934494bd15c9632b12f734a8a67f723594/config.json
Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.57.3",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file model.safetensors from cache at /cs/student/msc/aisd/2025/bekenova/.cache/huggingface/hub/models--google-bert--bert-base-uncased/snapshots/86b5e0

Saved full results to: result_output_bert/seegull_gpt_augmentation_trained/merged_winoqueer_seegull_gpt_augmentation/full_results.csv
Saved classification report to: result_output_bert/seegull_gpt_augmentation_trained/merged_winoqueer_seegull_gpt_augmentation/classification_report.csv
Number of unique labels: 2


loading configuration file config.json from cache at /cs/student/msc/aisd/2025/bekenova/.cache/huggingface/hub/models--google-bert--bert-base-uncased/snapshots/86b5e0934494bd15c9632b12f734a8a67f723594/config.json
Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.57.3",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading file vocab.txt from cache at /cs/student/msc/aisd/2025/bekenova/.cache/huggingface/hub/models--google-bert--bert-base-uncased/snapshots/86b5e0934494bd15c9632b

Sample tokenized input from train: {'text': 'The commander could tell the soldier was disciplined.', 'label': 1, 'group': 'profession', 'data_name': 'MGSD', '__index_level_0__': 7380, 'input_ids': [101, 1996, 3474, 2071, 2425, 1996, 5268, 2001, 28675, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 1}


Map: 100%|████████████████████████| 9152/9152 [00:00<00:00, 21037.02 examples/s]
Map: 100%|████████████████████████| 9152/9152 [00:00<00:00, 37789.98 examples/s]
PyTorch: setting up devices
  trainer = Trainer(


Sample tokenized input from validation: {'text': 'we expect that Queer people are artistic.', 'label': 0, 'group': 'Queer', 'data_name': 'Winoqueer - GPT Augmentation', '__index_level_0__': 3705, 'input_ids': [101, 2057, 5987, 2008, 19483, 2111, 2024, 6018, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


The following columns in the Training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: group, __index_level_0__, data_name, text. If group, __index_level_0__, data_name, text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 36,608
  Num Epochs = 6
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 3,432
  Number of trainable parameters = 109,483,778


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Balanced Accuracy
1,0.3791,0.378198,0.792885,0.812937,0.799424,0.812937
2,0.3018,0.346956,0.822737,0.828,0.825214,0.828
3,0.2141,0.376471,0.822978,0.834852,0.828051,0.834852
4,0.1332,0.468415,0.827566,0.825896,0.826717,0.825896
5,0.0889,0.542588,0.824201,0.837982,0.8299,0.837982
6,0.0613,0.590092,0.826006,0.835319,0.830158,0.835319


The following columns in the Evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: group, __index_level_0__, data_name, text. If group, __index_level_0__, data_name, text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 9152
  Batch size = 64
Saving model checkpoint to model_output_bert/merged_winoqueer_seegull_gpt_augmentation_trained/checkpoint-572
Configuration saved in model_output_bert/merged_winoqueer_seegull_gpt_augmentation_trained/checkpoint-572/config.json
Model weights saved in model_output_bert/merged_winoqueer_seegull_gpt_augmentation_trained/checkpoint-572/model.safetensors
tokenizer config file saved in model_output_bert/merged_winoqueer_seegull_gpt_augmentation_trained/checkpoint-572/tokenizer_config.json
Special tokens file saved in model_output_bert/merged_winoqueer_seegull_gpt_augmentation_trained/checkpoint-

Saved fine-tuned model to: model_output_bert/merged_winoqueer_seegull_gpt_augmentation_trained
Number of unique labels: 2


Map: 100%|██████████████████████████| 653/653 [00:00<00:00, 35283.48 examples/s]
Map: 100%|██████████████████████████| 653/653 [00:00<00:00, 34169.38 examples/s]
Device set to use cuda:0


Sample tokenized input from test: {'text': 'most people who are Gay are kites', 'label': 0, 'group': 'Gay', 'data_name': 'Winoqueer - GPT Augmentation', '__index_level_0__': 3236, 'input_ids': [101, 2087, 2111, 2040, 2024, 5637, 2024, 20497, 2015, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


loading configuration file model_output_bert/merged_winoqueer_seegull_gpt_augmentation_trained/config.json
Model config BertConfig {
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "dtype": "float32",
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "transformers_version": "4.57.3",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file model_output_bert/merged_winoqueer_seegull_gpt_augmentation_trained/model.safetensors
loading file vocab.txt
loading file tokenizer.json
loading file added_tokens.jso

Saved full results to: result_output_bert/merged_winoqueer_seegull_gpt_augmentation_trained/winoqueer_gpt_augmentation/full_results.csv
Saved classification report to: result_output_bert/merged_winoqueer_seegull_gpt_augmentation_trained/winoqueer_gpt_augmentation/classification_report.csv
Number of unique labels: 2


Map: 100%|██████████████████████████| 414/414 [00:00<00:00, 31292.88 examples/s]
Map: 100%|██████████████████████████| 414/414 [00:00<00:00, 31673.60 examples/s]
Device set to use cuda:0


Sample tokenized input from test: {'text': 'The Somali refugees were labeled as unacquainted by some', 'label': 0, 'group': 'Somalis', 'data_name': 'SeeGULL - GPT Augmentation', '__index_level_0__': 840, 'input_ids': [101, 1996, 16831, 8711, 2020, 12599, 2004, 14477, 2278, 16211, 18447, 2098, 2011, 2070, 102, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0], 'labels': 0}


loading configuration file model_output_bert/merged_winoqueer_seegull_gpt_augmentation_trained/config.json
Model config BertConfig {
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "dtype": "float32",
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "transformers_version": "4.57.3",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file model_output_bert/merged_winoqueer_seegull_gpt_augmentation_trained/model.safetensors
loading file vocab.txt
loading file tokenizer.json
loading file added_tokens.jso

Saved full results to: result_output_bert/merged_winoqueer_seegull_gpt_augmentation_trained/seegull_gpt_augmentation/full_results.csv
Saved classification report to: result_output_bert/merged_winoqueer_seegull_gpt_augmentation_trained/seegull_gpt_augmentation/classification_report.csv
Number of unique labels: 2


Map: 100%|██████████████████████| 10374/10374 [00:00<00:00, 28434.64 examples/s]
Map: 100%|██████████████████████| 10374/10374 [00:00<00:00, 27442.92 examples/s]
Device set to use cuda:0


Sample tokenized input from test: {'group': 'race', 'text': 'When i see an arab, i think they are a happy.', 'label': 0, 'data_name': 'MGSD', '__index_level_0__': 17851, 'input_ids': [101, 2043, 1045, 2156, 2019, 5424, 1010, 1045, 2228, 2027, 2024, 1037, 3407, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}


loading configuration file model_output_bert/merged_winoqueer_seegull_gpt_augmentation_trained/config.json
Model config BertConfig {
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "dtype": "float32",
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "transformers_version": "4.57.3",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file model_output_bert/merged_winoqueer_seegull_gpt_augmentation_trained/model.safetensors
loading file vocab.txt
loading file tokenizer.json
loading file added_tokens.jso

Saved full results to: result_output_bert/merged_winoqueer_seegull_gpt_augmentation_trained/mgsd/full_results.csv
Saved classification report to: result_output_bert/merged_winoqueer_seegull_gpt_augmentation_trained/mgsd/classification_report.csv
Number of unique labels: 2


Map: 100%|██████████████████████| 11441/11441 [00:00<00:00, 29407.80 examples/s]
Map: 100%|██████████████████████| 11441/11441 [00:00<00:00, 30577.80 examples/s]
Device set to use cuda:0


Sample tokenized input from test: {'text': 'The Somali refugees were labeled as unacquainted by some', 'label': 0, 'group': 'Somalis', 'data_name': 'SeeGULL - GPT Augmentation', 'input_ids': [101, 1996, 16831, 8711, 2020, 12599, 2004, 14477, 2278, 16211, 18447, 2098, 2011, 2070, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': 0}
Saved full results to: result_output_bert/merged_winoqueer_seegull_gpt_augmentation_trained/merged_winoqueer_seegull_gpt_augmentation/full_results.csv
Saved classification report to: result_output_bert/merged_winoqueer_seegull_gpt_augmentation_trained/merged_winoqueer_seegull_gpt_augmentation/classification_report.csv


### $\color{pink}{Question\ 3.1:}$ Results 

The expected model checkpoints, tokenizer files, and configuration artefacts were successfully generated:

- /model_output_albertv2/mgsd_trained/

- /model_output_albertv2/winoqueer_gpt_augmentation_trained/

- /model_output_albertv2/seegull_gpt_augmentation_trained/

- /model_output_albertv2/merged_winoqueer_seegull_gpt_augmentation_trained/

- /result_output_albertv2

- /result_output_bert

- /result_output_distilbert

In [None]:
# Check the outputs directory 
output_dir = "/tmp/HEARTS-Text-Stereotype-Detection/model_output_albertv2"
print(os.listdir(output_dir))

['mgsd_baseline', 'mgsd_trained', 'winoqueer_gpt_augmentation_trained', 'seegull_gpt_augmentation_trained', 'merged_winoqueer_seegull_gpt_augmentation_trained']


In [None]:
# Check the outputs directory for MSGD trained model 
mgsd_path = "/tmp/HEARTS-Text-Stereotype-Detection/model_output_albertv2/mgsd_trained"
print(os.listdir(mgsd_path))

['checkpoint-1038', 'config.json', 'model.safetensors', 'tokenizer_config.json', 'special_tokens_map.json', 'tokenizer.json', 'training_args.bin']


In [None]:
# Define the project root
PROJECT_ROOT = Path("/tmp/HEARTS-Text-Stereotype-Detection")

# Build the summary table
summary_df = collect_summary(PROJECT_ROOT).sort_values(["model", "train_on", "eval_on"])
display(summary_df)

Unnamed: 0,model,train_on,eval_on,accuracy,macro_precision,macro_recall,macro_f1
0,albertv2,merged_winoqueer_seegull_gpt_augmentation_trained,merged_winoqueer_seegull_gpt_augmentation,0.82956,0.812057,0.804933,0.808239
1,albertv2,merged_winoqueer_seegull_gpt_augmentation_trained,mgsd,0.817525,0.798737,0.791267,0.7947
2,albertv2,merged_winoqueer_seegull_gpt_augmentation_trained,seegull_gpt_augmentation,0.898551,0.886902,0.884058,0.885452
3,albertv2,merged_winoqueer_seegull_gpt_augmentation_trained,winoqueer_gpt_augmentation,0.977029,0.973671,0.97475,0.974207
4,albertv2,mgsd_trained,merged_winoqueer_seegull_gpt_augmentation,0.806485,0.795266,0.760456,0.772482
5,albertv2,mgsd_trained,mgsd,0.813187,0.802741,0.769727,0.781473
6,albertv2,mgsd_trained,seegull_gpt_augmentation,0.748792,0.733152,0.759058,0.735743
7,albertv2,mgsd_trained,winoqueer_gpt_augmentation,0.7366,0.815491,0.611225,0.602798
8,albertv2,seegull_gpt_augmentation_trained,merged_winoqueer_seegull_gpt_augmentation,0.699851,0.662657,0.653821,0.657215
9,albertv2,seegull_gpt_augmentation_trained,mgsd,0.687199,0.647613,0.638152,0.641524


In [None]:
# Mapping from full training directory names to short labels
train_map = {
    "mgsd_trained": "MGSD",
    "winoqueer_gpt_augmentation_trained": "AWinQ",
    "seegull_gpt_augmentation_trained": "ASeeGULL",
    "merged_winoqueer_seegull_gpt_augmentation_trained": "EMGSD",
}

# Mapping from full evaluation directory names to short labels
eval_map = {
    "mgsd": "MGSD",
    "winoqueer_gpt_augmentation": "AWinQ",
    "seegull_gpt_augmentation": "ASeeGULL",
    "merged_winoqueer_seegull_gpt_augmentation": "EMGSD",
}

# Build and display the table
table = build_table(summary_df)
table

Unnamed: 0_level_0,eval_short,ASeeGULL,AWinQ,EMGSD,MGSD
model,train_short,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
albertv2,ASeeGULL,0.878625,0.758341,0.657215,0.641524
albertv2,AWinQ,0.742479,0.97957,0.643667,0.615635
albertv2,EMGSD,0.885452,0.974207,0.808239,0.7947
albertv2,MGSD,0.735743,0.602798,0.772482,0.781473
bert,ASeeGULL,0.889947,0.823317,0.653108,0.631083
bert,AWinQ,0.726933,0.991441,0.651479,0.62756
bert,EMGSD,0.891304,0.975954,0.825447,0.813525
bert,MGSD,0.711166,0.783431,0.803933,0.808953
distilbert,ASeeGULL,0.883363,0.868559,0.64804,0.621496
distilbert,AWinQ,0.701243,0.974379,0.648156,0.622647


### Table: HEARTS Replicated Results

The table below summarises the replicated Macro F1 scores for ALBERT-V2, BERT, and DistilBERT across all combinations of training and evaluation datasets.

![HEARTS Replicated Results](COMP0173_Figures/hearts_replicated_results.png)

### Table: HEARTS Original Results [1]

![Model Results Table](COMP0173_Figures/hearts_results.png)

### Results Interpretation

The replicated Macro F1 scores for ALBERT-V2, BERT, and DistilBERT closely match the results reported in the HEARTS paper, with a few exceptions. For nearly all train-test combinations, the reproduced valuesexcept for two; fall within the required $±5\%$ range, confirming the successful replication of the baseline AI methodology.

#### ALBERT-V2

For ALBERT-V2, the replicated Macro F1 scores generally align with the original results. Most train-test combinations remain within a five percentage point margin, and the relative performance ordering across datasets is constant with the behaviour defined in the HEARTS paper. Only a couple of values diverge slightly beyond this margin, particularly those involving evaluations on `AWinoQueer`.

![ALBERT-V2 Results Table](COMP0173_Figures/results_comparison_albert.png)

#### BERT

The BERT model shows remarkable reproducibility. All replicated scores are consistently within a $±5\%$ range, with differences between the original and replicated values typically being minor, often just one or two points apart. 

![BERT Results Table](COMP0173_Figures/results_comparison_bert.png)

#### DistilBERT

DistilBERT demonstrates a strong alignment between the original and replicated metrics. Every train-test result falls within the target deviation range. 

![DistilBERT-V2 Results Table](COMP0173_Figures/results_comparison_distilbert.png)

# References 

[1] Theo King, Zekun Wu, Adriano Koshiyama, Emre Kazim, and Philip Treleaven. 2024.
HEARTS: A holistic framework for explainable, sustainable and robust text stereotype detection.
arXiv preprint arXiv:2409.11579.
Available at: https://arxiv.org/abs/2409.11579
(Accessed: 4 December 2025).
https://doi.org/10.48550/arXiv.2409.11579

[2] Theo King, Zekun Wu, Adriano Koshiyama, Emre Kazim, and Philip Treleaven. 2024.
HEARTS-Text-Stereotype-Detection (GitHub Repository).
Available at: https://github.com/holistic-ai/HEARTS-Text-Stereotype-Detection
(Accessed: 4 December 2025).

[3] Theo King, Zekun Wu, Adriano Koshiyama, Emre Kazim, and Philip Treleaven. Holistic AI. 2024.
EMGSD: Expanded Multi-Group Stereotype Dataset (HuggingFace Dataset).
Available at: https://huggingface.co/datasets/holistic-ai/EMGSD
(Accessed: 4 December 2025).

[4] University College London Technical Support Group (TSG).
2025. GPU Access and Usage Documentation.
Available at: https://tsg.cs.ucl.ac.uk/gpus/
(Accessed: 6 December 2025).

[5] United Nations. 2025. The 2030 Agenda for Sustainable Development. 
Available at: https://sdgs.un.org/2030agenda 
(Accessed: 6 December 2025).

[6] Veronika Grigoreva, Anastasiia Ivanova, Ilseyar Alimova, and Ekaterina Artemova. 2024.
RuBia: A Russian Language Bias Detection Dataset.
Available at: https://arxiv.org/abs/2403.17553
(Accessed: 9 December 2025).

[7] Veronika Grigoreva, Anastasiia Ivanova, Ilseyar Alimova, and Ekaterina Artemova. 2024.
RuBia-Dataset (GitHub Repository).
Available at: https://github.com/vergrig/RuBia-Dataset
(Accessed: 9 December 2025).

[8] Sismetanin. 2020. Toxic Comments Detection in Russian (GitHub Repository).
Available at: https://github.com/sismetanin/toxic-comments-detection-in-russian
(Accessed: 9 December 2025).

[9] DeepPavlov. 2019. RuBERT-base-cased (Hugging Face Model).
Available at: https://huggingface.co/DeepPavlov/rubert-base-cased
(Accessed: 9 December 2025).

[10] AI-Forever. 2023. RuBERT-base (Hugging Face Model).
Available at: https://huggingface.co/ai-forever/ruBert-base
(Accessed: 9 December 2025).

[11] Hugging Face. 2024. XLM-RoBERTa: Model Documentation.
Available at: https://huggingface.co/docs/transformers/en/model_doc/xlm-roberta
(Accessed: 9 December 2025).

[12] DeepPavlov. 2020. ruBERT-base-cased-sentence (Hugging Face Model).
Available at: https://huggingface.co/DeepPavlov/rubert-base-cased-sentence
(Accessed: 9 December 2025).