# Train ModernBERT on Pseudo-labeled Data

Pseudo-labeled data comes from two sources:
- MultiRC, pseudo-labeled by GPT5
- Authentic iTELL data, pseudo-labeled by o3-mini

Humans have labeled a non-overlapping portion of the authentic iTELL data. This will be our held-out test set.

In [1]:
import os
from pathlib import Path

import numpy as np
import pandas as pd
import torch
import datasets
from transformers import (
    Trainer,
    TrainingArguments,
    DataCollatorWithPadding,
    AutoTokenizer,
    AutoModelForSequenceClassification,
)
from sklearn import metrics
from scipy import stats

torch.set_float32_matmul_precision("high")
os.environ["TOKENIZERS_PARALLELISM"] = "true"

In [2]:
model_name_or_path = "answerdotai/ModernBERT-base"
output_dir = "../../results/modernbert-multirc-pseudo-labeled"

# Training/Validation Data:
datadict_path = "../../data/authentic-03-scores-multirc-gpt5-scores.hf"  # The prepared training and validation data
multirc_path = "../../data/multirc-data-w-gpt5-scores.csv"  # A subsample of MultiRC, scored by GPT 5
authentic_path = "../../data/authentic_train_data.csv"  # Authentic data from iTELL, scored by o3-mini using the same rubric/prompt

# Test Data:
test_data_path = "../../data/authentic_test_data.csv"  # Authentic data from iTELL, scored by the iTELL development team

batch_size = 4
num_epochs = 6
learning_rate = 1e-5
seed = 42

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

## Construct Dataset

In [5]:
pd.read_csv(test_data_path)

Unnamed: 0,response,ensemble_score,user_id,page_slug,chunk_slug,created_at,volume_slug,volume_title,page_title,chunk_header,chunk_text,question,answer,o3_mini_score,human_score,annotator
0,learning,0,clu91tykw0002l00fz3o1vfsh,emotional,01-Affect-Detection-from-Student-Activity-Data...,2024-04-15 19:57:38.656,cornell,Handbook of Learning Analytics,Emotional Learning Analytics,0.1 Affect Detection from Student Activity Data,Affective states cannot be directly measured b...,What approach was used to infer affect by anal...,"The interaction-based, log-file based, or sens...",1,1.0,langdon
1,answer,0,cluc0hclp0000ju0flnuywkrr,emotional,01-Affect-Detection-from-Student-Activity-Data...,2024-05-05 20:29:28.14,cornell,Handbook of Learning Analytics,Emotional Learning Analytics,0.1 Affect Detection from Student Activity Data,Affective states cannot be directly measured b...,What approach was used to infer affect by anal...,"The interaction-based, log-file based, or sens...",1,1.0,langdon
2,human observers made live annotations regardin...,0,clu7cooq9000kjt0fbxh29vfy,emotional,01-Affect-Detection-from-Student-Activity-Data...,2024-04-15 19:03:37.985,cornell,Handbook of Learning Analytics,Emotional Learning Analytics,0.1 Affect Detection from Student Activity Data,Affective states cannot be directly measured b...,What approach was used to infer affect by anal...,"The interaction-based, log-file based, or sens...",2,3.0,langdon
3,Recorded data from student while doing help ba...,0,clucz38h40005jv0fhc5vjpfp,emotional,01-Affect-Detection-from-Student-Activity-Data...,2024-03-30 01:56:12.488,cornell,Handbook of Learning Analytics,Emotional Learning Analytics,0.1 Affect Detection from Student Activity Data,Affective states cannot be directly measured b...,What approach was used to infer affect by anal...,"The interaction-based, log-file based, or sens...",2,3.0,langdon
4,analyzing context,0,clu8uwkw20000jz0f4yyg6c1d,emotional,01-Affect-Detection-from-Student-Activity-Data...,2024-05-03 20:56:52.298,cornell,Handbook of Learning Analytics,Emotional Learning Analytics,0.1 Affect Detection from Student Activity Data,Affective states cannot be directly measured b...,What approach was used to infer affect by anal...,"The interaction-based, log-file based, or sens...",2,2.0,langdon
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
365,trace refer to ambient data generated by learn...,2,clu7s2zh60005jt0gpk3z8at5,learning-analytics-for-self-regulated-learning,Data-For-Learning-Analytics-About-Learning-And...,2024-04-20 06:30:33.256,cornell,Handbook of Learning Analytics,Learning Analytics for Self-Regulated Learning,Data For Learning Analytics About Learning And...,"Traces\n\nAs learners work, they generate ambi...",What are traces and what do they reveal about ...,Traces are ambient data generated by learners ...,4,4.0,Tobasum
366,They are ambient data generated when learners ...,2,clu8s7sxo0005l50ft5mjtbjd,learning-analytics-for-self-regulated-learning,Data-For-Learning-Analytics-About-Learning-And...,2024-03-27 15:29:40.613,cornell,Handbook of Learning Analytics,Learning Analytics for Self-Regulated Learning,Data For Learning Analytics About Learning And...,"Traces\n\nAs learners work, they generate ambi...",What are traces and what do they reveal about ...,Traces are ambient data generated by learners ...,3,2.0,Tobasum
367,to make bindings in one module accessible in o...,1,fvbptwwe5elyvsqwx7cqgxs5fa,10-modules,ES-modules-845t,2024-09-11 16:04:14.815953+00,eloquent-javascript,Eloquent JavaScript,10. Modules,ES modules,The original JavaScript language did not have ...,What is the purpose of the export keyword in J...,The export keyword is used to indicate that a ...,3,3.0,Tobasum
368,"It is used to make function, class, object, or...",2,c35eotu6a3guavdkekjwe3346i,10-modules,ES-modules-845t,2024-09-10 19:11:56.866991+00,eloquent-javascript,Eloquent JavaScript,10. Modules,ES modules,The original JavaScript language did not have ...,What is the purpose of the export keyword in J...,The export keyword is used to indicate that a ...,4,4.0,Tobasum


In [24]:
train_dev_df1 = pd.read_csv(multirc_path)[
    ["chunk_text", "question", "response", "gpt5_score"]
].rename(columns={"gpt5_score": "label"})
train_dev_df2 = pd.read_csv(authentic_path)[
    ["chunk_text", "question", "response", "o3_mini_score", "answer"]
].rename(columns={"o3_mini_score": "label"})
test_df = pd.read_csv(test_data_path)[
    ["chunk_text", "question", "response", "human_score", "answer"]
].rename(columns={"human_score": "label"})

train_dev_df = pd.concat([train_dev_df1, train_dev_df2])
train_dev_df

Unnamed: 0,chunk_text,question,response,label,answer
0,A flood occurs when a river overflows its bank...,What forms the raised strip near the edge of a...,Sandy desert,1,
1,Force is a vector. What then is a vector? Thin...,What two pieces of information does a vector p...,Motion and distance,2,
2,"Madrid, Spain (CNN) -- Relatives of a woman ki...",Where was the Spanish MD82 bound for when the ...,Spain's Barcelona,1,
3,Flowing water causes sediment to move. Flowing...,How long does it take for water to dissolve ro...,Few days,1,
4,How would the universe look without gravity? I...,How would the universe look without gravity?,No planets,2,
...,...,...,...,...,...
1053,Let’s begin with a brief overview of spectacul...,What were economic conditions like before 1870?,"Slow technological progress, natural disasters...",4,Economic conditions before 1870 were marked by...
1054,Let’s begin with a brief overview of spectacul...,What were economic conditions like before 1870?,Economic conditions before 1870 were sluggish ...,2,Economic conditions before 1870 were marked by...
1055,Let’s begin with a brief overview of spectacul...,What were economic conditions like before 1870?,economic conditions were slow,2,Economic conditions before 1870 were marked by...
1056,Let’s begin with a brief overview of spectacul...,What were economic conditions like before 1870?,"Before 1870, economic conditions were relative...",2,Economic conditions before 1870 were marked by...


In [7]:
train_dev_ds = datasets.Dataset.from_pandas(train_dev_df, preserve_index=False)
dd = train_dev_ds.train_test_split(test_size=0.10, seed=42)
dd["dev"] = dd["test"]

test_ds = datasets.Dataset.from_pandas(test_df, preserve_index=False)
dd["test"] = test_ds
dd

DatasetDict({
    train: Dataset({
        features: ['chunk_text', 'question', 'response', 'label', 'answer'],
        num_rows: 5004
    })
    test: Dataset({
        features: ['chunk_text', 'question', 'response', 'label', 'answer'],
        num_rows: 370
    })
    dev: Dataset({
        features: ['chunk_text', 'question', 'response', 'label', 'answer'],
        num_rows: 556
    })
})

In [8]:
dd.save_to_disk(datadict_path)

Saving the dataset (0/1 shards):   0%|          | 0/5004 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/370 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/556 [00:00<?, ? examples/s]

## Prepare Dataset

In [16]:
dd = datasets.DatasetDict.load_from_disk(datadict_path)


def preprocess_function(example):
    input_str = "\n\n\n".join(
        [
            f"Passage: {example['chunk_text']}",
            f"Question: {example['question']}",
            f"Reference Answer: {example.get('answer', '')}",
            f"Student Response: {example['response']}",
        ]
    )
    new_example = tokenizer(input_str)
    return tokenizer(input_str)


dd = dd.map(
    preprocess_function,
    batched=False,
    remove_columns=[
        "chunk_text",
        "question",
        "response",
        "answer",
    ],
)

# Convert label column to float type
new_features = dd["train"].features.copy()
new_features["label"] = datasets.Value("float32")
dd = dd.cast(new_features)
dd

Casting the dataset:   0%|          | 0/5004 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/370 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/556 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 5004
    })
    test: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 370
    })
    dev: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 556
    })
})

## Set Up Training

In [17]:
def model_init():
    model = AutoModelForSequenceClassification.from_pretrained(
        model_name_or_path,
        num_labels=1,
    )
    return model

In [18]:
def compute_metrics(eval_pred):
    """
    eval_pred : tuple
        A tuple of (logits, labels) provided by the Hugging Face Trainer.
        - logits: numpy array of shape (n_samples,) for binary classification
        - labels: numpy array of shape (n_samples,)
    """
    preds, labels = eval_pred

    metric_dict = {}

    # Regression metrics
    metric_dict["mse"] = metrics.mean_squared_error(labels, preds)
    metric_dict["rmse"] = metrics.mean_squared_error(labels, preds, squared=False)
    metric_dict["mae"] = metrics.mean_absolute_error(labels, preds)
    metric_dict["r2"] = metrics.r2_score(labels, preds)

    # Classification metrics (round to integers for ordinal ratings)
    preds_int = np.round(preds).astype(int)
    labels_int = np.round(labels).astype(int)

    # Quadratic Weighted Kappa
    metric_dict["qwk"] = metrics.cohen_kappa_score(
        labels_int, preds_int, weights="quadratic"
    )

    # Spearman's r
    metric_dict["spearman"] = stats.spearmanr(labels_int, preds_int).statistic

    return metric_dict

In [19]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding="longest")

In [20]:
training_args = TrainingArguments(
    output_dir=output_dir,
    bf16=True,  # bfloat16 training
    optim="adamw_torch_fused",
    num_train_epochs=num_epochs,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    learning_rate=learning_rate,
    logging_dir="../../logs",
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    seed=seed,
    log_level="error",
    disable_tqdm=False,
    report_to="none",  # Disable WandB reporting
)

trainer = Trainer(
    model_init=model_init,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dd["train"],
    eval_dataset=dd["dev"],
    compute_metrics=compute_metrics,
)

trainer.train()

Epoch,Training Loss,Validation Loss,Mse,Rmse,Mae,R2,Qwk,Spearman
1,0.8053,0.742637,0.742637,0.861764,0.736581,0.252449,0.316983,0.420148
2,0.6572,0.696517,0.696517,0.834576,0.689032,0.298874,0.430365,0.50637
3,0.5733,0.775656,0.775656,0.880713,0.67191,0.219212,0.56498,0.553641
4,0.4964,0.635834,0.635834,0.797392,0.642564,0.359959,0.565201,0.569532
5,0.366,0.69663,0.69663,0.834644,0.646625,0.298761,0.550837,0.543964
6,0.3183,0.709133,0.709133,0.842101,0.638117,0.286175,0.536239,0.529151


TrainOutput(global_step=7506, training_loss=0.544247556361649, metrics={'train_runtime': 1032.3571, 'train_samples_per_second': 29.083, 'train_steps_per_second': 7.271, 'total_flos': 1.2317887451672856e+16, 'train_loss': 0.544247556361649, 'epoch': 6.0})

In [21]:
trainer.save_model("../../results/modernbert_authentic_multirc_with_reference")

## Functionality Test

In [22]:
from transformers import pipeline

classifier = pipeline(
    task="text-classification",
    model="../../results/modernbert_authentic_multirc",
    tokenizer=model_name_or_path,
    device=0,
)

sample = "Smoking is bad for your health."

classifier(sample)

[{'label': 'LABEL_0', 'score': 1.8023130893707275}]