# Deep Knowledge Tracing with ModernBERT

This notebook demonstrates our approach to simulating student responses in an educational platform using **Deep Knowledge Tracing (DKT)** with **ModernBERT**. Each student is treated as a unique case, and we fine-tune separate ModernBERT models per student. The key goal is to predict whether a student will answer a given question correctly or incorrectly based on:

- The text of the question and its multiple-choice options.
- The topic, subject, and other relevant attributes.
- The student's past performance (implicitly learned during training).

Our broader objective is to create synthetic "student models" that can help us estimate question difficulty without requiring costly large-scale field testing with real students.

## Notebook Overview
1. **Imports and Environment Setup**: We load environment variables and import the required Python libraries.
2. **Data Loading and Preparation**: We read the main dataset, remove duplicates, and prepare it for training.
3. **User Selection**: We select the top 50 users based on the number of answers submitted.
4. **Model Training Function**: We define a function that fine-tunes a ModernBERT model for a specific user.
5. **Batch Training**: We loop over the selected users and fine-tune a separate model for each.
6. **Next Steps**: Outline potential next steps, such as evaluating performance, computing difficulty metrics, etc.

In [None]:
# 1) Imports and Environment Setup

from dotenv import load_dotenv  # For loading environment variables from a .env file
import os
import pandas as pd
import matplotlib.pyplot as plt

# Load environment variables
load_dotenv()

print("[INFO] Environment variables loaded and libraries imported.")

### Data Loading
In this section, we load the main dataset that contains all the user interactions (student answers to questions). We remove duplicate entries (if any) to ensure that each `answer_id` is unique. This step helps maintain data quality.

In [None]:
# 2) Data Loading and Preparation

# Load the master dataset that has the translated version of all records
df_original = pd.read_csv('../data/new/master_translated.csv')
print(f"[INFO] Loaded master dataset with {len(df_original):,} rows.")

# Drop duplicate answer_ids to ensure each answer is only counted once
initial_count = len(df_original)
df_original.drop_duplicates(subset=['answer_id'], inplace=True)
final_count = len(df_original)
print(f"[INFO] Duplicates removed: {initial_count - final_count}")
print(f"[INFO] Dataset now has {final_count:,} rows.")

### User Selection
We next identify the top 50 users based on the number of answers submitted. These top users are chosen for individual model training, which will allow us to capture a variety of response patterns from different individuals.

In [None]:
# 3) Selecting the top 50 users by answer count

# Count how many answers each user has
user_answer_counts = df_original['user_id'].value_counts()

# Select the top 50
top_50_users = user_answer_counts.head(50)

# (Optional) Reverse the order, so you process from smallest to largest if you prefer
top_50_users = top_50_users.iloc[::-1]

print("[INFO] Top 50 users by number of answers:")
for u_id, count in top_50_users.items():
    print(f"- User {u_id}: {count:,} answers")

### Model Training Setup
Below, we install all necessary libraries and define the function that trains a **ModernBERT** model for a single user. We use Hugging Face Transformers to:

1. **Tokenize** our question text + multiple-choice options.
2. **Fine-tune** a ModernBERT-based binary classifier.
3. **Evaluate** performance (using a weighted F1 score, which is important because the dataset can be imbalanced).

Key libraries and components:
- **Datasets**: For handling train/validation splits in a memory-efficient manner.
- **AutoTokenizer** and **AutoModelForSequenceClassification**: Automatically load the correct tokenizer and model configuration.
- **Trainer** and **TrainingArguments**: Simplify the training loop, logging, checkpointing, and evaluation.

In [None]:
# 4) Model Definition and Utility Imports

from sklearn.model_selection import train_test_split

from datasets import Dataset, DatasetDict
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    Trainer,
    TrainingArguments,
)
from transformers.trainer_utils import get_last_checkpoint
import numpy as np
from sklearn.metrics import accuracy_score, f1_score

print("[INFO] Transformers and related libraries loaded.")

### Train a Model for One User
The following function, `train_user_model`, encapsulates all the steps for training a ModernBERT model on data from a single user. The pipeline:

1. **Filter** the DataFrame to retrieve only rows belonging to the target user.
2. **Combine** the question text, the multiple-choice options, and meta-info (topic, subject, axis) into one string.
3. **Split** the user's data into train/test sets.
4. **Tokenize** the text for ModernBERT.
5. **Initialize** and **train** the model (using the Hugging Face Trainer).
6. **Evaluate** using a weighted F1 metric.
7. **Save** the best checkpoint.

If `push_to_hub=True`, the final model can be automatically uploaded to the Hugging Face Hub.

In [None]:
def train_user_model(
    df: pd.DataFrame,
    user_id: int,
    model_name: str = "answerdotai/ModernBERT-base",
    output_dir_base: str = "user_models",
    push_to_hub: bool = False,
    hub_repo_org: str = None,
    hub_token: str = None,
    num_train_epochs: int = 5,
    batch_size: int = 4
):
    """
    Trains a ModernBERT model to predict if a user will answer correctly.

    Args:
        df (pd.DataFrame): The entire dataset containing rows for all users.
        user_id (int): The target user's identifier.
        model_name (str): Hugging Face model ID for ModernBERT.
        output_dir_base (str): Base directory where user-specific checkpoints will be stored.
        push_to_hub (bool): Whether to push the final model to the Hugging Face Hub.
        hub_repo_org (str): Org or username for HF Hub repositories.
        hub_token (str): Personal access token if required for private repos.
        num_train_epochs (int): Number of epochs to train.
        batch_size (int): Training batch size.
    """

    # 1) Filter user data
    df_user = df[df["user_id"] == user_id].copy()

    # Check if user has any data
    if df_user.empty:
        print(f"[User {user_id}] No data. Skipping.")
        return

    # Convert correctness to integer labels (0/1)
    df_user["label"] = df_user["is_correct"].astype(int)

    # 2) Combine text: question + choices + topics
    def combine_text(row):
        q = str(row.get("question_title", ""))
        a = str(row.get("option_a", ""))
        b = str(row.get("option_b", ""))
        c = str(row.get("option_c", ""))
        d = str(row.get("option_d", ""))
        e = str(row.get("option_e", ""))
        topic = str(row.get("topic_name", ""))
        subj = str(row.get("subject_name", ""))
        axis = str(row.get("axis_name", ""))

        return (
            f"Topic: {topic}\n"
            f"Subject: {subj}\n"
            f"Axis: {axis}\n\n"
            f"Question: {q}\n"
            f"A) {a}\n"
            f"B) {b}\n"
            f"C) {c}\n"
            f"D) {d}\n"
            f"E) {e}"
        )

    df_user["text"] = df_user.apply(combine_text, axis=1)

    # 3) Train/Test split
    train_df, test_df = train_test_split(
        df_user,
        test_size=0.2,
        shuffle=True,
        random_state=42
    )

    # Check if split is valid
    if train_df.empty or test_df.empty:
        print(f"[User {user_id}] Not enough data to split. Skipping.")
        return

    # Build Hugging Face Datasets
    train_dataset_hf = Dataset.from_pandas(train_df[["text","label"]].reset_index(drop=True))
    test_dataset_hf  = Dataset.from_pandas(test_df[["text","label"]].reset_index(drop=True))

    ds_dict = DatasetDict({
        "train": train_dataset_hf,
        "test": test_dataset_hf
    })

    # 4) Tokenize
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    def tokenize_fn(batch):
        return tokenizer(
            batch["text"],
            padding="max_length",
            truncation=True,
            max_length=512
        )

    ds_dict = ds_dict.map(tokenize_fn, batched=True)

    # 5) Load Model
    model = AutoModelForSequenceClassification.from_pretrained(
        model_name,
        num_labels=2
    )
    model.config.problem_type = "single_label_classification"

    # 6) Training Arguments
    user_output_dir = os.path.join(output_dir_base, f"user_{user_id}")
    os.makedirs(user_output_dir, exist_ok=True)

    if push_to_hub:
        if hub_repo_org:
            hub_repo_id = f"{hub_repo_org}/modernbert-user-{user_id}"
        else:
            hub_repo_id = f"modernbert-user-{user_id}"
    else:
        hub_repo_id = None

    training_args = TrainingArguments(
        output_dir=user_output_dir,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        learning_rate=5e-5,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        num_train_epochs=num_train_epochs,
        use_mps_device=True,          # For Apple Silicon
        bf16=True,                    # Use bfloat16 if supported
        optim="adamw_torch_fused",   # Fused optimizer for performance
        logging_steps=50,
        load_best_model_at_end=True,
        metric_for_best_model="f1",
        save_total_limit=2,
        push_to_hub=push_to_hub,
        hub_model_id=hub_repo_id,
        hub_token=hub_token,
        hub_strategy="end",
    )

    # Define metrics
    def compute_metrics(eval_pred):
        logits, labels = eval_pred
        preds = np.argmax(logits, axis=-1)
        acc = accuracy_score(labels, preds)
        f1_val = f1_score(labels, preds, average="weighted")
        return {"accuracy": acc, "f1": f1_val}

    # 7) Create Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=ds_dict["train"],
        eval_dataset=ds_dict["test"],
        tokenizer=tokenizer,
        compute_metrics=compute_metrics,
    )

    # Check if we should resume training
    last_checkpoint = None
    if os.path.isdir(user_output_dir):
        last_checkpoint = get_last_checkpoint(user_output_dir)
    if last_checkpoint is not None:
        print(f"[User {user_id}] Resuming from {last_checkpoint}.")
    else:
        print(f"[User {user_id}] Starting training from scratch.")

    # 8) Train
    trainer.train(resume_from_checkpoint=last_checkpoint)

    # Evaluate final
    eval_metrics = trainer.evaluate()
    print(f"[User {user_id}] Final eval metrics: {eval_metrics}")

    # 9) Save the best checkpoint
    trainer.save_model(user_output_dir)
    print(f"[User {user_id}] Done! Best model saved at: {user_output_dir}.")

### Train Models for the Top 50 Users
Now we iterate over the selected top 50 users and train a separate ModernBERT model for each. This step can be time-consuming depending on your hardware and the amount of data per user.

> **Note**: The loop below is configured with `push_to_hub=True`, so if you set up your Hugging Face credentials and tokens in the environment, the models will be uploaded to your account or organization repository. You can set this to `False` if you only want to train locally.

In [None]:
# 5) Batch Training for Top Users

OUTPUT_DIR_BASE = "user_models"  # Local directory for saving user-specific model checkpoints

for user_id, count in top_50_users.items():
    print(f"\n=== Training model for User {user_id} (Answers = {count}) ===")
    train_user_model(
        df=df_original,
        user_id=user_id,
        model_name="answerdotai/ModernBERT-base",
        output_dir_base=OUTPUT_DIR_BASE,
        push_to_hub=True,  # Set to False if you don't want to push to Hugging Face Hub
        hub_token=os.getenv("HUGGINGFACE_ACCESS_TOKEN"),
        num_train_epochs=10,
        batch_size=4
    )

## Next Steps
1. **Performance Analysis**: Check the distribution of F1 scores across the 50 user models. Evaluate if there's a strong correlation between the number of training samples a user has and the final model performance.
2. **Generalization**: Consider how well these models will generalize to new questions or topics not seen in the training set.
3. **Difficulty Estimation**: If these models accurately simulate student responses, use them to compute question difficulty by querying each synthetic student.
4. **Integration with Real Students**: Compare synthetic responses with real student performance to validate the realism of the simulated data.

This completes the core demonstration of training separate ModernBERT models per user.