# ü§ó Task B: CodeBERT Fine-Tuning - SemEval 2026 Task 13

**Goal:** Fine-tune a CodeBERT model to perform 11-class model attribution (Task B).

**What you'll learn:**
- How to load Task B data into a HuggingFace `Dataset`
- How to tokenize code with CodeBERT
- How to fine-tune a transformer with the `Trainer` API
- How to evaluate macro F1 on the validation set

> This notebook is intentionally lightweight and mirrors the style of `02_baseline_training.ipynb` and `03_task_b_training.ipynb`. Run cells top-to-bottom.


In [2]:
# 1. Setup

# Pin modern versions so `TrainingArguments` supports `evaluation_strategy`
%pip install -q -U "transformers>=4.40.0" "datasets>=2.19.0" "accelerate>=0.26.0" "evaluate>=0.4.1" scikit-learn

import sys
sys.path.append('..')

import pandas as pd
import numpy as np

import torch
from datasets import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
)
import evaluate

SEED = 42
np.random.seed(SEED)
torch.manual_seed(SEED)

print("‚úÖ Libraries loaded!")
print(f"üîí Random seed: {SEED}")


Note: you may need to restart the kernel to use updated packages.
‚úÖ Libraries loaded!
üîí Random seed: 42


In [3]:
# 2. Load Task B Data

train_df = pd.read_parquet('../data/train_B.parquet')
val_df   = pd.read_parquet('../data/validation_B.parquet')

# QUICK TEST MODE: Set to True for fast testing (~30-60 min instead of 12-24 hours)
QUICK_TEST = True  # Change to False for full training

if QUICK_TEST:
    # Use stratified sample to maintain class distribution
    train_df = train_df.groupby('label', group_keys=False).apply(
        lambda x: x.sample(min(5000, len(x)), random_state=SEED)
    ).reset_index(drop=True)
    val_df = val_df.groupby('label', group_keys=False).apply(
        lambda x: x.sample(min(1000, len(x)), random_state=SEED)
    ).reset_index(drop=True)
    print("‚ö° QUICK TEST MODE: Using subset of data")
    print(f"   Training: {len(train_df):,} samples (stratified)")
    print(f"   Validation: {len(val_df):,} samples (stratified)")

print(f"Training samples: {len(train_df):,}")
print(f"Validation samples: {len(val_df):,}")
print(f"Columns: {list(train_df.columns)}")

num_labels = train_df['label'].nunique()
print(f"Num labels: {num_labels}")


‚ö° QUICK TEST MODE: Using subset of data
   Training: 44,394 samples (stratified)
   Validation: 6,710 samples (stratified)
Training samples: 44,394
Validation samples: 6,710
Columns: ['code', 'generator', 'label', 'language']
Num labels: 11


In [None]:
# 3. Prepare HuggingFace Datasets and Tokenizer

model_name = "microsoft/codebert-base"

tokenizer = AutoTokenizer.from_pretrained(model_name)

# Wrap pandas DataFrames into HuggingFace Datasets
train_ds = Dataset.from_pandas(train_df[['code', 'label']])
val_ds   = Dataset.from_pandas(val_df[['code', 'label']])

max_length = 128  # Reduced for 8GB unified memory (can try 192 if stable)

def preprocess_fn(batch):
    return tokenizer(
        batch["code"],
        truncation=True,
        padding="max_length",
        max_length=max_length,
    )

train_ds_tok = train_ds.map(preprocess_fn, batched=True, remove_columns=["code"])
val_ds_tok   = val_ds.map(preprocess_fn,   batched=True, remove_columns=["code"])

train_ds_tok = train_ds_tok.rename_column("label", "labels")
val_ds_tok   = val_ds_tok.rename_column("label", "labels")

train_ds_tok.set_format("torch")
val_ds_tok.set_format("torch")

print(train_ds_tok)


Map:   0%|          | 0/44394 [00:00<?, ? examples/s]

Map:   0%|          | 0/6710 [00:00<?, ? examples/s]

Dataset({
    features: ['labels', 'input_ids', 'attention_mask'],
    num_rows: 44394
})


In [None]:
# 4. Initialize CodeBERT Model for 11-Class Classification

model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=num_labels
)

metric_f1 = evaluate.load("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    macro_f1 = metric_f1.compute(predictions=preds, references=labels, average="macro")
    return {"macro_f1": macro_f1["f1"]}

print("‚úÖ Model initialized for", num_labels, "classes")


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at microsoft/codebert-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


‚úÖ Model initialized for 11 classes


In [None]:
# 5. Training Configuration and Fine-Tuning (Optimized for 8GB Unified Memory)

# Check if QUICK_TEST is defined (from cell 2)
try:
    is_quick_test = QUICK_TEST
except NameError:
    is_quick_test = False

batch_size = 2  # Reduced for 8GB unified memory
gradient_accumulation_steps = 4  # Simulates batch_size=8 (2*4)
effective_batch_size = batch_size * gradient_accumulation_steps

# Adjust epochs and logging for quick test
num_epochs = 1 if is_quick_test else 3
logging_steps = 50 if is_quick_test else 500

if is_quick_test:
    print("‚ö° QUICK TEST MODE: Training for 1 epoch (~30-60 min)")
else:
    print("üöÄ FULL TRAINING MODE: Training for 3 epochs (~12-24 hours)")

training_args = TrainingArguments(
    output_dir="./codebert_taskB",
    eval_strategy="epoch",
    save_strategy="epoch",
    logging_strategy="steps",
    logging_steps=logging_steps,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=4,  # Reduced eval batch size
    gradient_accumulation_steps=gradient_accumulation_steps,
    num_train_epochs=num_epochs,
    learning_rate=5e-5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model="macro_f1",
    greater_is_better=True,
    # Memory optimizations
    fp16=False,  # Use bf16 on Apple Silicon instead
    bf16=torch.backends.mps.is_available(),  # Mixed precision for Apple Silicon
    dataloader_pin_memory=False,  # Disable for unified memory
    dataloader_num_workers=0,  # Reduce memory overhead
    gradient_checkpointing=True,  # Trade compute for memory
    max_grad_norm=1.0,  # Gradient clipping
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds_tok,
    eval_dataset=val_ds_tok,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)

trainer.train()

metrics = trainer.evaluate()
print("\nüìä Validation metrics:")
print(metrics)


‚ö° QUICK TEST MODE: Training for 1 epoch (~30-60 min)


  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Macro F1
1,1.6016,1.552924,0.394201



üìä Validation metrics:
{'eval_loss': 1.552924394607544, 'eval_macro_f1': 0.3942011021447564, 'eval_runtime': 142.8855, 'eval_samples_per_second': 46.961, 'eval_steps_per_second': 11.744, 'epoch': 1.0}


## Summary of Improvements Made:

1. **Language prefix**: Added `[LANG=...]` prefix to code to help model distinguish languages
2. **Increased max_length**: 128 ‚Üí 192 (captures more context)
3. **More epochs**: 1 ‚Üí 2 (quick test), 3 ‚Üí 5 (full training)
4. **Learning rate scheduling**: Added warmup (10%) + cosine decay
5. **Lower learning rate**: 5e-5 ‚Üí 3e-5 (more stable training)
6. **Better metrics**: Added per-class F1 for diagnostics
7. **Checkpoint management**: Keep only best 3 checkpoints

**Expected improvement**: F1 should increase from ~0.39 to **0.50-0.65** with these changes.

---

## About "Features" for Transformers

**Short answer:** Traditional feature engineering (AST, keyword counts, etc.) doesn't directly help transformers, but you CAN add contextual information as text.

**For Transformers (CodeBERT):**
- ‚úÖ **Add context as text prefixes**: Language, code length buckets, complexity hints
- ‚úÖ **Metadata as text**: `[LANG=python]`, `[LENGTH=medium]`, `[COMPLEXITY=high]`
- ‚ùå **Traditional numeric features**: AST depth, keyword counts (these need separate models)

**For Traditional ML (XGBoost, etc.):**
- ‚úÖ **All feature types work**: AST, keywords, statistics, etc.
- ‚úÖ **Feature engineering is crucial**: Can significantly boost F1

**Best approach for Task B:**
1. **Transformer**: Add more text context (see cell 3a below)
2. **Hyperparameter tuning**: Most impactful (see cell 6 below)
3. **Ensemble**: Combine transformer + traditional ML models


# 6. Hyperparameter Tuning (Optional but Recommended)

**When to use:** After you've run the baseline training and want to improve F1 further.

**Strategy:** Test different hyperparameter combinations and pick the best one.
