!/usr/bin/env python
coding: utf-8

# ModernBERT Fine-tuning for 20 Newsgroups Text Classification

An implementation of ModernBERT fine-tuning for multi-class text classification
using the 20 newsgroups dataset, optimized for Kaggle T4 GPU (16 GB VRAM).

## 1. Design Decisions

| Parameter | Value | Justification |
|-----------|-------|---------------|
| **Model** | answerdotai/ModernBERT-base | Modern encoder with RoPE, 8192 context, pre-trained on 2T tokens |
| **Learning Rate** | 5e-5 | Slightly higher than classic BERT; ModernBERT benefits from it |
| **Batch Size** | 32 | T4 16GB handles this with FP16 at seq len 256 |
| **Gradient Accum** | 2 | Effective batch size 64 for better convergence |
| **Epochs** | 3 | ModernBERT converges faster; avoids overfitting |
| **Max Length** | 256 | Good coverage of newsgroup posts within T4 memory budget |
| **Optimizer** | AdamW | Weight-decoupled Adam for transformers |
| **Scheduler** | Linear warmup + decay | Prevents early instability |
| **Layer Freezing** | Bottom 50% of encoder | Faster training, less memory, prevents overfitting |

## 2. Imports

In [None]:


import os
import json
import random
import time
from typing import Tuple
from dataclasses import dataclass, field

import numpy as np
import torch
from torch.utils.data import DataLoader
from torch.amp import GradScaler, autocast
from torch.optim import AdamW

from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    AutoConfig,
    get_linear_schedule_with_warmup
)
from sklearn.metrics import (
    accuracy_score,
    precision_recall_fscore_support,
    classification_report,
    confusion_matrix
)
from tqdm import tqdm

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    # Enable cuDNN auto-tuner for T4 optimization
    torch.backends.cudnn.benchmark = True

## 3. Configuration

All hyperparameters are centralized here with documented justifications.

In [None]:


"""
Configuration for ModernBERT Fine-tuning on 20 Newsgroups
(Optimized for Kaggle T4 GPU — 16 GB VRAM)

Design Decisions:
-----------------
1. Model: answerdotai/ModernBERT-base
   - Modern bidirectional encoder pre-trained on 2 trillion tokens
   - Uses Rotary Positional Embeddings (RoPE) and alternating local-global attention
   - Native 8192-token context window (we use 256 for efficiency)
   
2. Learning Rate: 5e-5
   - ModernBERT benefits from slightly higher LR than classic BERT
   - Combined with warmup to prevent early instability
   
3. Batch Size: 32 (effective 64 with gradient accumulation)
   - T4 with 16 GB VRAM handles batch_size=32 at seq_len=256 with FP16
   - Gradient accumulation steps=2 gives effective batch of 64
   
4. Epochs: 3
   - ModernBERT converges faster than classic BERT
   - Prevents overfitting on 11K training examples
   
5. Max Sequence Length: 256
   - Better coverage of newsgroup posts than 128
   - Still memory-efficient on T4 with FP16
   
6. Layer Freezing: Bottom 50% of encoder layers
   - Freezes embeddings + first 11 of 22 layers
   - Drastically reduces trainable params and memory
   - Pre-trained features in lower layers transfer well

7. Warmup Ratio: 0.1
   - 10% of training steps for linear warmup
   
8. Weight Decay: 0.01
   - Standard L2 regularization for transformers
"""

@dataclass
class Config:
    """Configuration class with all hyperparameters and settings."""
    
    # Model Configuration
    model_name: str = "answerdotai/ModernBERT-base"
    num_labels: int = 20
    
    # Training Hyperparameters
    learning_rate: float = 5e-5
    batch_size: int = 32  # T4 16GB can handle this with FP16 @ seq_len=256
    num_epochs: int = 3
    warmup_ratio: float = 0.1
    weight_decay: float = 0.01
    max_grad_norm: float = 1.0
    gradient_accumulation_steps: int = 2  # Effective batch size = 64
    
    # Layer Freezing
    freeze_layers: bool = True
    freeze_ratio: float = 0.5  # Freeze bottom 50% of encoder layers
    
    # Data Configuration
    dataset_name: str = "SetFit/20_newsgroups"
    max_length: int = 256  # Better coverage than 128; T4 handles it fine
    
    # Training Settings
    seed: int = 42
    use_fp16: bool = True
    save_model: bool = True
    output_dir: str = "./output"
    
    # Device (auto-detected)
    device: str = field(default_factory=lambda: "cuda" if torch.cuda.is_available() else "cpu")
    
    def __post_init__(self):
        if self.device == "cpu":
            self.use_fp16 = False
            
    def to_dict(self) -> dict:
        return {
            "model_name": self.model_name,
            "num_labels": self.num_labels,
            "learning_rate": self.learning_rate,
            "batch_size": self.batch_size,
            "effective_batch_size": self.batch_size * self.gradient_accumulation_steps,
            "num_epochs": self.num_epochs,
            "warmup_ratio": self.warmup_ratio,
            "weight_decay": self.weight_decay,
            "gradient_accumulation_steps": self.gradient_accumulation_steps,
            "max_length": self.max_length,
            "freeze_layers": self.freeze_layers,
            "freeze_ratio": self.freeze_ratio,
            "seed": self.seed,
            "use_fp16": self.use_fp16,
            "device": self.device,
        }

# Initialize configuration
config = Config()

print("Configuration:")
for key, value in config.to_dict().items():
    print(f"  {key}: {value}")

## 4. Data Loading

Load the 20 newsgroups dataset from HuggingFace and tokenize with ModernBERT tokenizer.

In [None]:


"""
Data Loading and Preprocessing for 20 Newsgroups

Design Decisions:
-----------------
1. Dataset Source: SetFit/20_newsgroups from HuggingFace
   - Pre-cleaned version with headers, signatures, and quotations removed
   - Follows sklearn best practices for realistic training
   
2. Tokenization Strategy:
   - Use ModernBERT tokenizer (handles [CLS]/[SEP] automatically)
   - Truncate to max_length (256)
   - Pad to max_length for batching efficiency
   
3. DataLoader:
   - 4 workers (Kaggle T4 instances have 4 vCPUs)
   - Pin memory for faster GPU transfers
"""

def get_label_names() -> list:
    """Get the 20 newsgroup category names."""
    return [
        'alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc',
        'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x',
        'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball',
        'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med',
        'sci.space', 'soc.religion.christian', 'talk.politics.guns',
        'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc'
    ]


def load_and_prepare_data(config) -> Tuple[DataLoader, DataLoader]:
    """Load 20 newsgroups dataset, tokenize, and create DataLoaders."""
    print(f"Loading dataset: {config.dataset_name}")
    
    # Load raw dataset from HuggingFace
    dataset = load_dataset(config.dataset_name)
    
    train_dataset = dataset['train']
    test_dataset = dataset['test']
    
    print(f"Train size: {len(train_dataset)}")
    print(f"Test size: {len(test_dataset)}")
    
    # Initialize tokenizer
    tokenizer = AutoTokenizer.from_pretrained(config.model_name)
    
    def tokenize_function(examples):
        return tokenizer(
            examples['text'],
            truncation=True,
            padding='max_length',
            max_length=config.max_length,
            return_tensors=None
        )
    
    # Apply tokenization
    print("Tokenizing datasets...")
    train_dataset = train_dataset.map(tokenize_function, batched=True, desc="Tokenizing train")
    test_dataset = test_dataset.map(tokenize_function, batched=True, desc="Tokenizing test")
    
    # Set format for PyTorch
    columns = ['input_ids', 'attention_mask', 'label']
    train_dataset.set_format(type='torch', columns=columns)
    test_dataset.set_format(type='torch', columns=columns)
    
    # Create DataLoaders (4 workers for Kaggle T4 which has 4 vCPUs)
    num_workers = 4 if config.device == 'cuda' else 0
    
    train_loader = DataLoader(
        train_dataset,
        batch_size=config.batch_size,
        shuffle=True,
        num_workers=num_workers,
        pin_memory=True if config.device == 'cuda' else False
    )
    
    test_loader = DataLoader(
        test_dataset,
        batch_size=config.batch_size,
        shuffle=False,
        num_workers=num_workers,
        pin_memory=True if config.device == 'cuda' else False
    )
    
    print(f"Created DataLoaders - Train batches: {len(train_loader)}, Test batches: {len(test_loader)}")
    
    return train_loader, test_loader


# Load data
train_loader, test_loader = load_and_prepare_data(config)

## 5. Model

Initialize ModernBERT model with classification head and optional layer freezing.

In [None]:


"""
ModernBERT Model for Text Classification

Design Decisions:
-----------------
1. Architecture: ModernBERT + SequenceClassification head
   - Pre-trained ModernBERT-base (149M params) + linear classification head
   - Uses [CLS] token representation for classification
   
2. Layer Freezing:
   - Freeze embeddings and bottom 50% of encoder layers
   - Only fine-tune top layers + classification head
   - Reduces trainable params from ~149M to ~75M
   - Faster training, less memory, better regularization on small datasets
   
3. Initialization:
   - Load pre-trained weights from HuggingFace Hub
   - Classification head initialized randomly (will be trained)
"""

def get_model(config):
    """Initialize ModernBERT model for sequence classification with optional layer freezing."""
    print(f"Loading model: {config.model_name}")
    print(f"Number of classes: {config.num_labels}")
    
    model_config = AutoConfig.from_pretrained(
        config.model_name,
        num_labels=config.num_labels,
        finetuning_task="text-classification"
    )
    
    model = AutoModelForSequenceClassification.from_pretrained(
        config.model_name,
        config=model_config
    )
    
    # Layer freezing for efficiency on T4
    if config.freeze_layers:
        # Freeze embeddings
        if hasattr(model, 'model') and hasattr(model.model, 'embeddings'):
            for param in model.model.embeddings.parameters():
                param.requires_grad = False
            print("  Froze embedding layer")
        elif hasattr(model, 'bert') and hasattr(model.bert, 'embeddings'):
            for param in model.bert.embeddings.parameters():
                param.requires_grad = False
            print("  Froze embedding layer")
        
        # Freeze bottom encoder layers
        encoder_layers = None
        if hasattr(model, 'model') and hasattr(model.model, 'encoder'):
            encoder = model.model.encoder
            if hasattr(encoder, 'layers'):
                encoder_layers = encoder.layers
            elif hasattr(encoder, 'layer'):
                encoder_layers = encoder.layer
        elif hasattr(model, 'bert') and hasattr(model.bert, 'encoder'):
            encoder = model.bert.encoder
            if hasattr(encoder, 'layer'):
                encoder_layers = encoder.layer
        
        if encoder_layers is not None:
            num_layers = len(encoder_layers)
            num_freeze = int(num_layers * config.freeze_ratio)
            for i, layer in enumerate(encoder_layers):
                if i < num_freeze:
                    for param in layer.parameters():
                        param.requires_grad = False
            print(f"  Froze {num_freeze}/{num_layers} encoder layers")
        else:
            print("  Warning: Could not identify encoder layers for freezing")
    
    total_params = sum(p.numel() for p in model.parameters())
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    frozen_params = total_params - trainable_params
    
    print(f"Total parameters: {total_params:,}")
    print(f"Trainable parameters: {trainable_params:,} ({100*trainable_params/total_params:.1f}%)")
    print(f"Frozen parameters: {frozen_params:,} ({100*frozen_params/total_params:.1f}%)")
    
    return model


# Initialize model
model = get_model(config)

## 6. Trainer

Training loop with AdamW optimizer, warmup scheduler, gradient accumulation, and FP16.

In [None]:


"""
Training Module for ModernBERT Fine-tuning (T4 Optimized)

Design Decisions:
-----------------
1. Optimizer: AdamW
   - Weight-decoupled Adam (fixes L2 regularization in Adam)
   - Standard choice for transformer fine-tuning
   
2. Scheduler: Linear warmup + linear decay
   - Warmup prevents early instability with pre-trained models
   - Linear decay gradually reduces learning rate
   
3. Mixed Precision (FP16):
   - ~2x faster training on T4
   - ~50% memory reduction for larger batches
   - Uses modern torch.amp API (not deprecated torch.cuda.amp)
   
4. Gradient Accumulation:
   - Steps=2 gives effective batch size of 64
   - Better convergence without extra memory
   
5. Gradient Clipping:
   - Max norm = 1.0 (standard for transformers)
   - Prevents exploding gradients
"""

class Trainer:
    """Trainer class for ModernBERT fine-tuning on T4 GPU."""
    
    def __init__(self, model, config, train_loader):
        self.model = model
        self.config = config
        self.train_loader = train_loader
        self.device = config.device
        self.grad_accum_steps = config.gradient_accumulation_steps
        
        self.model.to(self.device)
        
        # Only optimize trainable parameters
        self.optimizer = AdamW(
            filter(lambda p: p.requires_grad, self.model.parameters()),
            lr=config.learning_rate,
            weight_decay=config.weight_decay
        )
        
        # Total optimizer steps accounts for gradient accumulation
        self.total_steps = (len(train_loader) // self.grad_accum_steps) * config.num_epochs
        self.warmup_steps = int(self.total_steps * config.warmup_ratio)
        
        self.scheduler = get_linear_schedule_with_warmup(
            self.optimizer,
            num_warmup_steps=self.warmup_steps,
            num_training_steps=self.total_steps
        )
        
        # Modern AMP API
        self.scaler = GradScaler("cuda") if config.use_fp16 else None
        self.use_fp16 = config.use_fp16
        
        self.history = {'train_loss': [], 'learning_rate': []}
        
        print(f"\nTraining Configuration:")
        print(f"  Device: {self.device}")
        print(f"  Total optimizer steps: {self.total_steps}")
        print(f"  Warmup steps: {self.warmup_steps}")
        print(f"  Gradient accumulation steps: {self.grad_accum_steps}")
        print(f"  Effective batch size: {config.batch_size * self.grad_accum_steps}")
        print(f"  Mixed precision (FP16): {self.use_fp16}")
    
    def train_epoch(self, epoch):
        """Train for one epoch with gradient accumulation."""
        self.model.train()
        total_loss = 0
        num_batches = 0
        
        progress_bar = tqdm(
            self.train_loader,
            desc=f"Epoch {epoch+1}/{self.config.num_epochs}",
            leave=True
        )
        
        self.optimizer.zero_grad()
        
        for step, batch in enumerate(progress_bar):
            input_ids = batch['input_ids'].to(self.device)
            attention_mask = batch['attention_mask'].to(self.device)
            labels = batch['label'].to(self.device)
            
            if self.use_fp16:
                with autocast("cuda"):
                    outputs = self.model(
                        input_ids=input_ids,
                        attention_mask=attention_mask,
                        labels=labels
                    )
                    loss = outputs.loss / self.grad_accum_steps
                
                self.scaler.scale(loss).backward()
                
                if (step + 1) % self.grad_accum_steps == 0:
                    self.scaler.unscale_(self.optimizer)
                    torch.nn.utils.clip_grad_norm_(
                        filter(lambda p: p.requires_grad, self.model.parameters()),
                        self.config.max_grad_norm
                    )
                    self.scaler.step(self.optimizer)
                    self.scaler.update()
                    self.scheduler.step()
                    self.optimizer.zero_grad()
            else:
                outputs = self.model(
                    input_ids=input_ids,
                    attention_mask=attention_mask,
                    labels=labels
                )
                loss = outputs.loss / self.grad_accum_steps
                
                loss.backward()
                
                if (step + 1) % self.grad_accum_steps == 0:
                    torch.nn.utils.clip_grad_norm_(
                        filter(lambda p: p.requires_grad, self.model.parameters()),
                        self.config.max_grad_norm
                    )
                    self.optimizer.step()
                    self.scheduler.step()
                    self.optimizer.zero_grad()
            
            total_loss += loss.item() * self.grad_accum_steps  # Undo the division for logging
            num_batches += 1
            
            progress_bar.set_postfix({
                'loss': f'{loss.item() * self.grad_accum_steps:.4f}',
                'lr': f'{self.scheduler.get_last_lr()[0]:.2e}'
            })
        
        # Handle remaining gradients if batches not divisible by accum steps
        remaining = len(self.train_loader) % self.grad_accum_steps
        if remaining != 0:
            if self.use_fp16:
                self.scaler.unscale_(self.optimizer)
                torch.nn.utils.clip_grad_norm_(
                    filter(lambda p: p.requires_grad, self.model.parameters()),
                    self.config.max_grad_norm
                )
                self.scaler.step(self.optimizer)
                self.scaler.update()
            else:
                torch.nn.utils.clip_grad_norm_(
                    filter(lambda p: p.requires_grad, self.model.parameters()),
                    self.config.max_grad_norm
                )
                self.optimizer.step()
            self.scheduler.step()
            self.optimizer.zero_grad()
        
        return total_loss / num_batches
    
    def train(self):
        """Full training loop."""
        print("\n" + "="*60)
        print("Starting Training")
        print("="*60 + "\n")
        
        start_time = time.time()
        
        for epoch in range(self.config.num_epochs):
            epoch_start = time.time()
            
            train_loss = self.train_epoch(epoch)
            self.history['train_loss'].append(train_loss)
            
            current_lr = self.scheduler.get_last_lr()[0]
            self.history['learning_rate'].append(current_lr)
            
            epoch_time = time.time() - epoch_start
            
            print(f"\nEpoch {epoch+1}/{self.config.num_epochs} - "
                  f"Train Loss: {train_loss:.4f} - "
                  f"LR: {current_lr:.2e} - "
                  f"Time: {epoch_time:.1f}s")
        
        total_time = time.time() - start_time
        print(f"\nTraining Complete! Total time: {total_time/60:.1f} minutes")
        
        return self.history

## 7. Train the Model

In [None]:


# Set random seed for reproducibility
def set_seed(seed: int):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)

set_seed(config.seed)

# Initialize trainer and train
trainer = Trainer(model, config, train_loader)
history = trainer.train()

## 8. Evaluation

Evaluate the trained model on the test set.

In [None]:


"""
Evaluation Module for ModernBERT Fine-tuning

Metrics:
- Accuracy: Overall correctness
- Macro F1: Balanced performance across all classes
- Per-class metrics: Identify weak categories
- Confusion matrix: Reveals class confusion patterns
"""

@torch.no_grad()
def evaluate(model, test_loader, config):
    """Comprehensive evaluation on test set."""
    print("\n" + "="*60)
    print("Evaluating on Test Set")
    print("="*60 + "\n")
    
    model.eval()
    model.to(config.device)
    
    all_predictions = []
    all_labels = []
    total_loss = 0
    
    for batch in tqdm(test_loader, desc="Evaluating"):
        input_ids = batch['input_ids'].to(config.device)
        attention_mask = batch['attention_mask'].to(config.device)
        labels = batch['label'].to(config.device)
        
        if config.use_fp16:
            with autocast("cuda"):
                outputs = model(
                    input_ids=input_ids,
                    attention_mask=attention_mask,
                    labels=labels
                )
        else:
            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                labels=labels
            )
        
        total_loss += outputs.loss.item()
        predictions = torch.argmax(outputs.logits, dim=-1)
        
        all_predictions.extend(predictions.cpu().numpy())
        all_labels.extend(labels.cpu().numpy())
    
    all_predictions = np.array(all_predictions)
    all_labels = np.array(all_labels)
    
    # Calculate metrics
    accuracy = accuracy_score(all_labels, all_predictions)
    precision_macro, recall_macro, f1_macro, _ = precision_recall_fscore_support(
        all_labels, all_predictions, average='macro'
    )
    precision_weighted, recall_weighted, f1_weighted, _ = precision_recall_fscore_support(
        all_labels, all_predictions, average='weighted'
    )
    avg_loss = total_loss / len(test_loader)
    
    label_names = get_label_names()
    report = classification_report(all_labels, all_predictions, target_names=label_names, digits=4)
    conf_matrix = confusion_matrix(all_labels, all_predictions)
    
    # Print results
    print("\n" + "="*60)
    print("EVALUATION RESULTS")
    print("="*60)
    print(f"\n[Overall Metrics]")
    print(f"  Test Loss: {avg_loss:.4f}")
    print(f"  Accuracy: {accuracy:.4f} ({accuracy*100:.2f}%)")
    print(f"\n[Macro Averages]")
    print(f"  Precision: {precision_macro:.4f}")
    print(f"  Recall: {recall_macro:.4f}")
    print(f"  F1 Score: {f1_macro:.4f}")
    print(f"\n[Weighted Averages]")
    print(f"  Precision: {precision_weighted:.4f}")
    print(f"  Recall: {recall_weighted:.4f}")
    print(f"  F1 Score: {f1_weighted:.4f}")
    print("\n" + "="*60)
    print("CLASSIFICATION REPORT (Per-Class)")
    print("="*60)
    print(report)
    
    return {
        'test_loss': avg_loss,
        'accuracy': accuracy,
        'precision_macro': precision_macro,
        'recall_macro': recall_macro,
        'f1_macro': f1_macro,
        'precision_weighted': precision_weighted,
        'recall_weighted': recall_weighted,
        'f1_weighted': f1_weighted,
        'classification_report': report,
        'confusion_matrix': conf_matrix,
        'predictions': all_predictions,
        'labels': all_labels,
        'label_names': label_names
    }


# Evaluate
results = evaluate(model, test_loader, config)

## 9. Final Summary

In [None]:


print("\n" + "="*70)
print("FINAL RESULTS SUMMARY")
print("="*70)
print(f"  Model: {config.model_name}")
print(f"  Test Accuracy: {results['accuracy']:.4f} ({results['accuracy']*100:.2f}%)")
print(f"  Macro F1 Score: {results['f1_macro']:.4f}")
print(f"  Weighted F1 Score: {results['f1_weighted']:.4f}")
print("="*70)

## 10. Save Model (Optional)

In [None]:


if config.save_model:
    os.makedirs(config.output_dir, exist_ok=True)
    
    # Save model
    model_path = os.path.join(config.output_dir, "model")
    model.save_pretrained(model_path)
    print(f"Model saved to: {model_path}")
    
    # Save tokenizer for easy reloading
    tokenizer = AutoTokenizer.from_pretrained(config.model_name)
    tokenizer.save_pretrained(model_path)
    print(f"Tokenizer saved to: {model_path}")
    
    # Save training history
    history_path = os.path.join(config.output_dir, "training_history.json")
    with open(history_path, 'w') as f:
        json.dump(history, f, indent=2)
    print(f"Training history saved to: {history_path}")
    
    # Save evaluation metrics
    metrics = {
        'model': config.model_name,
        'test_loss': results['test_loss'],
        'accuracy': results['accuracy'],
        'precision_macro': results['precision_macro'],
        'recall_macro': results['recall_macro'],
        'f1_macro': results['f1_macro'],
        'precision_weighted': results['precision_weighted'],
        'recall_weighted': results['recall_weighted'],
        'f1_weighted': results['f1_weighted'],
    }
    metrics_path = os.path.join(config.output_dir, "evaluation_metrics.json")
    with open(metrics_path, 'w') as f:
        json.dump(metrics, f, indent=2)
    print(f"Evaluation metrics saved to: {metrics_path}")