## Installation

In [None]:
# Install MPDistil
!pip install git+https://github.com/parmanu-lcs2/mpdistil.git -q

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for mpdistil (pyproject.toml) ... [?25l[?25hdone


## Import Libraries

In [2]:
!pip install evaluate
from mpdistil import MPDistil, load_superglue_dataset
import torch

print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Downloading evaluate-0.4.6-py3-none-any.whl (84 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m84.1/84.1 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.6
PyTorch: 2.9.0+cu126
CUDA available: True


---

## Example 1: Accuracy Metric (Default)

**Task:** CB (CommitmentBank) classification  
**Metric:** Accuracy using `evaluate.load('accuracy')`  
**Use Case:** Standard classification tasks

In [3]:

print("Loading CB dataset...")
loaders, num_labels = load_superglue_dataset('CB', max_samples=100)

print(f"Dataset loaded: {num_labels} labels")
print(f"Train batches: {len(loaders['train'])}")
print(f"Val batches: {len(loaders['val'])}")

Loading CB dataset...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

cb/train-00000-of-00001.parquet:   0%|          | 0.00/58.0k [00:00<?, ?B/s]

cb/validation-00000-of-00001.parquet:   0%|          | 0.00/18.0k [00:00<?, ?B/s]

cb/test-00000-of-00001.parquet:   0%|          | 0.00/63.5k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/250 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/56 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/250 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Map:   0%|          | 0/50 [00:00<?, ? examples/s]

Map:   0%|          | 0/250 [00:00<?, ? examples/s]

Dataset loaded: 3 labels
Train batches: 13
Val batches: 7


In [4]:
# Initialize model with default accuracy metric
model_accuracy = MPDistil(
    task_name='CB',
    num_labels=num_labels,
    metric='accuracy',
    student_layers=6,
    device='auto'
)

print(f"Model initialized!")
print(f"Using metric: {model_accuracy.metric}")

Using device: cuda
Model initialized!
Using metric: accuracy


In [6]:
# Train the model
print("\nTraining with accuracy metric...")
history_accuracy = model_accuracy.train(
    train_loader=loaders['train'],
    val_loader=loaders['val'],
    teacher_epochs=2, student_epochs=2
)

print(f"\nTraining complete!")


# Extract accuracy from Phase 2 (Student) metricsprint(f"Best accuracy: {max(accuracies):.4f}")

phase2_metrics = history_accuracy['phase2']['val_metrics']
accuracies = [m['acc'] for m in phase2_metrics]


Training with accuracy metric...

Validating DataLoaders...

Preparing task loaders...
Tasks: ['CB']
Label counts: {'cb': 3}

Initializing models...
Slicing student model to 6 layers (original: 12)


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]


Model Sizes:
  Teacher: 109,485,316 parameters
  Student: 66,958,084 parameters (61.2% of teacher)
  Action:  769 parameters

Starting MPDistil Training

=== Phase 1: Teacher Fine-tuning ===


Phase 1 Epoch 1/2: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10/10 [00:02<00:00,  4.88it/s, loss=1.24]


Downloading builder script: 0.00B [00:00, ?B/s]

Epoch 1: Train Loss=1.1375, Val Metrics={'acc': 0.42, 'val_loss': 1.2858424711227416, 'task': 'CB'}


Phase 1 Epoch 2/2: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10/10 [00:01<00:00,  9.06it/s, loss=0.645]


Epoch 2: Train Loss=0.7843, Val Metrics={'acc': 0.46, 'val_loss': 1.0241307878494263, 'task': 'CB'}

=== Phase 2: Student Knowledge Distillation ===


  pkd_loss = F.mse_loss(s_features, t_features, reduction="mean")
Phase 2 Epoch 1/2: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10/10 [00:00<00:00, 10.47it/s, total_loss=0.53, task_loss=0.543]


Epoch 1: Train Loss=0.6606, Task Loss=0.8020, Val Metrics={'acc': 0.42, 'val_loss': 1.0228382349014282, 'task': 'CB'}


  pkd_loss = F.mse_loss(s_features, t_features, reduction="mean")
Phase 2 Epoch 2/2: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10/10 [00:00<00:00, 11.66it/s, total_loss=0.606, task_loss=0.707]


Epoch 2: Train Loss=0.6164, Task Loss=0.7195, Val Metrics={'acc': 0.44, 'val_loss': 0.9910922050476074, 'task': 'CB'}

=== Phase 3: Meta-Teacher Learning ===


Phase 3: Meta-Teacher: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3/3 [00:00<00:00, 29.37it/s, meta_loss=0.38]


Phase 3: Meta Loss=0.7203, Val Metrics={'acc': 0.42, 'val_loss': 1.1109878635406494, 'task': 'CB'}

Training Complete!

Training complete!


---

## Example 2: F1 Score Metric

**Metric:** Macro F1 + Accuracy using `evaluate.load('f1')` and `evaluate.load('accuracy')`  
**Use Case:** Imbalanced classification, multi-class tasks  
**Returns:** `acc`, `f1`, and `acc_and_f1` (average)

In [7]:
# Initialize model with F1 metric
model_f1 = MPDistil(
    task_name='CB',
    num_labels=num_labels,
    metric='f1',
    student_layers=6,
    device='auto'
)

print(f"Using metric: {model_f1.metric}")

Using device: cuda
Using metric: f1


In [9]:
# Train with F1 metric
print("Training with F1 metric...")
history_f1 = model_f1.train(
    train_loader=loaders['train'],
    val_loader=loaders['val'],
    teacher_epochs=2, student_epochs=2
)

print(f"\nTraining complete!")

# Extract F1 and accuracy from Phase 2 (Student) metrics

phase2_metrics = history_f1['phase2']['val_metrics']
print(f"Best accuracy: {max(accuracies):.4f}")

f1_scores = [m['f1'] for m in phase2_metrics]
print(f"Best F1 score: {max(f1_scores):.4f}")
accuracies = [m['acc'] for m in phase2_metrics]

Training with F1 metric...

Validating DataLoaders...

Preparing task loaders...
Tasks: ['CB']
Label counts: {'cb': 3}

Initializing models...
Slicing student model to 6 layers (original: 12)

Model Sizes:
  Teacher: 109,485,316 parameters
  Student: 66,958,084 parameters (61.2% of teacher)
  Action:  769 parameters

Starting MPDistil Training

=== Phase 1: Teacher Fine-tuning ===


Phase 1 Epoch 1/2: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10/10 [00:01<00:00,  9.01it/s, loss=1.24]


Downloading builder script: 0.00B [00:00, ?B/s]

Epoch 1: Train Loss=1.1375, Val Metrics={'acc': 0.42, 'f1': 0.19718309859154928, 'acc_and_f1': 0.30859154929577465, 'val_loss': 1.2858424711227416, 'task': 'CB'}


Phase 1 Epoch 2/2: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10/10 [00:01<00:00,  9.03it/s, loss=0.645]


Epoch 2: Train Loss=0.7843, Val Metrics={'acc': 0.46, 'f1': 0.2522812667740204, 'acc_and_f1': 0.3561406333870102, 'val_loss': 1.0241307878494263, 'task': 'CB'}

=== Phase 2: Student Knowledge Distillation ===


  pkd_loss = F.mse_loss(s_features, t_features, reduction="mean")
Phase 2 Epoch 1/2: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10/10 [00:00<00:00, 11.56it/s, total_loss=0.53, task_loss=0.543]


Epoch 1: Train Loss=0.6606, Task Loss=0.8020, Val Metrics={'acc': 0.42, 'f1': 0.19718309859154928, 'acc_and_f1': 0.30859154929577465, 'val_loss': 1.0228382349014282, 'task': 'CB'}


  pkd_loss = F.mse_loss(s_features, t_features, reduction="mean")
Phase 2 Epoch 2/2: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10/10 [00:00<00:00, 11.74it/s, total_loss=0.606, task_loss=0.707]


Epoch 2: Train Loss=0.6164, Task Loss=0.7195, Val Metrics={'acc': 0.44, 'f1': 0.24369747899159666, 'acc_and_f1': 0.3418487394957983, 'val_loss': 0.9910922050476074, 'task': 'CB'}

=== Phase 3: Meta-Teacher Learning ===


Phase 3: Meta-Teacher: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3/3 [00:00<00:00, 30.60it/s, meta_loss=0.38]


Phase 3: Meta Loss=0.7203, Val Metrics={'acc': 0.42, 'f1': 0.19718309859154928, 'acc_and_f1': 0.30859154929577465, 'val_loss': 1.1109878635406494, 'task': 'CB'}

Training Complete!

Training complete!
Best accuracy: 0.4400
Best F1 score: 0.2437


---

## Example 3: Matthews Correlation Coefficient (MCC)

**Metric:** MCC using `evaluate.load('matthews_correlation')`  
**Use Case:** Binary classification with imbalanced data  
**Range:** -1 (worst) to +1 (perfect)

In [10]:
# Load binary classification task (RTE)
print("Loading RTE dataset...")
loaders_rte, num_labels_rte = load_superglue_dataset('RTE', max_samples=100)

print(f"RTE loaded: {num_labels_rte} labels (binary classification)")

Loading RTE dataset...


rte/train-00000-of-00001.parquet:   0%|          | 0.00/586k [00:00<?, ?B/s]

rte/validation-00000-of-00001.parquet:   0%|          | 0.00/69.8k [00:00<?, ?B/s]

rte/test-00000-of-00001.parquet:   0%|          | 0.00/622k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2490 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/277 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3000 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Map:   0%|          | 0/50 [00:00<?, ? examples/s]

Map:   0%|          | 0/3000 [00:00<?, ? examples/s]

RTE loaded: 2 labels (binary classification)


In [11]:
# Initialize with MCC metric
model_mcc = MPDistil(
    task_name='RTE',
    num_labels=num_labels_rte,
    metric='mcc',
    student_layers=6,
    device='auto'
)

print(f"Using metric: {model_mcc.metric}")

Using device: cuda
Using metric: mcc


In [13]:
# Train with MCC metric
print("Training with MCC metric...")
history_mcc = model_mcc.train(
    train_loader=loaders_rte['train'],
    val_loader=loaders_rte['val'],
    teacher_epochs=2, student_epochs=2
)

print(f"\nTraining complete!")


# Extract MCC from Phase 2 (Student) metricsprint(f"Best MCC: {max(mcc_scores):.4f}")

phase2_metrics = history_mcc['phase2']['val_metrics']
mcc_scores = [m['mcc'] for m in phase2_metrics]

Training with MCC metric...

Validating DataLoaders...

Preparing task loaders...
Tasks: ['RTE']
Label counts: {'rte': 2}

Initializing models...
Slicing student model to 6 layers (original: 12)

Model Sizes:
  Teacher: 109,484,547 parameters
  Student: 66,957,315 parameters (61.2% of teacher)
  Action:  769 parameters

Starting MPDistil Training

=== Phase 1: Teacher Fine-tuning ===


Phase 1 Epoch 1/2: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10/10 [00:01<00:00,  8.95it/s, loss=0.774]


Downloading builder script: 0.00B [00:00, ?B/s]

Epoch 1: Train Loss=0.7330, Val Metrics={'mcc': 0.1091089451179962, 'val_loss': 0.6927059817314148, 'task': 'RTE'}


Phase 1 Epoch 2/2: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10/10 [00:01<00:00,  9.00it/s, loss=0.556]


Epoch 2: Train Loss=0.6124, Val Metrics={'mcc': 0.12038585308576918, 'val_loss': 0.6905442714691162, 'task': 'RTE'}

=== Phase 2: Student Knowledge Distillation ===


  pkd_loss = F.mse_loss(s_features, t_features, reduction="mean")
Phase 2 Epoch 1/2: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10/10 [00:00<00:00, 11.55it/s, total_loss=0.577, task_loss=0.652]


Epoch 1: Train Loss=0.6122, Task Loss=0.7121, Val Metrics={'mcc': 0.15617376188860607, 'val_loss': 0.6829880952835083, 'task': 'RTE'}


  pkd_loss = F.mse_loss(s_features, t_features, reduction="mean")
Phase 2 Epoch 2/2: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10/10 [00:00<00:00, 11.59it/s, total_loss=0.565, task_loss=0.633]


Epoch 2: Train Loss=0.5561, Task Loss=0.6154, Val Metrics={'mcc': 0.04364357804719848, 'val_loss': 0.6787544012069702, 'task': 'RTE'}

=== Phase 3: Meta-Teacher Learning ===


Phase 3: Meta-Teacher: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3/3 [00:00<00:00, 29.03it/s, meta_loss=0.667]


Phase 3: Meta Loss=0.7121, Val Metrics={'mcc': 0.0, 'val_loss': 0.8017272746562958, 'task': 'RTE'}

Training Complete!

Training complete!


---

## Example 4: Correlation Metrics (Pearson + Spearman)

**Metric:** Pearson & Spearman correlation using `evaluate.load('pearsonr')` and `evaluate.load('spearmanr')`  
**Use Case:** Regression tasks, semantic similarity  
**Returns:** `pearson`, `spearmanr`, and `corr` (average)

In [14]:
# Note: For demonstration with classification data
# In practice, use this with regression tasks like STS-B

model_corr = MPDistil(
    task_name='CB',
    num_labels=num_labels,
    metric='correlation',
    student_layers=6,
    device='auto'
)

print(f"Using metric: {model_corr.metric}")
print("\nNote: Correlation metrics are best suited for regression tasks like STS-B")

Using device: cuda
Using metric: correlation

Note: Correlation metrics are best suited for regression tasks like STS-B


---

## Example 6: Complete Training Pipeline

Full example with all 4 phases: Teacher fine-tuning ‚Üí Student distillation ‚Üí Meta-teacher ‚Üí Curriculum learning

In [20]:
# Load fresh dataset for complete pipeline
print("Setting up complete training pipeline...\n")
loaders_full, num_labels_full = load_superglue_dataset('BoolQ', max_samples=200)

# Initialize with F1 metric for comprehensive evaluation
model_full = MPDistil(
    task_name='BoolQ',
    num_labels=num_labels_full,
    metric='f1',
    teacher_model='bert-base-uncased',
    student_layers=6,
    device='auto'
)

print(f"Model: {model_full.teacher_model} ‚Üí {model_full.student_model}-layer student")
print(f"Metric: {model_full.metric} (HuggingFace evaluate)")
print(f"Task: {model_full.task_name}")

Setting up complete training pipeline...

Using device: cuda
Model: None ‚Üí None-layer student
Metric: f1 (HuggingFace evaluate)
Task: BoolQ


In [21]:
# Run complete 4-phase training
print("\n" + "="*60)
print("Starting 4-Phase MPDistil Training")
print("="*60)

history_full = model_full.train(
    train_loader=loaders_full['train'],
    val_loader=loaders_full['val'],
    teacher_epochs=3, student_epochs=3
)

print("\n" + "="*60)
print("Training Complete!")
print("="*60)


Starting 4-Phase MPDistil Training

Validating DataLoaders...

Preparing task loaders...
Tasks: ['BoolQ']
Label counts: {'boolq': 2}

Initializing models...
Slicing student model to 6 layers (original: 12)

Model Sizes:
  Teacher: 109,484,547 parameters
  Student: 66,957,315 parameters (61.2% of teacher)
  Action:  769 parameters

Starting MPDistil Training

=== Phase 1: Teacher Fine-tuning ===


Phase 1 Epoch 1/3: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 20/20 [00:02<00:00,  8.88it/s, loss=0.572]


Epoch 1: Train Loss=0.7099, Val Metrics={'acc': 0.7, 'f1': 0.4117647058823529, 'acc_and_f1': 0.5558823529411765, 'val_loss': 0.6251359701156616, 'task': 'BoolQ'}


Phase 1 Epoch 2/3: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 20/20 [00:02<00:00,  8.84it/s, loss=0.65]


Epoch 2: Train Loss=0.6193, Val Metrics={'acc': 0.69, 'f1': 0.43748865904554524, 'acc_and_f1': 0.5637443295227726, 'val_loss': 0.6132808947563171, 'task': 'BoolQ'}


Phase 1 Epoch 3/3: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 20/20 [00:02<00:00,  8.85it/s, loss=0.591]


Epoch 3: Train Loss=0.4834, Val Metrics={'acc': 0.59, 'f1': 0.4737517648568862, 'acc_and_f1': 0.5318758824284431, 'val_loss': 0.6899695634841919, 'task': 'BoolQ'}

=== Phase 2: Student Knowledge Distillation ===


  pkd_loss = F.mse_loss(s_features, t_features, reduction="mean")
Phase 2 Epoch 1/3: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 20/20 [00:01<00:00, 11.33it/s, total_loss=0.702, task_loss=0.806]


Epoch 1: Train Loss=0.6535, Task Loss=0.7161, Val Metrics={'acc': 0.7, 'f1': 0.4117647058823529, 'acc_and_f1': 0.5558823529411765, 'val_loss': 0.6490185976028442, 'task': 'BoolQ'}


  pkd_loss = F.mse_loss(s_features, t_features, reduction="mean")
Phase 2 Epoch 2/3: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 20/20 [00:01<00:00, 11.23it/s, total_loss=0.493, task_loss=0.481]


Epoch 2: Train Loss=0.6011, Task Loss=0.6397, Val Metrics={'acc': 0.7, 'f1': 0.4117647058823529, 'acc_and_f1': 0.5558823529411765, 'val_loss': 0.6223375773429871, 'task': 'BoolQ'}


  pkd_loss = F.mse_loss(s_features, t_features, reduction="mean")
Phase 2 Epoch 3/3: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 20/20 [00:01<00:00, 11.30it/s, total_loss=0.503, task_loss=0.512]


Epoch 3: Train Loss=0.5494, Task Loss=0.5846, Val Metrics={'acc': 0.65, 'f1': 0.4631078386255561, 'acc_and_f1': 0.5565539193127781, 'val_loss': 0.6065957617759704, 'task': 'BoolQ'}

=== Phase 3: Meta-Teacher Learning ===


Phase 3: Meta-Teacher: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5/5 [00:00<00:00, 24.66it/s, meta_loss=0.843]


Phase 3: Meta Loss=0.7638, Val Metrics={'acc': 0.68, 'f1': 0.5, 'acc_and_f1': 0.5900000000000001, 'val_loss': 0.6959292936325073, 'task': 'BoolQ'}

Training Complete!

Training Complete!


In [22]:
# Display results
print("\nüìä Final Results (HuggingFace Evaluate Metrics):\n")

# Extract metrics from Phase 2 (Student distillation)
phase2_metrics = history_full['phase2']['val_metrics']
f1_scores = [m['f1'] for m in phase2_metrics]

accuracies = [m['acc'] for m in phase2_metrics]
print(f"Final Accuracy: {accuracies[-1]:.4f}")

print(f"Final F1:       {f1_scores[-1]:.4f}")

print(f"Best F1 Score:  {max(f1_scores):.4f}")
print(f"Best Accuracy:  {max(accuracies):.4f}")


üìä Final Results (HuggingFace Evaluate Metrics):

Final Accuracy: 0.6500
Final F1:       0.4631
Best F1 Score:  0.4631
Best Accuracy:  0.7000
