## 1. Installation

Install MPDistil from GitHub:

In [None]:
!pip install git+https://github.com/yashpatel2010/mpdistil.git

Collecting git+https://github.com/yashpatel2010/mpdistil.git
  Cloning https://github.com/yashpatel2010/mpdistil.git to /tmp/pip-req-build-7ixji8di
  Running command git clone --filter=blob:none --quiet https://github.com/yashpatel2010/mpdistil.git /tmp/pip-req-build-7ixji8di
  Resolved https://github.com/yashpatel2010/mpdistil.git to commit 71f3a2c6e99de60d2c9d56f1c8d37294770e0668
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: mpdistil
  Building wheel for mpdistil (pyproject.toml) ... [?25l[?25hdone
  Created wheel for mpdistil: filename=mpdistil-0.1.0-py3-none-any.whl size=25523 sha256=3e8c514108c1ab2ead8563338b8a686ca4ac738d8d534f21c9927e37541cad4f
  Stored in directory: /tmp/pip-ephem-wheel-cache-35k75_iw/wheels/7d/64/9f/ea7c4ebf8e3fa28e14953d20a44980e1ef72a2d61ef916fbe4
Successfully built mpdistil
Installing collecte

## 2. Import Libraries

In [None]:
from mpdistil import MPDistil, load_superglue_dataset
import torch

print(f"Using GPU: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU Name: {torch.cuda.get_device_name(0)}")

Using GPU: True
GPU Name: Tesla T4


## 3. Load SuperGLUE Dataset

We'll use the CB (CommitmentBank) task as an example.

In [None]:
# Load CB dataset
loaders, num_labels = load_superglue_dataset(
    task_name='CB',
    tokenizer_name='bert-base-uncased',
    max_length=128,
    batch_size=8
)

print(f"Number of labels: {num_labels}")
print(f"Available splits: {list(loaders.keys())}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

cb/train-00000-of-00001.parquet:   0%|          | 0.00/58.0k [00:00<?, ?B/s]

cb/validation-00000-of-00001.parquet:   0%|          | 0.00/18.0k [00:00<?, ?B/s]

cb/test-00000-of-00001.parquet:   0%|          | 0.00/63.5k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/250 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/56 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/250 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/250 [00:00<?, ? examples/s]

Map:   0%|          | 0/56 [00:00<?, ? examples/s]

Map:   0%|          | 0/250 [00:00<?, ? examples/s]

Number of labels: 3
Available splits: ['train', 'val', 'test']


## 4. Initialize MPDistil Model

Create a 6-layer BERT student to learn from a 12-layer BERT teacher.

In [None]:
model = MPDistil(
    task_name='CB',
    num_labels=num_labels,
    teacher_model='bert-base-uncased',
    student_model='bert-base-uncased',
    student_layers=6,
    device='auto'
)

Using device: cuda


## 5. Train the Model

Train with all 4 phases:
1. Teacher fine-tuning
2. Student knowledge distillation
3. Meta-teacher learning
4. Curriculum learning (skipped if no meta_loaders)

In [None]:
# Train (reduced epochs for quick demo)
history = model.fit(
    train_loader=loaders['train'],
    val_loader=loaders['val'],
    test_loader=loaders['test'],
    teacher_epochs=3,   # Reduce for demo (use 10 for real training)
    student_epochs=3,   # Reduce for demo (use 10 for real training)
    num_episodes=0      # Skip phase 4 for quick demo
)


Validating DataLoaders...

Preparing task loaders...
Tasks: ['CB']
Label counts: {'cb': 3}

Initializing models...


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]


Model Sizes:
  Teacher: 109,485,316 parameters
  Student: 66,958,084 parameters (61.2% of teacher)
  Action:  769 parameters

Starting MPDistil Training

=== Phase 1: Teacher Fine-tuning ===


Phase 1 Epoch 1/3: 100%|██████████| 25/25 [00:05<00:00,  4.35it/s, loss=0.863]


Epoch 1: Train Loss=1.0368, Val Metrics={'acc': 0.6785714285714286, 'val_loss': 0.8354034083230155, 'task': 'CB'}


Phase 1 Epoch 2/3: 100%|██████████| 25/25 [00:04<00:00,  5.22it/s, loss=1.17]


Epoch 2: Train Loss=0.6845, Val Metrics={'acc': 0.6964285714285714, 'val_loss': 0.7838200756481716, 'task': 'CB'}


Phase 1 Epoch 3/3: 100%|██████████| 25/25 [00:04<00:00,  5.23it/s, loss=0.465]


Epoch 3: Train Loss=0.4575, Val Metrics={'acc': 0.75, 'val_loss': 0.7040706596204213, 'task': 'CB'}

=== Phase 2: Student Knowledge Distillation ===


  pkd_loss = F.mse_loss(s_features, t_features, reduction="mean")
Phase 2 Epoch 1/3: 100%|██████████| 25/25 [00:04<00:00,  6.20it/s, total_loss=0.457, task_loss=0.363]


Epoch 1: Train Loss=0.7155, Task Loss=0.8113, Val Metrics={'acc': 0.7142857142857143, 'val_loss': 0.7793670722416469, 'task': 'CB'}


  pkd_loss = F.mse_loss(s_features, t_features, reduction="mean")
Phase 2 Epoch 2/3: 100%|██████████| 25/25 [00:03<00:00,  6.45it/s, total_loss=0.498, task_loss=0.441]


Epoch 2: Train Loss=0.5996, Task Loss=0.6377, Val Metrics={'acc': 0.7321428571428571, 'val_loss': 0.7819410988262722, 'task': 'CB'}


  pkd_loss = F.mse_loss(s_features, t_features, reduction="mean")
Phase 2 Epoch 3/3: 100%|██████████| 25/25 [00:03<00:00,  6.33it/s, total_loss=0.598, task_loss=0.673]


Epoch 3: Train Loss=0.4857, Task Loss=0.4522, Val Metrics={'acc': 0.7321428571428571, 'val_loss': 0.7836860035146985, 'task': 'CB'}

=== Phase 3: Meta-Teacher Learning ===


Phase 3: Meta-Teacher: 100%|██████████| 7/7 [00:00<00:00, 14.11it/s, meta_loss=0.126]


Phase 3: Meta Loss=0.5194, Val Metrics={'acc': 0.75, 'val_loss': 0.832642854324409, 'task': 'CB'}

Training Complete!


## 6. Evaluate Results

In [None]:
# Check training history
print("\nPhase 1 (Teacher) Final Metrics:")
if 'phase1' in history:
    print(history['phase1']['val_metrics'][-1])

print("\nPhase 2 (Student PKD) Final Metrics:")
print(history['phase2']['val_metrics'][-1])


Phase 1 (Teacher) Final Metrics:
{'acc': 0.75, 'val_loss': 0.7040706596204213, 'task': 'CB'}

Phase 2 (Student PKD) Final Metrics:
{'acc': 0.7321428571428571, 'val_loss': 0.7836860035146985, 'task': 'CB'}


## 7. Make Predictions

In [None]:
# Generate predictions on test set
predictions = model.predict(loaders['test'])

print(f"Generated {len(predictions)} predictions")
print(f"First 10 predictions: {predictions[:10]}")


Generating predictions...
Generated 250 predictions
Generated 250 predictions
First 10 predictions: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


## 8. Save the Student Model

In [None]:
# Save student model
model.save_student('./my_student_cb')

# Save predictions
model.save_predictions(
    predictions,
    './cb_predictions.tsv',
    label_mapping={0: 'entailment', 1: 'contradiction', 2: 'neutral'}
)


Saving student model to ./my_student_cb...
Model saved successfully!
Predictions saved to ./cb_predictions.tsv


## 9. Load and Use Saved Model

In [None]:
# Create new model instance
new_model = MPDistil(
    task_name='CB',
    num_labels=3
)

# Load saved student
new_model.load_student('./my_student_cb')

# Use for predictions
new_predictions = new_model.predict(loaders['test'])
print(f"Predictions match: {predictions == new_predictions}")

Using device: cuda

Loading student model from ./my_student_cb...
Model loaded successfully!

Generating predictions...
Generated 250 predictions
Predictions match: True
