Skip to content

pedroandreou/supreme-unlearning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

⚑ SUPREME - Standardised Unlearning Platform for REproducible Method Evaluation

SUPREME

πŸ”¬ Tech Stack
Core: Python 3.9 PyTorch Lightning Fabric HuggingFace Transformers
Accelerators: CUDA 12.1 MPS TPU via PyTorch XLA
Distributed & precision: DeepSpeed bitsandbytes NVIDIA TransformerEngine

πŸ› οΈ Tooling
Experiment tracking: Weights & Biases TensorBoard
Environment: Docker Open in Dev Containers
Debug & profile: debugpy Scalene Profiler
Code quality: Ruff pre-commit

πŸ“„ Publication
arXiv Preprint (TBA) Under double-blind review at the WIPE-OUT 2 Workshop, ECML-PKDD 2026 Project Page

πŸ“¦ Repository
MIT License


πŸ“– Overview

SUPREME is an open-source framework for evaluating machine unlearning methods on image classification tasks.

What is machine unlearning? Given a model that was trained on some data, machine unlearning removes the influence of a chosen subset of that data (a class, a sub-class, or a random sample of examples) without retraining the model from scratch. Doing this well is hard: a good unlearned model should behave as if it had never seen the forgotten data, while still classifying everything else accurately. Many methods have been proposed, and they need a fair, repeatable way to be compared.

What SUPREME does. It runs the same three-stage pipeline end to end for any registered combination of dataset, model, unlearning method, and evaluation metric:

  1. Train a baseline model on the full dataset.
  2. Unlearn the chosen subset using the selected unlearning method.
  3. Evaluate the unlearned model against a from-scratch retrained baseline (trained only on the data that was kept), using a configurable set of metrics that cover forgetting, utility, privacy, behavioural/parametric equivalence, and efficiency.

It ships ready-to-use implementations of 5 datasets, 2 model architectures, 11 unlearning methods, and 10 evaluation metrics, all selectable through command-line flags.

What makes SUPREME different:

  • Reproducible by design. Recent work has shown that single-seed unlearning results can misrepresent a method's true behaviour. SUPREME runs the same experiment under multiple seeds, independently for the training, unlearning, and evaluation stages, so you measure distributions, not point estimates. The number of seeds at each stage is configurable per run.
  • Multi-GPU and multi-precision. Built on PyTorch and Lightning Fabric. Distribution (DDP, FSDP, DeepSpeed ZeRO 1/2/3) applies to all three stages, with mixed-precision (fp16 / bf16) and NVIDIA / Apple Silicon / CPU back-ends.
  • Registry-based extensibility. Add a dataset, model, unlearning method, or metric by implementing a small interface and registering its module path, with no framework changes required (see docs/extending.md).
  • Efficient. When several experiments share the same training configuration, the model is trained once and reused across them, guarded by a file lock so parallel SLURM jobs and concurrent local runs stay consistent.

For the formal pipeline algorithm and mathematical notation (seed formulas, set definitions, operation signatures), see src/README.md and docs/notation.md.


πŸ—ƒοΈ Available Components

Registry-based components are user-extensible - implement the relevant interface and register the module path (see docs/extending.md). The components provided via Lightning Fabric cover the supported hardware and execution configurations.

Registry-based (user-extensible)

Component Available implementations
Datasets CIFAR-10, CIFAR-20, CIFAR-100, PinsFaceRecognition, Caltech-101
Models ResNet18, Vision Transformer (ViT)
Baselines Retrain, Original
Unlearning methods Fine-Tuning (FT), Bad Teacher (BadT), Random Labels (RL), UNSIR, SSD, LFSSD, ASSD, SCRUB, JIT
Evaluation metrics Accuracy, Loss/Error, ZRF, Activation Distance, JS-Divergence, Layer-wise Distance, Membership Inference Attack, Completeness, Resource Consumption, Time
Unlearning scenarios Full-class, Subclass, Random sample

Paper-evaluated subset: Retrain, FT, BadT, RL, UNSIR, SSD, LFSSD. The remaining methods (ASSD, SCRUB, JIT) are experimental implementations that aren't in the paper's evaluation but can be selected via --methods.

Provided via Lightning Fabric

Component Available implementations
Accelerators CPU, CUDA, MPS, TPU
Precision modes 64-true, 32-true, 16-mixed, bf16-mixed, 16-true, bf16-true, transformer-engine, transformer-engine-float16 (FP8), nf4, nf4-dq, fp4, fp4-dq, int8, int8-training
Distributed strategies DDP, FSDP, DeepSpeed (ZeRO Stage 1/2/3)
Loggers Weights & Biases, TensorBoard, CSV

⚑ Quickstart

# 1. Clone
git clone https://github.com/pedroandreou/supreme-unlearning.git
cd supreme

# 2. Set up environment
python3.9 -m venv gpu_env
source gpu_env/bin/activate
pip install -r requirements.cuda_12_1.txt   # NVIDIA GPU (Linux / WSL2). Apple Silicon: use requirements.mps.txt
pip install -e .

# 3. Configure W&B + HF tokens
cp .env.example .env
# edit .env with your WANDB_API_KEY and HF_TOKEN

# 4. Smoke test - one seed, one method, one dataset
bash src/run_local.sh \
  --gpu 0 --models ViT --training-seeds 260 \
  --methods retrain,finetune,ssd \
  --strategies random_ --datasets Cifar10 \
  --forget-percs 0.01

Full environment setup (Docker Dev Container, MPS prerequisites, etc.) is documented in docs/environment_setup.md. The Docker image is NVIDIA-only (Linux / WSL2); macOS users follow the virtual-env path above.


πŸ§ͺ Running Experiments

The pipeline runs train β†’ unlearn β†’ evaluate automatically. Re-running is safe: per-stage outputs (training checkpoints, unlearning checkpoints, already-logged W&B results) are detected and skipped.

Local (workstation, GPU server, interactive cluster node)

# All 10 seeds, all methods, all datasets - defaults
bash src/run_local.sh --gpu 0

# Filter the sweep
bash src/run_local.sh \
  --gpu 0,1 \
  --models ViT \
  --training-seeds 260,261,262 \
  --methods retrain,finetune,bad_teacher,ssd \
  --strategies fullclass,random_ \
  --datasets PinsFaceRecognition
Flag Description Default
--gpu GPU ID(s) - 0 single, 0,1,2,3 multi-GPU 0
--models ResNet18, ViT both
--training-seeds Comma-separated training seeds (outer loop, I). 260–269
--unlearning-seeds Space-separated indices for J (e.g. "0 1 2" for J=3) "0" (matched)
--evaluation-seeds Space-separated indices for K "0" (matched)
--methods Unlearning methods to run all 15
--strategies fullclass, subclass, random_ all
--datasets Datasets to use all 5
--forget-percs Forget % for random_ strategy 0.001–0.10

SLURM (HPC, login node)

# Preview the grid (no submission)
./src/run_slurm.sh --dry-run

# Submit all experiments, max 12 concurrent jobs
./src/run_slurm.sh --max-concurrent 12

# Subset
./src/run_slurm.sh \
  --datasets Cifar10,Cifar20 \
  --models ViT \
  --training-seeds 260,261,262

# Multi-GPU DDP per job
./src/run_slurm.sh --gpus 4

Each submitted job runs one (seed, dataset, model) cell independently; cells run in parallel across the cluster. Distributed-strategy selection (DDP / FSDP / DeepSpeed) is documented in docs/implementation_notes.md β†’ Distributed Strategies.


πŸ” Reproducing the paper

Reproducing the paper's numbers is a two-step process: run the experiment grid on Pins Face Recognition (both architectures, both scenarios, all 10 seeds) and then render the three paper LaTeX tables from the W&B-logged results using src/utils/wandb_utils/results_analysis/pins_paper_tables.ipynb. The exact command, the table-rendering workflow, and the troubleshooting notes are documented in docs/reproducing_the_paper.md.


βž• Extending SUPREME

Adding a dataset, model, method, or metric follows a consistent register-and-implement pattern. Walkthroughs and Fabric-integration rules live in docs/extending.md:

What to add Walkthrough
New dataset docs/extending.md β†’ Adding a new dataset
New model docs/extending.md β†’ Adding a new model
New unlearning method docs/extending.md β†’ Adding a new unlearning method
New evaluation metric docs/extending.md β†’ Adding a new evaluation metric

πŸ“š Documentation

Document Covers
docs/notation.md Symbol glossary - seeds, datasets, models, indices, counts
src/README.md Formal algorithm specification (matched and decoupled protocols)
docs/environment_setup.md Virtual-env and Docker Dev Container setup, .env template, prerequisites
docs/reproducing_the_paper.md Single command for the paper's experiment grid plus the W&B-export-to-LaTeX-tables workflow
docs/script_arguments.md Full argument reference for train_main.py and unlearn_main.py
docs/extending.md How to add new datasets, models, methods, and metrics
docs/tooling.md Debugger, profiler, Fabric callbacks, process tracker, split export, W&B exporter
docs/wandb_integration.md W&B runtime behaviour: rank-0 logging, offline mode, sync workflow, metric synchronisation
docs/wandb_fields.md Paper-to-W&B metric mapping and per-metric field paths
docs/implementation_notes.md Distributed strategies, gradient handling, batch-size scaling, memory, known limitations
docs/adding_pinsfacerecognition.md Manual Kaggle download for the Pins Face Recognition dataset
docs/future_work.md Planned extensions

πŸ“ Citing this work

@misc{supreme2026,
  title  = {SUPREME: Standardised Unlearning Platform for REproducible Method Evaluation},
  author = {Petros Andreou, Jamie Lanyon, Axel Finke, Georgina Cosma},
  year   = {2026},
  howpublished = {}
}

This work was conducted at Loughborough University.

Loughborough University


πŸ“„ License

Released under the MIT License.

About

A one-stop, multi-GPU framework for image unlearning in vision models. SUPREME is easily extensible with new datasets, model architectures, evaluation metrics, and unlearning methods.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors