This repository contains the training, evaluation, and uncertainty estimation stack we use to study Neural Cellular Automata (NCA) based segmentation on a diverse set of biomedical datasets. In addition to classical nuclei benchmarks (DSB2018, MoNuSeg) we support augmented “hard” splits, real/synthetic ultrasound (RUS), hand-curated NuInsSeg tissues, and ISIC 2017 Task 1 lesions with multi-annotator labels used to build disagreement-aware baselines.
- End-to-end NCA training / evaluation via
train.pyandevaluate.pywith configurable steps, channel budgets, image sizes, and datasets. - Unified dataloader (
dataloader.py) for VOC, CamVid, DSB2018, MoNuSeg (with automatic XML rasterization + augmentation), NuInsSeg (with automatic split generation + augmentation), RUS ultrasound, and ISIC 2017 Task 1 (binary lesions with multi-annotator metadata). - Uncertainty tooling:
quality_labels.pycomputes Dice / boundary scores per sample and records annotator disagreement if available.compute_uncertainty.pyimplements single forward entropy, stop-time, stability, flicker, resilience, test-time augmentation (TTA), and the ISIC-specific disagreement baseline derived from annotator variance.evaluate_uncertainty.pyreports Dice@80/90, AURC, AUROC, AUPRC, and adds two fusion baselines (rank-average and validation-tuned weighted fusion) to combine the best signals.plot_uncertainty_examples.pycreates qualitative panels.
- Augmented “hard” splits for DSB2018 (photometric + geometric noise), MoNuSeg, and NuInsSeg (
*_augsplits are generated on-the-fly if missing).
python -m venv nca_env
source nca_env/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
# Ensure the matching CUDA builds of torch/torchvision if training on GPU.| Dataset | Expected Location | Notes |
|---|---|---|
| VOC 2012 | datasets/voc/VOCdevkit/VOC2012 or datasets/VOC2012_train_val |
Standard PASCAL VOC segmentation |
| CamVid | datasets/CamVid |
Requires train/val/test and label folders |
| DSB2018 | datasets/dsb2018 |
Contains stage1_train; _aug splits auto-generated |
| MoNuSeg | datasets/MoNuSeg |
Run python datasets/MoNuSeg/split_monuseg.py first |
| RUS Ultrasound | datasets/US/RUS/{images,annotations}/{train,val,test} |
Provide masks for training/validation splits |
| NuInsSeg | datasets/NuInsSeg |
Run python datasets/NuInsSeg/split_nuinsseg.py --root datasets/NuInsSeg --force once; _aug splits are generated on demand |
| ISIC 2017 Task 1 | datasets/isic/isic2017_task1 |
Keep the official ISIC-2017_<split>_Data and *_Part1_GroundTruth folders. Additional annotator masks (if present) are automatically discovered for disagreement baselines. |
All scripts accept --data_root to override the automatically detected locations when needed.
Train on any supported dataset:
# Example: NuInsSeg baseline
python train.py \
--dataset nuinsseg \
--data_root datasets/NuInsSeg \
--batch_size 4 \
--epochs 50 \
--lr 1e-3 \
--steps_min 32 --steps_max 64 \
--image_size 512 512 \
--num_workers 4 \
--device cuda \
--exp_name nuinsseg_baselineEvaluate a checkpoint on a target split:
python evaluate.py \
--checkpoint runs/nuinsseg_baseline/best.pt \
--dataset nuinsseg \
--split test \
--data_root datasets/NuInsSeg \
--batch_size 2 \
--num_workers 4 \
--image_size 512 512 \
--device cudaevaluate.py reports pixel accuracy, mIoU, Dice, and boundary F1.
-
Quality labels (Dice, boundary F1, annotator stats):
python quality_labels.py \ --datasets dsb2018 monuseg rus nuinsseg isic2017 \ --split test \ --runs_dir runs \ --pattern "*best.pt" \ --batch_size 2 \ --num_workers 4 \ --device cuda
-
Uncertainty computation (per-method maps + scalars). Available methods:
single,stoptime,stability,flicker,resilience,tta,disagreement.python compute_uncertainty.py \ --datasets dsb2018 nuinsseg isic2017 \ --split test test_aug \ --methods single stoptime stability flicker resilience tta disagreement \ --runs_dir runs \ --batch_size 2 \ --num_workers 4 \ --device cuda \ --save_png--tta_max_transformscontrols how many geometric transforms are used.disagreementrequires multi-annotator metadata (ISIC 2017); it falls back to zeros otherwise.
-
Evaluation + fusion using
evaluate_uncertainty.py. This script loads the quality/uncertainty JSON pairs per run and computes Dice@80/90, AURC, AUROC, and AUPRC. Optional fusion baselines:python evaluate_uncertainty.py \ --datasets dsb2018 nuinsseg isic2017 \ --split test \ --methods single tta stability disagreement \ --fusion_pairs tta,stability tta,single \ --fusion_metric aurc \ --fusion_alpha_steps 21 \ --fusion_val_split val \ --runs_dir runs \ --output runs/unc_summary_test.jsonfusion_rank_<m1>_<m2>: average normalized ranks of each method.fusion_weighted_<m1>_<m2>: learnαon the validation split (grid search) after standardizing each uncertainty score.
-
Qualitative panels:
python plot_uncertainty_examples.py \ --datasets nuinsseg isic2017 \ --split test \ --methods single tta disagreement \ --runs_dir runs \ --output runs/uncertainty_plots
dataloader.ISIC2017Datasetstores multi-annotator masks (when available). Downstream scripts exposemulti_mask_pathsmetadata.quality_labels.pylogs per-sample annotator variance, maximum disagreement, and pairwise Dice statistics.- The
disagreementuncertainty method computes pixel-wise variance and the average mask, serving as an upper bound for how much label ambiguity exists per image.
- DSB2018: requesting
split=test_augortrain_augtriggers_ensure_dsb_augmented_split, which applies geometric and photometric noise to each case. - MoNuSeg:
split=test_auguses_ensure_monuseg_augmented_splitto rasterize XML annotations, apply augmentations, and cache the results. - NuInsSeg:
split=test_augcalls_ensure_nuinsseg_augmented_splitto create noisy counterparts of each split image/mask.
Use --split test for in-distribution evaluation and --split test_aug to stress the models with synthetic perturbations.
The repository includes SLURM helpers such as run_dsb2018.sbatch, run_monuseg.sbatch, and run_isic.sbatch. Update these scripts with your environment-specific parameters (partition, time, etc.) before submitting them to a cluster.
├── NCA.py # Backbone NCA implementation
├── dataloader.py # Dataset builders and augmenters
├── train.py / evaluate.py # Segmentation training and evaluation
├── quality_labels.py # Dice/boundary tracking + annotator stats
├── compute_uncertainty.py # Uncertainty map generation (multiple methods)
├── evaluate_uncertainty.py # Risk/coverage, ROC/AUC, fusion baselines
├── plot_uncertainty_examples.py # Qualitative grids
├── datasets/
│ ├── NuInsSeg/split_nuinsseg.py
│ ├── MoNuSeg/split_monuseg.py
│ └── ...
└── runs/ # Default output directory for checkpoints + metrics
- Missing dataset roots: all loaders call
_ensure_existsand raise descriptive errors. Use--data_rootto point to the correct location or create symlinks underdatasets/. - RUS annotations: ensure
datasets/US/RUS/annotations/{train,val,test}exist. Training will fail otherwise. - NuInsSeg splits: run
datasets/NuInsSeg/split_nuinsseg.pyonce before training; re-run with--forceto rebuild. - Torch install: match the CUDA version on your system. For cluster jobs, load the corresponding module before
pip install.
This repository is part of an ongoing NCA-based uncertainty study. Please contact the maintainers for up-to-date citation and licensing information before redistributing or publishing results derived from this codebase.