DiffuseSeg demonstrates how a single, unconditionally trained Denoising Diffusion Probabilistic Model (DDPM) can serve as a powerful backbone for both high fidelity synthetic image generation and label-efficient semantic segmentation.
The core idea is to repurpose the rich, multi-scale features learned by the U-Net decoder of a DDPM. By extracting these features, we can train a lightweight, pixel level segmentation head with very few labeled examples, effectively turning the generative model into a labeled data factory.
This project was inspired by the following paper.
- End-to-End Pipeline: Train a DDPM on unlabeled images and then use it to generate paired synthetic images and segmentation masks.
- Label-Efficient: The segmentation head requires very little annotated data for decent enough results (mIOU > 0.35) (trained on as few as 100 labeled images).
- High-Quality Synthesis: Generates realistic 64x64 face images.
- Data Augmentation: Easily create large-scale synthetic datasets with perfect pixel-level annotations for downstream tasks.
The project is implemented in two main stages:
- An unconditional DDPM with a U-Net core is trained from scratch on a dataset of unlabeled face images (CelebA-HQ 64x64).This model learns to reverse a diffusion process, progressively transforming Gaussian noise into a realistic face image.
- One can use the scripts from here to train on any dataset with minor dataset specific modifications.
- You can also use a pre-trained model and directly move to stage2.
With the DDPM U-Net frozen, we use it as a feature extractor.
- Feature Extraction: For a given image (real or synthetic), we extract activations from specific decoder blocks of the U-Net at certain timesteps (e.g., t=50, 150, 250). The specific blocks(all upblocks) and timesteps(50,150,250) chosen by me were motivated by the paper and my architecture choices.
- Pixel Descriptors: These multi scale feature maps are upsampled and concatenated to form a single, rich feature vector for every pixel.
- MLP Training: An ensemble of small, pixel-wise Multi-Layer Perceptrons (MLPs) is trained on a small set of labeled images to classify each pixel feature vector into one of the semantic classes (e.g., hair, skin, nose).
This approach allows the model to generate a segmentation mask for any image—real or synthetically generated by the DDPM.
The segmentation head achieves strong performance on the CelebA-HQ validation set, demonstrating the quality of the features extracted from the trained DDPM.
Here are some end-to-end results, showing a synthetic image generated by the DDPM and the corresponding segmentation map produced by the MLP head.
Here are some validation results, showing a image, its GT Mask from the CelebA-HQ Dataset accompanied by the corresponding segmentation map produced by DiffuseSeg.
- Colab Notebook: Colab
- Generated Dataset: A starter(to be updated further) Synthetic Dataset (Image, Mask pairs) can be found here.
- Model Weights (DDPM): Trained on resized CelebAHQ256 Dataset can be found here.
- Model Weights (MLPs): Segmentation Head on trained with features obtained from above DDPM can be found here.
- Clone the repository:
git clone https://github.com/harish-jhr/DiffuseSeg.git cd DiffuseSeg - Create a virtual environment and install dependencies:
conda create -n diffuseg_env python=3.9 conda activate diffuseg_env pip install -r requirements.txt
- Training the DDPM
To train the diffusion model from scratch, use the
DDPM-train.pyscript. Make sure your dataset path and training parameters are correctly set inutils/config.yaml. Also make Dataset specific changes (im_size, im_channels) in the config file, while also noting that architectural changes (in terms of num of down/mid/up blocks ) can be made within config file.python utils/DDPM-train.py
- Feature Extraction
Once the DDPM is trained, use the
Feature_extractor.pyscript to generate and save the pixel wise feature descriptors from a set of images.python utils/Feature_extractor.py
- Training the Segmentation Head
Finally, train the ensemble of MLPs on the extracted features using the
train_MLPs.pyscript.python utils/train_MLPs.py
- Inference
- To generate a synthetic images using the trained DDPM model use the script
DDPM_inference.py, and adjust inference params in config file.
python utils/DDPM_inference.py
- To just test the Segementation head, use the script
DDPM-seg_inference.py, which returns predicted masks along with mIOU (mean IOU over all semantic parts) if GT Masks are provided.
python utils/DDPM-seg_inference.py
- To generate a synthetic images using the trained DDPM model and then obtain their segmentation maps (e2e inference) use the script
DiffuseSeg_e2e.py.
python utils/DiffuseSeg_e2e.py
- To generate a synthetic images using the trained DDPM model use the script
Find below the original paper that inspired this approach:
@inproceedings{baranchuk2022label,
title={Label-Efficient Semantic Segmentation with Diffusion Models},
author={Dmitry Baranchuk and Ivan Rubachev and Andrey Voynov and Valentin Khrulkov and Artem Babenko},
booktitle={International Conference on Learning Representations},
year={2022}
}




