Skip to content

Implementation of "Denoising Diffusion Probabilistic Models", Ho et al., 2020

Notifications You must be signed in to change notification settings

mattroz/diffusion-ddpm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

diffusion-DDPM

PyTorch Implementation of "Denoising Diffusion Probabilistic Models", Ho et al., 2020

Overview

This repo is yet another denoising diffusion probabilistic model (DDPM) implementation. This repo tries to stick to the original paper as close as possible.

The straightforward UNet model definition (without any fancy model builders, helpers, etc.) was specifically intentional because it can be quite difficult sometimes to get and understand the original model architecture behind all the abstraction layers and blocks and see the underlying entities clearly. However some kind of automated model generation with configuration files is handy while experimenting, hence will be added in the nearest future.

Some equations are borrowed from this blog post which demystifies whole math behind the diffusion process.

Diffusion process

Diffusion process was implemented as a part of a class called DDPMPipeline, which containes forward $q(x_t \vert x_{t-1})$ and backward $p_\theta(x_{t-1} \vert x_t)$ diffusion processes.

Forward diffusion process applies Gaussian noise to the input image in a scheduleded manner. Backwrd diffusion process is a process which "denoises" an image using model predictions. It is worth to mention, that UNet model in this particular process predicts some kind of noise residual, and the final "denoised" image is obtained by applying the following equation: $$x_{t-1} = \frac{1}{\sqrt{\alpha_t}}(x_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}}\epsilon(x_t,t)) + \sigma_tz$$

Here, $\epsilon$ is the UNet model, $\alpha_t$, $\bar{\alpha}_t$ are precomputed and $\sigma_t$ is calculated using these precomputed values at forward diffusion step.

UNet

As stated in the original paper:

  • Our neural network architecture follows the backbone of PixelCNN++, which is a U-Net based on a Wide ResNet.
  • We replaced weight normalization with group normalization to make the implementation simpler.
  • Our 32×32 models use four feature map resolutions (32×32 to 4×4), and our 256×256 models use six.
  • All models have two convolutional residual blocks per resolution level and self-attention blocks at the 16×16 resolution between the convolutional blocks.
  • Diffusion time is specified by adding the Transformer sinusoidal position embedding into each residual block.

This implementation follows default ResNet blocks architecture without any multiplying factors for simplicity. Also current UNet implementation works better with 128×128 resolution (see next sections) and thus has 5 feature map resoltuions (128 → 64 → 32 → 16 → 8). It is worth noting that subsequent papers suggests more appropriate and better UNet architectures for the diffusion problem.

Results

Training was performed on two datasets:

128×128 resolution

All 128×128 models were trained for 300 epochs with cosine annealing with initial learning rate set to 2e-4, batch size 6 and 1000 diffusion timesteps.

Training on smithsonian-butterflies-subset

300 epochs, 50266 steps

Epoch 4 Epoch 99
0004 0099
Epoch 204 Epoch 300
0204 0300
Sampling from the epoch=300 Sampling from the epoch=300
diffusion1 diffusion2

Training on croupier-mtg-dataset

300 epoch, 72599 steps

Epoch 4 Epoch 99
0004 0099
Epoch 204 Epoch 300
0204 0300
Sampling from the epoch=300 Sampling from the epoch=300
diffusion1 diffusion2

256×256 resolution

All 256×256 models were trained for 300 epochs with cosine annealing with initial learning rate set to 2e-5, batch size 6 and 1000 diffusion timesteps.

Training on smithsonian-butterflies-subset

300 epochs, 50266 steps

Epoch 4 Epoch 100
0004 0100
Epoch 205 Epoch 300
0205 0300

[TODO description]

Releases

No releases published

Packages

No packages published

Languages