## Setup

In [None]:
# login to huggingface
from huggingface_hub import notebook_login

notebook_login()

In [2]:
%%capture
!sudo apt -qq install git-lfs
!git config --global credential.helper store

In [None]:
import numpy as np
import torch
import torch.nn.functional as F
from matplotlib import pyplot as plt
from PIL import Image
import torchvision

def show_images(x):
    """Given a batch of images x, make a grid and convert to PIL"""
    x = x * 0.5 + 0.5  # Map from (-1, 1) back to (0, 1)
    grid = torchvision.utils.make_grid(x)
    grid_im = grid.detach().cpu().permute(1, 2, 0).clip(0, 1) * 255
    grid_im = Image.fromarray(np.array(grid_im).astype(np.uint8))
    return grid_im


def make_grid(images, size=64):
    """Given a list of PIL images, stack them together into a line for easy viewing"""
    output_im = Image.new("RGB", (size * len(images), size))
    for i, im in enumerate(images):
        output_im.paste(im.resize((size, size)), (i * size, 0))
    return output_im


# Mac users may need device = 'mps' (untested)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

## Minimum Viable Pipeline

Huggingface Diffusers is divided into three main components:
- **Pipelines**: high-level classes designed to rapidly generate samples from popular trained diffusion models in a user-friendly fashion.
- **Models**: popular architectures for training new diffusion models, e.g. UNet.
- **Schedulers**: various techniques for generating images from noise during inference as well as to generate noisy images for training.

In [None]:
# import denoising diffusion probabilistic models (see https://arxiv.org/abs/2006.11239)
from diffusers import DDPMPipeline

# load the butterfly ddpm model pipeline
butterfly_pipeline = DDPMPipeline.from_pretrained("johnowhitaker/ddpm-butterflies-32px").to(device)

In [None]:
# create a batch of images
images = butterfly_pipeline(batch_size=8).images

# show the images
make_grid(images)

## Train the DDPM

**Study case**: Train using [1000 Butterfly datasets](https://huggingface.co/datasets/huggan/smithsonian_butterflies_subset)

### Dataset Setup

In [None]:
import torchvision
from datasets import load_dataset
from torchvision import transforms

# load the dataset
dataset = load_dataset("huggan/smithsonian_butterflies_subset", split="train")

In [7]:
IMAGE_SIZE = 32
BATCH_SIZE = 64

# augmentation transforms
preprocess = transforms.Compose(
    [
        transforms.Resize((IMAGE_SIZE, IMAGE_SIZE)),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
    ]
)

def do_transform(examples):
    images = [preprocess(image.convert("RGB")) for image in examples["image"]]
    return {"images": images}

# apply the transform to the dataset
dataset.set_transform(do_transform)

# create a dataloader
train_dataloader = torch.utils.data.DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True)

In [None]:
# show the first batch of images
xb = next(iter(train_dataloader))["images"].to(device)[:8]
print("X shape:", xb.shape)
show_images(xb).resize((8 * 64, 64), resample=Image.NEAREST)

### Scheduler Setup

We need to add noise repeatedly to the input images to train the model. So we can use the **DDPMScheduler** to add noise to the input images. This method is **instant without thinking about mathematica**l calculations.

In [9]:
from diffusers import DDPMScheduler

noise_scheduler = DDPMScheduler(num_train_timesteps=1000)

The DDPM paper describes a corruption process that adds a small amount of noise for every `timestep`. 

$q(\mathbf{x}_t \vert \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1 - \beta_t} \mathbf{x}_{t-1}, \beta_t\mathbf{I}) \quad
q(\mathbf{x}_{1:T} \vert \mathbf{x}_0) = \prod^T_{t=1} q(\mathbf{x}_t \vert \mathbf{x}_{t-1})$<br><br>

Where $x_t-1$ is the previous timestep, $x_t$ is the current timestep, and $\beta_t$ is the timestep-dependent noise level.

**Flow**:
- Take $x_t-1$ as input
- Scale it by $\sqrt{1 - \beta_t}$
- Add noise $\beta_t\mathbf{I}$
- Output $x_t$

But, we don't want to do this operation $t$ times to get the final image so we have another formula to get $x_t$ directly from $x_0$.

$\begin{aligned}
q(\mathbf{x}_t \vert \mathbf{x}_0) &= \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t} \mathbf{x}_0, {(1 - \bar{\alpha}_t)} \mathbf{I})
\end{aligned}$ where $\bar{\alpha}_t = \prod_{i=1}^T \alpha_i$ and $\alpha_i = 1-\beta_i$<br><br>

Where:
- $x_0$ is the initial image
- $x_t$ is the final image
- $\bar{\alpha}_t$ is the final noise level
- $\sqrt{\bar{\alpha}_t}$ is how the **original image is scaled**; When this value is 0, the image is completely random noise, and when it is 1, the image is the same as the input image.
- $\sqrt{\bar{\alpha}_t - 1}$ is how many **noise added to the image**; When this value is 0, the image is the same as the input image, and when it is 1, the image is completely random noise.

In [None]:
plt.plot(noise_scheduler.alphas_cumprod.cpu() ** 0.5, label=r"${\sqrt{\bar{\alpha}_t}}$")
plt.plot((1 - noise_scheduler.alphas_cumprod.cpu()) ** 0.5, label=r"$\sqrt{(1 - \bar{\alpha}_t)}$")
plt.legend(fontsize="x-large");

If the **two plots are meeting** at the same point, then the contibution of the original image ($x_0$) and the noise is **equal** in the combined image ($x_t$).

In [None]:
# Just random things to try:
# One with too little noise added:
noise_scheduler_temp = DDPMScheduler(num_train_timesteps=1000, beta_start=0.001, beta_end=0.004)

plt.plot(noise_scheduler_temp.alphas_cumprod.cpu() ** 0.5, label=r"${\sqrt{\bar{\alpha}_t}}$")
plt.plot((1 - noise_scheduler_temp.alphas_cumprod.cpu()) ** 0.5, label=r"$\sqrt{(1 - \bar{\alpha}_t)}$")
plt.legend(fontsize="x-large");


In [None]:
# The 'cosine' schedule, which may be better for small image sizes:
noise_scheduler_temp = DDPMScheduler(num_train_timesteps=1000, beta_schedule='squaredcos_cap_v2')

plt.plot(noise_scheduler_temp.alphas_cumprod.cpu() ** 0.5, label=r"${\sqrt{\bar{\alpha}_t}}$")
plt.plot((1 - noise_scheduler_temp.alphas_cumprod.cpu()) ** 0.5, label=r"$\sqrt{(1 - \bar{\alpha}_t)}$")
plt.legend(fontsize="x-large");


In [None]:
# simulate adding noise to the batch
timesteps = torch.linspace(0, 999, 8).long().to(device)
noise = torch.randn_like(xb)
noisy_xb = noise_scheduler.add_noise(xb, noise, timesteps)

print(f"Timesteps: {timesteps}")
print(f"Noise shape: {noise.shape}")
print(f"Noisy Xb shape: {noisy_xb.shape}")
show_images(noise).resize((8 * 64, 64), resample=Image.NEAREST)
show_images(noisy_xb).resize((8 * 64, 64), resample=Image.NEAREST)

## Model Setup

Most diffusion models are based on the [UNet](https://arxiv.org/abs/1505.04597) architecture.

![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/unet-model.png)

**Flow**:

- The model has the input image go through several blocks of ResNet layers, each of which halves the image size by 2
- Then through the same number of blocks that upsample it again
- There are skip connections linking the features on the downsample path to the corresponding layers in the upsample path


This model will predict images with **same size** as the input image.

In [None]:
from diffusers import UNet2DModel

# create a UNet model
model = UNet2DModel(
    sample_size=IMAGE_SIZE, # target image size
    in_channels=3, # number of input channels
    out_channels=3, # number of output channels
    layers_per_block=2, # how many ResNet layers to use per UNet block
    block_out_channels=(64, 128, 256, 512), # number of output channels for each UNet block; More channels -> more parameters
    down_block_types=(
        "DownBlock2D", # regular ResNet downsampling block
        "DownBlock2D",
        "AttnDownBlock2D", # ResNet downsampling block with spatial self-attention
        "AttnDownBlock2D",
    ),
    up_block_types=(
        "AttnUpBlock2D", # ResNet upsampling block with spatial self-attention
        "AttnUpBlock2D",
        "UpBlock2D", # regular ResNet upsampling block
        "UpBlock2D",
    ),
)
model.to(device);

Notes:
- Higher-resolution images may require more down and up-blocks
- Keep the attention layers only at the lowest resolution to save memory

In [None]:
with torch.no_grad():
    model_prediction = model(noisy_xb, timesteps).sample
model_prediction.shape

## Training Loop

Flow:
- Get each batch of images
- Sample some random timesteps
- Noise the data accordingly
- Feed the noisy data through the model
- Compare the model predictions with the target (noise) using MSE as the loss function
- Update the model parameters via `loss.backward()` and `optimizer.step()`

In [None]:
# set the noise scheduler
noise_scheduler = DDPMScheduler(
    num_train_timesteps=1000, beta_schedule="squaredcos_cap_v2"
)

# training loop
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)

losses = []

for epoch in range(30):
    for step, batch in enumerate(train_dataloader):
        clean_images = batch["images"].to(device)

        # sample noise to add to the images
        noise = torch.randn(clean_images.shape).to(clean_images.device)
        bs = clean_images.shape[0]

        # sample a random timestep for each image in the batch
        timesteps = torch.randint(
            0, noise_scheduler.num_train_timesteps, (bs,), device=clean_images.device
        )

        # add noise to the images according to the noise magnitude at the sampled timestep
        noisy_images = noise_scheduler.add_noise(clean_images, noise, timesteps)

        # get the model prediction
        noise_pred = model(noisy_images, timesteps, return_dict=False)[0]

        # calculate the loss
        loss = F.mse_loss(noise_pred, noise)
        loss.backward(loss)
        losses.append(loss.item())

        # update the model parameters with the optimizer
        optimizer.step()
        optimizer.zero_grad()

    if (epoch + 1) % 5 == 0:
        loss_last_epoch = sum(losses[-len(train_dataloader) :]) / len(train_dataloader)
        print(f"Epoch {epoch + 1}, Loss: {loss_last_epoch:.4f}")


In [None]:
fig, axs = plt.subplots(1, 2, figsize=(12, 4))
axs[0].plot(losses)
axs[1].plot(np.log(losses))
plt.show()

In [None]:
# Uncomment to instead load the model I trained earlier:
# model = butterfly_pipeline.unet

## Inference Process (Generate Images)

### Option 1: Creating a pipeline

In [None]:
from diffusers import DDPMPipeline

image_pipe = DDPMPipeline(
    unet=model,
    scheduler=noise_scheduler,
)

In [None]:
pipeline_output = image_pipe()
pipeline_output.images[0]

In [26]:
# save the pipeline
image_pipe.save_pretrained("butterfly_ddpm")

### Option 2: Writing a Sampling Loop

Flow:
- Begin with random noise
- Run through the scheduler timesteps from most to least noisy
- Removing a small amount of noise at each step

In [35]:
from PIL import Image

In [None]:
# random starting point (8 random images):
sample = torch.randn(8, 3, 32, 32).to(device)

for i, t in enumerate(noise_scheduler.timesteps):

    # get model pred
    with torch.no_grad():
        residual = model(sample, t).sample

    # update sample with step
    sample = noise_scheduler.step(residual, t, sample).prev_sample

show_images(sample)

```
Hardware usage:
RTX 2060 6GB
CPU: 115%
Memory: 1639MB
GPU: 92%
GPU Memory: 3794MB
```

## Push Model to Huggingface

In [None]:
from huggingface_hub import get_full_repo_name

model_name = "butterfly-ddpm-32"
hub_model_id = get_full_repo_name(model_name)
hub_model_id

In [None]:
from huggingface_hub import HfApi, create_repo

create_repo(hub_model_id)

api = HfApi()
api.upload_folder(
    folder_path="butterfly_ddpm/scheduler", path_in_repo="", repo_id=hub_model_id
)
api.upload_folder(folder_path="butterfly_ddpm/unet", path_in_repo="", repo_id=hub_model_id)
api.upload_file(
    path_or_fileobj="butterfly_ddpm/model_index.json",
    path_in_repo="model_index.json",
    repo_id=hub_model_id,
)

In [None]:
from huggingface_hub import ModelCard

content = f"""
---
license: mit
tags:
- pytorch
- diffusers
- unconditional-image-generation
- diffusion-models-class
---

# Model Card for Unit 1 of the [Diffusion Models Class 🧨](https://github.com/huggingface/diffusion-models-class)

This model is a diffusion model for unconditional image generation of cute 🦋.

## Usage

```python
from diffusers import DDPMPipeline

pipeline = DDPMPipeline.from_pretrained('{hub_model_id}')
image = pipeline().images[0]
image
```
"""

card = ModelCard(content)
card.push_to_hub(hub_model_id)

In [None]:
from diffusers import DDPMPipeline

image_pipe = DDPMPipeline.from_pretrained(hub_model_id).to(device)
pipeline_output = image_pipe(batch_size=8)
make_grid(pipeline_output.images)

If error:
```
An error occurred while trying to fetch /home/sugab/.cache/huggingface/hub/models--hiseulgi--butterfly-ddpm-32/snapshots/be8922159e3c7177c4573627a25ea3b28d074720: Error no file named diffusion_pytorch_model.safetensors found in directory /home/sugab/.cache/huggingface/hub/models--hiseulgi--butterfly-ddpm-32/snapshots/be8922159e3c7177c4573627a25ea3b28d074720.

Defaulting to unsafe serialization. Pass `allow_pickle=False` to raise an error instead.
```

You need to manually download the `*.safe_tensors` file from the huggingface repository and place it in the cache directory.

## Acknowledgements

- [Denoising Diffusion Probabilistic Models - Paper](https://arxiv.org/abs/2006.11239)
- [What are Diffusion Models? - Youtube](https://www.youtube.com/watch?v=fbLgFrlTnGU)