We apply the head-wise masking technique to Stable Diffusion v3, aiming to preserve its generation performance while enabling the creation of complex degraded images through the masking mechanism.
We duplicate the part that receives the input image and insert clean image information by summing it with the output of a zero-convolution layer, allowing the model to preserve the original content.
However, we remove the modulation mechanism of AdaLN-Zero from the clean image input path, as the clean image information does not need to be influenced by the class conditioning.
Haze and rain both have the characteristic of increasing overall brightness. When both conditions are applied simultaneously, this can lead to the rain effect becoming overly blurred.
Modify the masking ratio to a (float) value other than 1 and 0 when generating multi-degradation images.
noise equation : πππ‘πππ‘π =πΌβππππ π+π½βππππ’π‘ (πΌ+π½=1)
As noise decreases, the degradation condition becomes weaker. Therefore, we want the sum of alpha and beta to be greater than 1. Initail noise equation : πππ‘πππ‘π =πΌβππππ π+π½βππππ’π‘ (πΌ+π½>1)
Problem: The generated results tend to preserve the overall color structure of the initial input image.
Step-wise generation results are presented for the haze, rain, and haze&rain classes.
It is observed that haze, being a low-frequency degradation, is generated in the early steps of the diffusion model, whereas rain, which has high-frequency characteristics, is generated in the later steps.
[Boosting diffusion models with moving average sampling in frequency domain Qian et al, CVPR 2024]
Qian et al. stated that βDiffusion models at the denoising process first focus on the recovery of low-frequency components in the earlier timesteps and gradually shift to recovering high-frequency details in the later timesteps.β
-> So, Degradation-specific details(rain) should be generated in the later stages of the denoising process.
[Jiang, Liming, et al. "Focal frequency loss for image reconstruction and synthesis." Proceedings of the IEEE/CVF international conference on computer vision. 2021.]
L jiang et al. use a frequency-domain loss instead of pixel-based loss when training GANs or VAEs to better learn high-frequency details.
We train the model to learn the degradation details(high-frequency).
Since frequency components become more important in the later stages of the backward process (i.e., at smaller timesteps), we multiply the focal-frequency loss by a weighting factor of (1 - T / 1000) to assign greater importance when T is small.
Artificial noise is suppressed, resulting in the effective generation of images containing a mixture of rain and haze degradations.
It shows visually effective results in specific style mixing scenarios.
We recommend installing π€ Diffusers in a virtual environment from PyPI or Conda. For more details about installing PyTorch and Flax, please refer to their official documentation.
With pip (official package):
pip install --upgrade diffusers[torch]With conda (maintained by the community):
conda install -c conda-forge diffusersWith pip (official package):
pip install --upgrade diffusers[flax]Please refer to the How to use Stable Diffusion in Apple Silicon guide.
Generating outputs is super easy with π€ Diffusers. To generate an image from text, use the from_pretrained method to load any pretrained diffusion model (browse the Hub for 30,000+ checkpoints):
from diffusers import DiffusionPipeline
import torch
pipeline = DiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16)
pipeline.to("cuda")
pipeline("An image of a squirrel in Picasso style").images[0]You can also dig into the models and schedulers toolbox to build your own diffusion system:
from diffusers import DDPMScheduler, UNet2DModel
from PIL import Image
import torch
scheduler = DDPMScheduler.from_pretrained("google/ddpm-cat-256")
model = UNet2DModel.from_pretrained("google/ddpm-cat-256").to("cuda")
scheduler.set_timesteps(50)
sample_size = model.config.sample_size
noise = torch.randn((1, 3, sample_size, sample_size), device="cuda")
input = noise
for t in scheduler.timesteps:
with torch.no_grad():
noisy_residual = model(input, t).sample
prev_noisy_sample = scheduler.step(noisy_residual, t, input).prev_sample
input = prev_noisy_sample
image = (input / 2 + 0.5).clamp(0, 1)
image = image.cpu().permute(0, 2, 3, 1).numpy()[0]
image = Image.fromarray((image * 255).round().astype("uint8"))
imageCheck out the Quickstart to launch your diffusion journey today!













