[feature] support stable diffusion inference (#1502)

* [sd] add files and run good. * [sd] misc change. * [sd] remove unused files. * [sd] directly load 7 submodels. * [sd] make pipeline clear * [sd] remove unrelated scheduler * [sd] use our own scheduler * [sd] rm schedulers and outputs data structures * [sd] replace log with mmengine log * [sd] rm utils dir * [sd] remove configure utils and model utils * [sd] move transformer models to clip wrapper. * [sd] load resource from url. * [sd] remove utils and accelerate related * [sd] remove utils * [sd] seperate vae from unet. * [sd] move vae outside. * [sd] move conditional unet to ddpm * [sd] add stable unet to denoisenet. * [sd] use denoising unet in ddpm and run good. * [sd] unet forward with stable type * [sd] delete unused code. * [sd] remove default parameters * [sd] add copy right and format clip_wrapper.py * [sd] format vae.py * [sd] formate stable_diffuser.py * [sd] append to last commit * [sd] format unet_blocks.py * [sd] format files. * [sd] format init.py * [sd] format demo * [sd] format config * [sd] add docsting. * [sd] add transformers dependency. * [sd] rename to stablediffusion. * [sd] add docstr in stable_diffusion.py * [sd] fix linter complain * [sd] res_block.py add docstring. * [sd] add docstring for vae.py * [sd] fix linter. * [sd] add docstrings for unet_blocks.py * [sd] stable diffusin return torch tensor * [sd] run linter. * [sd] add docstr * [sd] add docstring * [sd] put load ckpt together. * [sd] misc change * [sd] add clip wrapper ut. * [sd] add stable_diffusion ut * [sd] sd ut skip windows cuda * [sd] add vae ut. * [sd] fix linter. * [sd] fix vae ut. * [sd] add ddpm ddim ut and remove unused block * [sd] remove ut untested code. * [sd] add resblock ut * [sd] add resblock ut. * [sd] add attention ut. * [sd] add unet clock ut. * [sd] add embeddings ut. * [sd] add unet block ut. * [sd] add denoising unet ut. * [sd] add vae ut. * [sd] ddim ut * [sd] add ddim ddpm ut * [sd] add ut. * [ad] add attention ut. * [sd] remove useless code. * [sd] add sd ut. * [sd] add ddpm ut. * [sd] rename config. * [sd] put function inside timestep class. * [sd] use basemodel for sd. * [sd] add check for silu * [sd] fix tpo. * [sd] add ut. * [sd] add ut for unet_blocks.py * [sd] add ut. * [sd] load pretrained weights as mm way. * [sd] rename device. * [sd] remove main function in test files. * [sd] add readme and remove demo. * [sd] add stable diffusion readme. * [sd] load pretrained ckpt by diffusers. * [sd] update readme. * [sd] fix clip_wrapper ut. * [sd] format sd config. * [sd] update metafile.yml * [sd] try import transformers.
open-mmlab · Dec 30, 2022 · c9ef99b · c9ef99b
1 parent 57d49ab
commit c9ef99b
Show file tree

Hide file tree

Showing 28 changed files with 4,311 additions and 153 deletions.
diff --git a/configs/stable_diffusion/README.md b/configs/stable_diffusion/README.md
@@ -0,0 +1,67 @@
+# Stable Diffusion (2022)
+
+> [Stable Diffusion](https://github.com/CompVis/stable-diffusion)
+
+> **Task**: Text2Image
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+<!-- [ABSTRACT] -->
+
+Stable Diffusion is a latent diffusion model conditioned on the text embeddings of a CLIP text encoder, which allows you to create images from text inputs.
+
+<!-- [IMAGE] -->
+
+<div align=center >
+ <img src="https://user-images.githubusercontent.com/12782558/209609229-8221c7cc-d5c9-44d5-a1af-c254b5a95fae.png" width="400"/>
+</div >
+
+## Pretrained models
+
+We use stable diffusion v1.5 weights. This model has several weights including vae, unet and clip. You should download the weights from [stable-diffusion-1.5](https://huggingface.co/runwayml/stable-diffusion-v1-5) and change the 'pretrained_model_path' in config to the weights dir.
+
+|    Diffusion Model    |                       Config                       |                            Download                            |
+| :-------------------: | :------------------------------------------------: | :------------------------------------------------------------: |
+| stable_diffusion_v1.5 | [config](./stable-diffusion_ddim_denoisingunet.py) | [model](https://huggingface.co/runwayml/stable-diffusion-v1-5) |
+
+## Quick Start
+
+Running the following codes, you can get a text-generated image.
+
+```python
+from mmengine import MODELS, Config
+from torchvision import utils
+
+from mmedit.utils import register_all_modules
+
+register_all_modules()
+
+config = 'configs/stable_diffusion/stable-diffusion_ddim_denoisingunet.py'
+StableDiffuser = MODELS.build(Config.fromfile(config).model)
+prompt = 'A mecha robot in a favela in expressionist style'
+StableDiffuser = StableDiffuser.to('cuda')
+
+image = StableDiffuser.infer(prompt)['samples']
+utils.save_image(image, 'robot.png')
+```
+
+## Comments
+
+Our codebase for the stable diffusion models builds heavily on [diffusers codebase](https://github.com/huggingface/diffusers) and the model weights are from [stable-diffusion-1.5](https://huggingface.co/runwayml/stable-diffusion-v1-5).
+
+Thanks for the efforts of the community!
+
+## Citation
+
+```bibtex
+@misc{rombach2021highresolution,
+      title={High-Resolution Image Synthesis with Latent Diffusion Models},
+      author={Robin Rombach and Andreas Blattmann and Dominik Lorenz and Patrick Esser and Björn Ommer},
+      year={2021},
+      eprint={2112.10752},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV}
+}
+```
diff --git a/configs/stable_diffusion/metafile.yml b/configs/stable_diffusion/metafile.yml
@@ -0,0 +1,22 @@
+Collections:
+- Metadata:
+    Architecture:
+    - Stable Diffusion
+  Name: Stable Diffusion
+  Paper:
+  - https://github.com/CompVis/stable-diffusion
+  README: configs/stable_diffusion/README.md
+  Task:
+  - text2image
+  Year: 2022
+Models:
+- Config: configs/stable_diffusion/stable-diffusion_ddim_denoisingunet.py
+  In Collection: Stable Diffusion
+  Metadata:
+    Training Data: Others
+  Name: stable-diffusion_ddim_denoisingunet
+  Results:
+  - Dataset: Others
+    Metrics: {}
+    Task: Text2Image
+  Weights: https://huggingface.co/runwayml/stable-diffusion-v1-5
diff --git a/configs/stable_diffusion/stable-diffusion_ddim_denoisingunet.py b/configs/stable_diffusion/stable-diffusion_ddim_denoisingunet.py
@@ -0,0 +1,58 @@
+unet = dict(
+    type='DenoisingUnet',
+    image_size=512,
+    base_channels=320,
+    channels_cfg=[1, 2, 4, 4],
+    unet_type='stable',
+    act_cfg=dict(type='silu'),
+    cross_attention_dim=768,
+    num_heads=8,
+    in_channels=4,
+    layers_per_block=2,
+    down_block_types=[
+        'CrossAttnDownBlock2D', 'CrossAttnDownBlock2D', 'CrossAttnDownBlock2D',
+        'DownBlock2D'
+    ],
+    up_block_types=[
+        'UpBlock2D', 'CrossAttnUpBlock2D', 'CrossAttnUpBlock2D',
+        'CrossAttnUpBlock2D'
+    ],
+    output_cfg=dict(var='fixed'))
+
+vae = dict(
+    act_fn='silu',
+    block_out_channels=[128, 256, 512, 512],
+    down_block_types=[
+        'DownEncoderBlock2D', 'DownEncoderBlock2D', 'DownEncoderBlock2D',
+        'DownEncoderBlock2D'
+    ],
+    in_channels=3,
+    latent_channels=4,
+    layers_per_block=2,
+    norm_num_groups=32,
+    out_channels=3,
+    sample_size=512,
+    up_block_types=[
+        'UpDecoderBlock2D', 'UpDecoderBlock2D', 'UpDecoderBlock2D',
+        'UpDecoderBlock2D'
+    ])
+
+diffusion_scheduler = dict(
+    type='DDIMScheduler',
+    variance_type='learned_range',
+    beta_end=0.012,
+    beta_schedule='scaled_linear',
+    beta_start=0.00085,
+    num_train_timesteps=1000,
+    set_alpha_to_one=False,
+    clip_sample=False)
+
+init_cfg = dict(type='Pretrained', pretrained_model_path='')
+
+model = dict(
+    type='StableDiffusion',
+    diffusion_scheduler=diffusion_scheduler,
+    unet=unet,
+    vae=vae,
+    init_cfg=init_cfg,
+)
diff --git a/mmedit/models/editors/__init__.py b/mmedit/models/editors/__init__.py
@@ -51,6 +51,7 @@
 from .singan import SinGAN
 from .srcnn import SRCNNNet
 from .srgan import SRGAN, ModifiedVGG, MSRResNet
+from .stable_diffusion import StableDiffusion
 from .stylegan1 import StyleGAN1
 from .stylegan2 import StyleGAN2
 from .stylegan3 import StyleGAN3, StyleGAN3Generator
@@ -84,5 +85,6 @@
     'DiscoDiffusion', 'IDLossModel', 'PESinGAN', 'MSPIEStyleGAN2',
     'StyleGAN3Generator', 'InstColorization', 'NAFBaseline',
     'NAFBaselineLocal', 'NAFNet', 'NAFNetLocal', 'DDIMScheduler',
-    'DDPMScheduler', 'DenoisingUnet', 'ClipWrapper', 'EG3D', 'Restormer'
+    'DDPMScheduler', 'DenoisingUnet', 'ClipWrapper', 'EG3D', 'Restormer',
+    'StableDiffusion'
 ]
diff --git a/mmedit/models/editors/ddim/ddim_scheduler.py b/mmedit/models/editors/ddim/ddim_scheduler.py
@@ -4,8 +4,8 @@
 import numpy as np
 import torch
 
+from mmedit.models.utils.diffusion_utils import betas_for_alpha_bar
 from mmedit.registry import DIFFUSION_SCHEDULERS
-from ...utils.diffusion_utils import betas_for_alpha_bar
 
 
 @DIFFUSION_SCHEDULERS.register_module()
@@ -82,13 +82,17 @@ def __init__(
         self.timesteps = np.arange(0, num_train_timesteps)[::-1].copy()
 
     def set_timesteps(self, num_inference_steps, offset=0):
+        """set time steps."""
+
         self.num_inference_steps = num_inference_steps
         self.timesteps = np.arange(
             0, self.num_train_timesteps,
             self.num_train_timesteps // self.num_inference_steps)[::-1].copy()
         self.timesteps += offset
 
     def _get_variance(self, timestep, prev_timestep):
+        """get variance."""
+
         alpha_prod_t = self.alphas_cumprod[timestep]
         alpha_prod_t_prev = self.alphas_cumprod[
             prev_timestep] if prev_timestep >= 0 else self.final_alpha_cumprod
@@ -109,6 +113,8 @@ def step(
         use_clipped_model_output: bool = False,
         generator=None,
     ):
+        """step forward."""
+
         output = {}
         if self.num_inference_steps is None:
             raise ValueError("Number of inference steps is 'None', '\
@@ -123,7 +129,8 @@ def step(
                 1] * 2 and self.variance_type in ['learned', 'learned_range']:
             model_output, _ = torch.split(model_output, sample.shape[1], dim=1)
         else:
-            raise TypeError
+            if not model_output.shape == sample.shape:
+                raise TypeError
 
         # See formulas (12) and (16) of DDIM paper https://arxiv.org/pdf/2010.02502.pdf # noqa
         # Ideally, read DDIM paper in-detail understanding
@@ -209,6 +216,8 @@ def step(
         return output
 
     def add_noise(self, original_samples, noise, timesteps):
+        """add noise."""
+
         sqrt_alpha_prod = self.alphas_cumprod[timesteps]**0.5
         sqrt_one_minus_alpha_prod = (1 - self.alphas_cumprod[timesteps])**0.5
         noisy_samples = (