From 3fec8e2d7556d4a5ed7f27cf6ed85b2b3dca541d Mon Sep 17 00:00:00 2001 From: Steven Date: Tue, 21 Mar 2023 14:44:02 -0700 Subject: [PATCH 1/7] update performance tutorial --- docs/source/en/_toctree.yml | 4 +- docs/source/en/stable_diffusion.mdx | 476 ++++++++++++---------------- 2 files changed, 207 insertions(+), 273 deletions(-) diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml index 09012a5c693d..86cca71dd3d1 100644 --- a/docs/source/en/_toctree.yml +++ b/docs/source/en/_toctree.yml @@ -3,8 +3,6 @@ title: ๐Ÿงจ Diffusers - local: quicktour title: Quicktour - - local: stable_diffusion - title: Stable Diffusion - local: installation title: Installation title: Get started @@ -13,6 +11,8 @@ title: Overview - local: using-diffusers/write_own_pipeline title: Understanding models and schedulers + - local: stable_diffusion + title: Improve DiffusionPipeline performance - local: tutorials/basic_training title: Train a diffusion model title: Tutorials diff --git a/docs/source/en/stable_diffusion.mdx b/docs/source/en/stable_diffusion.mdx index 8190813e488a..72f3e1a39121 100644 --- a/docs/source/en/stable_diffusion.mdx +++ b/docs/source/en/stable_diffusion.mdx @@ -1,333 +1,267 @@ - - -# The Stable Diffusion Guide ๐ŸŽจ - - Open In Colab - - -## Intro - -Stable Diffusion is a [Latent Diffusion model](https://github.com/CompVis/latent-diffusion) developed by researchers from the Machine Vision and Learning group at LMU Munich, *a.k.a* CompVis. -Model checkpoints were publicly released at the end of August 2022 by a collaboration of Stability AI, CompVis, and Runway with support from EleutherAI and LAION. For more information, you can check out [the official blog post](https://stability.ai/blog/stable-diffusion-public-release). - -Since its public release the community has done an incredible job at working together to make the stable diffusion checkpoints **faster**, **more memory efficient**, and **more performant**. - -๐Ÿงจ Diffusers offers a simple API to run stable diffusion with all memory, computing, and quality improvements. - -This notebook walks you through the improvements one-by-one so you can best leverage [`StableDiffusionPipeline`] for **inference**. - -## Prompt Engineering ๐ŸŽจ - -When running *Stable Diffusion* in inference, we usually want to generate a certain type, or style of image and then improve upon it. Improving upon a previously generated image means running inference over and over again with a different prompt and potentially a different seed until we are happy with our generation. - -So to begin with, it is most important to speed up stable diffusion as much as possible to generate as many pictures as possible in a given amount of time. - -This can be done by both improving the **computational efficiency** (speed) and the **memory efficiency** (GPU RAM). - -Let's start by looking into computational efficiency first. - -Throughout the notebook, we will focus on [runwayml/stable-diffusion-v1-5](https://huggingface.co/runwayml/stable-diffusion-v1-5): - -``` python -model_id = "runwayml/stable-diffusion-v1-5" -``` - -Let's load the pipeline. - -## Speed Optimization - -``` python -from diffusers import StableDiffusionPipeline - -pipe = StableDiffusionPipeline.from_pretrained(model_id) -``` - -We aim at generating a beautiful photograph of an *old warrior chief* and will later try to find the best prompt to generate such a photograph. For now, let's keep the prompt simple: - -``` python -prompt = "portrait photo of a old warrior chief" -``` - -To begin with, we should make sure we run inference on GPU, so let's move the pipeline to GPU, just like you would with any PyTorch module. - -``` python -pipe = pipe.to("cuda") -``` - -To generate an image, you should use the [~`StableDiffusionPipeline.__call__`] method. - -To make sure we can reproduce more or less the same image in every call, let's make use of the generator. See the documentation on reproducibility [here](./conceptual/reproducibility) for more information. - -``` python -generator = torch.Generator("cuda").manual_seed(0) -``` - -Now, let's take a spin on it. - -``` python -image = pipe(prompt, generator=generator).images[0] -image -``` - -![img](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/stable_diffusion_101/sd_101_1.png) - -Cool, this now took roughly 30 seconds on a T4 GPU (you might see faster inference if your allocated GPU is better than a T4). - -The default run we did above used full float32 precision and ran the default number of inference steps (50). The easiest speed-ups come from switching to float16 (or half) precision and simply running fewer inference steps. Let's load the model now in float16 instead. - -``` python -import torch - -pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16) -pipe = pipe.to("cuda") -``` - -And we can again call the pipeline to generate an image. - -``` python -generator = torch.Generator("cuda").manual_seed(0) - -image = pipe(prompt, generator=generator).images[0] -image -``` -![img](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/stable_diffusion_101/sd_101_2.png) - -Cool, this is almost three times as fast for arguably the same image quality. - -We strongly suggest always running your pipelines in float16 as so far we have very rarely seen degradations in quality because of it. - -Next, let's see if we need to use 50 inference steps or whether we could use significantly fewer. The number of inference steps is associated with the denoising scheduler we use. Choosing a more efficient scheduler could help us decrease the number of steps. - -Let's have a look at all the schedulers the stable diffusion pipeline is compatible with. - -``` python -pipe.scheduler.compatibles -``` - -``` - [diffusers.schedulers.scheduling_dpmsolver_singlestep.DPMSolverSinglestepScheduler, - diffusers.schedulers.scheduling_lms_discrete.LMSDiscreteScheduler, - diffusers.schedulers.scheduling_heun_discrete.HeunDiscreteScheduler, - diffusers.schedulers.scheduling_pndm.PNDMScheduler, - diffusers.schedulers.scheduling_euler_discrete.EulerDiscreteScheduler, - diffusers.schedulers.scheduling_euler_ancestral_discrete.EulerAncestralDiscreteScheduler, - diffusers.schedulers.scheduling_dpmsolver_multistep.DPMSolverMultistepScheduler, - diffusers.schedulers.scheduling_ddpm.DDPMScheduler, - diffusers.schedulers.scheduling_ddim.DDIMScheduler] -``` - -Cool, that's a lot of schedulers. - -๐Ÿงจ Diffusers is constantly adding a bunch of novel schedulers/samplers that can be used with Stable Diffusion. For more information, we recommend taking a look at the official documentation [here](https://huggingface.co/docs/diffusers/main/en/api/schedulers/overview). - -Alright, right now Stable Diffusion is using the `PNDMScheduler` which usually requires around 50 inference steps. However, other schedulers such as `DPMSolverMultistepScheduler` or `DPMSolverSinglestepScheduler` seem to get away with just 20 to 25 inference steps. Let's try them out. - -You can set a new scheduler by making use of the [from_config](https://huggingface.co/docs/diffusers/main/en/api/configuration#diffusers.ConfigMixin.from_config) function. - -``` python -from diffusers import DPMSolverMultistepScheduler - -pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config) -``` - -Now, let's try to reduce the number of inference steps to just 20. - -``` python -generator = torch.Generator("cuda").manual_seed(0) - -image = pipe(prompt, generator=generator, num_inference_steps=20).images[0] -image -``` - -![img](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/stable_diffusion_101/sd_101_3.png) - -The image now does look a little different, but it's arguably still of equally high quality. We now cut inference time to just 4 seconds though ๐Ÿ˜. - -## Memory Optimization + + +# Improve DiffusionPipeline performance -``` python -def get_inputs(batch_size=1): - generator = [torch.Generator("cuda").manual_seed(i) for i in range(batch_size)] - prompts = batch_size * [prompt] - num_inference_steps = 20 +[[open-in-colab]] - return {"prompt": prompts, "generator": generator, "num_inference_steps": num_inference_steps} -``` -This function returns a list of prompts and a list of generators, so we can reuse the generator that produced a result we like. +When you're using the [`DiffusionPipeline`] for inference, you're trying to generate an image with a certain style and the image should contain the elements specified in the prompt. If you've already tried generating something, you know that it can be tricky to generate an image you're happy with, and you might have to run the [`DiffusionPipeline`] multiple times. But generating something out of nothing is a computationally intensive process, especially if you're running inference over and over again. -We also need a method that allows us to easily display a batch of images. +The key is to get the most *computational efficiency* (speed) and *memory efficiency* (GPU RAM) from the pipeline to shorten the time to getting high-quality outputs. -``` python -from PIL import Image +This tutorial will walk you through how to improve your [`DiffusionPipeline`]'s performance to generate faster. -def image_grid(imgs, rows=2, cols=2): - w, h = imgs[0].size - grid = Image.new('RGB', size=(cols*w, rows*h)) - - for i, img in enumerate(imgs): - grid.paste(img, box=(i%cols*w, i//cols*h)) - return grid -``` +Begin by loading the [`runwayml/stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5) model: -Cool, let's see how much memory we can use starting with `batch_size=4`. +```python +from diffusers import DiffusionPipeline -``` python -images = pipe(**get_inputs(batch_size=4)).images -image_grid(images) -``` +model_id = "runwayml/stable-diffusion-v1-5" +pipeline = DiffusionPipeline.from_pretrained(model_id) +``` -![img](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/stable_diffusion_101/sd_101_4.png) +The example prompt you'll use is a portrait of an old warrior chief, but feel free to use your own prompt: -Going over a batch_size of 4 will error out in this notebook (assuming we are running it on a T4 GPU). Also, we can see we only generate slightly more images per second (3.75s/image) compared to 4s/image previously. +```python +prompt = "portrait photo of a old warrior chief" +``` -However, the community has found some nice tricks to improve the memory constraints further. After stable diffusion was released, the community found improvements within days and shared them freely over GitHub - open-source at its finest! I believe the original idea came from [this](https://github.com/basujindal/stable-diffusion/pull/117) GitHub thread. +## Speed -By far most of the memory is taken up by the cross-attention layers. Instead of running this operation in batch, one can run it sequentially to save a significant amount of memory. + -It can easily be enabled by calling `enable_attention_slicing` as is documented [here](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/text2img#diffusers.StableDiffusionPipeline.enable_attention_slicing). +๐Ÿ’ก If you don't have access to a physical GPU, you can use one for free from a GPU provider like [Colab](https://colab.research.google.com/)! -``` python -pipe.enable_attention_slicing() -``` + -Great, now that attention slicing is enabled, let's try to double the batch size again, going for `batch_size=8`. +One of the simplest ways to speed up inference is to place the pipeline on a GPU the same way you would with any PyTorch module: -``` python -images = pipe(**get_inputs(batch_size=8)).images -image_grid(images, rows=2, cols=4) -``` +```python +pipeline = pipeline.to("cuda") +``` -![img](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/stable_diffusion_101/sd_101_5.png) +To make sure you can use the same image and improve on it, use a [`Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) and set a seed for [reproducibility](./using-diffusers/reproducibility). -Nice, it works. However, the speed gain is again not very big (it might however be much more significant on other GPUs). +```python +generator = torch.Generator("cuda").manual_seed(0) +``` -We're at roughly 3.5 seconds per image ๐Ÿ”ฅ which is probably the fastest we can be with a simple T4 without sacrificing quality. +Now you can generate an image: -Next, let's look into how to improve the quality! +```python +image = pipeline(prompt, generator=generator).images[0] +image +``` -## Quality Improvements +
+ +
-Now that our image generation pipeline is blazing fast, let's try to get maximum image quality. +This process took ~30 seconds on a T4 GPU (it might see faster if your allocated GPU is better than a T4). By default, the [`DiffusionPipeline`] runs inference with full `float32` precision for 50 inference steps. You can speed this up by switching to a lower precision like `float16` or running fewer inference steps. -First of all, image quality is extremely subjective, so it's difficult to make general claims here. +Load the model in `float16` and generate an image: -The most obvious step to take to improve quality is to use *better checkpoints*. Since the release of Stable Diffusion, many improved versions have been released, which are summarized here: +```python +import torch -- *Official Release - 22 Aug 2022*: [Stable-Diffusion 1.4](https://huggingface.co/CompVis/stable-diffusion-v1-4) -- *20 October 2022*: [Stable-Diffusion 1.5](https://huggingface.co/runwayml/stable-diffusion-v1-5) -- *24 Nov 2022*: [Stable-Diffusion 2.0](https://huggingface.co/stabilityai/stable-diffusion-2-0) -- *7 Dec 2022*: [Stable-Diffusion 2.1](https://huggingface.co/stabilityai/stable-diffusion-2-1) +pipeline = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16) +pipeline = pipeline.to("cuda") +generator = torch.Generator("cuda").manual_seed(0) +image = pipeline(prompt, generator=generator).images[0] +image +``` -Newer versions don't necessarily mean better image quality with the same parameters. People mentioned that *2.0* is slightly worse than *1.5* for certain prompts, but given the right prompt engineering *2.0* and *2.1* seem to be better. +
+ +
-Overall, we strongly recommend just trying the models out and reading up on advice online (e.g. it has been shown that using negative prompts is very important for 2.0 and 2.1 to get the highest possible quality. See for example [this nice blog post](https://minimaxir.com/2022/11/stable-diffusion-negative-prompt/). +This time, it only took ~11 seconds to generate the image, which is almost 3x faster than before! -Additionally, the community has started fine-tuning many of the above versions on certain styles with some of them having an extremely high quality and gaining a lot of traction. + -We recommend having a look at all [diffusers checkpoints sorted by downloads and trying out the different checkpoints](https://huggingface.co/models?library=diffusers). +๐Ÿ’ก We strongly suggest always running your pipelines in `float16`, and so far, we've rarely seen any degradation in output quality. -For the following, we will stick to v1.5 for simplicity. + -Next, we can also try to optimize single components of the pipeline, e.g. switching out the latent decoder. For more details on how the whole Stable Diffusion pipeline works, please have a look at [this blog post](https://huggingface.co/blog/stable_diffusion). +The other option is to reduce the number of inference steps. The number of inference steps is associated with a denoising scheduler, and choosing a more efficient scheduler could help decrease the number of steps. You can find out which schedulers are compatible with the [`DiffusionPipeline`] by calling the `compatibles` method: -Let's load [stabilityai's newest auto-decoder](https://huggingface.co/stabilityai/stable-diffusion-2-1). +```python +pipeline.scheduler.compatibles +[ + diffusers.schedulers.scheduling_lms_discrete.LMSDiscreteScheduler, + diffusers.schedulers.scheduling_unipc_multistep.UniPCMultistepScheduler, + diffusers.schedulers.scheduling_k_dpm_2_discrete.KDPM2DiscreteScheduler, + diffusers.schedulers.scheduling_deis_multistep.DEISMultistepScheduler, + diffusers.schedulers.scheduling_euler_discrete.EulerDiscreteScheduler, + diffusers.schedulers.scheduling_dpmsolver_multistep.DPMSolverMultistepScheduler, + diffusers.schedulers.scheduling_ddpm.DDPMScheduler, + diffusers.schedulers.scheduling_dpmsolver_singlestep.DPMSolverSinglestepScheduler, + diffusers.schedulers.scheduling_k_dpm_2_ancestral_discrete.KDPM2AncestralDiscreteScheduler, + diffusers.schedulers.scheduling_heun_discrete.HeunDiscreteScheduler, + diffusers.schedulers.scheduling_pndm.PNDMScheduler, + diffusers.schedulers.scheduling_euler_ancestral_discrete.EulerAncestralDiscreteScheduler, + diffusers.schedulers.scheduling_ddim.DDIMScheduler, +] +``` -``` python -from diffusers import AutoencoderKL +The Stable Diffusion model uses the [`PNDMScheduler`] by default which usually requires ~50 inference steps. But other more performant schedulers, like [`DPMSolverMultistepScheduler`], typically require only ~20 or 25 inference steps. Set a new scheduler with the [`ConfigMixin.from_config`] method: -vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse", torch_dtype=torch.float16).to("cuda") -``` +```python +from diffusers import DPMSolverMultistepScheduler -Now we can set it to the vae of the pipeline to use it. +pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config) +``` -``` python -pipe.vae = vae -``` +Now let's set the `num_inference_steps` to just 20: -Let's run the same prompt as before to compare quality. +```python +generator = torch.Generator("cuda").manual_seed(0) +image = pipeline(prompt, generator=generator, num_inference_steps=20).images[0] +image +``` -``` python -images = pipe(**get_inputs(batch_size=8)).images -image_grid(images, rows=2, cols=4) -``` +
+ +
-![img](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/stable_diffusion_101/sd_101_6.png) +Now you've managed to cut the inference time to just 4 seconds! โšก๏ธ -Seems like the difference is only very minor, but the new generations are arguably a bit *sharper*. +## Memory -Cool, finally, let's look a bit into prompt engineering. +The other key to improving pipeline performance is consuming less memory, which indirectly implies more speed, since you're often trying to maximize the number of images generated per second. The easiest way to see how many images you can generate at once is to see when you get an `OutOfMemoryError` (OOM). -Our goal was to generate a photo of an old warrior chief. Let's now try to bring a bit more color into the photos and make the look more impressive. +Create a function that'll generate a batch of images from a list of prompts and `Generators`. Make sure to assign each `Generator` a seed so you can reuse it if it produces a good result. -Originally our prompt was "*portrait photo of an old warrior chief*". +```python +def get_inputs(batch_size=1): + generator = [torch.Generator("cuda").manual_seed(i) for i in range(batch_size)] + prompts = batch_size * [prompt] + num_inference_steps = 20 -To improve the prompt, it often helps to add cues that could have been used online to save high-quality photos, as well as add more details. -Essentially, when doing prompt engineering, one has to think: + return {"prompt": prompts, "generator": generator, "num_inference_steps": num_inference_steps} +``` -- How was the photo or similar photos of the one I want probably stored on the internet? -- What additional detail can I give that steers the models into the style that I want? +You'll also need a function that'll display each batch of images: -Cool, let's add more details. +```python +from PIL import image -``` python -prompt += ", tribal panther make up, blue on red, side profile, looking away, serious eyes" -``` -and let's also add some cues that usually help to generate higher quality images. +def image_grid(imgs, rows=2, cols=2): + w, h = imgs[0].size + grid = Image.new("RGB", size=(cols * w, rows * h)) -``` python -prompt += " 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta" -prompt -``` + for i, img in enumerate(imgs): + grid.paste(img, box=(i % cols * w, i // cols * h)) + return grid +``` -Cool, let's now try this prompt. +Generate a batch of images with `batch_size=4` and see how much memory you've consumed: -``` python -images = pipe(**get_inputs(batch_size=8)).images -image_grid(images, rows=2, cols=4) -``` +```python +images = pipeline(**get_inputs(batch_size=4)).images +image_grid(images) +``` -![img](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/stable_diffusion_101/sd_101_7.png) +Unless you have a GPU with more RAM, the code above probably returned an `OOM` error! Most of the memory is taken up by the cross-attention layers. Instead of running this operation in a batch, you can run it sequentially to save a significant amount of memory. All you have to do is enable [`~DiffusionPipeline.enable_attention_slicing`]. -Pretty impressive! We got some very high-quality image generations there. The 2nd image is my personal favorite, so I'll re-use this seed and see whether I can tweak the prompts slightly by using "oldest warrior", "old", "", and "young" instead of "old". +Try increasing the `batch_size` to 8! -``` python -prompts = [ - "portrait photo of the oldest warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta", - "portrait photo of a old warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta", - "portrait photo of a warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta", - "portrait photo of a young warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta", -] +```python +images = pipeline(**get_inputs(batch_size=4)).images +image_grid(images, rows=2, cols=4) +``` -generator = [torch.Generator("cuda").manual_seed(1) for _ in range(len(prompts))] # 1 because we want the 2nd image +
+ +
-images = pipe(prompt=prompts, generator=generator, num_inference_steps=25).images -image_grid(images) -``` +Whereas before you couldn't even generate a batch of 4 images, now you can generate a batch of 8 images at ~3.5 seconds per image! This is probably the fastest you can go on a T4 GPU without sacrificing quality. -![img](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/stable_diffusion_101/sd_101_8.png) +## Quality -The first picture looks nice! The eye movement slightly changed and looks nice. This finished up our 101-guide on how to use Stable Diffusion ๐Ÿค—. +In the last two sections, you learned how to optimize the speed of your pipeline by using `fp16`, reducing the number of inference steps by using a more performant scheduler, and enabling attention slicing to reduce memory consumption. Now you're going to focus on how to improve the quality of generated images. -For more information on optimization or other guides, I recommend taking a look at the following: +### Better checkpoints -- [Blog post about Stable Diffusion](https://huggingface.co/blog/stable_diffusion): In-detail blog post explaining Stable Diffusion. -- [FlashAttention](https://huggingface.co/docs/diffusers/optimization/xformers): XFormers flash attention can optimize your model even further with more speed and memory improvements. -- [Dreambooth](https://huggingface.co/docs/diffusers/training/dreambooth) - Quickly customize the model by fine-tuning it. -- [General info on Stable Diffusion](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/overview) - Info on other tasks that are powered by Stable Diffusion. +The most obvious step is to use better checkpoints. The Stable Diffusion model is a good starting point, and since its official release, several improved versions have also been released. However, using a newer version doesn't automatically mean you'll get better results. You'll have to experiment with different checkpoints yourself, and do a little research (such as using [negative prompts](https://minimaxir.com/2022/11/stable-diffusion-negative-prompt/)) to get the best results. + +As the field grows, there are more high-quality checkpoints finetuned to produce certain styles. Try exploring the [Hub](https://huggingface.co/models?library=diffusers&sort=downloads) and [Diffusers Gallery](https://huggingface.co/spaces/huggingface-projects/diffusers-gallery) to find a checkpoint you're interested in! + +### Better pipeline components + +The other thing you can try is optimizing the pipeline components such as the latent decoder. Let's try loading the latest [autodecoder](https://huggingface.co/stabilityai/stable-diffusion-2-1/tree/main/vae) from Stability AI into the pipeline, and generate some images: + +```python +from diffusers import AutoencoderKL + +vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse", torch_dtype=torch.float16).to("cuda") +pipeline.vae = vae +images = pipeline(**get_inputs(batch_size=8)).images +image_grid(images, rows=2, cols=4) +``` + +
+ +
+ +### Better prompt engineering + +The text prompt you use to generate an image is super important, so much so that it is called *prompt engineering*. Some considerations to keep during prompt engineering are: + +- How is the image or similar images of the one I want to generate stored on the internet? +- What additional detail can I give that steers the model towards the style I want? + +With this in mind, let's improve the prompt to include color and higher quality details: + +```python +prompt += ", tribal panther make up, blue on red, side profile, looking away, serious eyes" +prompt += " 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta" +``` + +Generate a batch of images with the new prompt: + +```python +images = pipeline(**get_inputs(batch_size=8)).images +image_grid(images, rows=2, cols=4) +``` + +
+ +
+ +Pretty impressive! Let's tweak the second image - corresponding to the `Generator` with a seed of `1` - a bit more by adding some text about the age of the subject: + +```python +prommpts = [ + "portrait photo of the oldest warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta", + "portrait photo of a old warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta", + "portrait photo of a warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta", + "portrait photo of a young warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta", +] + +generator = [torch.Generator("cuda").manual_seed(1) for _ in range(len(prompts))] +images = pipeline(prompt=prompts, generator=generator, num_inference_steps=25).images +image_grid(images) +``` + +
+ +
+ +## Next steps + +In this tutorial, you learned how to optimize a [`DiffusionPipeline`] for computational and memory efficiency as well as improving the quality of generated outputs. If you're interested in making your pipeline even faster, take a look at the following resources: + +- Enable [xFormers](./optimization/xformers) memory efficient attention mechanism for faster speed and reduced memory consumption. +- Learn how in [PyTorch 2.0](./optimization/torch2.0), [`torch.compile`](https://pytorch.org/docs/stable/generated/torch.compile.html) can yield 2-9% faster inference speed. +- Many optimization techniques for inference are also included in this memory and speed [guide](./optimization/fp16), such as memory offloading. \ No newline at end of file From 02a28ea501d109ee7ba9c2a4507e17f3512297ab Mon Sep 17 00:00:00 2001 From: Steven Date: Tue, 21 Mar 2023 15:34:11 -0700 Subject: [PATCH 2/7] fix divs --- docs/source/en/stable_diffusion.mdx | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/docs/source/en/stable_diffusion.mdx b/docs/source/en/stable_diffusion.mdx index 72f3e1a39121..f33b4f1f906b 100644 --- a/docs/source/en/stable_diffusion.mdx +++ b/docs/source/en/stable_diffusion.mdx @@ -62,7 +62,7 @@ image = pipeline(prompt, generator=generator).images[0] image ``` -
+
@@ -80,7 +80,7 @@ image = pipeline(prompt, generator=generator).images[0] image ``` -
+
@@ -129,7 +129,7 @@ image = pipeline(prompt, generator=generator, num_inference_steps=20).images[0] image ``` -
+
@@ -177,12 +177,12 @@ Unless you have a GPU with more RAM, the code above probably returned an `OOM` e Try increasing the `batch_size` to 8! ```python -images = pipeline(**get_inputs(batch_size=4)).images +images = pipeline(**get_inputs(batch_size=8)).images image_grid(images, rows=2, cols=4) ``` -
- +
+
Whereas before you couldn't even generate a batch of 4 images, now you can generate a batch of 8 images at ~3.5 seconds per image! This is probably the fastest you can go on a T4 GPU without sacrificing quality. @@ -210,7 +210,7 @@ images = pipeline(**get_inputs(batch_size=8)).images image_grid(images, rows=2, cols=4) ``` -
+
@@ -235,7 +235,7 @@ images = pipeline(**get_inputs(batch_size=8)).images image_grid(images, rows=2, cols=4) ``` -
+
@@ -254,7 +254,7 @@ images = pipeline(prompt=prompts, generator=generator, num_inference_steps=25).i image_grid(images) ``` -
+
From f6bb776cfc2acd65d093933373731526756b2eaa Mon Sep 17 00:00:00 2001 From: Steven Date: Tue, 21 Mar 2023 17:07:38 -0700 Subject: [PATCH 3/7] oops forgot to close tag --- docs/source/en/stable_diffusion.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/stable_diffusion.mdx b/docs/source/en/stable_diffusion.mdx index f33b4f1f906b..153b838f8105 100644 --- a/docs/source/en/stable_diffusion.mdx +++ b/docs/source/en/stable_diffusion.mdx @@ -64,7 +64,7 @@ image
-
+
This process took ~30 seconds on a T4 GPU (it might see faster if your allocated GPU is better than a T4). By default, the [`DiffusionPipeline`] runs inference with full `float32` precision for 50 inference steps. You can speed this up by switching to a lower precision like `float16` or running fewer inference steps. From a46a53d1dd278996504439629d532d8dddba0187 Mon Sep 17 00:00:00 2001 From: Steven Date: Wed, 22 Mar 2023 10:46:49 -0700 Subject: [PATCH 4/7] apply feedback --- docs/source/en/stable_diffusion.mdx | 40 ++++++++++++++++------------- 1 file changed, 22 insertions(+), 18 deletions(-) diff --git a/docs/source/en/stable_diffusion.mdx b/docs/source/en/stable_diffusion.mdx index 153b838f8105..5a25fa39a513 100644 --- a/docs/source/en/stable_diffusion.mdx +++ b/docs/source/en/stable_diffusion.mdx @@ -14,11 +14,11 @@ specific language governing permissions and limitations under the License. [[open-in-colab]] -When you're using the [`DiffusionPipeline`] for inference, you're trying to generate an image with a certain style and the image should contain the elements specified in the prompt. If you've already tried generating something, you know that it can be tricky to generate an image you're happy with, and you might have to run the [`DiffusionPipeline`] multiple times. But generating something out of nothing is a computationally intensive process, especially if you're running inference over and over again. +Getting the [`DiffusionPipeline`] to generate images in a certain style or include what you want can be tricky. Often times, you have to run the [`DiffusionPipeline`] several times before you end up with an image you're happy with. But generating something out of nothing is a computationally intensive process, especially if you're running inference over and over again. -The key is to get the most *computational efficiency* (speed) and *memory efficiency* (GPU RAM) from the pipeline to shorten the time to getting high-quality outputs. +This is why it's important to get the most *computational* (speed) and *memory* (GPU RAM) efficiency from the pipeline to reduce the time between inference cycles so you can iterate faster. -This tutorial will walk you through how to improve your [`DiffusionPipeline`]'s performance to generate faster. +This tutorial walks you through how to generate faster and better with the [`DiffusionPipeline`]. Begin by loading the [`runwayml/stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5) model: @@ -39,7 +39,7 @@ prompt = "portrait photo of a old warrior chief" -๐Ÿ’ก If you don't have access to a physical GPU, you can use one for free from a GPU provider like [Colab](https://colab.research.google.com/)! +๐Ÿ’ก If you don't have access to a GPU, you can use one for free from a GPU provider like [Colab](https://colab.research.google.com/)! @@ -49,7 +49,7 @@ One of the simplest ways to speed up inference is to place the pipeline on a GPU pipeline = pipeline.to("cuda") ``` -To make sure you can use the same image and improve on it, use a [`Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) and set a seed for [reproducibility](./using-diffusers/reproducibility). +To make sure you can use the same image and improve on it, use a [`Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) and set a seed for [reproducibility](./using-diffusers/reproducibility): ```python generator = torch.Generator("cuda").manual_seed(0) @@ -66,9 +66,9 @@ image
-This process took ~30 seconds on a T4 GPU (it might see faster if your allocated GPU is better than a T4). By default, the [`DiffusionPipeline`] runs inference with full `float32` precision for 50 inference steps. You can speed this up by switching to a lower precision like `float16` or running fewer inference steps. +This process took ~30 seconds on a T4 GPU (it might be faster if your allocated GPU is better than a T4). By default, the [`DiffusionPipeline`] runs inference with full `float32` precision for 50 inference steps. You can speed this up by switching to a lower precision like `float16` or running fewer inference steps. -Load the model in `float16` and generate an image: +Let's start by loading the model in `float16` and generate an image: ```python import torch @@ -92,7 +92,7 @@ This time, it only took ~11 seconds to generate the image, which is almost 3x fa -The other option is to reduce the number of inference steps. The number of inference steps is associated with a denoising scheduler, and choosing a more efficient scheduler could help decrease the number of steps. You can find out which schedulers are compatible with the [`DiffusionPipeline`] by calling the `compatibles` method: +Another option is to reduce the number of inference steps. Choosing a more efficient scheduler could help decrease the number of steps without sacrificing output quality. You can find which schedulers are compatible with the current model in the [`DiffusionPipeline`] by calling the `compatibles` method: ```python pipeline.scheduler.compatibles @@ -113,7 +113,7 @@ pipeline.scheduler.compatibles ] ``` -The Stable Diffusion model uses the [`PNDMScheduler`] by default which usually requires ~50 inference steps. But other more performant schedulers, like [`DPMSolverMultistepScheduler`], typically require only ~20 or 25 inference steps. Set a new scheduler with the [`ConfigMixin.from_config`] method: +The Stable Diffusion model uses the [`PNDMScheduler`] by default which usually requires ~50 inference steps, but more performant schedulers like [`DPMSolverMultistepScheduler`], require only ~20 or 25 inference steps. Use the [`ConfigMixin.from_config`] method to load a new scheduler: ```python from diffusers import DPMSolverMultistepScheduler @@ -121,7 +121,7 @@ from diffusers import DPMSolverMultistepScheduler pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config) ``` -Now let's set the `num_inference_steps` to just 20: +Now set the `num_inference_steps` to 20: ```python generator = torch.Generator("cuda").manual_seed(0) @@ -133,11 +133,11 @@ image
-Now you've managed to cut the inference time to just 4 seconds! โšก๏ธ +Great, you've managed to cut the inference time to just 4 seconds! โšก๏ธ ## Memory -The other key to improving pipeline performance is consuming less memory, which indirectly implies more speed, since you're often trying to maximize the number of images generated per second. The easiest way to see how many images you can generate at once is to see when you get an `OutOfMemoryError` (OOM). +The other key to improving pipeline performance is consuming less memory, which indirectly implies more speed, since you're often trying to maximize the number of images generated per second. The easiest way to see how many images you can generate at once is to try out different batch sizes until you get an `OutOfMemoryError` (OOM). Create a function that'll generate a batch of images from a list of prompts and `Generators`. Make sure to assign each `Generator` a seed so you can reuse it if it produces a good result. @@ -165,16 +165,20 @@ def image_grid(imgs, rows=2, cols=2): return grid ``` -Generate a batch of images with `batch_size=4` and see how much memory you've consumed: +Start with `batch_size=4` and see how much memory you've consumed: ```python images = pipeline(**get_inputs(batch_size=4)).images image_grid(images) ``` -Unless you have a GPU with more RAM, the code above probably returned an `OOM` error! Most of the memory is taken up by the cross-attention layers. Instead of running this operation in a batch, you can run it sequentially to save a significant amount of memory. All you have to do is enable [`~DiffusionPipeline.enable_attention_slicing`]. +Unless you have a GPU with more RAM, the code above probably returned an `OOM` error! Most of the memory is taken up by the cross-attention layers. Instead of running this operation in a batch, you can run it sequentially to save a significant amount of memory. All you have to do is configure the pipeline to use the [`~DiffusionPipeline.enable_attention_slicing`] function: -Try increasing the `batch_size` to 8! +```python +pipeline.enable_attention_slicing() +``` + +Now try increasing the `batch_size` to 8! ```python images = pipeline(**get_inputs(batch_size=8)).images @@ -193,13 +197,13 @@ In the last two sections, you learned how to optimize the speed of your pipeline ### Better checkpoints -The most obvious step is to use better checkpoints. The Stable Diffusion model is a good starting point, and since its official release, several improved versions have also been released. However, using a newer version doesn't automatically mean you'll get better results. You'll have to experiment with different checkpoints yourself, and do a little research (such as using [negative prompts](https://minimaxir.com/2022/11/stable-diffusion-negative-prompt/)) to get the best results. +The most obvious step is to use better checkpoints. The Stable Diffusion model is a good starting point, and since its official launch, several improved versions have also been released. However, using a newer version doesn't automatically mean you'll get better results. You'll still have to experiment with different checkpoints yourself, and do a little research (such as using [negative prompts](https://minimaxir.com/2022/11/stable-diffusion-negative-prompt/)) to get the best results. -As the field grows, there are more high-quality checkpoints finetuned to produce certain styles. Try exploring the [Hub](https://huggingface.co/models?library=diffusers&sort=downloads) and [Diffusers Gallery](https://huggingface.co/spaces/huggingface-projects/diffusers-gallery) to find a checkpoint you're interested in! +As the field grows, there are more and more high-quality checkpoints finetuned to produce certain styles. Try exploring the [Hub](https://huggingface.co/models?library=diffusers&sort=downloads) and [Diffusers Gallery](https://huggingface.co/spaces/huggingface-projects/diffusers-gallery) to find one you're interested in! ### Better pipeline components -The other thing you can try is optimizing the pipeline components such as the latent decoder. Let's try loading the latest [autodecoder](https://huggingface.co/stabilityai/stable-diffusion-2-1/tree/main/vae) from Stability AI into the pipeline, and generate some images: +You can also try replacing the current pipeline components with a newer version. Let's try loading the latest [autodecoder](https://huggingface.co/stabilityai/stable-diffusion-2-1/tree/main/vae) from Stability AI into the pipeline, and generate some images: ```python from diffusers import AutoencoderKL From 544c70c1f579dcdb0b17945919db2b0222966563 Mon Sep 17 00:00:00 2001 From: Steven Date: Thu, 23 Mar 2023 10:18:29 -0700 Subject: [PATCH 5/7] apply feedback --- docs/source/en/_toctree.yml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml index 86cca71dd3d1..72465cfe1fb8 100644 --- a/docs/source/en/_toctree.yml +++ b/docs/source/en/_toctree.yml @@ -3,6 +3,8 @@ title: ๐Ÿงจ Diffusers - local: quicktour title: Quicktour + - local: stable_diffusion + title: Improve DiffusionPipeline performance - local: installation title: Installation title: Get started @@ -11,8 +13,6 @@ title: Overview - local: using-diffusers/write_own_pipeline title: Understanding models and schedulers - - local: stable_diffusion - title: Improve DiffusionPipeline performance - local: tutorials/basic_training title: Train a diffusion model title: Tutorials From 3b9c0209f7313b59eb1ebdd06e63ae6503079443 Mon Sep 17 00:00:00 2001 From: Steven Date: Mon, 27 Mar 2023 13:04:20 -0700 Subject: [PATCH 6/7] apply feedback --- docs/source/en/_toctree.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml index 72465cfe1fb8..66762bc3114a 100644 --- a/docs/source/en/_toctree.yml +++ b/docs/source/en/_toctree.yml @@ -4,7 +4,7 @@ - local: quicktour title: Quicktour - local: stable_diffusion - title: Improve DiffusionPipeline performance + title: Effective and efficient diffusion - local: installation title: Installation title: Get started From 519335e728c7df423cd33e2492f1f34022d9d9b0 Mon Sep 17 00:00:00 2001 From: Steven Date: Tue, 28 Mar 2023 10:46:28 -0700 Subject: [PATCH 7/7] align doc title --- docs/source/en/stable_diffusion.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/stable_diffusion.mdx b/docs/source/en/stable_diffusion.mdx index 5a25fa39a513..eebe0ec660f2 100644 --- a/docs/source/en/stable_diffusion.mdx +++ b/docs/source/en/stable_diffusion.mdx @@ -10,7 +10,7 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -# Improve DiffusionPipeline performance +# Effective and efficient diffusion [[open-in-colab]]