Skip to content

Commit 54747b8

Browse files
committed
try hfoptions syntax
1 parent 12a0de2 commit 54747b8

File tree

2 files changed

+65
-13
lines changed

2 files changed

+65
-13
lines changed

docs/source/en/_toctree.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -42,8 +42,6 @@
4242
- sections:
4343
- local: using-diffusers/pipeline_overview
4444
title: Overview
45-
- local: using-diffusers/kandinsky
46-
title: Kandinsky
4745
- local: using-diffusers/unconditional_image_generation
4846
title: Unconditional image generation
4947
- local: using-diffusers/conditional_image_generation
@@ -74,6 +72,8 @@
7472
title: Overview
7573
- local: using-diffusers/sdxl
7674
title: Stable Diffusion XL
75+
- local: using-diffusers/kandinsky
76+
title: Kandinsky
7777
- local: using-diffusers/controlnet
7878
title: ControlNet
7979
- local: using-diffusers/shap-e

docs/source/en/using-diffusers/kandinsky.md

Lines changed: 63 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,9 @@
44

55
The Kandinsky models are a series of multilingual text-to-image generation models. The Kandinsky 2.0 model uses two multilingual text encoders and concatenates those results for the UNet.
66

7-
[Kandinsky 2.1](../api/pipelines/kandinsky) changes the architecture to include an image prior model ([`CLIP`](https://huggingface.co/docs/transformers/model_doc/clip)) to generate a mapping between text and image embeddings. The mapping embedding provides better text-image alignment are is used with the text embeddings for training, leading to higher quality results. Finally, Kandinsky 2.1 uses a [Modulating Quantized Vectors (MoVQ)](https://huggingface.co/papers/2209.09002) decoder - adds a spatial conditional normalization layer to increase photorealism - to decode the latents into images.
7+
[Kandinsky 2.1](../api/pipelines/kandinsky) changes the architecture to include an image prior model ([`CLIP`](https://huggingface.co/docs/transformers/model_doc/clip)) to generate a mapping between text and image embeddings. The mapping provides better text-image alignment and it is used with the text embeddings during training, leading to higher quality results. Finally, Kandinsky 2.1 uses a [Modulating Quantized Vectors (MoVQ)](https://huggingface.co/papers/2209.09002) decoder - which adds a spatial conditional normalization layer to increase photorealism - to decode the latents into images.
88

9-
[Kandinsky 2.2](../api/pipelines/kandinsky_v22) improves on the previous model by replacing the image encoder of the image prior model with a larger CLIP-ViT-G model to increase quality. The image prior model was also retrained on images with different resolutions and aspect ratios to generate higher-resolution images and different image sizes.
9+
[Kandinsky 2.2](../api/pipelines/kandinsky_v22) improves on the previous model by replacing the image encoder of the image prior model with a larger CLIP-ViT-G model to improve quality. The image prior model was also retrained on images with different resolutions and aspect ratios to generate higher-resolution images and different image sizes.
1010

1111
This guide will show you how to use the Kandinsky models for text-to-image, image-to-image, inpainting, interpolation, and more.
1212

@@ -19,20 +19,24 @@ Before you begin, make sure you have the following libraries installed:
1919

2020
<Tip warning={true}>
2121

22-
Kandinsky 2.1 and 2.2 usage is very similar! The only difference is Kandinsky 2.2 no longer accepts `prompt` as an input when decoding the latents. Instead, Kandinsky 2.2 only accepts `image_embeds` during decoding.
22+
Kandinsky 2.1 and 2.2 usage is very similar! The only difference is Kandinsky 2.2 doesn't accept `prompt` as an input when decoding the latents. Instead, Kandinsky 2.2 only accepts `image_embeds` during decoding.
2323

2424
</Tip>
2525

2626
## Text-to-image
2727

2828
To use the Kandinsky models for any task, you always start by setting up the prior pipeline to encode the prompt and generate the image embeddings. The prior pipeline also generates `negative_image_embeds` that correspond to the negative prompt `""`. For better results, you can pass an actual `negative_prompt` to the prior pipeline, but this'll increase the effective batch size of the prior pipeline by 2x.
2929

30+
31+
<hfoptions id="kandinsky-text-to-image">
32+
<hfoption id="Kandinsky 2.1">
33+
3034
```py
31-
import torch
3235
from diffusers import KandinskyPriorPipeline, KandinskyPipeline
36+
import torch
3337

34-
prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
35-
pipeline = KandinskyPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
38+
prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16).to("cuda")
39+
pipeline = KandinskyPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16).to("cuda")
3640

3741
prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting"
3842
negative_prompt = "low quality, bad quality" # optional to include a negative prompt, but results are usually better
@@ -50,21 +54,69 @@ image
5054
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/kandinsky-docs/cheeseburger.png"/>
5155
</div>
5256

53-
🤗 Diffusers also provides an end-to-end API with [`KandinskyCombinedPipeline`], meaning you don't have to separately load the prior and text-to-image pipeline. The combined pipeline automatically loads both [`KandinskyPriorPipeline`] and [`KandinskyPipeline`]. You can still set different values for the prior pipeline with the `prior_guidance_scale` and `prior_num_inference_steps` parameters if you want.
57+
<hfoption>
58+
<hfoption id="Kandinsky 2.2">
59+
60+
```py
61+
from diffusers import KandinskyV22PriorPipeline, KandinskyV22Pipeline
62+
import torch
63+
64+
prior_pipeline = KandinskyV22PriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16).to("cuda")
65+
pipeline = KandinskyV22Pipeline.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16).to("cuda")
66+
67+
prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting"
68+
negative_prompt = "low quality, bad quality" # optional to include a negative prompt, but results are usually better
69+
image_embeds, negative_image_embeds = prior_pipeline(prompt, guidance_scale=1.0).to_tuple()
70+
```
71+
72+
Pass the `image_embeds` and `negative_image_embeds` to the [`KandinskyV22Pipeline`] to generate an image:
73+
74+
```py
75+
image = pipeline(image_embeds=image_embeds, negative_image_embeds=negative_image_embeds, height=768, width=768).images[0]
76+
image
77+
```
78+
79+
<div class="flex justify-center">
80+
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/kandinsky-text-to-image.png"/>
81+
</div>
82+
83+
</hfoption>
84+
</hfoptions>
85+
86+
🤗 Diffusers also provides an end-to-end API with the [`KandinskyCombinedPipeline`] and [`KandinskyV22CombinedPipeline`], meaning you don't have to separately load the prior and text-to-image pipeline. The combined pipeline automatically loads both the prior model and the decoder. You can still set different values for the prior pipeline with the `prior_guidance_scale` and `prior_num_inference_steps` parameters if you want.
5487

55-
Use the [`AutoPipelineForText2Image`] to automatically call the [`KandinskyCombinedPipeline`] under the hood:
88+
Use the [`AutoPipelineForText2Image`] to automatically call the combined pipelines under the hood:
89+
90+
<hfoptions id="combined-text-to-image">
91+
<hfoption id="Kandinsky 2.1">
5692

5793
```py
5894
from diffusers import AutoPipelineForText2Image
5995
import torch
6096

61-
pipe = AutoPipelineForText2Image.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16)
62-
pipe.enable_model_cpu_offload()
97+
pipeline = AutoPipelineForText2Image.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16).to("cuda")
98+
pipeline.enable_model_cpu_offload()
99+
100+
prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting"
101+
negative_prompt = "low quality, bad quality"
102+
103+
image = pipeline(prompt=prompt, negative_prompt=negative_prompt, prior_guidance_scale=1.0, guidance_scale = 4.0, height=768, width=768).images[0]
104+
```
105+
106+
</hfoption>
107+
<hfoption id="Kandinsky 2.2">
108+
109+
```py
110+
from diffusers import AutoPipelineForText2Image
111+
import torch
112+
113+
pipeline = AutoPipelineForText2Image.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16).to("cuda")
114+
pipeline.enable_model_cpu_offload()
63115

64116
prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting"
65117
negative_prompt = "low quality, bad quality"
66118

67-
image = pipe(prompt=prompt, negative_prompt=negative_prompt, prior_guidance_scale=1.0, guidance_scale = 4.0, height=768, width=768).images[0]
119+
image = pipeline(prompt=prompt, negative_prompt=negative_prompt, prior_guidance_scale=1.0, guidance_scale = 4.0, height=768, width=768).images[0]
68120
```
69121

70122
## Image-to-image

0 commit comments

Comments
 (0)