In [None]:
!pip install -qU diffusers transformers accelerate

# Kandinsky

The Kandinsky models are a series of multilingual text-to-image generation models.

* The Kandinsky 2.0 model uses two multilingual text encoders and concatenates those results for the UNet.

* Kandinsky 2.1 changes the architecture to include an image prior model (`CLIP`) to generate a mapping between text and image embeddings. Kandinsky 2.1 also uses a `Modulating Quantized Vectors` (MoVQ) decoder, which adds a spatial conditional normalization layer to increase photorealism to decode the latents into images.

* Kandinsky 2.2 improves on the previous model by replacing the image encoder of the image prior model with a larger CLIP-ViT-G model to improve quality.

* Kandinsky 3 simplifies the architecture and shifts away from the two-stage generation process involving the prior model and diffusion model. Instead, Kandinsky 3 uses `Flan-UL2` to encode text, a UNet with `BigGan-Deep` blocks, and `Sber-MoVQGAN` to decode the latents into images.

## Text-to-image

For Kandinsky models, we always start by setting up the prior pipeline to encode the prompt and generate the image embeddings.

##### Kandinsky 2.1

In [None]:
from diffusers import KandinskyPriorPipeline, KandinskyPipeline
import torch

prior_pipeline = KandinskyPriorPipeline.from_pretrained(
    'kandinsky-community/kandinsky-2-1-prior',
    torch_dtype=torch.float16,
).to('cuda')

pipeline = KandinskyPipeline.from_pretrained(
    'kandinsky-community/kandinsky-2-1',
    torch_dtype=torch.float16
).to('cuda')

In [None]:
prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting"
negative_prompt = "low quality, bad quality"

image_embeds, negative_image_embeds = prior_pipeline(
    prompt,
    negative_prompt=negative_prompt,
    guidance_scale=1.0
).to_tuple()

image = pipeline(
    prompt,
    negative_prompt=negative_prompt,
    image_embeds=image_embeds,
    negative_image_embeds=negative_image_embeds,
    height=768,
    width=768
).images[0]
image

##### Kandinsky 2.2

In [None]:
from diffusers import KandinskyV22PriorPipeline, KandinskyV22Pipeline
import torch

prior_pipeline = KandinskyV22PriorPipeline.from_pretrained(
    'kandinsky-community/kandinsky-2-2-prior',
    torch_dtype=torch.float16
).to('cuda')

pipeline = KandinskyV22Pipeline.from_pretrained(
    'kandinsky-community/kandinsky-2-2-decoder',
    torch_dtype=torch.float16
).to('cuda')

In [None]:
prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting"
negative_prompt = "low quality, bad quality"

image_embeds, negative_image_embeds = prior_pipeline(
    prompt,
    guidance_scale=1.0
).to_tuple()

image = pipeline(
    image_embeds=image_embeds,
    negative_image_embeds=negative_image_embeds,
    height=768,
    width=768
).images[0]
image

##### Kandinsky 3

Kandinsky 3 does not require a prior model so we can directly load the `Kandinsky3Pipeline` and pass a prompt to generate an image

In [None]:
from diffusers import Kandinsky3Pipeline
import torch

pipeline = Kandinsky3Pipeline.from_pretrained(
    "kandinsky-community/kandinsky-3",
    variant="fp16",
    torch_dtype=torch.float1
)
pipeline.enable_model_cpu_offload()

In [None]:
prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting"
image = pipeline(prompt).images[0]
image

#### AutoPipeline end-to-end API


Diffusers provides an end-to-end API with the `KandinskyCombinedPipeline` and `KandinskyV22CombinedPipeline`. The combined pipeline automatically loads both the prior model and the decoder.

##### Kandinsky 2.1

In [None]:
from diffusers import AutoPipelineForText2Image
import torch

pipeline = AutoPipelineForText2Image.from_pretrained(
    "kandinsky-community/kandinsky-2-1",
    torch_dtype=torch.float16
)
pipeline.enable_model_cpu_offload()

In [None]:
prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting"
negative_prompt = "low quality, bad quality"

image = pipeline(
    prompt,
    negative_prompt=negative_prompt,
    prior_guidance_scale=1.0,
    guidance_scale=4.0,
    height=768,
    width=768
).images[0]
image

##### Kandinsky 2.2

In [None]:
from diffusers import AutoPipelineForText2Image
import torch

pipeline = AutoPipelineForText2Image.from_pretrained(
    "kandinsky-community/kandinsky-2-2-decoder",
    torch_dtype=torch.float16
)
pipeline.enable_model_cpu_offload()

In [None]:
prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting"
negative_prompt = "low quality, bad quality"

image = pipeline(
    prompt,
    negative_prompt=negative_prompt,
    prior_guidance_scale=1.0,
    guidance_scale=4.0,
    height=768,
    width=768
).images[0]
image

## Image-to-image

##### Kandinsky 2.1

In [None]:
from diffusers import KandinskyImg2ImgPipeline, KandinskyPriorPipeline
from diffusers.utils import load_image, make_image_grid
import torch

prior_pipeline = KandinskyPriorPipeline.from_pretrained(
    "kandinsky-community/kandinsky-2-1-prior",
    torch_dtype=torch.float16,
    use_safetensors=True
).to('cuda')

pipeline = KandinskyImg2ImgPipeline.from_pretrained(
    "kandinsky-community/kandinsky-2-1",
    torch_dtype=torch.float16,
    use_safetensors=True
).to("cuda")

In [None]:
url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
original_image = load_image(url)
original_image = original_image.resize((768, 512))

prompt = "A fantasy landscape, Cinematic lighting"
negative_prompt = "low quality, bad quality"

image_embeds, negative_image_embeds = prior_pipeline(
    prompt,
    negative_prompt
).to_tuple()

image = pipeline(
    prompt,
    negative_prompt=negative_prompt,
    image=original_image,
    image_embeds=image_embeds,
    negative_image_embeds=negative_image_embeds,
    height=768,
    width=768,
    strength=0.3
).images[0]
make_image_grid([original_image.resize((512,512)), image.resize((512,512))], rows=1, cols=2)

##### Kandinsky 2.2

In [None]:
from diffusers import KandinskyV22Img2ImgPipeline, KandinskyPriorPipeline
from diffusers.utils import load_image, make_image_grid
import torch

prior_pipeline = KandinskyPriorPipeline.from_pretrained(
    "kandinsky-community/kandinsky-2-2-prior",
    torch_dtype=torch.float16,
    use_safetensors=True
).to("cuda")

pipeline = KandinskyV22Img2ImgPipeline.from_pretrained(
    "kandinsky-community/kandinsky-2-2-decoder",
    torch_dtype=torch.float16,
    use_safetensors=True
).to("cuda")

In [None]:
url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
original_image = load_image(url)
original_image = original_image.resize((768, 512))

prompt = "A fantasy landscape, Cinematic lighting"
negative_prompt = "low quality, bad quality"

image_embeds, negative_image_embeds = prior_pipeline(prompt, negative_prompt).to_tuple()

image = pipeline(
    image=original_image,
    image_embeds=image_embeds,
    negative_image_embeds=negative_image_embeds,
    height=768,
    width=768,
    strength=0.3
).images[0]

make_image_grid([original_image.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2)

##### Kandinsky 3

In [None]:
from diffusers import Kandinsky3Img2ImgPipeline
from diffusers.utils import load_image, make_image_grid
import torch

pipeline = Kandinsky3Img2ImgPipeline.from_pretrained(
    "kandinsky-community/kandinsky-3",
    variant="fp16",
    torch_dtype=torch.float16
)
pipeline.enable_model_cpu_offload()

In [None]:
url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
original_image = load_image(url)
original_image = original_image.resize((768, 512))

prompt = "A fantasy landscape, Cinematic lighting"
negative_prompt = "low quality, bad quality"

image_embeds, negative_image_embeds = prior_pipeline(prompt, negative_prompt).to_tuple()

image = pipeline(
    prompt,
    negative_prompt=negative_prompt,
    image=image,
    strength=0.75,
    num_inference_steps=25
).images[0]
make_image_grid([original_image.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2)

#### AutoPipeline end-to-end API

##### Kandinsky 2.1

In [None]:
from diffusers import AutoPipelineForImage2Image
from diffusers.utils import make_image_grid, load_image
import torch

pipeline = AutoPipelineForImage2Image.from_pretrained(
    'kandinsky-community/kandinsky-2-1',
    torch_dtype=torch.float16,
    use_safetensors=True
)
pipeline.enable_model_cpu_offload()

In [None]:
url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
original_image = load_image(url)

prompt = "A fantasy landscape, Cinematic lighting"
negative_prompt = "low quality, bad quality"

original_image.thumbnail((768, 768))

image = pipeline(
    prompt,
    negative_prompt=negative_prompt,
    image=original_image,
    strength=0.3
).images[0]
make_image_grid([original_image.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2)

##### Kandinsky 2.2

In [None]:
from diffusers import AutoPipelineForImage2Image
from diffusers.utils import make_image_grid, load_image
import torch

pipeline = AutoPipelineForImage2Image.from_pretrained(
    'kandinsky-community/kandinsky-2-2-decoder',
    torch_dtype=torch.float16,
    use_safetensors=True
)
pipeline.enable_model_cpu_offload()

In [None]:
url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
original_image = load_image(url)

prompt = "A fantasy landscape, Cinematic lighting"
negative_prompt = "low quality, bad quality"

original_image.thumbnail((768, 768))

image = pipeline(
    prompt,
    negative_prompt=negative_prompt,
    image=original_image,
    strength=0.3
).images[0]
make_image_grid([original_image.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2)

## Inpainting

The Kandinsky models use **white pixels** to represent the masked area now instead of black pixels.

SO if our mask does not have this format, we need to inverse that
```python
# For PIL input
import PIL.ImageOps
mask = PIL.ImageOps.invert(mask)

# For PyTorch and NumPy input
mask = 1 - mask
```

##### Kandinsky 2.1

In [None]:
from diffusers import KandinskyInpaintPipeline, KandinskyPriorPipeline
from diffusers.utils import load_image, make_image_grid
import torch
import numpy as np
from PIL import Image

prior_pipeline = KandinskyPriorPipeline.from_pretrained(
    'kandinsky-community/kandinsky-2-1-prior',
    torch_dtype=torch.float16,
    use_safetensors=True
).to('cuda')

pipeline = KandinskyInpaintPipeline.from_pretrained(
    'kandinsky-community/kandinsky-2-1-inpaint',
    torch_dtype=torch.float16,
    use_safetensors=True
).to('cuda')

In [7]:
init_image = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png")

mask = np.zeros(init_image.size, dtype=np.float32)
# mask area
mask[:250, 250:-250] = 1

prompt = 'a hat'

prior_output = prior_pipeline(prompt)

output_image = pipeline(
    prompt,
    image=init_image,
    mask_image=mask,
    **prior_output,
    num_inference_steps=150,
    heihgt=768,
    width=768,
).images[0]

mask = Image.fromarray((mask*255).astype('uint8'), 'L')

make_image_grid([init_image, mask, output_image], rows=1, cols=3)

The end-to-end `KandinskyInpaintCombinedPipeline`:

In [None]:
import torch
import numpy as np
from PIL import Image
from diffusers import AutoPipelineForInpainting
from diffusers.utils import load_image, make_image_grid

pipe = AutoPipelineForInpainting.from_pretrained(
    'kandinsky-community/kandinsky-2-1-inpaint',
    torch_dtype=torch.float16,
)
pipe.enable_model_cpu_offload()

In [None]:
init_image = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png")
mask = np.zeros((768, 768), dtype=np.float32)

mask[:250, 250:-250] = 1
prompt = 'a hat'

output_image = pipe(
    prompt,
    image=init_image,
    mask_image=mask
).images[0]
mask = Image.fromarray((mask*255).astype('uint8'), 'L')
make_image_grid([init_image, mask, output_image], rows=1, cols=3)

##### Kandinsky 2.2

In [None]:
from diffusers import KandinskyV22InpaintPipeline, KandinskyV22PriorPipeline
from diffusers.utils import load_image, make_image_grid
import torch
import numpy as np
from PIL import Image

prior_pipeline = KandinskyV22PriorPipeline.from_pretrained(
    "kandinsky-community/kandinsky-2-2-prior",
    torch_dtype=torch.float16,
    use_safetensors=True
).to("cuda")

pipeline = KandinskyV22InpaintPipeline.from_pretrained(
    "kandinsky-community/kandinsky-2-2-decoder-inpaint",
    torch_dtype=torch.float16,
    use_safetensors=True
).to("cuda")

In [None]:
init_image = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png")

mask = np.zeros(init_image.size, dtype=np.float32)
# mask area
mask[:250, 250:-250] = 1

prompt = 'a hat'

prior_output = prior_pipeline(prompt)

output_image = pipeline(
    image=init_image,
    mask_image=mask,
    **prior_output,
    height=768,
    width=768,
    num_inference_steps=150
).images[0]
mask = Image.fromarray((mask*255).astype('uint8'), 'L')
make_image_grid([init_image, mask, output_image], rows=1, cols=3)

The end-to-end `KandinskyV22InpaintCombinedPipeline`

In [None]:
import torch
import numpy as np
from PIL import Image
from diffusers import AutoPipelineForInpainting
from diffusers.utils import load_image, make_image_grid

pipe = AutoPipelineForInpainting.from_pretrained(
    "kandinsky-community/kandinsky-2-2-decoder-inpaint",
    torch_dtype=torch.float16
)
pipe.enable_model_cpu_offload()

In [None]:
init_image = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png")
mask = np.zeros((768, 768), dtype=np.float32)

mask[:250, 250:-250] = 1
prompt = 'a hat'

output_image = pipe(
    prompt,
    image=original_image,
    mask_image=mask
).images[0]
mask = Image.fromarray((mask*255).astype('uint8'), 'L')
make_image_grid([init_image, mask, output_image], rows=1, cols=3)

## Interpolation

Interpolation allows us to explore the latent space between the image and text embeddings.

##### Kandinsky 2.1

In [None]:
from diffusers import KandinskyPriorPipeline, KandinskyPipeline
from diffusers.utils import load_image, make_image_grid
import torch

prior_pipeline = KandinskyPriorPipeline.from_pretrained(
    "kandinsky-community/kandinsky-2-1-prior",
    torch_dtype=torch.float16,
    use_safetensors=True
).to('cuda')

pipeline = KandinskyPipeline.from_pretrained(
    "kandinsky-community/kandinsky-2-1",
    torch_dtype=torch.float16,
    use_safetensors=True
).to("cuda")

In [None]:
img_1 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png")
img_2 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/starry_night.jpeg")
make_image_grid([img_1.resize((512,512)), img_2.resize((512,512))], rows=1, cols=2)

We can specify the text or images to interpolate, and set the weights for each text or image.

In [None]:
images_texts = ['a cat', img_1, img_2]
weights = [0.3, 0.3, 0.4]

# prompt can be left empty
prompt = ""
prior_out = prior_pipeline.interpolate(images_texts, weights)

image = pipeline(
    prompt,
    **prior_out,
    height=768,
    width=768,
).images[0]
make_image_grid([img_1.resize((512,512)), img_2.resize((512,512)), image.resize((512,512))], rows=1, cols=3)

##### Kandinsky 2.2

In [None]:
from diffusers import KandinskyV22PriorPipeline, KandinskyV22Pipeline
from diffusers.utils import load_image, make_image_grid
import torch

prior_pipeline = KandinskyV22PriorPipeline.from_pretrained(
    "kandinsky-community/kandinsky-2-2-prior",
    torch_dtype=torch.float16,
    use_safetensors=True
).to("cuda")

pipeline = KandinskyV22Pipeline.from_pretrained(
    "kandinsky-community/kandinsky-2-2-decoder",
    torch_dtype=torch.float16,
    use_safetensors=True
).to("cuda")

In [None]:
img_1 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png")
img_2 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/starry_night.jpeg")
make_image_grid([img_1.resize((512,512)), img_2.resize((512,512))], rows=1, cols=2)

In [None]:
images_texts = ['a cat', img_1, img_2]
weights = [0.3, 0.3, 0.4]

# prompt can be left empty
prompt = ""
prior_out = prior_pipeline.interpolate(images_texts, weights)

image = pipeline(
    prompt,
    **prior_out,
    height=768,
    width=768,
).images[0]
make_image_grid([img_1.resize((512,512)), img_2.resize((512,512)), image.resize((512,512))], rows=1, cols=3)

## ControlNet

ControlNet is only supported for Kandinsky 2.2.

In [None]:
from diffusers.utils import load_image, make_image_grid
import torch
import numpy as np
from transformers import pipeline

img = load_image(
    "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinskyv22/cat.png"
).resize((768, 768))
img

In [None]:
def make_hint(image, depth_estimator):
    image = depth_estimator(image)['depth']
    image = np.array(image)
    image = image[:, :, None]
    image = np.concatenate([image, image, image], axis=2)
    detected_map = torch.from_numpy(image).float() / 255.
    hint = detected_map.permute(2, 0, 1)
    return hint


depth_estimator = pipeline('depth-estimation')

In [None]:
hint = make_hint(img, depth_estimator).unsqueeze(0).half().to('cuda')

### Text-to-image

In [None]:
from diffusers import KandinskyV22PriorPipeline, KandinskyV22ControlnetPipeline

prior_pipeline = KandinskyV22PriorPipeline.from_pretrained(
    "kandinsky-community/kandinsky-2-2-prior",
    torch_dtype=torch.float16,
    use_safetensors=True
).to("cuda")

pipeline = KandinskyV22ControlnetPipeline.from_pretrained(
    "kandinsky-community/kandinsky-2-2-controlnet-depth",
    torch_dtype=torch.float16
).to("cuda")

In [None]:
prompt = "A robot, 4k photo"
negative_prior_prompt = "lowres, text, error, cropped, worst quality, low quality, jpeg artifacts, ugly, duplicate, morbid, mutilated, out of frame, extra fingers, mutated hands, poorly drawn hands, poorly drawn face, mutation, deformed, blurry, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face, disfigured, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, long neck, username, watermark, signature"

generator = torch.Generator('cuda').manual_seed(111)

image_emb, zero_image_emb = prior_pipeline(
    prompt,
    negative_prompt=negative_prior_prompt,
    generator=generator,
).to_tuple()

image = pipeline(
    image_embeds=image_emb,
    negative_image_embeds=zero_image_emb,
    hint=hint,
    num_inference_steps=50,
    generator=generator,
    height=768,
    width=768,
).images[0]
make_image_grid([img, image], rows=1, cols=2)

### Image-to-image

In [None]:
from diffusers import KandinskyV22PriorEmb2EmbPipeline, KandinskyV22ControlnetImg2ImgPipeline

prior_pipeline = KandinskyV22PriorEmb2EmbPipeline.from_pretrained(
    "kandinsky-community/kandinsky-2-2-prior",
    torch_dtype=torch.float16,
    use_safetensors=True
).to("cuda")

pipeline = KandinskyV22ControlnetImg2ImgPipeline.from_pretrained(
    "kandinsky-community/kandinsky-2-2-controlnet-depth",
    torch_dtype=torch.float16
).to("cuda")

In [None]:
prompt = "A robot, 4k photo"
negative_prior_prompt = "lowres, text, error, cropped, worst quality, low quality, jpeg artifacts, ugly, duplicate, morbid, mutilated, out of frame, extra fingers, mutated hands, poorly drawn hands, poorly drawn face, mutation, deformed, blurry, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face, disfigured, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, long neck, username, watermark, signature"

generator = torch.Generator('cuda').manual_seed(111)

img_emb = prior_pipeline(
    prompt,
    image=img,
    strength=0.85,
    generator=generator
)
negative_emb = prior_pipeline(
    negative_prior_prompt,
    image=img,
    strength=1,
    generator=generator
)

In [None]:
image = pipeline(
    image=img,
    strength=0.5,
    image_embeds=img_emb.image_embeds,
    negative_image_embeds=negative_emb.image_embeds,
    hint=hint,
    num_inference_steps=50,
    generator=generator,
    height=768,
    width=768
).images[0]
make_image_grid([img.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2)

## Optimizations

Since Kandinsky requires a prior pipeline to generator the mappings and a second pipeline to decode the latents into an image, our optimization should be focused on the second pipeline because that is where the bulk of the computation is done.

1. Enable `xFormers` if we use PyTorch < 2.0

In [None]:
from diffusers import DiffusionPipeline
import torch

pipe = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16)
pipe.enable_xformers_memory_efficient_attention()

2. Enable `torch.compile` if we use PyTorch >= 2.0 to automatically use scaled dot-product attention (SDPA):

In [None]:
pipe.unet.to(memory_format=torch.channels_last)
pipe.unet = torch.compile(pipe.unet, mode='reduce-overhead', fullgraph=True)

This is the same as explicitly setting the attention processor to use `AttnAddedKVProcessor2_0`:

In [None]:
from diffusers.models.attention_processor import AttnAddedKVProcessor2_0

pipe.unet.set_attn_processor(AttnAddedKVProcessor2_0())

3. Offload the model to the CPU with `enable_model_cpu_offload()` to avoid out-of-memory errors:

In [None]:
pipe.enable_model_cpu_offload()

4. By default, the text-to-image pipeline uses the `DDIMScheduler` but we can replace it with another scheduler to see how that affects the tradeoff between inference speed and image quality:

In [None]:
from diffusers import DDPMScheduler
from diffusers import DiffusionPipeline

scheduler = DDPMScheduler.from_pretrained(
    "kandinsky-community/kandinsky-2-1",
    subfolder="ddpm_scheduler"
)
pipe = DiffusionPipeline.from_pretrained(
    "kandinsky-community/kandinsky-2-1",
    scheduler=scheduler,
    torch_dtype=torch.float16,
    use_safetensors=True
).to("cuda")