In [None]:
!pip install -qU diffusers accelerate transformers huggingface_hub

In [None]:
from huggingface_hub import notebook_login
notebook_login()

# Image-to-Image

In addition o a text prompt, we can also pass an initial image as a starting point for the diffusion process. The initial image is encoded to latent space and noise is added to it.

Then the latent diffusion model takes a prompt and the noisy latent image, predicts the added noise, and removes the predicted noise from the initial latent image to get the new latent image. Finally, a decoder decodes the new latent image back into an image.

1. Load a checkpoint into the `AutoPipelineForImage2Image` class.

In [None]:
import torch
from diffusers import AutoPipelineForImage2Image
from diffusers.utils import load_image, make_image_grid

pipeline = AutoPipelineForImage2Image.from_pretrained(
    'kandinsky-community/kandinsky-2-2-decoder',
    torch_dtype=torch.float16,
    use_safetensors=True,
)

pipeline.enable_model_cpu_offload()
# Remove following line if xFormers is not installed or we have PyTorch 2.0 or higher installed
pipeline.enable_xformers_memory_efficient_attention()

If we are using PyTorch 2.0, then we do not need to call `enable_xformers_memory_efficient_attention()` on our pipeline because it will already be using PyTorch 2.0 native `scaled_dot_product_attention`.

2. Load an image to pass to the pipeline.

In [None]:
init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png")

3. Pass a prompt and image to the pipeline to generate an image

In [None]:
prompt = "cat wizard, gandalf, lord of the rings, detailed, fantasy, cute, adorable, Pixar, Disney, 8k"

image = pipeline(
    prompt,
    image=init_image,
).images[0]

make_image_grid([init_image, image], rows=1, cols=2)

## Popular models

### Stable Diffusion v1.5

In [None]:
import torch
from diffusers import AutoPipelineForImage2Image
from diffusers.utils import load_image, make_image_grid

pipeline = AutoPipelineForImage2Image.from_pretrained(
    'stable-diffusion-v1-5/stable-diffusion-v1-5',
    torch_dtype=torch.float16,
    variant='fp16',
    use_safetensors=True,
)
pipeline.enable_model_cpu_offload()
pipeline.enable_xformers_memory_efficient_attention()

In [None]:
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png"
init_image = load_image(url)

prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"

image = pipeline(
    prompt,
    image=init_image,
).images[0]
make_image_grid([init_image, image], rows=1, cols=2)

### Stable Diffusion XL (SDXL)

SDXL uses a larger base model, and an additional refiner model to increase the quality of the base model's output.

In [None]:
import torch
from diffusers import AutoPipelineForImage2Image
from diffusers.utils import make_image_grid, load_image

pipeline = AutoPipelineForImage2Image.from_pretrained(
    'stabilityai/stable-diffusion-xl-refiner-1.0',
    torch_dtype=torch.float16,
    variant='fp16',
    use_safetensors=True,
)
pipeline.enable_model_cpu_offload()
pipeline.enable_xformers_memory_efficient_attention()

In [None]:
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-sdxl-init.png"
init_image = load_image(url)

prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"

image = pipeline(
    prompt,
    image=init_image,
    strength=0.5
).images[0]

make_image_grid([init_image, image], rows=1, cols=2)

### Kandinsky 2.2

The Kandinsky model is different from the Stable Diffusion models because it uses an image prior model to create image embeddings. The embeddings help create a better alignment between text and images, allowing the latent diffusion model to generate better images.

In [None]:
import torch
from diffusers import AutoPipelineForImage2Image
from diffusers.utils import make_image_grid, load_image

pipeline = AutoPipelineForImage2Image.from_pretrained(
    'kandinsky-community/kandinsky-2-2-decoder',
    torch_dtype=torch.float16,
    use_safetensors=True,
)
pipeline.enable_model_cpu_offload()
pipeline.enable_xformers_memory_efficient_attention()

In [None]:
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png"
init_image = load_image(url)

prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"

image = pipeline(
    prompt,
    image=init_image,
).images[0]
make_image_grid([init_image, image], rows=1, cols=2)

## Configure pipeline parameters

### Strength

`strength` has a huge impact on our generated image. It determines how much the generated image resembles the initial image.
* a higher `strength` value gives the model more "creativity" to generate an image that is different from the initial image; a `strength` value of 1.0 means the initial image is more or less ignored.
* a lower `strength` value means the generated image is more similar to the initial image.

The `strength` determines the number of noise steps to add. If the `num_inference_steps` is 50 and `strength` is 0.8, then this means adding 40 (50 * 0.8) steps of noise to the initial image and then denoising for 40 steps to get the newly generated image.

In [None]:
import torch
from diffusers import AutoPipelineForImage2Image
from diffusers.utils import load_image, make_image_grid

pipeline = AutoPipelineForImage2Image.from_pretrained(
    'stable-diffusion-v1-5/stable-diffusion-v1-5',
    torch_dtype=torch.float16,
    variant='fp16',
    use_safetensors=True,
)
pipeline.enable_model_cpu_offload()
pipeline.enable_xformers_memory_efficient_attention()

In [None]:
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png"
init_image = load_image(url)

prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"

image = pipeline(
    prompt,
    image=init_image,
    strength=0.8,
).images[0]
make_image_grid([init_image, image], rows=1, cols=2)

### Guidance scale

The `guidance_scale` is used to control how closely aligned the generated image and the text prompt are.
* a higher `guidance_scale` value means our generated image is more aligned with the prompt,
* a lower `guidance_scale` value means our generated image has more space to deviate from the prompt.

We can combine `guidance_scale` with `strength` for even more precise control over how expressive the model is.
* a high `strength + guidance_scale` for maximum creativity
* a low `strength + guidance_scale` to generate an image that resembles the initial image but is not as strictly bound to the prompt.

In [None]:
import torch
from diffusers import AutoPipelineForImage2Image
from diffusers.utils import load_image, make_image_grid

pipeline = AutoPipelineForImage2Image.from_pretrained(
    'stable-diffusion-v1-5/stable-diffusion-v1-5',
    torch_dtype=torch.float16,
    variant='fp16',
    use_safetensors=True,
)
pipeline.enable_model_cpu_offload()
pipeline.enable_xformers_memory_efficient_attention()

In [None]:
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png"
init_image = load_image(url)

prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"

image = pipeline(
    prompt,
    image=init_image,
    guidance_scale=8.0,
).images[0]
make_image_grid([init_image, image], rows=1, cols=2)

In [None]:
image = pipeline(
    prompt,
    image=init_image,
    guidance_scale=0.1,
).images[0]
make_image_grid([init_image, image], rows=1, cols=2)

In [None]:
image = pipeline(
    prompt,
    image=init_image,
    guidance_scale=5.0,
).images[0]
make_image_grid([init_image, image], rows=1, cols=2)

In [None]:
image = pipeline(
    prompt,
    image=init_image,
    guidance_scale=10.0,
).images[0]
make_image_grid([init_image, image], rows=1, cols=2)

In [None]:
image = pipeline(
    prompt,
    image=init_image,
    guidance_scale=8.0,
    strength=0.5,
).images[0]
make_image_grid([init_image, image], rows=1, cols=2)

In [None]:
image = pipeline(
    prompt,
    image=init_image,
    guidance_scale=8.0,
    strength=0.1,
).images[0]
make_image_grid([init_image, image], rows=1, cols=2)

In [None]:
image = pipeline(
    prompt,
    image=init_image,
    guidance_scale=8.0,
    strength=0.8,
).images[0]
make_image_grid([init_image, image], rows=1, cols=2)

### Negative prompt

In [None]:
import torch
from diffusers import AutoPipelineForImage2Image
from diffusers.utils import make_image_grid, load_image

pipeline = AutoPipelineForImage2Image.from_pretrained(
    'stabilityai/stable-diffusion-xl-refiner-1.0',
    torch_dtype=torch.float16,
    variant='fp16',
    use_safetensors=True,
)
pipeline.enable_model_cpu_offload()
pipeline.enable_xformers_memory_efficient_attention()

In [None]:
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png"
init_image = load_image(url)

prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
negative_prompt = "ugly, deformed, disfigured, poor details, bad anatomy"

image = pipeline(
    prompt,
    negative_prompt=negative_prompt,
    image=init_image,
).images[0]
make_image_grid([init_image, image], rows=1, cols=2)

## Chained image-to-image pipelines

### Text-to-image-to-image

Chaining a text-to-image and image-to-image pipeline allows us to generate an image from text and use the generated image as the initial image for the image-to-image pipeline.

In [None]:
import torch
from diffusers import AutoPipelineForImage2Image, AutoPipelineForText2Image
from diffusers.utils import load_image, make_image_grid

pipeline = AutoPipelineForText2Image.from_pretrained(
    'stable-diffusion-v1-5/stable-diffusion-v1-5',
    torch_dtype=torch.float16,
    variant='fp16',
    use_safetensors=True,
)
pipeline.enable_model_cpu_offload()
pipeline.enable_xformers_memory_efficient_attention()

In [None]:
text2image = pipeline(
    "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
).images[0]
text2image

In [None]:
pipeline = AutoPipelineForImage2Image.from_pretrained(
    'kandinsky-community/kandinsky-2-2-decoder',
    torch_dtype=torch.float16,
    use_safetensors=True,
)
pipeline.enable_model_cpu_offload()
pipeline.enable_xformers_memory_efficient_attention()

In [None]:
image2image = pipeline(
    "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k",
    image=text2image,
)
make_image_grid([text2image, image2image], rows=1, cols=2)

### Image-to-image-to-image

In [None]:
import torch
from diffusers import AutoPipelineForImage2Image
from diffusers.utils import load_image, make_image_grid

pipeline = AutoPipelineForImage2Image.from_pretrained(
    'stable-diffusion-v1-5/stable-diffusion-v1-5',
    torch_dtype=torch.float16,
    variant='fp16',
    use_safetensors=True,
)
pipeline.enable_model_cpu_offload()
pipeline.enable_xformers_memory_efficient_attention()

In [None]:
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png"
init_image = load_image(url)

prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"

image = pipeline(
    prompt,
    image=init_image,
    output_type='latent',
).images[0]

It is important to specify `output_type='latent'` in the pipeline to keep all the outputs in latent space to avoid an unnecessary decode-encode step. This only works if the chained pipelines are using the same VAE.

In [None]:
pipeline = AutoPipelineForImage2Image.from_pretrained(
    'ogkalu/Comic-Diffusion', # comic book art style
    torch_dtype=torch.float16,
)
pipeline.enable_model_cpu_offload()
pipeline.enable_xformers_memory_efficient_attention()

In [None]:
# need to include the token `charliebo artstyle` in the prompt to use this checkpoint
image = pipeline(
    'Astronaut in a jungle, charliebo artstyle',
    image=image,
    output_type='latent',
)

In [None]:
# repeat one more time to generate the final image in a `pixel_artstyle`
pipeline = AutoPipelineForImage2Image.from_pretrained(
    'kohbanye/pixel-art-style',
    torch_dtype=torch.float16,
)
pipeline.enable_model_cpu_offload()
pipeline.enable_xformers_memory_efficient_attention()

In [None]:
# need to include the token `pixelartstyle` in the prompt to use this checkpoint
image = pipeline(
    'Astronaut in a jungle, pixelartstyle',
    image=image,
).images[0]

make_image_grid([init_image, image], rows=1, cols=2)

### Image-to-upscaler-to-super-resolution

In [None]:
import torch
from diffusers import AutoPipelineForImage2Image
from diffusers.utils import load_image, make_image_grid

pipeline = AutoPipelineForImage2Image.from_pretrained(
    'stable-diffusion-v1-5/stable-diffusion-v1-5',
    torch_dtype=torch.float16,
    variant='fp16',
    use_safetensors=True,
)
pipeline.enable_model_cpu_offload()
pipeline.enable_xformers_memory_efficient_attention()

In [None]:
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png"
init_image = load_image(url)

prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"

image_1 = pipeline(
    prompt,
    image=init_image,
    output_type='latent',
).images[0]

Make sure the `output_type='latent'` in the pipeline.

Chain it to an upscaler pipeline to increase the image resolution:

In [None]:
from diffusers import StableDiffusionLatentUpscalePipeline

upscaler = StableDiffusionLatentUpscalePipeline.from_pretrained(
    'stabilityai/sd-x2-latent-upscaler',
    torch_dtype=torch.float16,
    variant='fp16',
    use_safetensors=True,
)
upscaler.enable_model_cpu_offload()
upscaler.enable_xformers_memory_efficient_attention()

In [None]:
image_2 = upscaler(
    prompt,
    image=image_1,
    output_type='latent',
).images[0]

Finally, chain it to a super-resolution pipeline to further enhance the resolution:

In [None]:
from diffusers import StableDiffusionUpscalePipeline

super_res = StableDiffusionUpscalePipeline.from_pretrained(
    'stabilityai/stable-diffusion-x4-upscaler',
    torch_dtype=torch.float16,
    variant='fp16',
    use_safetensors=True,
)
super_res.enable_model_cpu_offload()
super_res.enable_xformers_memory_efficient_attention()

In [None]:
image_3 = super_res(
    prompt,
    image=image_2,
).images[0]

make_image_grid(
    [init_image, image_3.resize((512, 512))],
    rows=1,
    cols=2,
)