# Image Interpolation with Stable Diffusion

In this example, we will use Stable Diffusion (SD) to interpolate between images. **Image interpolation** using Stable Diffusion is the process of creating intermediate images that smoothly transition from one given image to another, using a generative model based on diffusion.

Use cases of image interpolation:
- Data Augmentation: SD can augment training data for ML models by generating synthetic images that lie between existing data points. This can improve the generalization and robustness of ML models, especially in image generation, classification, or object detection.
- Product Design and Prototyping: SD can aid in product design by generating variations of product designs or prototypes with subtle differences. This can be useful for exploring design alternatives, conducting user studies, or visualizing design iterations before committing to physical prototypes.
- Content Generation for Media Productions: SD can be used to generate intermediate frames between key frames, enabling smoother transitions and enhancing visual storytelling.


In this example, we will explore examples of image interpolation using SD and demonstrate how latent space walking can be implemented and utilized to create smooth transitions between images.

## Setups

In [None]:
!pip install -qU diffusers transformers xformers accelerate numpy scipy ftfy Pillow

In [1]:
import torch
import numpy as np
import os
import time

from PIL import Image
from IPython import display as IPdisplay
from tqdm.auto import tqdm

from diffusers import (
    StableDiffusionPipeline,
    DDIMScheduler,
    PNDMScheduler,
    LMSDiscreteScheduler,
    DPMSolverMultistepScheduler,
    EulerDiscreteScheduler,
    EulerAncestralDiscreteScheduler
)

from transformers import logging

logging.set_verbosity_error()

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

torch.backends.cudnn.benchmark = True
torch.backends.cuda.matmul.allow_tf32 = True

## Model

The [`stable-diffusion-v1-5`](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5) model and the `LMSDiscreteScheduler` were chosen to generate images.

In [2]:
model_name_or_path = 'stable-diffusion-v1-5/stable-diffusion-v1-5'

scheduler = LMSDiscreteScheduler(
    beta_start=0.00085,
    beta_end=0.012,
    beta_schedule='scaled_linear',
    num_train_timesteps=1000
)

pipe = StableDiffusionPipeline.from_pretrained(
    model_name_or_path,
    scheduler=scheduler,
    torch_dtype=torch.float32
).to(device)

# disable image generation progress bar
# we will display our own
pipe.set_progress_bar_config(disable=True)

model_index.json:   0%|          | 0.00/541 [00:00<?, ?B/s]

Fetching 14 files:   0%|          | 0/14 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/472 [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/525k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/617 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/342 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/4.72k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/806 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/492M [00:00<?, ?B/s]

diffusion_pytorch_model.safetensors:   0%|          | 0.00/3.44G [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.06M [00:00<?, ?B/s]

diffusion_pytorch_model.safetensors:   0%|          | 0.00/335M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/547 [00:00<?, ?B/s]

Loading pipeline components...:   0%|          | 0/7 [00:00<?, ?it/s]

  "_class_name": "LMSDiscreteScheduler",
  "_diffusers_version": "0.33.1",
  "beta_end": 0.012,
  "beta_schedule": "scaled_linear",
  "beta_start": 0.00085,
  "num_train_timesteps": 1000,
  "prediction_type": "epsilon",
  "steps_offset": 0,
  "timestep_spacing": "linspace",
  "trained_betas": null,
  "use_beta_sigmas": false,
  "use_exponential_sigmas": false,
  "use_karras_sigmas": false
}
 is outdated. `steps_offset` should be set to 1 instead of 0. Please make sure to update the config accordingly as leaving `steps_offset` might led to incorrect results in future versions. If you have downloaded this checkpoint from the Hugging Face Hub, it would be very nice if you could open a Pull request for the `scheduler/scheduler_config.json` file
  deprecate("steps_offset!=1", "1.0.0", deprecation_message, standard_warn=False)


The following methods are designed to reduce the memory consumed by the GPU. We may skip this cell if we have more VRAM.

In [3]:
# Offloading the weights to the CPU and only loading them on GPU
pipe.enable_model_cpu_offload()

# Tighter ordering of memory tensors
pipe.unet.to(memory_format=torch.channels_last)

# Decoding large batches of images with limited VRAM
# or batches with 32 images or more by decoding the batches of latents one image at a time
pipe.enable_vae_slicing()

# Splitting the image into overlapping tiles, decoding the tiles,
# and then blending the outputs together to compose the final image
pipe.enable_vae_tiling()

# Using flash attention;
# If we have PyTorch > 2.0, we should not expect a speed-up for inference when enabling xformers
pipe.enable_xformers_memory_efficient_attention()

The `display_image` function converts a list of image arrays into a GIF, saves it to a specified path and returns the GIF object for display.

In [4]:
def display_images(images, save_path):
    try:
        # Convert each image in the `images` list from an array to an Image object
        images = [Image.fromarray(np.array(image[0], dtype=np.uint8)) for image in images]

        # Generate a filename based on the current time, replacing colons with hyphens
        # to ensure the filename is valid for file systems that do not allow colons
        filename = time.strftime('%H:%M:%S', time.localtime()).replace(":", "-")

        # Save the first image in the list as a GIF file at the `save_path` location.
        # The rest of the images in the list are added as subsequent frames to the GIF.
        # The GIF will play each frame for 100 ms and will loop indefinitely
        images[0].save(
            f'{save_path}/{filename}.gif',
            save_all=True,
            append_images=images[1:],
            duration=100,
            loop=0
        )
    except Exception as e:
        # If there is an error during the process
        print(e)

    # Return the saved GIF as an IPython display object
    return IPdisplay.Image(f"{save_path}/{filename}.gif")

## Generation parameters

- `seed`: sets a specific random seed for reproducibility.
- `generator`: sets to a PyTorch random number generator object if a seed is provided, otherwise it is None. It ensures that the operations using it have reproducible outcomes.
- `guidance_scale`: controls the extent to which the model should follow the prompt in text-to-image generation tasks, with higher values leading to stronger adherence to the prompt.
- `num_inference_steps`: specifies the number of steps the model takes to generate an image. More steps can lead to a higher quality image but take longer to generate.
- `num_interpolation_steps`: determines the number of steps used when interpolating between two points in the latent space, affecting the smoothness of transitions in generated animations.
- `height`: height of the generated images in pixels.
- `width`: width of the generated images in pixels.
- `save_path`: file system path where the generated gifs will be saved.

In [5]:
seed = None

if seed is not None:
    generator = torch.manual_seed(seed)
else:
    generator = None

guidance_scale = 8
num_inference_steps = 15
num_interpolation_steps = 30
height = 512
width = 512

save_path = '/output'
if not os.path.exists(save_path):
    os.makedirs(save_path)

## Example 1: Prompt interpolation

Interpolation between positive and negative prompt embeddings allows exploration of space between two conceptual points defined by prompts, potentially leading to variety of images blending characteristics dictated by prompts gradually.

This interpolation involves adding scaled deltas to original embeddings, creating a series of new embeddings that will be used later to generate images with smooth transitions between different states based on the original prompt.

In [32]:
prompt = "Epic shot of Sweden, ultra detailed lake with an ren dear, nostalgic vintage, ultra cozy and inviting, wonderful light atmosphere, fairy, little photorealistic, digital painting, sharp focus, ultra cozy and inviting, wish to be there. very detailed, arty, should rank high on youtube for a dream trip."
negative_prompt = "poorly drawn,cartoon, 2d, disfigured, bad art, deformed, poorly drawn, extra limbs, close up, b&w, weird colors, blurry"

# step size for the interpolation in the latent space
step_size = 0.001

prompt_tokens = pipe.tokenizer(
    prompt,
    padding='max_length',
    max_length=pipe.tokenizer.model_max_length,
    truncation=True,
    return_tensors='pt'
)
prompt_embeds = pipe.text_encoder(prompt_tokens.input_ids.to(device))[0]

if negative_prompt is None:
    negative_prompt = [""]

negative_prompt_tokens = pipe.tokenizer(
    negative_prompt,
    padding='max_length',
    max_length=pipe.tokenizer.model_max_length,
    truncation=True,
    return_tensors='pt'
)
negative_prompt_embeds = pipe.text_encoder(negative_prompt_tokens.input_ids.to(device))[0]

Now we generate a random initial vector using a normal distribution that is structured to match the dimensions expected by the diffusion model (UNet). Then we apply a series of interpolations by incrementally adding a small step size for each interation. The results are stored in a list named `walked_embeddings`.

In [33]:
latents = torch.rand(
    (1, pipe.unet.config.in_channels, height // 8, width // 8),
    generator=generator
)

walked_embeddings = []
# interpolating between embeddings for the given number of interpolation steps
for i in range(num_interpolation_steps):
    walked_embeddings.append(
        [prompt_embeds + step_size * i, negative_prompt_embeds + step_size * i]
    )

In [11]:
pipe.safety_checker = None

Finally, we generate a series of images based on interpolated embeddings and then displaying these images.

In [34]:
from importlib.metadata import requires
images = []
for latent in tqdm(walked_embeddings):
    images.append(
        pipe(
            height=height,
            width=width,
            num_images_per_prompt=1,
            prompt_embeds=latent[0],
            negative_prompt_embeds=latent[1],
            num_inference_steps=num_inference_steps,
            guidance_scale=guidance_scale,
            generator=generator,
            latents=latents,
            requires_safety_checker=False
        ).images
    )

  0%|          | 0/30 [00:00<?, ?it/s]

In [35]:
display_images(images, save_path)

Output hidden; open in https://colab.research.google.com to view.

## Example 2: Diffusion latents interpolation for a single prompt

In this one, we perform interpolation between the two embeddings of the diffusion model itself, rather than the prompts.

The `slerp` function below stand for [**Spheircal Linear Interpolation**](https://en.wikipedia.org/wiki/Slerp), which is a method of interpolation on the surface of a sphere. This function is commonly used in computer graphics to animate rotations in a smooth manner and can also be used to interpolate between high-dimensional data points in ML, such as latent vectors in generative models.

In [22]:
def slerp(v0, v1, num, t0=0, t1=1):
    v0 = v0.detach().cpu().numpy()
    v1 = v1.detach().cpu().numpy()

    def interpolation(t, v0, v1, DOT_THRESHOLD=0.9995):
        """Helper function to spherically interpolate two arrays v1 and v2"""
        dot = np.sum(v0 * v1 / (np.linalg.norm(v0) * np.linalg.norm(v1)))
        if np.abs(dot) > DOT_THRESHOLD:
            v2 = (1 - t) * v0 + t * v1
        else:
            theta_0 = np.arccos(dot)
            sin_theta_0 = np.sin(theta_0)
            theta_t = theta_0 * t
            sin_theta_t = np.sin(theta_t)

            s0 = np.sin(theta_0 - theta_t) / sin_theta_0
            s1 = sin_theta_t / sin_theta_0
            v2 = s0 * v0 + s1 * v1

        return v2

    t = np.linspace(t0, t1, num)

    v3 = torch.tensor(
        np.array([interpolation(t[i], v0, v1) for i in range(num)]),
        dtype=torch.float32
    )

    return v3

In [23]:
prompt = "Sci-fi digital painting of an alien landscape with otherworldly plants, strange creatures, and distant planets."
negative_prompt = "poorly drawn,cartoon, 3d, disfigured, bad art, deformed, poorly drawn, extra limbs, close up, b&w, weird colors, blurry"

# Generating initial latent vectors from a random normal distribution
# Two latent vectors are generated, serving as start and end points for the interpolation
# These vectors are shaped to fit the input requirements of the diffusion model's UNet
latents = torch.randn(
    (2, pipe.unet.config.in_channels, height // 8, width // 8),
    generator=generator
)

# Get our latent embeddings
interpolated_latents = slerp(latents[0], latents[1], num_interpolation_steps)

# Generate images using the interpolated embeddings
images = []
for latent_vector in tqdm(interpolated_latents):
    images.append(
        pipe(
            prompt,
            negative_prompt=negative_prompt,
            height=height,
            width=width,
            num_images_per_prompt=1,
            num_inference_steps=num_inference_steps,
            guidance_scale=guidance_scale,
            generator=generator,
            latents=latent_vector[None, ...]
        ).images
    )

  0%|          | 0/30 [00:00<?, ?it/s]

In [24]:
display_images(images, save_path)

Output hidden; open in https://colab.research.google.com to view.

## Example 3: Interpolation between multiple prompts

In this example, we will interpolate between any number of prompts. We will take consecutive pairs of prompts and create smooth transitions between them. Then, we will combine the interpolations of these consecutive pairs, and instruct the model to generate images based on them. For interplation we still use the `slerp` function shown in the second example.

In [25]:
prompts = [
    "A cute dog in a beautiful field of lavander colorful flowers everywhere, perfect lighting, leica summicron 35mm f2.0, kodak portra 400, film grain",
    "A cute cat in a beautiful field of lavander colorful flowers everywhere, perfect lighting, leica summicron 35mm f2.0, kodak portra 400, film grain",
]

negative_prompts = [
    "poorly drawn,cartoon, 2d, sketch, cartoon, drawing, anime, disfigured, bad art, deformed, poorly drawn, extra limbs, close up, b&w, weird colors, blurry",
    "poorly drawn,cartoon, 2d, sketch, cartoon, drawing, anime, disfigured, bad art, deformed, poorly drawn, extra limbs, close up, b&w, weird colors, blurry",
]

batch_size = len(prompts)


prompts_tokens = pipe.tokenizer(
    prompts,
    padding='max_length',
    max_length=pipe.tokenizer.model_max_length,
    truncation=True,
    return_tensors='pt'
)
prompts_embeds = pipe.text_encoder(prompts_tokens.input_ids.to(device))[0]

if negative_prompts is None:
    negative_prompts = [""] * batch_size

negative_prompts_tokens = pipe.tokenizer(
    negative_prompts,
    padding='max_length',
    max_length=pipe.tokenizer.model_max_length,
    truncation=True,
    return_tensors='pt'
)
negative_prompts_embeds = pipe.text_encoder(negative_prompts_tokens.input_ids.to(device))[0]

We will take consecutive pairs of prompts and create smooth transitions between them with `slerp` function.

In [26]:
latents = torch.randn(
    (1, pipe.unet.config.in_channels, height // 8, width // 8),
    generator=generator
)

# Interpolate between embedding pairs for the given number of interpolation steps
interpolated_prompt_embeds = []
interpolated_negative_prompt_embeds = []
for i in range(batch_size - 1):
    interpolated_prompt_embeds.append(
        slerp(
            prompts_embeds[i],
            prompts_embeds[i + 1],
            num_interpolation_steps
        )
    )
    interpolated_negative_prompt_embeds.append(
        slerp(
            negative_prompts_embeds[i],
            negative_prompts_embeds[i + 1],
            num_interpolation_steps
        )
    )

interpolated_prompt_embeds = torch.cat(interpolated_prompt_embeds, dim=0).to(device)
interpolated_negative_prompt_embeds = torch.cat(interpolated_negative_prompt_embeds, dim=0).to(device)

In [27]:
images = []
for prompt_embeds, negative_prompt_embeds in tqdm(
    zip(interpolated_prompt_embeds, interpolated_negative_prompt_embeds),
    total=len(interpolated_prompt_embeds)
):
    images.append(
        pipe(
            height=height,
            width=width,
            num_images_per_prompt=1,
            prompt_embeds=prompt_embeds[None, ...],
            negative_prompt_embeds=negative_prompt_embeds[None, ...],
            num_inference_steps=num_inference_steps,
            guidance_scale=guidance_scale,
            generator=generator,
            latents=latents
        ).images
    )

  0%|          | 0/30 [00:00<?, ?it/s]

In [28]:
display_images(images, save_path)

Output hidden; open in https://colab.research.google.com to view.

## Example 4: Circular walk through the diffusion latent space for a single prompt

If we have two noise components, which are called `x` and `y`. We start by moving from $0$ to $2\pi$ and at each step we add the cosine of `x` and the sine of `y` to the result. Using this approach, at the end of our movement, we end up with the same noise values that we started with. This means that vectors end up turning into themselves, ending our movement.

In [29]:
prompt = "Beautiful sea sunset, warm light, Aivazovsky style"
negative_prompt = "picture frames"

latents = torch.randn(
    (2, 1, pipe.unet.config.in_channels, height // 8, width // 8),
    generator=generator
)

# Calculate looped embeddings
walk_noise_x = latents[0].to(device)
walk_noise_y = latents[1].to(device)

# Walk on a trigonometric circle
walk_scale_x = torch.cos(
    torch.linspace(0, 2, num_interpolation_steps) * np.pi
).to(device)
walk_scale_y = torch.sin(
    torch.linspace(0, 2, num_interpolation_steps) * np.pi
).to(device)

# Apply interpolation to noise
noise_x = torch.tensordot(walk_scale_x, walk_noise_x, dims=0)
noise_y = torch.tensordot(walk_scale_y, walk_noise_y, dims=0)

circular_latents = noise_x + noise_y

In [30]:
images = []
for latent_vector in tqdm(circular_latents):
    images.append(
        pipe(
            prompt,
            negative_prompt=negative_prompt,
            height=height,
            width=width,
            num_images_per_prompt=1,
            num_inference_steps=num_inference_steps,
            guidance_scale=guidance_scale,
            generator=generator,
            latents=latent_vector
        ).images
    )

  0%|          | 0/30 [00:00<?, ?it/s]

In [31]:
display_images(images, save_path)

Output hidden; open in https://colab.research.google.com to view.