# Stable diffusion optimization using Intel Openvino and Optimum-intel

Latent Diffusion models are game changers when it comes to solving text-to-image generation problems. Stable Diffusion is one of the most famous examples that got wide adoption in the community and industry. The idea behind the Stable Diffusion model is simple and compelling: you generate an image from a noise vector in multiple small steps refining the noise to a latent image representation.

However, such an approach inevitably increases the overall inference time and causes a poor user experience when deployed on a client machine. 

In this notebook, we will outline the problems of optimizing Stable Diffusion models and propose a workflow that substantially reduces the latency of such models when running on a resource-constrained HW such as CPU

In [1]:
from diffusers.training_utils import set_seed
from IPython.display import display

In [2]:
import time

def elapsed_time(pipeline, prompt, nb_pass=4, num_inference_steps=20):
    # warmup
    images = pipeline(prompt, num_inference_steps=10).images
    start = time.time()
    for _ in range(nb_pass):
        _ = pipeline(prompt, num_inference_steps=num_inference_steps, output_type="np")
    end = time.time()
    return (end - start) / nb_pass

The Diffusers library makes it extremely simple to generate images with Stable Diffusion models. If you're not familiar with these models, here's a great illustrated [introduction](https://jalammar.github.io/illustrated-stable-diffusion/).

Let's build a StableDiffusionPipeline with the default float32 data type, and measure its inference latency.

In [3]:
from diffusers import StableDiffusionPipeline

model_id = "runwayml/stable-diffusion-v1-5"
pipe = StableDiffusionPipeline.from_pretrained(model_id)
prompt = "sailing ship in storm by Rembrandt"
latency = elapsed_time(pipe, prompt)
print(latency)

`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The value `text_config["id2label"]` will be overriden.


  0%|          | 0/10 [00:00<?, ?it/s]

  0%|          | 0/20 [00:00<?, ?it/s]

  0%|          | 0/20 [00:00<?, ?it/s]

  0%|          | 0/20 [00:00<?, ?it/s]

  0%|          | 0/20 [00:00<?, ?it/s]

35.45383483171463


## Optimum Intel and OpenVINO

Optimum Intel accelerates end-to-end pipelines on Intel architectures. Its API is extremely similar to the vanilla Diffusers API, making it trivial to adapt existing code.

Optimum Intel supports OpenVINO, an Intel open-source toolkit for high-performance inference.

Optimum Intel and OpenVINO can be installed as follows:

In [4]:
from optimum.intel.openvino import OVStableDiffusionPipeline

ov_pipe = OVStableDiffusionPipeline.from_pretrained(model_id, export=True)
latency = elapsed_time(ov_pipe, prompt)

  deprecate(
Framework not specified. Using pt to export to ONNX.
Keyword arguments {'subfolder': '', 'config': {'_class_name': 'StableDiffusionPipeline', '_diffusers_version': '0.6.0', 'feature_extractor': ['transformers', 'CLIPImageProcessor'], 'safety_checker': ['stable_diffusion', 'StableDiffusionSafetyChecker'], 'scheduler': ['diffusers', 'PNDMScheduler'], 'text_encoder': ['transformers', 'CLIPTextModel'], 'tokenizer': ['transformers', 'CLIPTokenizer'], 'unet': ['diffusers', 'UNet2DConditionModel'], 'vae': ['diffusers', 'AutoencoderKL']}} are not expected by StableDiffusionPipeline and will be ignored.
`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The value `text_config["id2label"]` will be overriden.
  deprecate("cross_attention", "0.18.0", deprecation_message, standard_warn=False)
Using framework PyTorch: 2.1.0.dev20230609+cpu
  mask = torch.full((tgt_len, tgt_len), torch.tensor(torch.finfo(dtype).min, device=device), device=device)
  if attn_

verbose: False, log level: 40



Using framework PyTorch: 2.1.0.dev20230609+cpu
  if any(s % default_overall_up_factor != 0 for s in sample.shape[-2:]):
  assert hidden_states.shape[1] == self.channels
  assert hidden_states.shape[1] == self.channels
  assert hidden_states.shape[1] == self.channels
  if hidden_states.shape[0] >= 64:
  if not return_dict:
Saving external data to one file...


verbose: False, log level: 40



Using framework PyTorch: 2.1.0.dev20230609+cpu
  _C._jit_pass_onnx_node_shape_type_inference(node, params_dict, opset_version)
  _C._jit_pass_onnx_graph_shape_type_inference(
  _C._jit_pass_onnx_graph_shape_type_inference(
Using framework PyTorch: 2.1.0.dev20230609+cpu


verbose: False, log level: 40

verbose: False, log level: 40



Compiling the text_encoder...
Compiling the vae_decoder...
Compiling the unet...


  0%|          | 0/11 [00:00<?, ?it/s]

  0%|          | 0/21 [00:00<?, ?it/s]

  0%|          | 0/21 [00:00<?, ?it/s]

  0%|          | 0/21 [00:00<?, ?it/s]

  0%|          | 0/21 [00:00<?, ?it/s]

In [5]:
print(latency)

14.661957800388336


The pipeline above support dynamic input shapes, with no restriction on the number of images or their resolution. With Stable Diffusion, your application is usually restricted to one (or a few) different output resolutions, such as 512x512, or 256x256. Thus, it makes a lot of sense to unlock significant acceleration by reshaping the pipeline to a fixed resolution. If you need more than one output resolution, you can simply maintain a few pipeline instances, one for each resolution

In [6]:
ov_pipe.reshape(batch_size=1, height=512, width=512, num_images_per_prompt=1)
latency = elapsed_time(ov_pipe, prompt)

Compiling the text_encoder...


  0%|          | 0/11 [00:00<?, ?it/s]

Compiling the unet...
Compiling the vae_decoder...


  0%|          | 0/21 [00:00<?, ?it/s]

  0%|          | 0/21 [00:00<?, ?it/s]

  0%|          | 0/21 [00:00<?, ?it/s]

  0%|          | 0/21 [00:00<?, ?it/s]

In [7]:
print(latency)

10.09998345375061
