# Text-to-Image Generation with Stable Diffusion and OpenVINO™

Stable Diffusion is a text-to-image latent diffusion model created by the researchers and engineers from [CompVis](https://github.com/CompVis), [Stability AI](https://stability.ai/) and [LAION](https://laion.ai/). It is trained on 512x512 images from a subset of the [LAION-5B](https://laion.ai/blog/laion-5b/) database. This model uses a frozen CLIP ViT-L/14 text encoder to condition the model on text prompts. With its 860M UNet and 123M text encoder.
See the [model card](https://huggingface.co/CompVis/stable-diffusion) for more information.

General diffusion models are machine learning systems that are trained to denoise random gaussian noise step by step, to get to a sample of interest, such as an image.
Diffusion models have shown to achieve state-of-the-art results for generating image data. But one downside of diffusion models is that the reverse denoising process is slow. In addition, these models consume a lot of memory because they operate in pixel space, which becomes unreasonably expensive when generating high-resolution images. Therefore, it is challenging to train these models and also use them for inference. OpenVINO brings capabilities to run model inference on Intel hardware and opens the door to the fantastic world of diffusion models for everyone!

Model capabilities are not limited text-to-image only, it also is able solve additional tasks, for example text-guided image-to-image generation and inpainting. This tutorial also considers how to run text-guided image-to-image generation using Stable Diffusion.


This notebook demonstrates how to convert and run stable diffusion model using OpenVINO.

<img referrerpolicy="no-referrer-when-downgrade" src="https://static.scarf.sh/a.png?x-pxid=5b5a4db0-7875-4bfb-bdbd-01698b5b1a77&file=notebooks/stable-diffusion-text-to-image/stable-diffusion-text-to-image.ipynb" />


#### Table of contents:

- [Prerequisites](#Prerequisites)
- [Prepare Inference Pipelines](#Prepare-Inference-Pipelines)
- [Text-to-image pipeline](#Text-to-image-pipeline)
    - [Load Stable Diffusion model and create text-to-image pipeline](#Load-Stable-Diffusion-model-and-create-text-to-image-pipeline)
    - [Text-to-Image generation](#Text-to-Image-generation)
    - [Interactive text-to-image demo](#Interactive-text-to-image-demo)
- [Image-to-Image pipeline](#Image-to-Image-pipeline)
    - [Create image-to-Image pipeline](#Create-image-to-Image-pipeline)
    - [Image-to-Image generation](#Image-to-Image-generation)
    - [Interactive image-to-image demo](#Interactive-image-to-image-demo)

### Installation Instructions

This is a self-contained example that relies solely on its own code.

We recommend  running the notebook in a virtual environment. You only need a Jupyter server to start.
For details, please refer to [Installation Guide](https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/README.md#-installation-guide).

## Prerequisites
[back to top ⬆️](#Table-of-contents:)

In [None]:
import os
import sys
username = os.environ.get('USER')
user_bin_path = os.path.expanduser(f"/home/{username}/.local/bin")
sys.path.append(user_bin_path)
print(sys.path)

In [None]:
!{sys.executable} -m pip install -q "openvino>=2023.1.0" "git+https://github.com/huggingface/optimum-intel.git"
!{sys.executable} -m pip install -q --extra-index-url https://download.pytorch.org/whl/cpu "diffusers>=0.9.0" "torch>=2.1"
!{sys.executable} -m pip install -q "huggingface-hub>=0.9.1"
!{sys.executable} -m pip install -q transformers Pillow opencv-python tqdm ipywidgets

In [None]:
# Fetch `notebook_utils` module
import requests

r = requests.get(
    url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/notebook_utils.py",
)

open("notebook_utils.py", "w").write(r.text)

from notebook_utils import download_file, device_widget

## Prepare Inference Pipelines
[back to top ⬆️](#Table-of-contents:)

Let us now take a closer look at how the model works in inference by illustrating the logical flow.

![sd-pipeline](https://user-images.githubusercontent.com/29454499/260981188-c112dd0a-5752-4515-adca-8b09bea5d14a.png)

As you can see from the diagram, the only difference between Text-to-Image and text-guided Image-to-Image generation in approach is how initial latent state is generated. In case of Image-to-Image generation, you additionally have an image encoded by VAE encoder mixed with the noise produced by using latent seed, while in Text-to-Image you use only noise as initial latent state.
The stable diffusion model takes both a latent image representation of size $64 \times 64$ and a text prompt is transformed to text embeddings of size $77 \times 768$ via CLIP's text encoder as an input.

Next, the U-Net iteratively *denoises* the random latent image representations while being conditioned on the text embeddings. The output of the U-Net, being the noise residual, is used to compute a denoised latent image representation via a scheduler algorithm. Many different scheduler algorithms can be used for this computation, each having its pros and cons. For Stable Diffusion, it is recommended to use one of:

- [PNDM scheduler](https://github.com/huggingface/diffusers/blob/main/src/diffusers/schedulers/scheduling_pndm.py)
- [DDIM scheduler](https://github.com/huggingface/diffusers/blob/main/src/diffusers/schedulers/scheduling_ddim.py)
- [K-LMS scheduler](https://github.com/huggingface/diffusers/blob/main/src/diffusers/schedulers/scheduling_lms_discrete.py)(you will use it in your pipeline)

Theory on how the scheduler algorithm function works is out of scope for this notebook. Nonetheless, in short, you should remember that you compute the predicted denoised image representation from the previous noise representation and the predicted noise residual.
For more information, refer to the recommended [Elucidating the Design Space of Diffusion-Based Generative Models](https://arxiv.org/abs/2206.00364)

The *denoising* process is repeated given number of times (by default 50) to step-by-step retrieve better latent image representations.
When complete, the latent image representation is decoded by the decoder part of the variational auto encoder.

## Text-to-image pipeline
[back to top ⬆️](#Table-of-contents:)


### Load Stable Diffusion model and create text-to-image pipeline
[back to top ⬆️](#Table-of-contents:)

We will load optimized Stable Diffusion model from the Hugging Face Hub and create pipeline to run an inference with OpenVINO Runtime by [Optimum Intel](https://huggingface.co/docs/optimum/intel/inference#stable-diffusion). 

For running the Stable Diffusion model with Optimum Intel, we will use the `optimum.intel.OVStableDiffusionPipeline` class, which represents the inference pipeline. `OVStableDiffusionPipeline` initialized by the `from_pretrained` method. It supports on-the-fly conversion models from PyTorch using the `export=True` parameter. A converted model can be saved on disk using the `save_pretrained` method for the next running.

When Stable Diffusion models are exported to the OpenVINO format, they are decomposed into three components that consist of four models combined during inference into the pipeline:

* The text encoder
    * The text-encoder is responsible for transforming the input prompt(for example "a photo of an astronaut riding a horse") into an embedding space that can be understood by the U-Net. It is usually a simple transformer-based encoder that maps a sequence of input tokens to a sequence of latent text embeddings.
* The U-NET
    * Model predicts the `sample` state for the next step.
* The VAE encoder
    * The encoder is used to convert the image into a low dimensional latent representation, which will serve as the input to the U-Net model.
* The VAE decoder
    * The decoder transforms the latent representation back into an image.

Select device from dropdown list for running inference using OpenVINO.

In [None]:
device = device_widget()
device

In [None]:
from optimum.intel.openvino import OVStableDiffusionPipeline
from pathlib import Path

DEVICE = device.value

MODEL_ID = "prompthero/openjourney"
MODEL_DIR = Path("diffusion_pipeline")

if not MODEL_DIR.exists():
    ov_pipe = OVStableDiffusionPipeline.from_pretrained(MODEL_ID, export=True, device=DEVICE, compile=False)
    ov_pipe.save_pretrained(MODEL_DIR)
else:
    ov_pipe = OVStableDiffusionPipeline.from_pretrained(MODEL_DIR, device=DEVICE, compile=False)

ov_pipe.compile()

### Text-to-Image generation
[back to top ⬆️](#Table-of-contents:)

Now, you can define a text prompt for image generation and run inference pipeline.

> **Note**: Consider increasing `steps` to get more precise results. A suggested value is `50`, but it will take longer time to process.

In [None]:
import ipywidgets as widgets

sample_text = (
    "cyberpunk cityscape like Tokyo New York  with tall buildings at dusk golden hour cinematic lighting, epic composition. "
    "A golden daylight, hyper-realistic environment. "
    "Hyper and intricate detail, photo-realistic. "
    "Cinematic and volumetric light. "
    "Epic concept art. "
    "Octane render and Unreal Engine, trending on artstation"
)
text_prompt = widgets.Text(value=sample_text, description="your text")
num_steps = widgets.IntSlider(min=1, max=50, value=20, description="steps:")
seed = widgets.IntSlider(min=0, max=10000000, description="seed: ", value=42)
widgets.VBox([text_prompt, num_steps, seed])

In [None]:
print("Pipeline settings")
print(f"Input text: {text_prompt.value}")
print(f"Seed: {seed.value}")
print(f"Number of steps: {num_steps.value}")

Let's generate an image and save the generation results.
The pipeline returns one or several results: `images` contains final generated image. To get more than one result, you can set the `num_images_per_prompt` parameter.

In [None]:
import numpy as np

np.random.seed(seed.value)

result = ov_pipe(text_prompt.value, num_inference_steps=num_steps.value)

final_image = result["images"][0]
final_image.save("result.png")

Now is show time!

In [None]:
text = "\n\t".join(text_prompt.value.split("."))
print("Input text:")
print("\t" + text)
display(final_image)

As you can see, the image was rendered in high definition 🔥.