# Generate creative QR codes with ControlNet QR Code Monster and OpenVINO™

[Stable Diffusion](https://github.com/CompVis/stable-diffusion), a cutting-edge image generation technique, but it can be further enhanced by combining it with [ControlNet](https://arxiv.org/abs/2302.05543), a widely used control network approach. The combination allows Stable Diffusion to use a condition input to guide the image generation process, resulting in highly accurate and visually appealing images. The condition input could be in the form of various types of data such as scribbles, edge maps, pose key points, depth maps, segmentation maps, normal maps, or any other relevant information that helps to guide the content of the generated image, for example - QR codes! This method can be particularly useful in complex image generation scenarios where precise control and fine-tuning are required to achieve the desired results.

In this tutorial, we will learn how to convert and run [Controlnet QR Code Monster For SD-1.5](https://huggingface.co/monster-labs/control_v1p_sd15_qrcode_monster) by [monster-labs](https://qrcodemonster.art/).

![](https://github.com/openvinotoolkit/openvino_notebooks/assets/76463150/1a5978c6-e7a0-4824-9318-a3d8f4912c47)

If you want to learn more about ControlNet and particularly on conditioning by pose, please refer to this [tutorial](../235-controlnet-stable-diffusion/235-controlnet-stable-diffusion.ipynb)

#### Table of contents:
- [Prerequisites](#Prerequisites-Uparrow)
- [Instantiating Generation Pipeline](#Instantiating-Generation-Pipeline-Uparrow)
- [Convert models to OpenVINO Intermediate representation (IR) format](#Convert-models-to-OpenVINO-Intermediate-representation-(IR)-format-Uparrow)
    - [ControlNet conversion](#ControlNet-conversion-Uparrow)
    - [UNet conversion](#UNet-conversion-Uparrow)
    - [Text Encoder](#Text-Encoder-Uparrow)
    - [VAE Decoder conversion](#VAE-Decoder-conversion-Uparrow)
- [Select inference device](#Select-inference-device-for-Stable-Diffusion-pipeline-Uparrow)
- [Prepare Inference pipeline](#Prepare-Inference-pipeline-Uparrow)
- [Running Text-to-Image Generation with ControlNet Conditioning and OpenVINO](#Running-Text-to-Image-Generation-with-ControlNet-Conditioning-and-OpenVINO-Uparrow)

## Prerequisites [$\Uparrow$](#Table-of-content:)


In [2]:
%pip install -q accelerate diffusers transformers torch gradio --extra-index-url https://download.pytorch.org/whl/cpu
%pip install -q "openvino>=2023.1.0"

[0mNote: you may need to restart the kernel to use updated packages.
[0mNote: you may need to restart the kernel to use updated packages.


## Instantiating Generation Pipeline [$\Uparrow$](#Table-of-content:)

### ControlNet in Diffusers library

For working with Stable Diffusion and ControlNet models, we will use Hugging Face [Diffusers](https://github.com/huggingface/diffusers) library. To experiment with ControlNet, Diffusers exposes the [`StableDiffusionControlNetPipeline`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/controlnet) similar to the [other Diffusers pipelines](https://huggingface.co/docs/diffusers/api/pipelines/overview). Central to the `StableDiffusionControlNetPipeline` is the `controlnet` argument which enables providing a particularly trained [`ControlNetModel`](https://huggingface.co/docs/diffusers/main/en/api/models#diffusers.ControlNetModel) instance while keeping the pre-trained diffusion model weights the same. The code below demonstrates how to create `StableDiffusionControlNetPipeline`, using the `controlnet-openpose` controlnet model and `stable-diffusion-v1-5`:

In [1]:
%load_ext autoreload
%autoreload 2

from diffusers import AudioLDM2Pipeline
from IPython.display import Audio
import torch

import gc
from functools import partial
from pathlib import Path
import openvino as ov

MODEL_ID = "cvssp/audioldm2"
pipe = AudioLDM2Pipeline.from_pretrained(MODEL_ID)

prompt = "applause underwater high quality"
negative_prompt = "Low quality"
audio = pipe(
    prompt,
    negative_prompt=negative_prompt,
    num_inference_steps=3,
    audio_length_in_s=7.0
).audios[0]

sampling_rate = 16000
Audio(audio, rate=sampling_rate)

Loading pipeline components...:   0%|          | 0/11 [00:00<?, ?it/s]

700 4


  0%|          | 0/3 [00:00<?, ?it/s]

## Convert models to OpenVINO Intermediate representation (IR) format [$\Uparrow$](#Table-of-content:)

We need to provide a model object, input data for model tracing to `ov.convert_model` function to obtain OpenVINO `ov.Model` object instance. Model can be saved on disk for next deployment using `ov.save_model` function.

The pipeline consists of four important parts:

* ControlNet for conditioning by image annotation.
* Text Encoder for creation condition to generate an image from a text prompt.
* Unet for step-by-step denoising latent image representation.
* Autoencoder (VAE) for decoding latent space to image.

In [2]:
import gc
from functools import partial
from pathlib import Path
from PIL import Image
import openvino as ov
import torch

def cleanup_torchscript_cache():
    """
    Helper for removing cached model representation
    """
    torch._C._jit_clear_class_registry()
    torch.jit._recursive.concrete_type_store = torch.jit._recursive.ConcreteTypeStore()
    torch.jit._state._clear_class_state()

### Text Encoder [$\Uparrow$](#Table-of-content:)
The text-encoder is responsible for transforming the input prompt, for example, "a photo of an astronaut riding a horse" into an embedding space that can be understood by the U-Net. It is usually a simple transformer-based encoder that maps a sequence of input tokens to a sequence of latent text embeddings.

The input of the text encoder is tensor `input_ids`, which contains indexes of tokens from text processed by the tokenizer and padded to the maximum length accepted by the model. Model outputs are two tensors: `last_hidden_state` - hidden state from the last MultiHeadAttention layer in the model and `pooler_out` - pooled output for whole model hidden states.

In [3]:
class ClapEncoderWrapper(torch.nn.Module):
    def __init__(self, encoder):
        super().__init__()
        encoder.eval()
        self.encoder = encoder

    def forward(self, input_ids):
        return self.encoder.get_text_features(input_ids)

clap_text_encoder_ir_path = Path('./clap_text_encoder.xml')

if not clap_text_encoder_ir_path.exists():
    with torch.no_grad():
        ov_model = ov.convert_model(
            ClapEncoderWrapper(pipe.text_encoder),  # model instance
            example_input=torch.ones((1, 512), dtype=torch.long),  # inputs for model tracing
        )
    ov.save_model(ov_model, clap_text_encoder_ir_path)
    # del ov_model
    # del pipe.text_encoder
    cleanup_torchscript_cache()
    print('Text Encoder successfully converted to IR')
else:
    # del pipe.text_encoder
    print(f"Text Encoder will be loaded from {clap_text_encoder_ir_path}")

We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


Text Encoder successfully converted to IR


In [4]:
t5_text_encoder_ir_path = Path('./t5_text_encoder.xml')

if not t5_text_encoder_ir_path.exists():
    pipe.text_encoder_2.eval()
    with torch.no_grad():
        ov_model = ov.convert_model(
            pipe.text_encoder_2,  # model instance
            example_input=torch.ones((1, 7), dtype=torch.long),  # inputs for model tracing
        )
    ov.save_model(ov_model, t5_text_encoder_ir_path)
    # del ov_model
    # del pipe.text_encoder_2
    cleanup_torchscript_cache()
    print('Text Encoder successfully converted to IR')
else:
    # del pipe.text_encoder_2
    print(f"Text Encoder will be loaded from {t5_text_encoder_ir_path}")

Text Encoder successfully converted to IR


### Vocoder conversion

In [11]:
vocoder_ir_path = Path('./vocoder.xml')

if not vocoder_ir_path.exists():
    pipe.vocoder.eval()
    with torch.no_grad():
        ov_model = ov.convert_model(
            pipe.vocoder,  # model instance
            example_input=torch.ones((1, 700, 64), dtype=torch.float32),  # inputs for model tracing
        )
    ov.save_model(ov_model, vocoder_ir_path)
    # del ov_model
    # del pipe.vocoder
    cleanup_torchscript_cache()
    print('The Vocoder successfully converted to IR')
else:
    # del pipe.vocoder
    print(f"The Vocoder will be loaded from {vocoder_ir_path}")

The Vocoder will be loaded from vocoder.xml


## GPT-2 conversion

In [2]:
pipe.language_model

GPT2Model(
  (wte): Embedding(50257, 768)
  (wpe): Embedding(1024, 768)
  (drop): Dropout(p=0.1, inplace=False)
  (h): ModuleList(
    (0-11): 12 x GPT2Block(
      (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (attn): GPT2Attention(
        (c_attn): Conv1D()
        (c_proj): Conv1D()
        (attn_dropout): Dropout(p=0.1, inplace=False)
        (resid_dropout): Dropout(p=0.1, inplace=False)
      )
      (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (mlp): GPT2MLP(
        (c_fc): Conv1D()
        (c_proj): Conv1D()
        (act): NewGELUActivation()
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
  )
  (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)

In [8]:
core = ov.Core()
core.compile_model(language_model_ir_path)(language_model_inputs)

{<ConstOutput: names[1296] shape[?,?,768] type: f32>: array([[[ 0.8527496 , -0.12465352,  0.1506904 , ...,  0.46492258,
          0.31112245,  0.5430192 ],
        [ 0.49439627,  0.28936243,  0.26366374, ...,  0.5778445 ,
          0.08182814,  0.68793845],
        [ 0.31121126,  0.44631216,  0.13288724, ...,  0.5011845 ,
          0.28979525,  0.45660776],
        ...,
        [ 0.41636455,  0.23728015,  0.4695267 , ...,  0.6083536 ,
          0.37676963,  0.23251592],
        [ 0.4711076 ,  0.33413666,  0.23145561, ...,  0.7286286 ,
          0.14590473,  0.23739147],
        [ 0.5131769 ,  0.3060617 ,  0.15885174, ...,  0.32218865,
          0.2660841 ,  0.32491276]]], dtype=float32), <ConstOutput: names[key.1, 224] shape[?,12,?,64] type: f32>: array([[[[-0.642802  ,  0.51117724,  0.37933215, ..., -0.9475289 ,
          -0.15000518,  0.3280833 ],
         [ 0.10413963,  0.24237704,  0.0858774 , ...,  0.81702495,
          -0.8155389 ,  0.8845759 ],
         [-0.08738628, -0.55112225

In [7]:
pipe.language_model.config.torchscript = False
pipe.language_model.__call__ = partial(pipe.language_model.__call__, kwargs={
                "past_key_values": None,
                "use_cache": False,
                "return_dict": True})
pipe.language_model(**language_model_inputs)

BaseModelOutputWithPastAndCrossAttentions(last_hidden_state=tensor([[[ 0.8527, -0.1243,  0.1505,  ...,  0.4652,  0.3114,  0.5433],
         [ 0.4943,  0.2895,  0.2635,  ...,  0.5781,  0.0820,  0.6881],
         [ 0.3111,  0.4464,  0.1327,  ...,  0.5014,  0.2900,  0.4567],
         ...,
         [ 0.4163,  0.2372,  0.4695,  ...,  0.6086,  0.3768,  0.2326],
         [ 0.4710,  0.3342,  0.2314,  ...,  0.7289,  0.1460,  0.2374],
         [ 0.5131,  0.3059,  0.1588,  ...,  0.3223,  0.2661,  0.3250]]],
       grad_fn=<ViewBackward0>), past_key_values=((tensor([[[[-0.6427,  0.5108,  0.3794,  ..., -0.9476, -0.1499,  0.3280],
          [ 0.1039,  0.2426,  0.0859,  ...,  0.8169, -0.8157,  0.8848],
          [-0.0873, -0.5512, -0.6696,  ..., -1.0817,  0.2380,  0.3290],
          ...,
          [-0.1912, -0.7259, -0.1856,  ..., -0.2303,  0.8725, -0.4864],
          [-0.8761, -0.5943, -0.3634,  ..., -0.5033, -0.3858, -0.8335],
          [-0.9531, -0.3947,  0.4453,  ..., -1.1079, -0.0334,  0.7825]],

### GPT-2 conversion

In [4]:
from functools import partial

language_model_ir_path = Path('./language_model.xml')

language_model_inputs = {
    "inputs_embeds": torch.randn((1, 12, 768), dtype=torch.float32),
    "attention_mask": torch.ones((1, 12), dtype=torch.int64),
}

if not language_model_ir_path.exists():
    pipe.language_model.config.torchscript = True
    pipe.language_model.eval()
    pipe.language_model.__call__ = partial(pipe.language_model.__call__, kwargs={
                "past_key_values": None,
                "use_cache": False,
                "return_dict": False})
    with torch.no_grad():
        ov_model = ov.convert_model(
            pipe.language_model,  # model instance
            example_input=language_model_inputs,  # inputs for model tracing
        )
    ov.save_model(ov_model, language_model_ir_path)
    # del ov_model
    # del pipe.language_model
    cleanup_torchscript_cache()
    print('The Projection Model successfully converted to IR')
else:
    # del pipe.language_model
    print(f"The Projection Model will be loaded from {language_model_ir_path}")

The Projection Model will be loaded from language_model.xml


### Projection model conversion

In [7]:
projection_model_ir_path = Path('./projection_model.xml')

projection_model_inputs = {
    "hidden_states": torch.randn((1, 1, 512), dtype=torch.float32),
    "hidden_states_1": torch.randn((1, 7, 1024), dtype=torch.float32),
    "attention_mask": torch.ones((1, 1), dtype=torch.int64),
    "attention_mask_1": torch.ones((1, 7), dtype=torch.int64),
}

if not projection_model_ir_path.exists():
    pipe.projection_model.eval()
    with torch.no_grad():
        ov_model = ov.convert_model(
            pipe.projection_model,  # model instance
            example_input=projection_model_inputs,  # inputs for model tracing
        )
    ov.save_model(ov_model, projection_model_ir_path)
    # del ov_model
    # del pipe.projection_model
    cleanup_torchscript_cache()
    print('The Projection Model successfully converted to IR')
else:
    # del pipe.projection_model
    print(f"The Projection Model will be loaded from {projection_model_ir_path}")

The Projection Model will be loaded from projection_model.xml


### UNet conversion [$\Uparrow$](#Table-of-content:)

The process of UNet model conversion remains the same, like for original Stable Diffusion model, but with respect to the new inputs generated by ControlNet.

In [2]:
unet_ir_path = Path('./unet.xml')

dtype_mapping = {
    torch.float32: ov.Type.f32,
    torch.float64: ov.Type.f64,
    torch.int32: ov.Type.i32,
    torch.int64: ov.Type.i64
}

def flattenize_inputs(inputs):
    flatten_inputs = []
    for input_data in inputs:
        if input_data is None:
            continue
        if isinstance(input_data, (list, tuple)):
            flatten_inputs.extend(flattenize_inputs(input_data))
        else:
            flatten_inputs.append(input_data)
    return flatten_inputs


pipe.unet.eval()
unet_inputs = {
    "sample": torch.randn((2, 8, 175, 16), dtype=torch.float32),
    "timestep": torch.tensor(1, dtype=torch.int64),
    "encoder_hidden_states": torch.randn((2, 8, 768), dtype=torch.float32),
    "encoder_hidden_states_1": torch.randn((2, 7, 1024), dtype=torch.float32),
    "encoder_attention_mask_1": torch.ones((2, 7), dtype=torch.int64),
}

if not unet_ir_path.exists():
    with torch.no_grad():
        ov_model = ov.convert_model(pipe.unet, example_input=unet_inputs)

    flatten_inputs = flattenize_inputs(unet_inputs.values())
    for input_data, input_tensor in zip(flatten_inputs, ov_model.inputs):
        input_tensor.get_node().set_partial_shape(ov.PartialShape(input_data.shape))
        input_tensor.get_node().set_element_type(dtype_mapping[input_data.dtype])
    ov_model.validate_nodes_and_infer_types()
        
    ov.save_model(ov_model, unet_ir_path)
    # del ov_model
    # del pipe.unet
    cleanup_torchscript_cache()
    gc.collect()
    print('Unet successfully converted to IR')
else:
    # del pipe.unet
    print(f"Unet will be loaded from {unet_ir_path}")

  if any(s % default_overall_up_factor != 0 for s in sample.shape[-2:]):
  assert hidden_states.shape[1] == self.channels
  assert hidden_states.shape[1] == self.channels
  if current_length != target_length:
  if attention_mask.shape[0] < batch_size * head_size:
  assert hidden_states.shape[1] == self.channels
  if hidden_states.shape[0] >= 64:


NameError: name 'cleanup_torchscript_cache' is not defined

### VAE Decoder conversion [$\Uparrow$](#Table-of-content:)

The VAE model has two parts, an encoder, and a decoder. The encoder is used to convert the image into a low-dimensional latent representation, which will serve as the input to the U-Net model. The decoder, conversely, transforms the latent representation back into an image.

During latent diffusion training, the encoder is used to get the latent representations (latents) of the images for the forward diffusion process, which applies more and more noise at each step. During inference, the denoised latents generated by the reverse diffusion process are converted back into images using the VAE decoder. During inference, we will see that we **only need the VAE decoder**. You can find instructions on how to convert the encoder part in a stable diffusion [notebook](../225-stable-diffusion-text-to-image/225-stable-diffusion-text-to-image.ipynb).

In [3]:
vae_ir_path = Path('./vae.xml')


class VAEDecoderWrapper(torch.nn.Module):
    def __init__(self, vae):
        super().__init__()
        vae.eval()
        self.vae = vae

    def forward(self, latents):
        return self.vae.decode(latents)

if not vae_ir_path.exists():
    vae_decoder = VAEDecoderWrapper(pipe.vae)
    latents = torch.zeros((1, 8, 175, 16))

    vae_decoder.eval()
    with torch.no_grad():
        ov_model = ov.convert_model(vae_decoder, example_input=latents)
        ov.save_model(ov_model, vae_ir_path)
    # del ov_model
    # del pipe.vae
    cleanup_torchscript_cache()
    print('VAE decoder successfully converted to IR')
else:
    # del pipe.vae
    print(f"VAE decoder will be loaded from {vae_ir_path}")

  assert hidden_states.shape[1] == self.channels
  if hidden_states.shape[0] >= 64:


NameError: name 'cleanup_torchscript_cache' is not defined

## Select inference device for Stable Diffusion pipeline [$\Uparrow$](#Table-of-content:)

select device from dropdown list for running inference using OpenVINO

In [9]:
import ipywidgets as widgets

core = ov.Core()

device = widgets.Dropdown(
    options=core.available_devices + ["AUTO"],
    value="CPU",
    description="Device:",
    disabled=False,
)

device

Dropdown(description='Device:', index=1, options=('CPU', 'AUTO'), value='AUTO')

## Prepare Inference pipeline [$\Uparrow$](#Table-of-content:)

The stable diffusion model takes both a latent seed and a text prompt as input. The latent seed is then used to generate random latent image representations of size $96 \times 96$ where as the text prompt is transformed to text embeddings of size $77 \times 768$ via CLIP's text encoder.

Next, the U-Net iteratively *denoises* the random latent image representations while being conditioned on the text embeddings. In comparison with the original stable-diffusion pipeline, latent image representation, encoder hidden states, and control condition annotation passed via ControlNet on each denoising step for obtaining middle and down blocks attention parameters, these attention blocks results additionally will be provided to the UNet model for the control generation process. The output of the U-Net, being the noise residual, is used to compute a denoised latent image representation via a scheduler algorithm. Many different scheduler algorithms can be used for this computation, each having its pros and cons. For Stable Diffusion, it is recommended to use one of:

- [PNDM scheduler](https://github.com/huggingface/diffusers/blob/main/src/diffusers/schedulers/scheduling_pndm.py)
- [DDIM scheduler](https://github.com/huggingface/diffusers/blob/main/src/diffusers/schedulers/scheduling_ddim.py)
- [K-LMS scheduler](https://github.com/huggingface/diffusers/blob/main/src/diffusers/schedulers/scheduling_lms_discrete.py)

Theory on how the scheduler algorithm function works is out of scope for this notebook, but in short, you should remember that they compute the predicted denoised image representation from the previous noise representation and the predicted noise residual.
For more information, it is recommended to look into [Elucidating the Design Space of Diffusion-Based Generative Models](https://arxiv.org/abs/2206.00364)

In this tutorial, instead of using Stable Diffusion's default [`PNDMScheduler`](https://huggingface.co/docs/diffusers/main/en/api/schedulers/pndm), we use [`EulerAncestralDiscreteScheduler`](https://huggingface.co/docs/diffusers/api/schedulers/euler_ancestral), recommended by authors. More information regarding schedulers can be found [here](https://huggingface.co/docs/diffusers/main/en/using-diffusers/schedulers).

The *denoising* process is repeated a given number of times (by default 50) to step-by-step retrieve better latent image representations.
Once complete, the latent image representation is decoded by the decoder part of the variational auto-encoder.

Similarly to Diffusers `StableDiffusionControlNetPipeline`, we define our own `OVContrlNetStableDiffusionPipeline` inference pipeline based on OpenVINO.

In [10]:
from diffusers.pipeline_utils import DiffusionPipeline
from diffusers.schedulers import KarrasDiffusionSchedulers
from transformers import (
    T5Tokenizer,
    T5TokenizerFast,
    RobertaTokenizer,
    RobertaTokenizerFast,
)
from typing import Union, List, Optional, Tuple
import cv2
import numpy as np


def scale_fit_to_window(dst_width:int, dst_height:int, image_width:int, image_height:int):
    """
    Preprocessing helper function for calculating image size for resize with peserving original aspect ratio 
    and fitting image to specific window size
    
    Parameters:
      dst_width (int): destination window width
      dst_height (int): destination window height
      image_width (int): source image width
      image_height (int): source image height
    Returns:
      result_width (int): calculated width for resize
      result_height (int): calculated height for resize
    """
    im_scale = min(dst_height / image_height, dst_width / image_width)
    return int(im_scale * image_width), int(im_scale * image_height)


def preprocess(image: Image.Image):
    """
    Image preprocessing function. Takes image in PIL.Image format, resizes it to keep aspect ration and fits to model input window 768x768,
    then converts it to np.ndarray and adds padding with zeros on right or bottom side of image (depends from aspect ratio), after that
    converts data to float32 data type and change range of values from [0, 255] to [-1, 1], finally, converts data layout from planar NHWC to NCHW.
    The function returns preprocessed input tensor and padding size, which can be used in postprocessing.
    
    Parameters:
      image (Image.Image): input image
    Returns:
       image (np.ndarray): preprocessed image tensor
       pad (Tuple[int]): pading size for each dimension for restoring image size in postprocessing
    """
    src_width, src_height = image.size
    dst_width, dst_height = scale_fit_to_window(768, 768, src_width, src_height)
    image = image.convert("RGB")
    image = np.array(image.resize((dst_width, dst_height), resample=Image.Resampling.LANCZOS))[None, :]
    pad_width = 768 - dst_width
    pad_height = 768 - dst_height
    pad = ((0, 0), (0, pad_height), (0, pad_width), (0, 0))
    image = np.pad(image, pad, mode="constant")
    image = image.astype(np.float32) / 255.0
    image = image.transpose(0, 3, 1, 2)
    return image, pad


def randn_tensor(
    shape: Union[Tuple, List],
    dtype: Optional[np.dtype] = np.float32,
):
    """
    Helper function for generation random values tensor with given shape and data type
    
    Parameters:
      shape (Union[Tuple, List]): shape for filling random values
      dtype (np.dtype, *optiona*, np.float32): data type for result
    Returns:
      latents (np.ndarray): tensor with random values with given data type and shape (usually represents noise in latent space)
    """
    latents = np.random.randn(*shape).astype(dtype)

    return latents


class OVAudioLDM2Pipeline(DiffusionPipeline):
    """
    OpenVINO inference pipeline for Stable Diffusion with ControlNet guidence
    """
    def __init__(
        self,
        core: ov.Core,
        tokenizer: Union[RobertaTokenizer, RobertaTokenizerFast],
        tokenizer_2: Union[T5Tokenizer, T5TokenizerFast],
        scheduler: KarrasDiffusionSchedulers,
        text_encoder: ov.Model,
        text_encoder_2: ov.Model,
        vae_decoder: ov.Model,
        projection_model: ov.Model,
        language_model: ov.Model,
        unet: ov.Model,
        vocoder: ov.Model,
        device:str = "AUTO"
    ):
        super().__init__()
        self.tokenizer = tokenizer
        self.tokenizer_2 = tokenizer_2
        self.vae_scale_factor = 8
        self.scheduler = scheduler
        self.load_models(
            core, device,
            text_encoder, text_encoder_2,
            unet, vae_decoder,
            projection_model, language_model, vocoder
        )
        self.set_progress_bar_config(disable=True)

    def load_models(
            self, core: ov.Core, device: str,
            text_encoder: ov.Model, text_encoder_2: ov.Model,
            unet: ov.Model, vae_decoder: ov.Model,
            projection_model: ov.Model, language_model: ov.Model,
            vocoder: ov.Model
        ):
        """
        Function for loading models on device using OpenVINO
        
        Parameters:
          core (Core): OpenVINO runtime Core class instance
          device (str): inference device
          controlnet (Model): OpenVINO Model object represents ControlNet
          text_encoder (Model): OpenVINO Model object represents text encoder
          unet (Model): OpenVINO Model object represents UNet
          vae_decoder (Model): OpenVINO Model object represents vae decoder
        Returns
          None
        """
        self.text_encoder = core.compile_model(text_encoder, device)
        self.text_encoder_out = self.text_encoder.output(0)
        self.text_encoder_2 = core.compile_model(text_encoder_2, device)
        self.text_encoder_2_out = self.text_encoder_2.output(0)
        self.unet = core.compile_model(unet, device)
        self.unet_out = self.unet.output(0)
        self.vae_decoder = core.compile_model(vae_decoder, device)
        self.vae_decoder_out = self.vae_decoder.output(0)
        self.projection_model = core.compile_model(projection_model, device)
        self.projection_model_out = self.projection_model.output(0)
        self.language_model = core.compile_model(language_model, device)
        self.language_model_out = self.language_model.output(0)
        self.vocoder = core.compile_model(vocoder, device)
        self.vocoder_out = self.vocoder.output(0)

    def __call__(
        self,
        prompt: Union[str, List[str]],
        audio_length_in_s: Optional[float] = None,
        num_inference_steps: int = 50,
        guidance_scale: float = 7.5,
        negative_prompt: Union[str, List[str]] = None,
        eta: float = 0.0,
        latents: Optional[np.array] = None,
        output_type: Optional[str] = "np",
    ):
        """
        Function invoked when calling the pipeline for generation.

        Parameters:
            prompt (`str` or `List[str]`):
                The prompt or prompts to guide the image generation.
            image (`Image.Image`):
                `Image`, or tensor representing an image batch which will be repainted according to `prompt`.
            num_inference_steps (`int`, *optional*, defaults to 100):
                The number of denoising steps. More denoising steps usually lead to a higher quality image at the
                expense of slower inference.
            negative_prompt (`str` or `List[str]`):
                negative prompt or prompts for generation
            guidance_scale (`float`, *optional*, defaults to 7.5):
                Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
                `guidance_scale` is defined as `w` of equation 2. of [Imagen
                Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
                1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
                usually at the expense of lower image quality. This pipeline requires a value of at least `1`.
            latents (`np.ndarray`, *optional*):
                Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
                generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
                tensor will ge generated by sampling using the supplied random `generator`.
            output_type (`str`, *optional*, defaults to `"pil"`):
                The output format of the generate image. Choose between
                [PIL](https://pillow.readthedocs.io/en/stable/): `Image.Image` or `np.array`.
        Returns:
            image ([List[Union[np.ndarray, Image.Image]]): generaited images
            
        """
        # 0. Convert audio input length from seconds to spectrogram height
        vocoder_upsample_factor = 0.01
        if audio_length_in_s is None:
            audio_length_in_s = 5

        height = int(audio_length_in_s / vocoder_upsample_factor)

        original_waveform_length = int(audio_length_in_s * sampling_rate)
        if height % 4 != 0:
            height = int(np.ceil(height / self.vae_scale_factor)) * 4
            print(
                f"Audio length in seconds {audio_length_in_s} is increased to {height * vocoder_upsample_factor} "
                f"so that it can be handled by the model. It will be cut to {audio_length_in_s} after the "
                f"denoising process."
            )

        # 1. Define call parameters
        batch_size = 1 if isinstance(prompt, str) else len(prompt)
        # here `guidance_scale` is defined analog to the guidance weight `w` of equation (2)
        # of the Imagen paper: https://arxiv.org/pdf/2205.11487.pdf . `guidance_scale = 1`
        # corresponds to doing no classifier free guidance.
        do_classifier_free_guidance = guidance_scale > 1.0
        # 2. Encode input prompt
        text_embeddings = self._encode_prompt(prompt, negative_prompt=negative_prompt)

        # 3. Preprocess image
        orig_width, orig_height = image.size
        image, pad = preprocess(image)
        height, width = image.shape[-2:]
        if do_classifier_free_guidance:
            image = np.concatenate(([image] * 2))

        # 4. set timesteps
        self.scheduler.set_timesteps(num_inference_steps)
        timesteps = self.scheduler.timesteps

        # 6. Prepare latent variables
        num_channels_latents = 4
        latents = self.prepare_latents(
            batch_size,
            num_channels_latents,
            height,
            width,
            text_embeddings.dtype,
            latents,
        )

        # 7. Denoising loop
        num_warmup_steps = len(timesteps) - num_inference_steps * self.scheduler.order
        with self.progress_bar(total=num_inference_steps) as progress_bar:
            for i, t in enumerate(timesteps):
                # Expand the latents if we are doing classifier free guidance.
                # The latents are expanded 3 times because for pix2pix the guidance\
                # is applied for both the text and the input image.
                latent_model_input = np.concatenate(
                    [latents] * 2) if do_classifier_free_guidance else latents
                latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)

                result = self.controlnet([latent_model_input, t, text_embeddings, image])
                down_and_mid_blok_samples = [sample * controlnet_conditioning_scale for _, sample in result.items()]

                # predict the noise residual
                noise_pred = self.unet([latent_model_input, t, text_embeddings, *down_and_mid_blok_samples])[self.unet_out]

                # perform guidance
                if do_classifier_free_guidance:
                    noise_pred_uncond, noise_pred_text = noise_pred[0], noise_pred[1]
                    noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)

                # compute the previous noisy sample x_t -> x_t-1
                latents = self.scheduler.step(torch.from_numpy(noise_pred), t, torch.from_numpy(latents)).prev_sample.numpy()

                # update progress
                if i == len(timesteps) - 1 or ((i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0):
                    progress_bar.update()

        # 8. Post-processing
        image = self.decode_latents(latents, pad)

        # 9. Convert to PIL
        if output_type == "pil":
            image = self.numpy_to_pil(image)
            image = [img.resize((orig_width, orig_height), Image.Resampling.LANCZOS) for img in image]
        else:
            image = [cv2.resize(img, (orig_width, orig_width))
                     for img in image]

        return image

    def _encode_prompt(self, prompt:Union[str, List[str]], num_images_per_prompt:int = 1, do_classifier_free_guidance:bool = True, negative_prompt:Union[str, List[str]] = None):
        """
        Encodes the prompt into text encoder hidden states.

        Parameters:
            prompt (str or list(str)): prompt to be encoded
            num_images_per_prompt (int): number of images that should be generated per prompt
            do_classifier_free_guidance (bool): whether to use classifier free guidance or not
            negative_prompt (str or list(str)): negative prompt to be encoded
        Returns:
            text_embeddings (np.ndarray): text encoder hidden states
        """
        batch_size = len(prompt) if isinstance(prompt, list) else 1

        # tokenize input prompts
        text_inputs = self.tokenizer(
            prompt,
            padding="max_length",
            max_length=self.tokenizer.model_max_length,
            truncation=True,
            return_tensors="np",
        )
        text_input_ids = text_inputs.input_ids

        text_embeddings = self.text_encoder(
            text_input_ids)[self.text_encoder_out]

        # duplicate text embeddings for each generation per prompt
        if num_images_per_prompt != 1:
            bs_embed, seq_len, _ = text_embeddings.shape
            text_embeddings = np.tile(
                text_embeddings, (1, num_images_per_prompt, 1))
            text_embeddings = np.reshape(
                text_embeddings, (bs_embed * num_images_per_prompt, seq_len, -1))

        # get unconditional embeddings for classifier free guidance
        if do_classifier_free_guidance:
            uncond_tokens: List[str]
            max_length = text_input_ids.shape[-1]
            if negative_prompt is None:
                uncond_tokens = [""] * batch_size
            elif isinstance(negative_prompt, str):
                uncond_tokens = [negative_prompt]
            else:
                uncond_tokens = negative_prompt
            uncond_input = self.tokenizer(
                uncond_tokens,
                padding="max_length",
                max_length=max_length,
                truncation=True,
                return_tensors="np",
            )

            uncond_embeddings = self.text_encoder(uncond_input.input_ids)[self.text_encoder_out]

            # duplicate unconditional embeddings for each generation per prompt, using mps friendly method
            seq_len = uncond_embeddings.shape[1]
            uncond_embeddings = np.tile(uncond_embeddings, (1, num_images_per_prompt, 1))
            uncond_embeddings = np.reshape(uncond_embeddings, (batch_size * num_images_per_prompt, seq_len, -1))

            # For classifier free guidance, we need to do two forward passes.
            # Here we concatenate the unconditional and text embeddings into a single batch
            # to avoid doing two forward passes
            text_embeddings = np.concatenate([uncond_embeddings, text_embeddings])

        return text_embeddings

    def prepare_latents(self, batch_size:int, num_channels_latents:int, height:int, width:int, dtype:np.dtype = np.float32, latents:np.ndarray = None):
        """
        Preparing noise to image generation. If initial latents are not provided, they will be generated randomly, 
        then prepared latents scaled by the standard deviation required by the scheduler
        
        Parameters:
           batch_size (int): input batch size
           num_channels_latents (int): number of channels for noise generation
           height (int): image height
           width (int): image width
           dtype (np.dtype, *optional*, np.float32): dtype for latents generation
           latents (np.ndarray, *optional*, None): initial latent noise tensor, if not provided will be generated
        Returns:
           latents (np.ndarray): scaled initial noise for diffusion
        """
        shape = (batch_size, num_channels_latents, height // self.vae_scale_factor, width // self.vae_scale_factor)
        if latents is None:
            latents = randn_tensor(shape, dtype=dtype)
        else:
            latents = latents

        # scale the initial noise by the standard deviation required by the scheduler
        latents = latents * np.array(self.scheduler.init_noise_sigma)
        return latents

    def decode_latents(self, latents:np.array, pad:Tuple[int]):
        """
        Decode predicted image from latent space using VAE Decoder and unpad image result
        
        Parameters:
           latents (np.ndarray): image encoded in diffusion latent space
           pad (Tuple[int]): each side padding sizes obtained on preprocessing step
        Returns:
           image: decoded by VAE decoder image
        """
        latents = 1 / 0.18215 * latents
        image = self.vae_decoder(latents)[self.vae_decoder_out]
        (_, end_h), (_, end_w) = pad[1:3]
        h, w = image.shape[2:]
        unpad_h = h - end_h
        unpad_w = w - end_w
        image = image[:, :, :unpad_h, :unpad_w]
        image = np.clip(image / 2 + 0.5, 0, 1)
        image = np.transpose(image, (0, 2, 3, 1))
        return image

  from diffusers.pipeline_utils import DiffusionPipeline


## Running Text-to-Image Generation with ControlNet Conditioning and OpenVINO [$\Uparrow$](#Table-of-content:)

Now, we are ready to start generation. For improving the generation process, we also introduce an opportunity to provide a `negative prompt`. Technically, positive prompt steers the diffusion toward the images associated with it, while negative prompt steers the diffusion away from it. More explanation of how it works can be found in this [article](https://stable-diffusion-art.com/how-negative-prompt-work/). We can keep this field empty if we want to generate image without negative prompting.

In [11]:
from diffusers import EulerAncestralDiscreteScheduler

tokenizer = CLIPTokenizer.from_pretrained('openai/clip-vit-large-patch14')
scheduler = EulerAncestralDiscreteScheduler.from_config(pipe.scheduler.config)

ov_pipe = OVContrlNetStableDiffusionPipeline(tokenizer, scheduler, core, controlnet_ir_path, text_encoder_ir_path, unet_ir_path, vae_ir_path, device=device.value)


In [None]:
import qrcode

def create_code(content: str):
    """Creates QR codes with provided content."""
    qr = qrcode.QRCode(
        version=1,
        error_correction=qrcode.constants.ERROR_CORRECT_H,
        box_size=16,
        border=0,
    )
    qr.add_data(content)
    qr.make(fit=True)
    img = qr.make_image(fill_color="black", back_color="white")

    # find smallest image size multiple of 256 that can fit qr
    offset_min = 8 * 16
    w, h = img.size
    w = (w + 255 + offset_min) // 256 * 256
    h = (h + 255 + offset_min) // 256 * 256
    if w > 1024:
        raise gr.Error("QR code is too large, please use a shorter content")
    bg = Image.new('L', (w, h), 128)

    # align on 16px grid
    coords = ((w - img.size[0]) // 2 // 16 * 16,
              (h - img.size[1]) // 2 // 16 * 16)
    bg.paste(img, coords)
    return bg

In [None]:
import gradio as gr

def _generate(
    qr_code_content: str,
    prompt: str,
    negative_prompt: str,
    seed: Optional[int] = 42,
    guidance_scale: float = 10.0,
    controlnet_conditioning_scale: float = 2.0,
    num_inference_steps: int = 5,
):
    if seed is not None:
        np.random.seed(int(seed))
    qrcode_image = create_code(qr_code_content)
    return ov_pipe(
        prompt, qrcode_image, negative_prompt=negative_prompt,
        num_inference_steps=int(num_inference_steps),
        guidance_scale=guidance_scale,
        controlnet_conditioning_scale=controlnet_conditioning_scale
    )[0]

demo = gr.Interface(
    _generate,
    inputs=[
        gr.Textbox(label="QR Code content"),
        gr.Textbox(label="Text Prompt"),
        gr.Textbox(label="Negative Text Prompt"),
        gr.Number(
            minimum=-1,
            maximum=9999999999,
            step=1,
            value=42,
            label="Seed",
            info="Seed for the random number generator"
        ),
        gr.Slider(
            minimum=0.0,
            maximum=25.0,
            step=0.25,
            value=7,
            label="Guidance Scale",
            info="Controls the amount of guidance the text prompt guides the image generation"
        ),
        gr.Slider(
            minimum=0.5,
            maximum=2.5,
            step=0.01,
            value=1.5,
            label="Controlnet Conditioning Scale",
            info="""Controls the readability/creativity of the QR code.
            High values: The generated QR code will be more readable.
            Low values: The generated QR code will be more creative.
            """
        ),
        gr.Slider(label="Steps", step=1, value=5, minimum=1, maximum=50)
    ],
    outputs=[
        "image"
    ],
    examples=[
        [
            "Hi OpenVINO",
            "snowy mountains 8k",
            "blurry unreal occluded",
            42, 7, 1.7, 5
        ],
    ],
)
try:
    demo.queue().launch(debug=True)
except Exception:
    demo.queue().launch(share=True, debug=True)

# If you are launching remotely, specify server_name and server_port
# EXAMPLE: `demo.launch(server_name='your server name', server_port='server port in int')`
# To learn more please refer to the Gradio docs: https://gradio.app/docs/