# Stable Diffusion v2 Demo with Torch Compile

## Prerequisites

install required packages

In [None]:
%pip install -q "diffusers>=0.14.0" openvino-nightly "datasets>=2.14.6" "transformers>=4.25.1" "gradio>=4.19" "torch>=2.1" Pillow opencv-python --extra-index-url https://download.pytorch.org/whl/cpu
%pip install -q git+https://github.com/openvinotoolkit/nncf.git
%pip install -q accelerate

In [1]:
import diffusers
diffusers.__version__

'0.20.0'

## Stable Diffusion v2 for Text-to-Image Generation

To start, let's look on Text-to-Image process for Stable Diffusion v2. We will use [Stable Diffusion v2-1](https://huggingface.co/stabilityai/stable-diffusion-2-1) model for these purposes. The main difference from Stable Diffusion v2 and Stable Diffusion v2.1 is usage of more data, more training, and less restrictive filtering of the dataset, that gives promising results for selecting wide range of input text prompts. More details about model can be found in [Stability AI blog post](https://stability.ai/blog/stablediffusion2-1-release7-dec-2022) and original model [repository](https://github.com/Stability-AI/stablediffusion).

### Stable Diffusion in Diffusers library
To work with Stable Diffusion v2, we will use Hugging Face [Diffusers](https://github.com/huggingface/diffusers) library. To experiment with Stable Diffusion models, Diffusers exposes the [`StableDiffusionPipeline`](https://huggingface.co/docs/diffusers/using-diffusers/conditional_image_generation) similar to the [other Diffusers pipelines](https://huggingface.co/docs/diffusers/api/pipelines/overview).  The code below demonstrates how to create `StableDiffusionPipeline` using `stable-diffusion-2-1`:

In [1]:
from torch._export import capture_pre_autograd_graph
from nncf.torch.dynamic_graph.patch_pytorch import disable_patching
import numpy as np

INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, onnx, openvino


In [2]:
from diffusers import DiffusionPipeline
import torch

pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0")
pipe.to("cpu")

# if using torch < 2.0
# pipe.enable_xformers_memory_efficient_attention()

prompt = "An astronaut riding a green horse"

images = pipe(prompt=prompt, num_inference_steps=2).images[0]

Loading pipeline components...:   0%|          | 0/7 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?it/s]

torch.Size([2, 1280]) torch.Size([2, 6]) torch.Size([2, 4, 128, 128]) torch.Size([]) torch.Size([2, 77, 2048])
torch.Size([2, 1280]) torch.Size([2, 6]) torch.Size([2, 4, 128, 128]) torch.Size([]) torch.Size([2, 77, 2048])


In [None]:
images = pipe(prompt=prompt, num_inference_steps=5).images[0]

In [3]:
class UNetWrapper(torch.nn.Module):
    def __init__(self, unet):
        super().__init__()
        self.unet = unet
        self.captured_args = []

    def forward(self, *args, **kwargs):
        self.captured_args.append(args)
        return self.unet(*args, **kwargs)

In [None]:
original_unet = pipe.unet
wrapped_unet = UNetWrapper(original_unet)
_ = pipe(prompt=prompt, num_inference_steps=1)
pipe.unet = wrapped_unet
_ = pipe(prompt=prompt, num_inference_steps=1)
print(wrapped_unet.captured_args)

In [None]:


pipe.unet(*unet_input, added_cond_kwargs=added_cond_kwargs)

In [62]:
text_encoder_input = torch.ones((1, 77), dtype=torch.long)

pipe.text_encoder(text_encoder_input)[0].shape

torch.Size([1, 77, 768])

In [7]:
text_encoder_input = torch.ones((1, 77), dtype=torch.long)
text_encoder_2_input = torch.ones((1, 77), dtype=torch.long)
vae_input = torch.ones((1, 3, 256, 256))

encoder_hidden_state = torch.ones((2, 77, 2048))
latents_shape = (2, 4, 128, 128)
latents = torch.randn(latents_shape)
t = torch.from_numpy(np.array(1, dtype=np.float32))
added_cond_kwargs = {}
added_cond_kwargs["text_embeds"] = torch.ones((2, 1280))
added_cond_kwargs["time_ids"] = torch.ones((2,6))
unet_kwargs = {}
unet_kwargs["added_cond_kwargs"] = added_cond_kwargs
unet_input = (latents, t, encoder_hidden_state)

with disable_patching():
    with torch.no_grad():
        # text_encoder = capture_pre_autograd_graph(pipe.text_encoder, args=(text_encoder_input,))
        # text_encoder_2 = capture_pre_autograd_graph(pipe.text_encoder_2, args=(text_encoder_2_input,))
        # vae_encoder = capture_pre_autograd_graph(pipe.vae.encoder, args=(vae_input,))
        unet = capture_pre_autograd_graph(pipe.unet, args=(*unet_input,), kwargs=(unet_kwargs))


TypeError: UNet2DConditionModel.forward() got an unexpected keyword argument 'text_embeds'

In [51]:
count = 0
for i in vae_encoder.graph.nodes:
    count += 1

print(count)

234


In [None]:
text_encoder_2 = capture_pre_autograd_graph(pipe.text_encoder_2.eval())
unet = capture_pre_autograd_graph(pipe.unet.eval())
vae_encoder = capture_pre_autograd_graph(pipe.vae.encoder.eval())
vae_decoder = capture_pre_autograd_graph(pipe.vae.decoder.eval())

In [27]:
pipe.config

FrozenDict([('vae', ('diffusers', 'AutoencoderKL')),
            ('text_encoder', ('transformers', 'CLIPTextModel')),
            ('text_encoder_2',
             ('transformers', 'CLIPTextModelWithProjection')),
            ('tokenizer', ('transformers', 'CLIPTokenizer')),
            ('tokenizer_2', ('transformers', 'CLIPTokenizer')),
            ('unet', ('diffusers', 'UNet2DConditionModel')),
            ('scheduler', ('diffusers', 'EulerDiscreteScheduler')),
            ('image_encoder', (None, None)),
            ('feature_extractor', (None, None)),
            ('force_zeros_for_empty_prompt', True),
            ('_name_or_path', 'stabilityai/stable-diffusion-xl-base-1.0'),
            ('_class_name', 'StableDiffusionXLPipeline'),
            ('_diffusers_version', '0.30.0')])

In [1]:
from diffusers import DiffusionPipeline
import torch 
import numpy as np
from transformers import CLIPTokenizer
from diffusers.schedulers import DDIMScheduler

seed = 42

np.random.seed(seed)
torch.manual_seed(seed)

pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, use_safetensors=True, variant="fp16").to("cpu")


# pipe.text_encoder = torch.compile(pipe.text_encoder.eval(), backend='openvino')
# # pipe.unet = torch.compile(pipe.unet.eval(), backend='openvino')
# pipe.vae = torch.compile(pipe.vae.eval(), backend='openvino')

# pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)  # DDIMScheduler is used because UNet quantization produces better results with it
# pipe.tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")

Fetching 19 files:   0%|          | 0/19 [00:00<?, ?it/s]

model.fp16.safetensors:   0%|          | 0.00/1.39G [00:00<?, ?B/s]

diffusion_pytorch_model.fp16.safetensors:   0%|          | 0.00/167M [00:00<?, ?B/s]

diffusion_pytorch_model.fp16.safetensors:   0%|          | 0.00/5.14G [00:00<?, ?B/s]

model.fp16.safetensors:   0%|          | 0.00/246M [00:00<?, ?B/s]

diffusion_pytorch_model.fp16.safetensors:   0%|          | 0.00/167M [00:00<?, ?B/s]

Loading pipeline components...:   0%|          | 0/5 [00:00<?, ?it/s]

You have disabled the safety checker for <class 'diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline'> by passing `safety_checker=None`. Ensure that you abide to the conditions of the Stable Diffusion license and do not expose unfiltered results in services or applications open to the public. Both the diffusers team and Hugging Face strongly recommend to keep the safety filter enabled in all public facing circumstances, disabling it only for use-cases that involve analyzing network behavior or auditing its results. For more information, please have a look at https://github.com/huggingface/diffusers/pull/254 .
Pipelines loaded with `dtype=torch.float16` cannot run with `cpu` device. It is not recommended to move them to `cpu` as running them will fail. Please make sure to use an accelerator to run the pipeline in inference, due to the lack of support for`float16` operations on this device in PyTorch. Please, remove the `torch_dtype=torch.float16` argum

In [4]:
pipe.config

FrozenDict([('vae', ('diffusers', 'AutoencoderKL')),
            ('text_encoder', ('transformers', 'CLIPTextModel')),
            ('tokenizer', ('transformers', 'CLIPTokenizer')),
            ('unet', ('diffusers', 'UNet2DConditionModel')),
            ('scheduler', ('diffusers', 'EulerDiscreteScheduler')),
            ('safety_checker', (None, None)),
            ('feature_extractor', (None, None)),
            ('image_encoder', (None, None)),
            ('requires_safety_checker', True),
            ('_name_or_path', 'stabilityai/stable-diffusion-xl-base-1.0')])

### Model Compilation

This step involves passing the data to initially compile all the models in the pipeline after inference

In [5]:
#Warmup the model for initial compile
prompt = "valley in the Alps at sunset, epic vista, beautiful landscape, 4k, 8k"
negative_prompt = "frames, borderline, text, charachter, duplicate, error, out of frame, watermark, low quality, ugly, deformed, blur"
num_steps = 1

image = pipe(prompt=prompt).images[0]

  0%|          | 0/50 [00:00<?, ?it/s]

TypeError: argument of type 'NoneType' is not iterable

## Running Inference
Generating an image with the same parameters as the original OV Stable diffusion model for comparison

In [31]:
num_steps = 25

with torch.no_grad():
    image = pipe(prompt, negative_prompt=negative_prompt, num_inference_steps=num_steps, latents=latents, guidance_scale=7.5).images[0]
image.show()


  0%|          | 0/25 [00:00<?, ?it/s]