# Stable Diffusion v2 Demo with Torch Compile

## Prerequisites

install required packages

In [None]:
%pip install -q "diffusers>=0.14.0" openvino-nightly "datasets>=2.14.6" "transformers>=4.25.1" "gradio>=4.19" "torch>=2.1" Pillow opencv-python --extra-index-url https://download.pytorch.org/whl/cpu
%pip install -q git+https://github.com/openvinotoolkit/nncf.git
%pip install -q accelerate

## Stable Diffusion v2 for Text-to-Image Generation

To start, let's look on Text-to-Image process for Stable Diffusion v2. We will use [Stable Diffusion v2-1](https://huggingface.co/stabilityai/stable-diffusion-2-1) model for these purposes. The main difference from Stable Diffusion v2 and Stable Diffusion v2.1 is usage of more data, more training, and less restrictive filtering of the dataset, that gives promising results for selecting wide range of input text prompts. More details about model can be found in [Stability AI blog post](https://stability.ai/blog/stablediffusion2-1-release7-dec-2022) and original model [repository](https://github.com/Stability-AI/stablediffusion).

### Stable Diffusion in Diffusers library
To work with Stable Diffusion v2, we will use Hugging Face [Diffusers](https://github.com/huggingface/diffusers) library. To experiment with Stable Diffusion models, Diffusers exposes the [`StableDiffusionPipeline`](https://huggingface.co/docs/diffusers/using-diffusers/conditional_image_generation) similar to the [other Diffusers pipelines](https://huggingface.co/docs/diffusers/api/pipelines/overview).  The code below demonstrates how to create `StableDiffusionPipeline` using `stable-diffusion-2-1`:

In [1]:
from diffusers import StableDiffusionPipeline
import torch 
import numpy as np
from transformers import CLIPTokenizer
from diffusers.schedulers import DDIMScheduler

seed = 42

np.random.seed(seed)
torch.manual_seed(seed)

pipe = StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0").to("cpu")

pipe.text_encoder = torch.compile(pipe.text_encoder.eval(), backend='openvino')
pipe.unet = torch.compile(pipe.unet.eval(), backend='openvino')
pipe.vae = torch.compile(pipe.vae.eval(), backend='openvino')

# pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)  # DDIMScheduler is used because UNet quantization produces better results with it
# pipe.tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")

Fetching 19 files:   0%|          | 0/19 [00:00<?, ?it/s]

diffusion_pytorch_model.safetensors:   0%|          | 0.00/335M [00:00<?, ?B/s]

diffusion_pytorch_model.safetensors:   0%|          | 0.00/10.3G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/492M [00:00<?, ?B/s]

diffusion_pytorch_model.safetensors:   0%|          | 0.00/335M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.78G [00:00<?, ?B/s]

Loading pipeline components...:   0%|          | 0/5 [00:00<?, ?it/s]

You have disabled the safety checker for <class 'diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline'> by passing `safety_checker=None`. Ensure that you abide to the conditions of the Stable Diffusion license and do not expose unfiltered results in services or applications open to the public. Both the diffusers team and Hugging Face strongly recommend to keep the safety filter enabled in all public facing circumstances, disabling it only for use-cases that involve analyzing network behavior or auditing its results. For more information, please have a look at https://github.com/huggingface/diffusers/pull/254 .


## Provide numpy seed for random generation
For an equal conversion with the Openvino Stable Diffusion pipeline, the latent for this pipeline is passed manually after randomly generating using numpy random generator which utilizes a random numpy seed like the original OV Stable Diffusion pipeline

In [29]:
#Generating latent using numpy seed to make the image reproducible with OV Stable diffusion pipeline
np.random.seed(seed)
latents = np.random.randn(1, pipe.unet.in_channels, 96, 96).astype(np.float32)
latents = torch.from_numpy(latents).to("cpu")

  return getattr(self._orig_mod, name)


### Model Compilation

This step involves passing the data to initially compile all the models in the pipeline after inference

In [30]:
#Warmup the model for initial compile
prompt = "valley in the Alps at sunset, epic vista, beautiful landscape, 4k, 8k"
negative_prompt = "frames, borderline, text, charachter, duplicate, error, out of frame, watermark, low quality, ugly, deformed, blur"
num_steps = 1

image = pipe(prompt, negative_prompt=negative_prompt, num_inference_steps=num_steps, latents=latents).images[0]

  0%|          | 0/1 [00:00<?, ?it/s]

## Running Inference
Generating an image with the same parameters as the original OV Stable diffusion model for comparison

In [31]:
num_steps = 25

with torch.no_grad():
    image = pipe(prompt, negative_prompt=negative_prompt, num_inference_steps=num_steps, latents=latents, guidance_scale=7.5).images[0]
image.show()


  0%|          | 0/25 [00:00<?, ?it/s]