#  Stable Diffusion Demo on Intel® Xeon® 4th generation Scalable Processor with Intel® Extension for PyTorch*

### Intel® Xeon® 4th generation Scalable Processor have Intel® AMX® Instruction Set Architecture which allow faster matrix multiplications for BFloat16 & int8 datatypes

In [None]:
# Imports
import torch
import intel_extension_for_pytorch as ipex
from PIL import Image
from diffusers import StableDiffusionPipeline

import copy

# We have provided this file in the source-code
from quantization_modules import load_int8_model, convert_to_fp32model

### Trying to run Stable Diffusion with PyTorch* Float32 in eager mode would probably hang up this process.

#### We are using a patched diffusers branch, built with instructions at https://github.com/intel/intel-extension-for-pytorch/tree/dev_demo/examples/cpu/inference/python/stable_diffusion, so run this cell with a different kernel, and then switch back to the kernel used for this notebook

In [None]:
pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")

output = pipe("a photo of an astronaut riding a horse on mars", generator=torch.manual_seed(13)).images[0]

#### If the above cell is taking too long to run, you might want to restart the kernel, and skip it


## First, we'll eyeball performance with FP32, by using Intel Extension for PyTorch* to prepack weights, ensure auto-channels-last, etc

https://intel.github.io/intel-extension-for-pytorch/latest/tutorials/api_doc.html#ipex.optimize

In [None]:
pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
# sample input that represents the shape of the input tensors that'd be fed to the model
input = torch.randn(2, 4, 64, 64).to(memory_format=torch.channels_last), torch.tensor(921), torch.randn(2, 77, 768)

# These are the 3 main components of Stable Diffusion 
pipe.text_encoder = ipex.optimize(pipe.text_encoder.eval(), inplace=True)
pipe.unet = ipex.optimize(pipe.unet.eval(), inplace=True)
pipe.vae = ipex.optimize(pipe.vae.eval(), inplace=True)

#JIT-tracing

with torch.no_grad():
    pipe.unet = torch.jit.trace(pipe.unet, input, strict=False)
    pipe.unet = torch.jit.freeze(pipe.unet)
    pipe.unet(*input)
    pipe.unet(*input)

In [None]:
output = pipe("a photo of an astronaut riding a horse on mars", generator=torch.manual_seed(13)).images[0]

In [None]:
output

## Now, let's look at performance with Automatic Mixed Precision (BFloat16 datatype)

In [None]:
pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")

pipe.text_encoder = ipex.optimize(pipe.text_encoder.eval(), dtype=torch.bfloat16, inplace=True)
pipe.unet = ipex.optimize(pipe.unet.eval(), dtype=torch.bfloat16, inplace=True)
pipe.vae = ipex.optimize(pipe.vae.eval(), dtype=torch.bfloat16, inplace=True)

# sample input that represents the shape of the input tensors that'd be fed to the model
input = torch.randn(2, 4, 64, 64).to(memory_format=torch.channels_last), torch.tensor(921), torch.randn(2, 77, 768)
#JIT-tracing
with torch.cpu.amp.autocast(dtype=torch.bfloat16), torch.no_grad():
    pipe.unet = torch.jit.trace(pipe.unet, input, strict=False)
    pipe.unet = torch.jit.freeze(pipe.unet)
    pipe.unet(*input)
    pipe.unet(*input)

In [None]:
with torch.cpu.amp.autocast(dtype=torch.bfloat16), torch.no_grad():
    output = pipe("a photo of an astronaut riding a horse on mars", generator=torch.manual_seed(13)).images[0]

In [None]:
output

In [None]:
with torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CPU], record_shapes=True) as p:
    with torch.cpu.amp.autocast(dtype=torch.bfloat16), torch.no_grad():
        pipe("a photo of an astronaut riding a horse on mars", generator=torch.manual_seed(13)).images
output = p.key_averages().table(sort_by="self_cpu_time_total")
print(output)

## Even Stable Diffusion v2.1 performs well with Intel Extension for PyTorch* with BFloat16


In [None]:
pipe = StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1")

pipe.text_encoder = ipex.optimize(pipe.text_encoder.eval(), dtype=torch.bfloat16, inplace=True)
pipe.unet = ipex.optimize(pipe.unet.eval(), dtype=torch.bfloat16, inplace=True)
pipe.vae = ipex.optimize(pipe.vae.eval(), dtype=torch.bfloat16, inplace=True)

# sample input that represents the shape of the input tensors that'd be fed to the model
input = torch.randn(2, 4, 64, 64).to(memory_format=torch.channels_last), torch.tensor(921), torch.randn(2, 77, 1024)
#JIT-tracing
with torch.cpu.amp.autocast(dtype=torch.bfloat16), torch.no_grad():
    pipe.unet = torch.jit.trace(pipe.unet, input, strict=False)
    pipe.unet = torch.jit.freeze(pipe.unet)
    pipe.unet(*input)
    pipe.unet(*input)

In [None]:
with torch.cpu.amp.autocast(dtype=torch.bfloat16), torch.no_grad():
    result = pipe("a photo of an astronaut riding a horse on mars", generator=torch.manual_seed(13)).images[0]
result

In [None]:
with torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CPU], record_shapes=True) as p:
    with torch.cpu.amp.autocast(dtype=torch.bfloat16), torch.no_grad():
        pipe("a photo of an astronaut riding a horse on mars", generator=torch.manual_seed(13)).images
output = p.key_averages().table(sort_by="self_cpu_time_total")
print(output)

Currently, we haven't optimized Stable Diffusion for int8 static quantization with JIT/graph mode, but you can play around with eager mode int8 by quantizing Stable Diffusion with QAT by following instructions at https://github.com/intel/intel-extension-for-transformers/blob/main/examples/huggingface/pytorch/text-to-image/quantization/qat/README.md.

Please note that in that case, you should install diffusers 0.16 via conda/pip

\* Other names & brands may be claimed as the property of others
  