<img src="../imgs/cover.png" alt="cover" />

# Accelerate Stable Diffusion Model Using Speedster + New TensorRT 8.6 Release and Gradio App Creation

#### 🚀 Welcome to this notebook focused on optimizing Stable Diffusion using Speedster 🚀

### Goal
The goal is to optimize the performance of a Stable Diffusion model using Speedster and the latest TensorRT version, to obtain an inference model that runs ~2 times faster than the original one. After that, we will show you how easy it is to integrate an optimized model in a Gradio application, so we will develop a game using Gradio and the accelerated model. The game we create will be called *guess the Pokémon*, the user will have to identify the name of the Pokémon generated using the optimized Stable Diffusion model.

<img src="../imgs/sd-cover.png" alt="cover" />

### Stable Diffusion

Stable Diffusion, which was released in 2022, is a text-to-image model that is mainly utilized for generating images based on textual descriptions.

Stable Diffusion consists of three main components: 
* **a text-understanding component** that uses a Transformer language model to translate text into a numeric representation
* **an image generator** that includes an image information creator and an image decoder 
* **an autoencoder decoder** that produces the final image

The image generator works in the latent space and gradually processes the information to generate high-quality images, with the UNet neural network and scheduling algorithm as its components. The image decoder uses the processed information array to produce the final pixel image. Speedster's main focus for model acceleration is the conditional U-Net architecture utilized for denoising the latent component of the encoded image. The reason for this emphasis is that UNet is executed multiple times based on the num_inference_steps hyperparameter, which is usually set to 50 by default. As a result, the computational cost of UNet far outweighs the impact of the other two model components, which are only executed once.

<center><img src="../imgs/stab1.jpg" alt="stab1" width="600" height="400"/></center>

<center><small style="font-size: 12px;">Image taken from The Illustrated Stable Diffusionb log by Jay Alammar.</small></center>

<center><img src="../imgs/stab2.jpg" alt="stab2"  width="600" height="400"/></center>

<center><small style="font-size: 12px;">Image taken from The Illustrated Stable Diffusion blog by Jay Alammar.</small></center>

### Speedster

Nebuly's [Speedster](https://github.com/nebuly-ai/nebullvm/tree/main/apps/accelerate/speedster) is an **open-source** module that enables fast AI inference through just a few lines of code. The module automatically applies state-of-the-art optimization techniques to maximize the hardware's physical inference speed-up, including latency, throughput, and model size on a single machine.

To get started, you can refer to the [getting started guide](https://github.com/nebuly-ai/nebullvm/tree/main/apps/accelerate/speedster) for five different input model frameworks supported by Speedster, which are PyTorch, Hugging Face Transformers, Hugging Face Diffusers, TensorFlow/Keras, and ONNX.

In this notebook, we will delve into the details of Stable Diffusion and the Speedster algorithm. We will learn how to implement Speedster in Python and use it to optimize Stable Diffusion.

In addition to optimization techniques, we will also create a Gradio application to interact with our accelerated model.

<center><img src="../imgs/neb.jpg" alt="nebuly" width="600" height="400"/></center>

<center><small style="font-size: 12px;">Image taken from Nebuly's website.</small></center>

## Install and Imports

Before we dive into the analysis, we need to make sure that we have all the necessary tools to execute the code. In this section, we will install the required packages and libraries to ensure a smooth and error-free run of the notebook. These packages include torch, Speedster and some Hugging Face libraries. Once we have installed the packages, we will import them into the notebook and get ready to use them in our analysis.

In [None]:
! pip install accelerate torch datasets diffusers gradio speedster --quiet 

In [None]:
! python -m nebullvm.installers.auto_installer --frameworks diffusers --compilers all

In [None]:
import torch

import pandas as pd

from diffusers import StableDiffusionPipeline
from datasets import load_dataset

from speedster import optimize_model, save_model, load_model

import gradio as gr

from random import randrange
import os

## Environment Check

![ChessUrl](https://media.giphy.com/media/rAm0u2k17rM3e/giphy.gif "env")

In order to make everything work, we need to check that the environment meets the necessary requirements. In particular, to optimize a model of the Hugging Face Diffusers library you need to have `CUDA>=12` and `tensorrt>=8.6.0`.

From TensorRT 8.6, all the tensorrt pre-built wheels released by nvidia support only `CUDA>=12.0`. Speedster will install `tensorrt>=8.6.0` automatically in the auto-installer only if it detects `CUDA>=12.0`, otherwise it will install `tensorrt==8.5.3.1`. In that case, you will have to upgrade your CUDA version and then to upgarde tensorrt to 8.6.0 or above to execute this notebook.

First of all, Let's check the CUDA version installed on the machine:

In [2]:
import subprocess

if torch.cuda.is_available():
    cuda_version = subprocess.check_output(["nvidia-smi"])
    cuda_version = int(cuda_version.decode("utf-8").split("\n")[2].split("|")[-2].split(":")[-1].strip().split(".")[0])
    assert cuda_version >= 12, ("This notebook requires CUDA>=12.0 to be executed, please upgrade your CUDA version.")

If you have `CUDA<12.0`, you can upgrade it at this link: https://developer.nvidia.com/cuda-downloads. The installation is very simple, you only have to select the characteristics of your environment and copy and paste the commands provided by Nvidia.

Then, let's check the tensorrt version installed on the platform. Stable Diffusion optimization is supported starting from `tensorrt==8.6.0`

In [3]:
import tensorrt
from nebullvm.tools.utils import check_module_version

assert check_module_version(tensorrt, "8.6.0"), ("This notebook can be run only with tensorrt>=8.6.0, if using an older version you could have issues during the optimization.")

If you have an older version, after ensuring you have `CUDA>=12.0` installed, you can upgrade your TensorRT version by running:
```
! pip install -U tensorrt
```

## Load Data

The data that will be used for analysis and modeling are derived from the [Bulbagarden](https://bulbapedia.bulbagarden.net/wiki/List_of_Pok%C3%A9mon_by_name) website and consists of a small dataset with two columns, one with the names of the Pokémon and the other with their descriptions.

You can get the data from [Hugging Face](https://huggingface.co/datasets/mfumanelli/pokemon-description-xs).

In [None]:
dataset = load_dataset("mfumanelli/pokemon-description-xs")

In [5]:
data = dataset['train'].to_pandas()

In [6]:
data.head(2)

Unnamed: 0,name,description
0,Bulbasaur,it's a blue and green Pokémon. It has a seed o...
1,Pikachu,it's a yellow and black Pokémon. It's a creatu...


![ChessUrl](https://media.giphy.com/media/I2nZMy0sI0ySA/giphy.gif "chess")

## Model

First of all we have to choose the version of the Stable Diffusion model we want to optimize, Speedster officially supports the most used versions:

* [CompVis/stable-diffusion-v1-4](https://huggingface.co/CompVis/stable-diffusion-v1-4)
* [runwayml/stable-diffusion-v1-5](https://huggingface.co/runwayml/stable-diffusion-v1-5)
* [stabilityai/stable-diffusion-2-1-base](https://huggingface.co/stabilityai/stable-diffusion-2-1-base)
* [stabilityai/stable-diffusion-2-1](https://huggingface.co/stabilityai/stable-diffusion-2-1)

Other Stable Diffusion versions from the [Diffusers library](https://github.com/huggingface/diffusers) should work but have never been officially tested. 

⚠️ If you try a version not included among these and it works, please feel free to report it to us on [Discord](https://discord.com/invite/RbeQMu886J) so we can add it to the list of supported versions. If you try a version that does not work, you can open an issue and possibly a PR on [GitHub](https://github.com/nebuly-ai/nebullvm/tree/main/apps/accelerate/speedster) ⚠️

In this notebook, we'll be utilizing the `Stable-Diffusion-v1-4` checkpoint. This checkpoint was initially based on the `Stable-Diffusion-v1-2` checkpoint's weights and subsequently fine-tuned for 225k steps at a resolution of 512x512 using "laion-aesthetics v2 5+" dataset, with 10% of text-conditioning removed to enhance classifier-free guidance sampling.

**NOTE**: If you want to run the `stable-diffusion-2-1` version of Stable Diffusion, you need a GPU with a minimum of 22GB memory. If your GPU has less than 22GB memory, you can opt for other versions of the model, such as `stable-diffusion-2-1-base`.

In [7]:
model_id = "CompVis/stable-diffusion-v1-4"

## Model Optimization

Using Speedster's latest API, there are two options to enhance the speed of your models. The first option allows you to accelerate your models without compromising accuracy, whereas the second option lets you further increase their speed by specifying a desired level of accuracy or precision to trade-off. To achieve this acceleration, Speedster utilizes several optimization techniques, including deep learning compilers (in both options), quantization, and half accuracy, among others (in the second option).

Here are the outcomes achieved for Stable Diffusion on the A10 GPU (the GPU that we will use) and on the 3090Ti:

<img src="../imgs/sd-benchmarks.png" alt="cover" />

We conducted tests on each GPU, evaluating the performance of the four most commonly utilized versions of Stable Diffusion. We compared the performance of the base version in fp16 with the version utilizing xformers and the speedster-compiled model. Our findings demonstrate that the optimized attention algorithm implemented within xformers delivers considerably better performance than the base model, particularly for the most complex model (the 2.1). Moreover, Speedster enhances the model's speed even further. In fact, the latest version of TensorRT outperforms both the base version and the one that utilizes xformers in all tested cases.

When utilizing Speedster to optimize Stable Diffusion models from Hugging Face's Diffusers library, two easy steps must be followed:

1) Once you have selected and downloaded the desired model, generate a small set of sample input data:

In [8]:
input_data = data.description[0:5].tolist()

2) Run the optimization by specifying as arguments: 
* **Model**
* **Input data**
* **Optimization time** specify whether to limit or not
* **Compilers to exclude** depending on hardware and model, certain compilers may be unsuitable
* **Acceptable level of accuracy loss** during optimization, can be set to zero as well

**Please note**: Optimization of stable diffusion requires **a lot of RAM**. If you are running this notebook on google colab, make sure to use the high RAM option, otherwise the kernel may crash. If the kernel crashes also when using the high RAM option, please try adding "torchscript" to the ignore_compilers list.

### Optimization

Let's optimize the model on a single **A10 GPU**, note that on GPU we load by default the model in half precision, because it's faster and lighter:

In [9]:
device = "cuda" if torch.cuda.is_available() else "cpu"

In [None]:
if device == "cuda":
    pipe = StableDiffusionPipeline.from_pretrained(model_id, revision='fp16', torch_dtype=torch.float16)
else:
    pipe = StableDiffusionPipeline.from_pretrained(model_id)

pipe.to(device)

In [None]:
optimized_model = optimize_model(
    model=pipe,
    input_data=input_data,
    optimization_time="unconstrained",
    ignore_compilers=["torch_tensor_rt", "tvm"],  # TensorRT from the torch pipeline has some issues with Stable Diffusion, so we are going to skip it.
    metric_drop_ths=0.2,
    device=device
)

<img src="../imgs/output-stable-diffusion.png" alt="cover" />

As can be seen, the compiler that allows for greater acceleration of the model is TensorRT. Speedster has integrated the latest release of TensorRT, [**TensorRT v8.6.0**](https://github.com/NVIDIA/TensorRT/releases/tag/v8.6.0), which was recently made available by Nvidia.

In essence, TensorRT optimizes a model's mathematical coordinates to strike a balance between the smallest possible size and highest achievable accuracy for the intended system. In the latest release, one of the key updates is that now the demoDiffusion acceleration is supported out of the box in TensorRT without requiring the installation of additional plugins.

Using Speedster has enormous advantages over using TensorRT directly: it allows us to compare different optimization tools based on the available hardware and software, it offers simple and intuitive usability, performing all steps automatically and it returns as output a directly usable model in a simple way, not a tensorRT engine that would need additional operations before being used in inference. 

The model has ~2 times acceleration. This value refers only to the speedup on the UNet. To actually see how the latency times change let's calculate benchmarks using the original model and the optimized one.

## Benchmarks

Let's run the prediction 10 times to calculate the average response time of the original model.

In [12]:
test_prompt = "futuristic llama with a cyberpunk city on the background"

In [None]:
if device == "cuda":
    pipe = StableDiffusionPipeline.from_pretrained(model_id, revision='fp16', torch_dtype=torch.float16)
else:
    pipe = StableDiffusionPipeline.from_pretrained(model_id)

pipe.to(device)

In [None]:
import time

times = []

# Warmup for 2 iterations
for _ in range(2):
    with torch.no_grad():
        final_out = pipe(test_prompt).images[0]

# Benchmark
for _ in range(8):
    st = time.time()
    with torch.no_grad():
        final_out = pipe(test_prompt).images[0]
    times.append(time.time()-st)
original_model_time = sum(times)/len(times)

In [15]:
print(f"Average response time for original Stable Diffusion 1.4 on a A10 GPU: {original_model_time} s")

Average response time for original Stable Diffusion 1.4 on a A10 GPU: 4.100552409887314 s


While the average response of the optimized model turns out to be:

In [None]:
times = []

for _ in range(2):
    with torch.no_grad():
        final_out = optimized_model(test_prompt).images[0]

# Benchmark
for _ in range(8):
    st = time.time()
    with torch.no_grad():
        final_out = optimized_model(test_prompt).images[0]
    times.append(time.time()-st)
optimized_model_time = sum(times)/len(times)

In [17]:
print(f"Average response time for optimized Stable Diffusion 1.4: {optimized_model_time} s")

Average response time for optimized Stable Diffusion 1.4: 2.160187304019928 s


The entire model has been sped up by roughly two times, reducing the average response time from around 4 seconds to approximately 2 seconds.

## Gradio App

<img src="../imgs/neb_x_gradio.png" alt="neb_gradio" />

In this section, we will show how to create a Gradio application that utilizes the **optimized Stable Diffusion model generated using Speedster**. 

This Gradio application will allow users to generate images of Pokémon based on the descriptions provided in the previously loaded dataset. Let's dive in and create our very own Pokémon generator using Stable Diffusion and Gradio!

The initial stage involves creating an inference function that enables the generation of images based on a prompt

In [18]:
def infer(prompt, steps, scale):
    generator = torch.Generator(device=device)

    if device == 'cuda':
        with torch.autocast(device):
            image = optimized_model(
                f"""Cutest cartoon ever created: {prompt}""",
                num_inference_steps=steps,
                guidance_scale=scale,
                generator=generator,
            )
    else:
        image = optimized_model(
            f"""Cutest cartoon ever created: {prompt}""",
            num_inference_steps=steps,
            guidance_scale=scale,
            generator=generator,
        )

    return image

The function needs to be invoked with a randomly selected Pokémon description from the dataset as its input

In [19]:
def generate_pokemon():
    seed = randrange(data.shape[0])
    random_description = data.iloc[seed]["description"]
    image = infer(random_description, 50, 7)

    return image[0][0], seed

last but not least, let's make a function to get the name of the generated Pokémon

In [20]:
def pokemon_name(seed):
    return data.iloc[int(seed)]["name"]

Here is some CSS code to improve the visual appearance of our application:

In [21]:
css = """
        .gradio-container {
            font-family: 'IBM Plex Sans', sans-serif;
        }
        .gr-button {
            color: white;
            border-color: black;
            background: black;
        }
        input[type='range'] {
            accent-color: black;
        }
        .dark input[type='range'] {
            accent-color: #dfdfdf;
        }
        .container {
            max-width: 730px;
            margin: auto;
            padding-top: 1.5rem;
        }
        #iamge {
            min-height: 22rem;
            margin-bottom: 15px;
            margin-left: auto;
            margin-right: auto;
            border-bottom-right-radius: .5rem !important;
            border-bottom-left-radius: .5rem !important;
        }
        #iamge>div>.h-full {
            min-height: 20rem;
        }
        .details:hover {
            text-decoration: underline;
        }
        .gr-button {
            white-space: nowrap;
        }
        .gr-button:focus {
            border-color: rgb(147 197 253 / var(--tw-border-opacity));
            outline: none;
            box-shadow: var(--tw-ring-offset-shadow), var(--tw-ring-shadow), var(--tw-shadow, 0 0 #0000);
            --tw-border-opacity: 1;
            --tw-ring-offset-shadow: var(--tw-ring-inset) 0 0 0 var(--tw-ring-offset-width) var(--tw-ring-offset-color);
            --tw-ring-shadow: var(--tw-ring-inset) 0 0 0 calc(3px var(--tw-ring-offset-width)) var(--tw-ring-color);
            --tw-ring-color: rgb(191 219 254 / var(--tw-ring-opacity));
            --tw-ring-opacity: .5;
        }
        .footer {
            margin-bottom: 45px;
            margin-top: 35px;
            text-align: center;
            border-bottom: 1px solid #e5e5e5;
        }
        .footer>p {
            font-size: .8rem;
            display: inline-block;
            padding: 0 10px;
            transform: translateY(10px);
            background: white;
        }
        .dark .footer {
            border-color: #303030;
        }
        .dark .footer>p {
            background: #0b0f19;
        }
        .acknowledgments h4{
            margin: 1.25em 0 .25em 0;
            font-weight: bold;
            font-size: 115%;
        }
"""

Lastly, we will develop and deploy the application. Moreover, it can be hosted on Hugging Face for a duration of 72 hours using: <p style="font-family:monospace"> demo.launch(share=True)</p>

In [None]:
with gr.Blocks(css=css) as demo:
    gr.HTML(
        """
            <div style="text-align: center; max-width: 650px; margin: 0 auto;">
              <div
                style="
                  display: inline-flex;
                  align-items: center;
                  gap: 0.8rem;
                  font-size: 1.75rem;
                "
              >
                <svg xmlns="http://www.w3.org/2000/svg" width="20%" height="20%" viewBox="0 0 100 100">>
	<path d="M 30 50
		a 1 1 1 0 1 40 0
		h-12.5
		a 1 1 1 0 0 -15 0
		z"
		fill="#f00" stroke="#222"
	></path>
	<circle
		cx="50"
		cy="50"
		r="5"
		fill="#222" stroke="#222"
	></circle>
	<path d="M 30 50
		a 1 1 1 0 0 40 0
		h-12.5
		a 1 1 1 0 1 -15 0
		z"
		fill="#fff" stroke="#222"
	></path>
</svg>
                <h1 style="font-weight: 900; margin-bottom: 7px;">
                  Stable Diffusion Loves Pokémon
                </h1>
              </div>
              <p style="margin-bottom: 20px; font-size: 94%">
                Stable Diffusion is a state-of-the-art text-to-image model that generates images from text, 
                in this demo it is used to generate Pokèmon from their description. <br></p>
                <hr style="height:2px;border-width:0;color:gray;background-color:gray">
                <br>
              <p align="left" style="margin-bottom: 10px; font-size: 94%">
                <b>Instructions</b>: press the "Generate a Pokémon!" button to generate an image and try to see if you can guess the movie.
                You can see if you guessed right by pressing the "Tell me the name" button.
              </p>
              </br>
              <b>NOTE: If a completely black image is generated, it means that the NSFW checker has blocked the display of the output.</b>
            </div>
        """
    )
    with gr.Group():
        with gr.Box():
            with gr.Row().style(mobile_collapse=False, equal_height=True):
                b1 = gr.Button("Generate a Pokémon")
                b2 = gr.Button("Tell me the name")
            text = gr.Textbox(label="Name:")
            image = gr.Image(
                label="Generated images", show_label=False, elem_id="image"
            ).style(height="auto")

            seed = gr.Number(visible=False)

            b1.click(generate_pokemon, inputs=None, outputs=[image, seed])
            b2.click(pokemon_name, inputs=seed, outputs=text)

demo.launch(share=False)

<img src="../imgs/gradio_app.png" alt="cover" />

## Conclusions

And here we are, Let's summarize what we saw in this notebook:
* Stable Diffusion is a text-to-image model mainly used to generate images based on textual descriptions
* Accelerating this model with Speedster is very simple and fast
* Speedster also integrates the latest version of TensorRT for 🚀🚀🚀🚀  performance and ease of use
* Using Speedster we achieve 2x faster model latency
* Using a Speedster-optimized model within a Gradio application is straightforward, as we showed in our example with the **guess the Pokémon** application

We hope this notebook will be useful to you, also you can find many more speedster use cases in [this repository](https://github.com/nebuly-ai/learning-hub/notebooks).

Remember to join our [Discord community](https://discord.com/invite/RbeQMu886J) and if you are interested in AI optimization or if you liked this notebook please leave a star at our repo [Speedster](https://github.com/nebuly-ai/nebullvm/tree/main/apps/accelerate/speedster) 💕🌟!

![ChessUrl](https://media.giphy.com/media/slVWEctHZKvWU/giphy.gif "env")