Implement the stable diffusion algorithm in CUDA. The existing open-source implementations of the diffusion models [1][2][3][4] use high level APIs like pytorch or tensorflow for implementing diffusion models. Their performance is orders of magnitude off from the latency requirements of the real-time applications. Even on the modern desktop class GPUs, these models take ~weeks to train and 10s of seconds to generate even relatively low resolution images (512x512). [6] has shown that CUDA based implementations of the MLP networks can achieve approximately 1OOM improvement in training and inference times compared to their high level (tensorflow based) implementations. Moreover, the frame generation latency scales with the resolution of the frame, which makes matter even worse for higher resolution images. Hence, it might make sense to accelerate the stable diffusion algorithm using GPU programming languages. I propose to implement the state of the art stable discussion algorithm [4] in GPU programming languages (cuda or vulkan) where “hardware-aware” fusion of certain layers of diffusion models can be performed, which can potentially exploit more and more on-chip data reuse for intermediate outputs and avoid more expensive off-chip DRAM accesses.
Experiments conducted on a single GeForce RTX 3050 Mobile GPU. TODO: Add nsight profiling results.
Model | Average Forward Time |
---|---|
Original CrossAttention | 0.723301888 |
Our CrossAttention(t=1024) | 0.700392 |
- Currently, this project can only be run using
rai
, a tool from UIUC to run on cloud servers.
rai -p .
- If you need to do profiling, you can use the following command:
rai -p . --queue rai_amd64_exclusive
- Run controlnet with out attentionblock implementation
...
commands:
build:
- /bin/sh -c 'cd /src/python/ && python3.8 -m pip install ./'
- /bin/sh -c 'cd /src/ControlNet && python3.8 my_scribble2image.py'
Generate the fake data for running the kernel from the CrossAttentionBuild unittest for verifying the correctness of the kernelMove the model weights into the docker container and pushIntegrate pytorch & cuda-cFind necessary modules from the ControlNet code to make sure what we need to implementFind out how to connect the C++ code to pytorch
Implement the operations in an CrossAttentionMatrix MultiplicationSoftmax
- Implement the cuda operations in an CrossAttention
Matrix Multiplication- Softmax
Plug our CrossAttention implementation into the ControlNet model and load weights successfully- Provide argparse options to switch between the original and our implementation
- Profiling
ControlNet with the original CrossAttentionControlNet with our CrossAttention
- Potential code optimizations
pointer/reftorch.zero_grad()
- Decouple the project with rai
Info | Description | |
---|---|---|
TeamID | Team-10 | |
Member1 | Cheng-Han Chiang | chc11@illinois.edu |
Member2 | Shao-Chian Chen | scchen4@illinois.edu |
Member3 | Po-Wei Wang | poweiww2@illinois.edu |
Trying to run gradio_hough2image.py in ControlNet, but we don't have enough gpus to run it.