Skip to content

KMint1819/cuda-diffusion

Repository files navigation

ECE/CS 508 Spring 2023 Project Team 10

Project Title

CUDA Stable Diffusion Algorithm

ControlNet repository

Project Summary

Implement the stable diffusion algorithm in CUDA. The existing open-source implementations of the diffusion models [1][2][3][4] use high level APIs like pytorch or tensorflow for implementing diffusion models. Their performance is orders of magnitude off from the latency requirements of the real-time applications. Even on the modern desktop class GPUs, these models take ~weeks to train and 10s of seconds to generate even relatively low resolution images (512x512). [6] has shown that CUDA based implementations of the MLP networks can achieve approximately 1OOM improvement in training and inference times compared to their high level (tensorflow based) implementations. Moreover, the frame generation latency scales with the resolution of the frame, which makes matter even worse for higher resolution images. Hence, it might make sense to accelerate the stable diffusion algorithm using GPU programming languages. I propose to implement the state of the art stable discussion algorithm [4] in GPU programming languages (cuda or vulkan) where “hardware-aware” fusion of certain layers of diffusion models can be performed, which can potentially exploit more and more on-chip data reuse for intermediate outputs and avoid more expensive off-chip DRAM accesses.

CrossAttentionBlock Forward Pass Comparison

Experiments conducted on a single GeForce RTX 3050 Mobile GPU. TODO: Add nsight profiling results.

Model Average Forward Time
Original CrossAttention 0.723301888
Our CrossAttention(t=1024) 0.700392

How to run

  • Currently, this project can only be run using rai, a tool from UIUC to run on cloud servers.
rai -p .
  • If you need to do profiling, you can use the following command:
rai -p . --queue rai_amd64_exclusive
  • Input image:
    • turtle_scribble
  • Output image:
    • turtle_image

How to configure rai build commands

  • Run controlnet with out attentionblock implementation
...
commands:
  build:
    - /bin/sh -c 'cd /src/python/ && python3.8 -m pip install ./'
    - /bin/sh -c 'cd /src/ControlNet && python3.8 my_scribble2image.py'

TODO:

  1. Generate the fake data for running the kernel from the CrossAttention
  2. Build unittest for verifying the correctness of the kernel
  3. Move the model weights into the docker container and push
  4. Integrate pytorch & cuda-c
    1. Find necessary modules from the ControlNet code to make sure what we need to implement
    2. Find out how to connect the C++ code to pytorch
  5. Implement the operations in an CrossAttention
    1. Matrix Multiplication
    2. Softmax
  6. Implement the cuda operations in an CrossAttention
    1. Matrix Multiplication
    2. Softmax
  7. Plug our CrossAttention implementation into the ControlNet model and load weights successfully
    1. Provide argparse options to switch between the original and our implementation
  8. Profiling
    1. ControlNet with the original CrossAttention
    2. ControlNet with our CrossAttention
  9. Potential code optimizations
    1. pointer/ref
    2. torch.zero_grad()
  10. Decouple the project with rai

Team Information

Info Description Email
TeamID Team-10
Member1 Cheng-Han Chiang chc11@illinois.edu
Member2 Shao-Chian Chen scchen4@illinois.edu
Member3 Po-Wei Wang poweiww2@illinois.edu

4/20 meeting status overview

where we are in the process

Trying to run gradio_hough2image.py in ControlNet, but we don't have enough gpus to run it.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages