SD3: initial support (#2124)

TO DO: - [x] readme - [x] text - [x] table of content - [x] device selection - [x] quantization - [x] meta - [x] gradio
openvinotoolkit · Jun 20, 2024 · 08cb183 · 08cb183
1 parent 1f11b58
commit 08cb183
Show file tree

Hide file tree

Showing 9 changed files with 1,986 additions and 4 deletions.
diff --git a/.ci/ignore_convert_execution.txt b/.ci/ignore_convert_execution.txt
@@ -61,4 +61,5 @@ notebooks/stable-video-diffusion/stable-video-diffusion.ipynb
 notebooks/llm-agent-langchain/llm-agent-langchain.ipynb
 notebooks/hello-npu/hello-npu.ipynb
 notebooks/yolov10-optimization/yolov10-optimization.ipynb
-notebooks/hunyuan-dit-image-generation/hunyuan-dit-image-generation.ipynb
+notebooks/hunyuan-dit-image-generation/hunyuan-dit-image-generation.ipynb
+notebooks/stable-diffusion-v3/stable-diffusion-v3.ipynb
diff --git a/.ci/ignore_pip_conflicts.txt b/.ci/ignore_pip_conflicts.txt
@@ -29,4 +29,5 @@ notebooks/sketch-to-image-pix2pix-turbo/sketch-to-image-pix2pix-turbo.ipynb
 notebooks/yolov10-optimization/yolov10-optimization.ipynb # nncf from git
 notebooks/person-counting-webcam/person-counting.ipynb # numpy should be installed first
 notebooks/llava-multimodal-chatbot/videollava-multimodal-chatbot.ipynb # torchvision < 0.17.0
-notebooks/parler-tts-text-to-speech/parler-tts-text-to-speech.ipynb # torch >= 2.2
+notebooks/parler-tts-text-to-speech/parler-tts-text-to-speech.ipynb # torch >= 2.2
+notebooks/stable-diffusion-v3/stable-diffusion-v3.ipynb # diffusers from git
diff --git a/.ci/ignore_treon_docker.txt b/.ci/ignore_treon_docker.txt
@@ -68,3 +68,4 @@ notebooks/yolov10-optimization/yolov10-optimization.ipynb
 notebooks/whisper-subtitles-generation/whisper-subtitles-generation.ipynb
 notebooks/speechbrain-emotion-recognition/speechbrain-emotion-recognition.ipynb
 notebooks/hunyuan-dit-image-generation/hunyuan-dit-image-generation.ipynb
+notebooks/stable-diffusion-v3/stable-diffusion-v3.ipynb
diff --git a/.ci/ignore_treon_linux.txt b/.ci/ignore_treon_linux.txt
@@ -67,4 +67,5 @@ notebooks/stable-cascade-image-generation/stable-cascade-image-generation.ipynb
 notebooks/dynamicrafter-animating-images/dynamicrafter-animating-images.ipynb
 notebooks/yolov10-optimization/yolov10-optimization.ipynb
 notebooks/whisper-subtitles-generation/whisper-subtitles-generation.ipynb
-notebooks/hunyuan-dit-image-generation/hunyuan-dit-image-generation.ipynb
+notebooks/hunyuan-dit-image-generation/hunyuan-dit-image-generation.ipynb
+notebooks/stable-diffusion-v3/stable-diffusion-v3.ipynb
diff --git a/.ci/ignore_treon_mac.txt b/.ci/ignore_treon_mac.txt
@@ -69,4 +69,5 @@ notebooks/dynamicrafter-animating-images/dynamicrafter-animating-images.ipynb
 notebooks/yolov10-optimization/yolov10-optimization.ipynb
 notebooks/nano-llava-multimodal-chatbot/nano-llava-multimodal-chatbot.ipynb
 notebooks/whisper-subtitles-generation/whisper-subtitles-generation.ipynb
-notebooks/hunyuan-dit-image-generation/hunyuan-dit-image-generation.ipynb
+notebooks/hunyuan-dit-image-generation/hunyuan-dit-image-generation.ipynb
+notebooks/stable-diffusion-v3/stable-diffusion-v3.ipynb
diff --git a/.ci/ignore_treon_win.txt b/.ci/ignore_treon_win.txt
@@ -66,3 +66,4 @@ notebooks/dynamicrafter-animating-images/dynamicrafter-animating-images.ipynb
 notebooks/yolov10-optimization/yolov10-optimization.ipynb
 notebooks/whisper-subtitles-generation/whisper-subtitles-generation.ipynb
 notebooks/hunyuan-dit-image-generation/hunyuan-dit-image-generation.ipynb
+notebooks/stable-diffusion-v3/stable-diffusion-v3.ipynb
diff --git a/.ci/spellcheck/.pyspelling.wordlist.txt b/.ci/spellcheck/.pyspelling.wordlist.txt
@@ -443,6 +443,7 @@ MLLM
 MLLMs
 MMVLM
 MLP
+MMDiT
 MobileCLIP
 MobileLLaMA
 mobilenet

diff --git a/notebooks/stable-diffusion-v3/README.md b/notebooks/stable-diffusion-v3/README.md
@@ -0,0 +1,42 @@
+# Image generation with Stable Diffusion v3 and OpenVINO
+
+Stable Diffusion V3 is next generation of latent diffusion image Stable Diffusion models family that  outperforms state-of-the-art text-to-image generation systems in typography and prompt adherence, based on human preference evaluations. In comparison with previous versions, it based on Multimodal Diffusion Transformer (MMDiT) text-to-image model that features greatly improved performance in image quality, typography, complex prompt understanding, and resource-efficiency.
+
+![mmdit.png](https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/dd079427-89f2-4d28-a10e-c80792d750bf)
+
+More details about model can be found in [model card](https://huggingface.co/stabilityai/stable-diffusion-3-medium), [research paper](https://stability.ai/news/stable-diffusion-3-research-paper) and [Stability.AI blog post](https://stability.ai/news/stable-diffusion-3-medium).
+In this tutorial, we will consider how to convert and optimize Stable Diffusion v3 for running with OpenVINO.
+If you want to run previous Stable Diffusion versions, please check our other notebooks:
+
+* [Stable Diffusion](../stable-diffusion-text-to-image)
+* [Stable Diffusion v2](../stable-diffusion-v2)
+* [Stable Diffusion XL](../stable-diffusion-xl)
+* [LCM Stable Diffusion](../latent-consistency-models-image-generation)
+* [Turbo SDXL](../sdxl-turbo)
+* [Turbo SD](../sketch-to-image-pix2pix-turbo)
+
+
+The notebook provides a simple interface that allows communication with a model using text instruction. In this demonstration user can provide input instructions and the model generates an image. An additional part demonstrates how to optimize model with [NNCF](https://github.com/openvinotoolkit/nncf/) to speed up pipeline and reduce memory consumption.
+
+The image below illustrates the provided generated image example.
+
+![text2img_example.png](https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/ac99098c-66ec-4b7b-9e01-e80625f1dc3f)
+
+>**Note**: Some demonstrated models can require at least 32GB RAM for conversion and running.
+
+### Notebook Contents
+
+The tutorial consists of the following steps:
+
+- Install prerequisites
+- Collect Pytorch model pipeline
+- Convert model to OpenVINO intermediate representation (IR) format and compress weights using NNCF
+- Prepare OpenVINO Inference pipeline
+- Run Text-to-Image generation
+- Launch interactive demo
+
+## Installation Instructions
+
+This is a self-contained example that relies solely on its own code.</br>
+We recommend  running the notebook in a virtual environment. You only need a Jupyter server to start.
+For details, please refer to [Installation Guide](../../README.md).