From 8b31a993a4b084a3c9b2ad49648b17718eb9a3c6 Mon Sep 17 00:00:00 2001 From: isamu-isozaki Date: Sat, 21 Oct 2023 23:28:55 -0400 Subject: [PATCH 1/9] Finished _toctree.yml and index.md --- docs/source/jp/_toctree.yml | 84 +++ .../stable_diffusion/stable_diffusion_xl.md | 52 ++ docs/source/jp/in_translation.md | 4 + docs/source/jp/index.md | 98 +++ docs/source/jp/installation.md | 146 ++++ docs/source/jp/optimization/fp16.md | 68 ++ docs/source/jp/optimization/memory.md | 357 +++++++++ docs/source/jp/optimization/open_vino.md | 81 ++ docs/source/jp/optimization/opt_overview.md | 17 + docs/source/jp/optimization/tome.md | 89 +++ docs/source/jp/optimization/torch2.0.md | 434 +++++++++++ docs/source/jp/quicktour.md | 314 ++++++++ docs/source/jp/stable_diffusion.md | 260 +++++++ docs/source/jp/training/create_dataset.md | 90 +++ docs/source/jp/training/dreambooth.md | 710 ++++++++++++++++++ docs/source/jp/training/overview.md | 84 +++ docs/source/jp/training/text_inversion.md | 277 +++++++ docs/source/jp/tutorials/autopipeline.md | 146 ++++ docs/source/jp/tutorials/basic_training.md | 404 ++++++++++ docs/source/jp/tutorials/tutorial_overview.md | 23 + .../jp/tutorials/using_peft_for_inference.md | 165 ++++ .../jp/using-diffusers/pipeline_overview.md | 17 + docs/source/jp/using-diffusers/sdxl.md | 431 +++++++++++ 23 files changed, 4351 insertions(+) create mode 100644 docs/source/jp/_toctree.yml create mode 100644 docs/source/jp/api/pipelines/stable_diffusion/stable_diffusion_xl.md create mode 100644 docs/source/jp/in_translation.md create mode 100644 docs/source/jp/index.md create mode 100644 docs/source/jp/installation.md create mode 100644 docs/source/jp/optimization/fp16.md create mode 100644 docs/source/jp/optimization/memory.md create mode 100644 docs/source/jp/optimization/open_vino.md create mode 100644 docs/source/jp/optimization/opt_overview.md create mode 100644 docs/source/jp/optimization/tome.md create mode 100644 docs/source/jp/optimization/torch2.0.md create mode 100644 docs/source/jp/quicktour.md create mode 100644 docs/source/jp/stable_diffusion.md create mode 100644 docs/source/jp/training/create_dataset.md create mode 100644 docs/source/jp/training/dreambooth.md create mode 100644 docs/source/jp/training/overview.md create mode 100644 docs/source/jp/training/text_inversion.md create mode 100644 docs/source/jp/tutorials/autopipeline.md create mode 100644 docs/source/jp/tutorials/basic_training.md create mode 100644 docs/source/jp/tutorials/tutorial_overview.md create mode 100644 docs/source/jp/tutorials/using_peft_for_inference.md create mode 100644 docs/source/jp/using-diffusers/pipeline_overview.md create mode 100644 docs/source/jp/using-diffusers/sdxl.md diff --git a/docs/source/jp/_toctree.yml b/docs/source/jp/_toctree.yml new file mode 100644 index 000000000000..f5c8b7d1ddea --- /dev/null +++ b/docs/source/jp/_toctree.yml @@ -0,0 +1,84 @@ +- sections: + - local: index + title: 🧚 Diffusers + - local: quicktour + title: 簡単な案内 + - local: stable_diffusion + title: 効果的で効率的な拡散モデル + - local: installation + title: むンストヌル + - local: in_translation + title: 翻蚳に぀いお + title: はじめに +- sections: + - local: tutorials/tutorial_overview + title: 抂芁 + - local: using-diffusers/write_own_pipeline + title: モデルずスケゞュヌラを理解する + - local: tutorials/autopipeline + title: 自動パむプラむン + - local: tutorials/basic_training + title: 拡散モデルのトレヌニング + - local: tutorials/using_peft_for_inference + title: PEFTで効率よく生成 + title: チュヌトリアル +- sections: + - sections: + - local: using-diffusers/loading_overview + title: 抂芁 + - local: using-diffusers/loading + title: パむプラむン、モデル、スケゞュヌラのロヌド + - local: using-diffusers/schedulers + title: 異なるスケゞュヌラのロヌドず比范 + - local: using-diffusers/custom_pipeline_overview + title: コミュニティ・パむプラむンをロヌドする + - local: using-diffusers/using_safetensors + title: 安党なモデルのロヌド + - local: using-diffusers/other-formats + title: 様々なStable Diffusionフォヌマットのロヌド + - local: using-diffusers/push_to_hub + title: ファむルをハブにプッシュする + title: 技術 + - sections: + - local: using-diffusers/pipeline_overview + title: 抂芁 + - local: using-diffusers/sdxl + title: Stable Diffusion XL + title: 生成のためのパむプラむン + - sections: + - local: training/overview + title: 抂芁 + - local: training/create_dataset + title: トレヌニングのためのデヌタセット䜜成 + - local: training/text_inversion + title: Textual Inversion + - local: training/dreambooth + title: Dreambooth + title: トレヌニング + title: Diffusersの䜿い方 +- sections: + - local: optimization/opt_overview + title: 抂芁 + - sections: + - local: optimization/fp16 + title: 生成のスピヌドアップ + - local: optimization/memory + title: メモリヌ䜿甚の削枛 + - local: optimization/torch2.0 + title: Torch 2.0 + - local: optimization/tome + title: トヌクンのマヌゞ + title: 䞀般的な最適化 + - sections: + - local: optimization/open_vino + title: OpenVINO + title: 最適化されたモデルタむプ + title: 最適化 +- sections: + - sections: + - sections: + - local: api/pipelines/stable_diffusion/stable_diffusion_xl + title: Stable Diffusion XL + title: Stable Diffusion + title: パむプラむン + title: API diff --git a/docs/source/jp/api/pipelines/stable_diffusion/stable_diffusion_xl.md b/docs/source/jp/api/pipelines/stable_diffusion/stable_diffusion_xl.md new file mode 100644 index 000000000000..aedb03d51caf --- /dev/null +++ b/docs/source/jp/api/pipelines/stable_diffusion/stable_diffusion_xl.md @@ -0,0 +1,52 @@ + + +# Stable Diffusion XL + +Stable Diffusion XL (SDXL) was proposed in [SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis](https://huggingface.co/papers/2307.01952) by Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas MÃŒller, Joe Penna, and Robin Rombach. + +The abstract from the paper is: + +*We present SDXL, a latent diffusion model for text-to-image synthesis. Compared to previous versions of Stable Diffusion, SDXL leverages a three times larger UNet backbone: The increase of model parameters is mainly due to more attention blocks and a larger cross-attention context as SDXL uses a second text encoder. We design multiple novel conditioning schemes and train SDXL on multiple aspect ratios. We also introduce a refinement model which is used to improve the visual fidelity of samples generated by SDXL using a post-hoc image-to-image technique. We demonstrate that SDXL shows drastically improved performance compared the previous versions of Stable Diffusion and achieves results competitive with those of black-box state-of-the-art image generators.* + +## Tips + +- Most SDXL checkpoints work best with an image size of 1024x1024. Image sizes of 768x768 and 512x512 are also supported, but the results aren't as good. Anything below 512x512 is not recommended and likely won't for for default checkpoints like [stabilityai/stable-diffusion-xl-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0). +- SDXL can pass a different prompt for each of the text encoders it was trained on. We can even pass different parts of the same prompt to the text encoders. +- SDXL output images can be improved by making use of a refiner model in an image-to-image setting. +- SDXL offers `negative_original_size`, `negative_crops_coords_top_left`, and `negative_target_size` to negatively condition the model on image resolution and cropping parameters. + + + +To learn how to use SDXL for various tasks, how to optimize performance, and other usage examples, take a look at the [Stable Diffusion XL](../../../using-diffusers/sdxl) guide. + +Check out the [Stability AI](https://huggingface.co/stabilityai) Hub organization for the official base and refiner model checkpoints! + + + +## StableDiffusionXLPipeline + +[[autodoc]] StableDiffusionXLPipeline + - all + - __call__ + +## StableDiffusionXLImg2ImgPipeline + +[[autodoc]] StableDiffusionXLImg2ImgPipeline + - all + - __call__ + +## StableDiffusionXLInpaintPipeline + +[[autodoc]] StableDiffusionXLInpaintPipeline + - all + - __call__ diff --git a/docs/source/jp/in_translation.md b/docs/source/jp/in_translation.md new file mode 100644 index 000000000000..72f38f363687 --- /dev/null +++ b/docs/source/jp/in_translation.md @@ -0,0 +1,4 @@ +# 翻蚳䞭 + +䞀生懞呜翻蚳を進行䞭です。少しだけお埅ちください。 +ありがずうございたす。 \ No newline at end of file diff --git a/docs/source/jp/index.md b/docs/source/jp/index.md new file mode 100644 index 000000000000..6e8ba78dd55f --- /dev/null +++ b/docs/source/jp/index.md @@ -0,0 +1,98 @@ + + +

+
+ +
+

+ +# Diffusers + +🀗 Diffusers は、画像や音声、さらには分子の3D構造を生成するための、最先端の事前孊習枈みDiffusion Model(拡散モデル)を提䟛するラむブラリです。シンプルな生成゜リュヌションをお探しの堎合でも、独自の拡散モデルをトレヌニングしたい堎合でも、🀗 Diffusers はその䞡方をサポヌトするモゞュヌル匏のツヌルボックスです。我々のラむブラリは、[性胜より䜿いやすさ](conceptual/philosophy#usability-over-performance)、[簡単よりシンプル](conceptual/philosophy#simple-over-easy)、[抜象化よりカスタマむズ性](conceptual/philosophy#tweakable-contributorfriendly-over-abstraction)に重点を眮いお蚭蚈されおいたす。 + +このラむブラリには3぀の䞻芁コンポヌネントがありたす: + +- 最先端の[拡散パむプラむン](api/pipelines/overview)で数行のコヌドで生成が可胜です。 +- 亀換可胜な[ノむズスケゞュヌラ](api/schedulers/overview)で生成速床ず品質のトレヌドオフのバランスをずれたす。 +- 事前に蚓緎された[モデル](api/models)は、ビルディングブロックずしお䜿甚するこずができ、スケゞュヌラず組み合わせるこずで、独自の゚ンドツヌ゚ンドの拡散システムを䜜成するこずができたす。 + +
+
+
チュヌトリアル
+

出力の生成、独自の拡散システムの構築、拡散モデルのトレヌニングを開始するために必芁な基本的なスキルを孊ぶこずができたす。初めお🀗Diffusersを䜿甚する堎合は、ここから始めるこずをお勧めしたす

+
+
ガむド
+

パむプラむン、モデル、スケゞュヌラのロヌドに圹立぀実践的なガむドです。たた、特定のタスクにパむプラむンを䜿甚する方法、出力の生成方法を制埡する方法、生成速床を最適化する方法、さたざたなトレヌニング手法に぀いおも孊ぶこずができたす。

+
+
Conceptual guides
+

ラむブラリがなぜこのように蚭蚈されたのかを理解し、ラむブラリを利甚する際の倫理的ガむドラむンや安党察策に぀いお詳しく孊べたす。

+
+
Reference
+

🀗 Diffusersのクラスずメ゜ッドがどのように機胜するかに぀いおの技術的な説明です。

+
+
+
+ +## Supported pipelines + +| Pipeline | Paper/Repository | Tasks | +|---|---|:---:| +| [alt_diffusion](./api/pipelines/alt_diffusion) | [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) | Image-to-Image Text-Guided Generation | +| [audio_diffusion](./api/pipelines/audio_diffusion) | [Audio Diffusion](https://github.com/teticio/audio-diffusion.git) | Unconditional Audio Generation | +| [controlnet](./api/pipelines/controlnet) | [Adding Conditional Control to Text-to-Image Diffusion Models](https://arxiv.org/abs/2302.05543) | Image-to-Image Text-Guided Generation | +| [cycle_diffusion](./api/pipelines/cycle_diffusion) | [Unifying Diffusion Models' Latent Space, with Applications to CycleDiffusion and Guidance](https://arxiv.org/abs/2210.05559) | Image-to-Image Text-Guided Generation | +| [dance_diffusion](./api/pipelines/dance_diffusion) | [Dance Diffusion](https://github.com/williamberman/diffusers.git) | Unconditional Audio Generation | +| [ddpm](./api/pipelines/ddpm) | [Denoising Diffusion Probabilistic Models](https://arxiv.org/abs/2006.11239) | Unconditional Image Generation | +| [ddim](./api/pipelines/ddim) | [Denoising Diffusion Implicit Models](https://arxiv.org/abs/2010.02502) | Unconditional Image Generation | +| [if](./if) | [**IF**](./api/pipelines/if) | Image Generation | +| [if_img2img](./if) | [**IF**](./api/pipelines/if) | Image-to-Image Generation | +| [if_inpainting](./if) | [**IF**](./api/pipelines/if) | Image-to-Image Generation | +| [latent_diffusion](./api/pipelines/latent_diffusion) | [High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752)| Text-to-Image Generation | +| [latent_diffusion](./api/pipelines/latent_diffusion) | [High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752)| Super Resolution Image-to-Image | +| [latent_diffusion_uncond](./api/pipelines/latent_diffusion_uncond) | [High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752) | Unconditional Image Generation | +| [paint_by_example](./api/pipelines/paint_by_example) | [Paint by Example: Exemplar-based Image Editing with Diffusion Models](https://arxiv.org/abs/2211.13227) | Image-Guided Image Inpainting | +| [pndm](./api/pipelines/pndm) | [Pseudo Numerical Methods for Diffusion Models on Manifolds](https://arxiv.org/abs/2202.09778) | Unconditional Image Generation | +| [score_sde_ve](./api/pipelines/score_sde_ve) | [Score-Based Generative Modeling through Stochastic Differential Equations](https://openreview.net/forum?id=PxTIG12RRHS) | Unconditional Image Generation | +| [score_sde_vp](./api/pipelines/score_sde_vp) | [Score-Based Generative Modeling through Stochastic Differential Equations](https://openreview.net/forum?id=PxTIG12RRHS) | Unconditional Image Generation | +| [semantic_stable_diffusion](./api/pipelines/semantic_stable_diffusion) | [Semantic Guidance](https://arxiv.org/abs/2301.12247) | Text-Guided Generation | +| [stable_diffusion_adapter](./api/pipelines/stable_diffusion/adapter) | [**T2I-Adapter**](https://arxiv.org/abs/2302.08453) | Image-to-Image Text-Guided Generation | - +| [stable_diffusion_text2img](./api/pipelines/stable_diffusion/text2img) | [Stable Diffusion](https://stability.ai/blog/stable-diffusion-public-release) | Text-to-Image Generation | +| [stable_diffusion_img2img](./api/pipelines/stable_diffusion/img2img) | [Stable Diffusion](https://stability.ai/blog/stable-diffusion-public-release) | Image-to-Image Text-Guided Generation | +| [stable_diffusion_inpaint](./api/pipelines/stable_diffusion/inpaint) | [Stable Diffusion](https://stability.ai/blog/stable-diffusion-public-release) | Text-Guided Image Inpainting | +| [stable_diffusion_panorama](./api/pipelines/stable_diffusion/panorama) | [MultiDiffusion](https://multidiffusion.github.io/) | Text-to-Panorama Generation | +| [stable_diffusion_pix2pix](./api/pipelines/stable_diffusion/pix2pix) | [InstructPix2Pix: Learning to Follow Image Editing Instructions](https://arxiv.org/abs/2211.09800) | Text-Guided Image Editing| +| [stable_diffusion_pix2pix_zero](./api/pipelines/stable_diffusion/pix2pix_zero) | [Zero-shot Image-to-Image Translation](https://pix2pixzero.github.io/) | Text-Guided Image Editing | +| [stable_diffusion_attend_and_excite](./api/pipelines/stable_diffusion/attend_and_excite) | [Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models](https://arxiv.org/abs/2301.13826) | Text-to-Image Generation | +| [stable_diffusion_self_attention_guidance](./api/pipelines/stable_diffusion/self_attention_guidance) | [Improving Sample Quality of Diffusion Models Using Self-Attention Guidance](https://arxiv.org/abs/2210.00939) | Text-to-Image Generation Unconditional Image Generation | +| [stable_diffusion_image_variation](./stable_diffusion/image_variation) | [Stable Diffusion Image Variations](https://github.com/LambdaLabsML/lambda-diffusers#stable-diffusion-image-variations) | Image-to-Image Generation | +| [stable_diffusion_latent_upscale](./stable_diffusion/latent_upscale) | [Stable Diffusion Latent Upscaler](https://twitter.com/StabilityAI/status/1590531958815064065) | Text-Guided Super Resolution Image-to-Image | +| [stable_diffusion_model_editing](./api/pipelines/stable_diffusion/model_editing) | [Editing Implicit Assumptions in Text-to-Image Diffusion Models](https://time-diffusion.github.io/) | Text-to-Image Model Editing | +| [stable_diffusion_2](./api/pipelines/stable_diffusion_2) | [Stable Diffusion 2](https://stability.ai/blog/stable-diffusion-v2-release) | Text-to-Image Generation | +| [stable_diffusion_2](./api/pipelines/stable_diffusion_2) | [Stable Diffusion 2](https://stability.ai/blog/stable-diffusion-v2-release) | Text-Guided Image Inpainting | +| [stable_diffusion_2](./api/pipelines/stable_diffusion_2) | [Depth-Conditional Stable Diffusion](https://github.com/Stability-AI/stablediffusion#depth-conditional-stable-diffusion) | Depth-to-Image Generation | +| [stable_diffusion_2](./api/pipelines/stable_diffusion_2) | [Stable Diffusion 2](https://stability.ai/blog/stable-diffusion-v2-release) | Text-Guided Super Resolution Image-to-Image | +| [stable_diffusion_safe](./api/pipelines/stable_diffusion_safe) | [Safe Stable Diffusion](https://arxiv.org/abs/2211.05105) | Text-Guided Generation | +| [stable_unclip](./stable_unclip) | Stable unCLIP | Text-to-Image Generation | +| [stable_unclip](./stable_unclip) | Stable unCLIP | Image-to-Image Text-Guided Generation | +| [stochastic_karras_ve](./api/pipelines/stochastic_karras_ve) | [Elucidating the Design Space of Diffusion-Based Generative Models](https://arxiv.org/abs/2206.00364) | Unconditional Image Generation | +| [text_to_video_sd](./api/pipelines/text_to_video) | [Modelscope's Text-to-video-synthesis Model in Open Domain](https://modelscope.cn/models/damo/text-to-video-synthesis/summary) | Text-to-Video Generation | +| [unclip](./api/pipelines/unclip) | [Hierarchical Text-Conditional Image Generation with CLIP Latents](https://arxiv.org/abs/2204.06125)(implementation by [kakaobrain](https://github.com/kakaobrain/karlo)) | Text-to-Image Generation | +| [versatile_diffusion](./api/pipelines/versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Text-to-Image Generation | +| [versatile_diffusion](./api/pipelines/versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Image Variations Generation | +| [versatile_diffusion](./api/pipelines/versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Dual Image and Text Guided Generation | +| [vq_diffusion](./api/pipelines/vq_diffusion) | [Vector Quantized Diffusion Model for Text-to-Image Synthesis](https://arxiv.org/abs/2111.14822) | Text-to-Image Generation | +| [stable_diffusion_ldm3d](./api/pipelines/stable_diffusion/ldm3d_diffusion) | [LDM3D: Latent Diffusion Model for 3D](https://arxiv.org/abs/2305.10853) | Text to Image and Depth Generation | diff --git a/docs/source/jp/installation.md b/docs/source/jp/installation.md new file mode 100644 index 000000000000..1a0951bf7bba --- /dev/null +++ b/docs/source/jp/installation.md @@ -0,0 +1,146 @@ + + +# Installation + +Install 🀗 Diffusers for whichever deep learning library you're working with. + +🀗 Diffusers is tested on Python 3.8+, PyTorch 1.7.0+ and Flax. Follow the installation instructions below for the deep learning library you are using: + +- [PyTorch](https://pytorch.org/get-started/locally/) installation instructions. +- [Flax](https://flax.readthedocs.io/en/latest/) installation instructions. + +## Install with pip + +You should install 🀗 Diffusers in a [virtual environment](https://docs.python.org/3/library/venv.html). +If you're unfamiliar with Python virtual environments, take a look at this [guide](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/). +A virtual environment makes it easier to manage different projects and avoid compatibility issues between dependencies. + +Start by creating a virtual environment in your project directory: + +```bash +python -m venv .env +``` + +Activate the virtual environment: + +```bash +source .env/bin/activate +``` + +🀗 Diffusers also relies on the 🀗 Transformers library, and you can install both with the following command: + + + +```bash +pip install diffusers["torch"] transformers +``` + + +```bash +pip install diffusers["flax"] transformers +``` + + + +## Install from source + +Before installing 🀗 Diffusers from source, make sure you have `torch` and 🀗 Accelerate installed. + +For `torch` installation, refer to the `torch` [installation](https://pytorch.org/get-started/locally/#start-locally) guide. + +To install 🀗 Accelerate: + +```bash +pip install accelerate +``` + +Install 🀗 Diffusers from source with the following command: + +```bash +pip install git+https://github.com/huggingface/diffusers +``` + +This command installs the bleeding edge `main` version rather than the latest `stable` version. +The `main` version is useful for staying up-to-date with the latest developments. +For instance, if a bug has been fixed since the last official release but a new release hasn't been rolled out yet. +However, this means the `main` version may not always be stable. +We strive to keep the `main` version operational, and most issues are usually resolved within a few hours or a day. +If you run into a problem, please open an [Issue](https://github.com/huggingface/diffusers/issues/new/choose), so we can fix it even sooner! + +## Editable install + +You will need an editable install if you'd like to: + +* Use the `main` version of the source code. +* Contribute to 🀗 Diffusers and need to test changes in the code. + +Clone the repository and install 🀗 Diffusers with the following commands: + +```bash +git clone https://github.com/huggingface/diffusers.git +cd diffusers +``` + + + +```bash +pip install -e ".[torch]" +``` + + +```bash +pip install -e ".[flax]" +``` + + + +These commands will link the folder you cloned the repository to and your Python library paths. +Python will now look inside the folder you cloned to in addition to the normal library paths. +For example, if your Python packages are typically installed in `~/anaconda3/envs/main/lib/python3.8/site-packages/`, Python will also search the `~/diffusers/` folder you cloned to. + + + +You must keep the `diffusers` folder if you want to keep using the library. + + + +Now you can easily update your clone to the latest version of 🀗 Diffusers with the following command: + +```bash +cd ~/diffusers/ +git pull +``` + +Your Python environment will find the `main` version of 🀗 Diffusers on the next run. + +## Notice on telemetry logging + +Our library gathers telemetry information during `from_pretrained()` requests. +This data includes the version of Diffusers and PyTorch/Flax, the requested model or pipeline class, +and the path to a pre-trained checkpoint if it is hosted on the Hub. +This usage data helps us debug issues and prioritize new features. +Telemetry is only sent when loading models and pipelines from the HuggingFace Hub, +and is not collected during local usage. + +We understand that not everyone wants to share additional information, and we respect your privacy, +so you can disable telemetry collection by setting the `DISABLE_TELEMETRY` environment variable from your terminal: + +On Linux/MacOS: +```bash +export DISABLE_TELEMETRY=YES +``` + +On Windows: +```bash +set DISABLE_TELEMETRY=YES +``` diff --git a/docs/source/jp/optimization/fp16.md b/docs/source/jp/optimization/fp16.md new file mode 100644 index 000000000000..2ac16786eb46 --- /dev/null +++ b/docs/source/jp/optimization/fp16.md @@ -0,0 +1,68 @@ + + +# Speed up inference + +There are several ways to optimize 🀗 Diffusers for inference speed. As a general rule of thumb, we recommend using either [xFormers](xformers) or `torch.nn.functional.scaled_dot_product_attention` in PyTorch 2.0 for their memory-efficient attention. + + + +In many cases, optimizing for speed or memory leads to improved performance in the other, so you should try to optimize for both whenever you can. This guide focuses on inference speed, but you can learn more about preserving memory in the [Reduce memory usage](memory) guide. + + + +The results below are obtained from generating a single 512x512 image from the prompt `a photo of an astronaut riding a horse on mars` with 50 DDIM steps on a Nvidia Titan RTX, demonstrating the speed-up you can expect. + +| | latency | speed-up | +| ---------------- | ------- | ------- | +| original | 9.50s | x1 | +| fp16 | 3.61s | x2.63 | +| channels last | 3.30s | x2.88 | +| traced UNet | 3.21s | x2.96 | +| memory efficient attention | 2.63s | x3.61 | + +## Use TensorFloat-32 + +On Ampere and later CUDA devices, matrix multiplications and convolutions can use the [TensorFloat-32 (TF32)](https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/) mode for faster, but slightly less accurate computations. By default, PyTorch enables TF32 mode for convolutions but not matrix multiplications. Unless your network requires full float32 precision, we recommend enabling TF32 for matrix multiplications. It can significantly speeds up computations with typically negligible loss in numerical accuracy. + +```python +import torch + +torch.backends.cuda.matmul.allow_tf32 = True +``` + +You can learn more about TF32 in the [Mixed precision training](https://huggingface.co/docs/transformers/en/perf_train_gpu_one#tf32) guide. + +## Half-precision weights + +To save GPU memory and get more speed, try loading and running the model weights directly in half-precision or float16: + +```Python +import torch +from diffusers import DiffusionPipeline + +pipe = DiffusionPipeline.from_pretrained( + "runwayml/stable-diffusion-v1-5", + torch_dtype=torch.float16, + use_safetensors=True, +) +pipe = pipe.to("cuda") + +prompt = "a photo of an astronaut riding a horse on mars" +image = pipe(prompt).images[0] +``` + + + +Don't use [`torch.autocast`](https://pytorch.org/docs/stable/amp.html#torch.autocast) in any of the pipelines as it can lead to black images and is always slower than pure float16 precision. + + \ No newline at end of file diff --git a/docs/source/jp/optimization/memory.md b/docs/source/jp/optimization/memory.md new file mode 100644 index 000000000000..c91fed1b2784 --- /dev/null +++ b/docs/source/jp/optimization/memory.md @@ -0,0 +1,357 @@ +# Reduce memory usage + +A barrier to using diffusion models is the large amount of memory required. To overcome this challenge, there are several memory-reducing techniques you can use to run even some of the largest models on free-tier or consumer GPUs. Some of these techniques can even be combined to further reduce memory usage. + + + +In many cases, optimizing for memory or speed leads to improved performance in the other, so you should try to optimize for both whenever you can. This guide focuses on minimizing memory usage, but you can also learn more about how to [Speed up inference](fp16). + + + +The results below are obtained from generating a single 512x512 image from the prompt a photo of an astronaut riding a horse on mars with 50 DDIM steps on a Nvidia Titan RTX, demonstrating the speed-up you can expect as a result of reduced memory consumption. + +| | latency | speed-up | +| ---------------- | ------- | ------- | +| original | 9.50s | x1 | +| fp16 | 3.61s | x2.63 | +| channels last | 3.30s | x2.88 | +| traced UNet | 3.21s | x2.96 | +| memory-efficient attention | 2.63s | x3.61 | + + +## Sliced VAE + +Sliced VAE enables decoding large batches of images with limited VRAM or batches with 32 images or more by decoding the batches of latents one image at a time. You'll likely want to couple this with [`~ModelMixin.enable_xformers_memory_efficient_attention`] to further reduce memory use. + +To use sliced VAE, call [`~StableDiffusionPipeline.enable_vae_slicing`] on your pipeline before inference: + +```python +import torch +from diffusers import StableDiffusionPipeline + +pipe = StableDiffusionPipeline.from_pretrained( + "runwayml/stable-diffusion-v1-5", + torch_dtype=torch.float16, + use_safetensors=True, +) +pipe = pipe.to("cuda") + +prompt = "a photo of an astronaut riding a horse on mars" +pipe.enable_vae_slicing() +images = pipe([prompt] * 32).images +``` + +You may see a small performance boost in VAE decoding on multi-image batches, and there should be no performance impact on single-image batches. + +## Tiled VAE + +Tiled VAE processing also enables working with large images on limited VRAM (for example, generating 4k images on 8GB of VRAM) by splitting the image into overlapping tiles, decoding the tiles, and then blending the outputs together to compose the final image. You should also used tiled VAE with [`~ModelMixin.enable_xformers_memory_efficient_attention`] to further reduce memory use. + +To use tiled VAE processing, call [`~StableDiffusionPipeline.enable_vae_tiling`] on your pipeline before inference: + +```python +import torch +from diffusers import StableDiffusionPipeline, UniPCMultistepScheduler + +pipe = StableDiffusionPipeline.from_pretrained( + "runwayml/stable-diffusion-v1-5", + torch_dtype=torch.float16, + use_safetensors=True, +) +pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config) +pipe = pipe.to("cuda") +prompt = "a beautiful landscape photograph" +pipe.enable_vae_tiling() +pipe.enable_xformers_memory_efficient_attention() + +image = pipe([prompt], width=3840, height=2224, num_inference_steps=20).images[0] +``` + +The output image has some tile-to-tile tone variation because the tiles are decoded separately, but you shouldn't see any sharp and obvious seams between the tiles. Tiling is turned off for images that are 512x512 or smaller. + +## CPU offloading + +Offloading the weights to the CPU and only loading them on the GPU when performing the forward pass can also save memory. Often, this technique can reduce memory consumption to less than 3GB. + +To perform CPU offloading, call [`~StableDiffusionPipeline.enable_sequential_cpu_offload`]: + +```Python +import torch +from diffusers import StableDiffusionPipeline + +pipe = StableDiffusionPipeline.from_pretrained( + "runwayml/stable-diffusion-v1-5", + torch_dtype=torch.float16, + use_safetensors=True, +) + +prompt = "a photo of an astronaut riding a horse on mars" +pipe.enable_sequential_cpu_offload() +image = pipe(prompt).images[0] +``` + +CPU offloading works on submodules rather than whole models. This is the best way to minimize memory consumption, but inference is much slower due to the iterative nature of the diffusion process. The UNet component of the pipeline runs several times (as many as `num_inference_steps`); each time, the different UNet submodules are sequentially onloaded and offloaded as needed, resulting in a large number of memory transfers. + + + +Consider using [model offloading](#model-offloading) if you want to optimize for speed because it is much faster. The tradeoff is your memory savings won't be as large. + + + +CPU offloading can also be chained with attention slicing to reduce memory consumption to less than 2GB. + +```Python +import torch +from diffusers import StableDiffusionPipeline + +pipe = StableDiffusionPipeline.from_pretrained( + "runwayml/stable-diffusion-v1-5", + torch_dtype=torch.float16, + use_safetensors=True, +) + +prompt = "a photo of an astronaut riding a horse on mars" +pipe.enable_sequential_cpu_offload() + +image = pipe(prompt).images[0] +``` + + + +When using [`~StableDiffusionPipeline.enable_sequential_cpu_offload`], don't move the pipeline to CUDA beforehand or else the gain in memory consumption will only be minimal (see this [issue](https://github.com/huggingface/diffusers/issues/1934) for more information). + +[`~StableDiffusionPipeline.enable_sequential_cpu_offload`] is a stateful operation that installs hooks on the models. + + + +## Model offloading + + + +Model offloading requires 🀗 Accelerate version 0.17.0 or higher. + + + +[Sequential CPU offloading](#cpu-offloading) preserves a lot of memory but it makes inference slower because submodules are moved to GPU as needed, and they're immediately returned to the CPU when a new module runs. + +Full-model offloading is an alternative that moves whole models to the GPU, instead of handling each model's constituent *submodules*. There is a negligible impact on inference time (compared with moving the pipeline to `cuda`), and it still provides some memory savings. + +During model offloading, only one of the main components of the pipeline (typically the text encoder, UNet and VAE) +is placed on the GPU while the others wait on the CPU. Components like the UNet that run for multiple iterations stay on the GPU until they're no longer needed. + +Enable model offloading by calling [`~StableDiffusionPipeline.enable_model_cpu_offload`] on the pipeline: + +```Python +import torch +from diffusers import StableDiffusionPipeline + +pipe = StableDiffusionPipeline.from_pretrained( + "runwayml/stable-diffusion-v1-5", + torch_dtype=torch.float16, + use_safetensors=True, +) + +prompt = "a photo of an astronaut riding a horse on mars" +pipe.enable_model_cpu_offload() +image = pipe(prompt).images[0] +``` + +Model offloading can also be combined with attention slicing for additional memory savings. + +```Python +import torch +from diffusers import StableDiffusionPipeline + +pipe = StableDiffusionPipeline.from_pretrained( + "runwayml/stable-diffusion-v1-5", + torch_dtype=torch.float16, + use_safetensors=True, +) + +prompt = "a photo of an astronaut riding a horse on mars" +pipe.enable_model_cpu_offload() + +image = pipe(prompt).images[0] +``` + + + +In order to properly offload models after they're called, it is required to run the entire pipeline and models are called in the pipeline's expected order. Exercise caution if models are reused outside the context of the pipeline after hooks have been installed. See [Removing Hooks](https://huggingface.co/docs/accelerate/en/package_reference/big_modeling#accelerate.hooks.remove_hook_from_module) +for more information. + +[`~StableDiffusionPipeline.enable_model_cpu_offload`] is a stateful operation that installs hooks on the models and state on the pipeline. + + + +## Channels-last memory format + +The channels-last memory format is an alternative way of ordering NCHW tensors in memory to preserve dimension ordering. Channels-last tensors are ordered in such a way that the channels become the densest dimension (storing images pixel-per-pixel). Since not all operators currently support the channels-last format, it may result in worst performance but you should still try and see if it works for your model. + +For example, to set the pipeline's UNet to use the channels-last format: + +```python +print(pipe.unet.conv_out.state_dict()["weight"].stride()) # (2880, 9, 3, 1) +pipe.unet.to(memory_format=torch.channels_last) # in-place operation +print( + pipe.unet.conv_out.state_dict()["weight"].stride() +) # (2880, 1, 960, 320) having a stride of 1 for the 2nd dimension proves that it works +``` + +## Tracing + +Tracing runs an example input tensor through the model and captures the operations that are performed on it as that input makes its way through the model's layers. The executable or `ScriptFunction` that is returned is optimized with just-in-time compilation. + +To trace a UNet: + +```python +import time +import torch +from diffusers import StableDiffusionPipeline +import functools + +# torch disable grad +torch.set_grad_enabled(False) + +# set variables +n_experiments = 2 +unet_runs_per_experiment = 50 + + +# load inputs +def generate_inputs(): + sample = torch.randn(2, 4, 64, 64).half().cuda() + timestep = torch.rand(1).half().cuda() * 999 + encoder_hidden_states = torch.randn(2, 77, 768).half().cuda() + return sample, timestep, encoder_hidden_states + + +pipe = StableDiffusionPipeline.from_pretrained( + "runwayml/stable-diffusion-v1-5", + torch_dtype=torch.float16, + use_safetensors=True, +).to("cuda") +unet = pipe.unet +unet.eval() +unet.to(memory_format=torch.channels_last) # use channels_last memory format +unet.forward = functools.partial(unet.forward, return_dict=False) # set return_dict=False as default + +# warmup +for _ in range(3): + with torch.inference_mode(): + inputs = generate_inputs() + orig_output = unet(*inputs) + +# trace +print("tracing..") +unet_traced = torch.jit.trace(unet, inputs) +unet_traced.eval() +print("done tracing") + + +# warmup and optimize graph +for _ in range(5): + with torch.inference_mode(): + inputs = generate_inputs() + orig_output = unet_traced(*inputs) + + +# benchmarking +with torch.inference_mode(): + for _ in range(n_experiments): + torch.cuda.synchronize() + start_time = time.time() + for _ in range(unet_runs_per_experiment): + orig_output = unet_traced(*inputs) + torch.cuda.synchronize() + print(f"unet traced inference took {time.time() - start_time:.2f} seconds") + for _ in range(n_experiments): + torch.cuda.synchronize() + start_time = time.time() + for _ in range(unet_runs_per_experiment): + orig_output = unet(*inputs) + torch.cuda.synchronize() + print(f"unet inference took {time.time() - start_time:.2f} seconds") + +# save the model +unet_traced.save("unet_traced.pt") +``` + +Replace the `unet` attribute of the pipeline with the traced model: + +```python +from diffusers import StableDiffusionPipeline +import torch +from dataclasses import dataclass + + +@dataclass +class UNet2DConditionOutput: + sample: torch.FloatTensor + + +pipe = StableDiffusionPipeline.from_pretrained( + "runwayml/stable-diffusion-v1-5", + torch_dtype=torch.float16, + use_safetensors=True, +).to("cuda") + +# use jitted unet +unet_traced = torch.jit.load("unet_traced.pt") + + +# del pipe.unet +class TracedUNet(torch.nn.Module): + def __init__(self): + super().__init__() + self.in_channels = pipe.unet.in_channels + self.device = pipe.unet.device + + def forward(self, latent_model_input, t, encoder_hidden_states): + sample = unet_traced(latent_model_input, t, encoder_hidden_states)[0] + return UNet2DConditionOutput(sample=sample) + + +pipe.unet = TracedUNet() + +with torch.inference_mode(): + image = pipe([prompt] * 1, num_inference_steps=50).images[0] +``` + +## Memory-efficient attention + +Recent work on optimizing bandwidth in the attention block has generated huge speed-ups and reductions in GPU memory usage. The most recent type of memory-efficient attention is [Flash Attention](https://arxiv.org/pdf/2205.14135.pdf) (you can check out the original code at [HazyResearch/flash-attention](https://github.com/HazyResearch/flash-attention)). + + + +If you have PyTorch >= 2.0 installed, you should not expect a speed-up for inference when enabling `xformers`. + + + +To use Flash Attention, install the following: + +- PyTorch > 1.12 +- CUDA available +- [xFormers](xformers) + +Then call [`~ModelMixin.enable_xformers_memory_efficient_attention`] on the pipeline: + +```python +from diffusers import DiffusionPipeline +import torch + +pipe = DiffusionPipeline.from_pretrained( + "runwayml/stable-diffusion-v1-5", + torch_dtype=torch.float16, + use_safetensors=True, +).to("cuda") + +pipe.enable_xformers_memory_efficient_attention() + +with torch.inference_mode(): + sample = pipe("a small cat") + +# optional: You can disable it via +# pipe.disable_xformers_memory_efficient_attention() +``` + +The iteration speed when using `xformers` should match the iteration speed of Torch 2.0 as described [here](torch2.0). diff --git a/docs/source/jp/optimization/open_vino.md b/docs/source/jp/optimization/open_vino.md new file mode 100644 index 000000000000..606c2207bcda --- /dev/null +++ b/docs/source/jp/optimization/open_vino.md @@ -0,0 +1,81 @@ + + + +# OpenVINO + +🀗 [Optimum](https://github.com/huggingface/optimum-intel) provides Stable Diffusion pipelines compatible with OpenVINO to perform inference on a variety of Intel processors (see the [full list]((https://docs.openvino.ai/latest/openvino_docs_OV_UG_supported_plugins_Supported_Devices.html)) of supported devices). + +You'll need to install 🀗 Optimum Intel with the `--upgrade-strategy eager` option to ensure [`optimum-intel`](https://github.com/huggingface/optimum-intel) is using the latest version: + +``` +pip install --upgrade-strategy eager optimum["openvino"] +``` + +This guide will show you how to use the Stable Diffusion and Stable Diffusion XL (SDXL) pipelines with OpenVINO. + +## Stable Diffusion + +To load and run inference, use the [`~optimum.intel.OVStableDiffusionPipeline`]. If you want to load a PyTorch model and convert it to the OpenVINO format on-the-fly, set `export=True`: + +```python +from optimum.intel import OVStableDiffusionPipeline + +model_id = "runwayml/stable-diffusion-v1-5" +pipeline = OVStableDiffusionPipeline.from_pretrained(model_id, export=True) +prompt = "sailing ship in storm by Rembrandt" +image = pipeline(prompt).images[0] + +# Don't forget to save the exported model +pipeline.save_pretrained("openvino-sd-v1-5") +``` + +To further speed-up inference, statically reshape the model. If you change any parameters such as the outputs height or width, you’ll need to statically reshape your model again. + +```python +# Define the shapes related to the inputs and desired outputs +batch_size, num_images, height, width = 1, 1, 512, 512 + +# Statically reshape the model +pipeline.reshape(batch_size, height, width, num_images) +# Compile the model before inference +pipeline.compile() + +image = pipeline( + prompt, + height=height, + width=width, + num_images_per_prompt=num_images, +).images[0] +``` +
+ +
+ +You can find more examples in the 🀗 Optimum [documentation](https://huggingface.co/docs/optimum/intel/inference#stable-diffusion), and Stable Diffusion is supported for text-to-image, image-to-image, and inpainting. + +## Stable Diffusion XL + +To load and run inference with SDXL, use the [`~optimum.intel.OVStableDiffusionXLPipeline`]: + +```python +from optimum.intel import OVStableDiffusionXLPipeline + +model_id = "stabilityai/stable-diffusion-xl-base-1.0" +pipeline = OVStableDiffusionXLPipeline.from_pretrained(model_id) +prompt = "sailing ship in storm by Rembrandt" +image = pipeline(prompt).images[0] +``` + +To further speed-up inference, [statically reshape](#stable-diffusion) the model as shown in the Stable Diffusion section. + +You can find more examples in the 🀗 Optimum [documentation](https://huggingface.co/docs/optimum/intel/inference#stable-diffusion-xl), and running SDXL in OpenVINO is supported for text-to-image and image-to-image. diff --git a/docs/source/jp/optimization/opt_overview.md b/docs/source/jp/optimization/opt_overview.md new file mode 100644 index 000000000000..1f809bb011ce --- /dev/null +++ b/docs/source/jp/optimization/opt_overview.md @@ -0,0 +1,17 @@ + + +# Overview + +Generating high-quality outputs is computationally intensive, especially during each iterative step where you go from a noisy output to a less noisy output. One of 🀗 Diffuser's goal is to make this technology widely accessible to everyone, which includes enabling fast inference on consumer and specialized hardware. + +This section will cover tips and tricks - like half-precision weights and sliced attention - for optimizing inference speed and reducing memory-consumption. You'll also learn how to speed up your PyTorch code with [`torch.compile`](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) or [ONNX Runtime](https://onnxruntime.ai/docs/), and enable memory-efficient attention with [xFormers](https://facebookresearch.github.io/xformers/). There are also guides for running inference on specific hardware like Apple Silicon, and Intel or Habana processors. \ No newline at end of file diff --git a/docs/source/jp/optimization/tome.md b/docs/source/jp/optimization/tome.md new file mode 100644 index 000000000000..66d69c6900cc --- /dev/null +++ b/docs/source/jp/optimization/tome.md @@ -0,0 +1,89 @@ + + +# Token merging + +[Token merging](https://huggingface.co/papers/2303.17604) (ToMe) merges redundant tokens/patches progressively in the forward pass of a Transformer-based network which can speed-up the inference latency of [`StableDiffusionPipeline`]. + +You can use ToMe from the [`tomesd`](https://github.com/dbolya/tomesd) library with the [`apply_patch`](https://github.com/dbolya/tomesd?tab=readme-ov-file#usage) function: + +```diff +from diffusers import StableDiffusionPipeline +import tomesd + +pipeline = StableDiffusionPipeline.from_pretrained( + "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True, +).to("cuda") ++ tomesd.apply_patch(pipeline, ratio=0.5) + +image = pipeline("a photo of an astronaut riding a horse on mars").images[0] +``` + +The `apply_patch` function exposes a number of [arguments](https://github.com/dbolya/tomesd#usage) to help strike a balance between pipeline inference speed and the quality of the generated tokens. The most important argument is `ratio` which controls the number of tokens that are merged during the forward pass. + +As reported in the [paper](https://huggingface.co/papers/2303.17604), ToMe can greatly preserve the quality of the generated images while boosting inference speed. By increasing the `ratio`, you can speed-up inference even further, but at the cost of some degraded image quality. + +To test the quality of the generated images, we sampled a few prompts from [Parti Prompts](https://parti.research.google/) and performed inference with the [`StableDiffusionPipeline`] with the following settings: + +
+ +
+ +We didn’t notice any significant decrease in the quality of the generated samples, and you can check out the generated samples in this [WandB report](https://wandb.ai/sayakpaul/tomesd-results/runs/23j4bj3i?workspace=). If you're interested in reproducing this experiment, use this [script](https://gist.github.com/sayakpaul/8cac98d7f22399085a060992f411ecbd). + +## Benchmarks + +We also benchmarked the impact of `tomesd` on the [`StableDiffusionPipeline`] with [xFormers](https://huggingface.co/docs/diffusers/optimization/xformers) enabled across several image resolutions. The results are obtained from A100 and V100 GPUs in the following development environment: + +```bash +- `diffusers` version: 0.15.1 +- Python version: 3.8.16 +- PyTorch version (GPU?): 1.13.1+cu116 (True) +- Huggingface_hub version: 0.13.2 +- Transformers version: 4.27.2 +- Accelerate version: 0.18.0 +- xFormers version: 0.0.16 +- tomesd version: 0.1.2 +``` + +To reproduce this benchmark, feel free to use this [script](https://gist.github.com/sayakpaul/27aec6bca7eb7b0e0aa4112205850335). The results are reported in seconds, and where applicable we report the speed-up percentage over the vanilla pipeline when using ToMe and ToMe + xFormers. + +| **GPU** | **Resolution** | **Batch size** | **Vanilla** | **ToMe** | **ToMe + xFormers** | +|----------|----------------|----------------|-------------|----------------|---------------------| +| **A100** | 512 | 10 | 6.88 | 5.26 (+23.55%) | 4.69 (+31.83%) | +| | 768 | 10 | OOM | 14.71 | 11 | +| | | 8 | OOM | 11.56 | 8.84 | +| | | 4 | OOM | 5.98 | 4.66 | +| | | 2 | 4.99 | 3.24 (+35.07%) | 2.1 (+37.88%) | +| | | 1 | 3.29 | 2.24 (+31.91%) | 2.03 (+38.3%) | +| | 1024 | 10 | OOM | OOM | OOM | +| | | 8 | OOM | OOM | OOM | +| | | 4 | OOM | 12.51 | 9.09 | +| | | 2 | OOM | 6.52 | 4.96 | +| | | 1 | 6.4 | 3.61 (+43.59%) | 2.81 (+56.09%) | +| **V100** | 512 | 10 | OOM | 10.03 | 9.29 | +| | | 8 | OOM | 8.05 | 7.47 | +| | | 4 | 5.7 | 4.3 (+24.56%) | 3.98 (+30.18%) | +| | | 2 | 3.14 | 2.43 (+22.61%) | 2.27 (+27.71%) | +| | | 1 | 1.88 | 1.57 (+16.49%) | 1.57 (+16.49%) | +| | 768 | 10 | OOM | OOM | 23.67 | +| | | 8 | OOM | OOM | 18.81 | +| | | 4 | OOM | 11.81 | 9.7 | +| | | 2 | OOM | 6.27 | 5.2 | +| | | 1 | 5.43 | 3.38 (+37.75%) | 2.82 (+48.07%) | +| | 1024 | 10 | OOM | OOM | OOM | +| | | 8 | OOM | OOM | OOM | +| | | 4 | OOM | OOM | 19.35 | +| | | 2 | OOM | 13 | 10.78 | +| | | 1 | OOM | 6.66 | 5.54 | + +As seen in the tables above, the speed-up from `tomesd` becomes more pronounced for larger image resolutions. It is also interesting to note that with `tomesd`, it is possible to run the pipeline on a higher resolution like 1024x1024. You may be able to speed-up inference even more with [`torch.compile`](torch2.0). diff --git a/docs/source/jp/optimization/torch2.0.md b/docs/source/jp/optimization/torch2.0.md new file mode 100644 index 000000000000..1e07b876514f --- /dev/null +++ b/docs/source/jp/optimization/torch2.0.md @@ -0,0 +1,434 @@ + + +# Torch 2.0 + +🀗 Diffusers supports the latest optimizations from [PyTorch 2.0](https://pytorch.org/get-started/pytorch-2.0/) which include: + +1. A memory-efficient attention implementation, scaled dot product attention, without requiring any extra dependencies such as xFormers. +2. [`torch.compile`](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html), a just-in-time (JIT) compiler to provide an extra performance boost when individual models are compiled. + +Both of these optimizations require PyTorch 2.0 or later and 🀗 Diffusers > 0.13.0. + +```bash +pip install --upgrade torch diffusers +``` + +## Scaled dot product attention + +[`torch.nn.functional.scaled_dot_product_attention`](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention) (SDPA) is an optimized and memory-efficient attention (similar to xFormers) that automatically enables several other optimizations depending on the model inputs and GPU type. SDPA is enabled by default if you're using PyTorch 2.0 and the latest version of 🀗 Diffusers, so you don't need to add anything to your code. + +However, if you want to explicitly enable it, you can set a [`DiffusionPipeline`] to use [`~models.attention_processor.AttnProcessor2_0`]: + +```diff + import torch + from diffusers import DiffusionPipeline ++ from diffusers.models.attention_processor import AttnProcessor2_0 + + pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True).to("cuda") ++ pipe.unet.set_attn_processor(AttnProcessor2_0()) + + prompt = "a photo of an astronaut riding a horse on mars" + image = pipe(prompt).images[0] +``` + +SDPA should be as fast and memory efficient as `xFormers`; check the [benchmark](#benchmark) for more details. + +In some cases - such as making the pipeline more deterministic or converting it to other formats - it may be helpful to use the vanilla attention processor, [`~models.attention_processor.AttnProcessor`]. To revert to [`~models.attention_processor.AttnProcessor`], call the [`~UNet2DConditionModel.set_default_attn_processor`] function on the pipeline: + +```diff + import torch + from diffusers import DiffusionPipeline + from diffusers.models.attention_processor import AttnProcessor + + pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True).to("cuda") ++ pipe.unet.set_default_attn_processor() + + prompt = "a photo of an astronaut riding a horse on mars" + image = pipe(prompt).images[0] +``` + +## torch.compile + +The `torch.compile` function can often provide an additional speed-up to your PyTorch code. In 🀗 Diffusers, it is usually best to wrap the UNet with `torch.compile` because it does most of the heavy lifting in the pipeline. + +```python +from diffusers import DiffusionPipeline +import torch + +pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True).to("cuda") +pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) +images = pipe(prompt, num_inference_steps=steps, num_images_per_prompt=batch_size).images[0] +``` + +Depending on GPU type, `torch.compile` can provide an *additional speed-up* of **5-300x** on top of SDPA! If you're using more recent GPU architectures such as Ampere (A100, 3090), Ada (4090), and Hopper (H100), `torch.compile` is able to squeeze even more performance out of these GPUs. + +Compilation requires some time to complete, so it is best suited for situations where you prepare your pipeline once and then perform the same type of inference operations multiple times. For example, calling the compiled pipeline on a different image size triggers compilation again which can be expensive. + +For more information and different options about `torch.compile`, refer to the [`torch_compile`](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) tutorial. + +## Benchmark + +We conducted a comprehensive benchmark with PyTorch 2.0's efficient attention implementation and `torch.compile` across different GPUs and batch sizes for five of our most used pipelines. The code is benchmarked on 🀗 Diffusers v0.17.0.dev0 to optimize `torch.compile` usage (see [here](https://github.com/huggingface/diffusers/pull/3313) for more details). + +Expand the dropdown below to find the code used to benchmark each pipeline: + +
+ +### Stable Diffusion text-to-image + +```python +from diffusers import DiffusionPipeline +import torch + +path = "runwayml/stable-diffusion-v1-5" + +run_compile = True # Set True / False + +pipe = DiffusionPipeline.from_pretrained(path, torch_dtype=torch.float16, use_safetensors=True) +pipe = pipe.to("cuda") +pipe.unet.to(memory_format=torch.channels_last) + +if run_compile: + print("Run torch compile") + pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) + +prompt = "ghibli style, a fantasy landscape with castles" + +for _ in range(3): + images = pipe(prompt=prompt).images +``` + +### Stable Diffusion image-to-image + +```python +from diffusers import StableDiffusionImg2ImgPipeline +import requests +import torch +from PIL import Image +from io import BytesIO + +url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg" + +response = requests.get(url) +init_image = Image.open(BytesIO(response.content)).convert("RGB") +init_image = init_image.resize((512, 512)) + +path = "runwayml/stable-diffusion-v1-5" + +run_compile = True # Set True / False + +pipe = StableDiffusionImg2ImgPipeline.from_pretrained(path, torch_dtype=torch.float16, use_safetensors=True) +pipe = pipe.to("cuda") +pipe.unet.to(memory_format=torch.channels_last) + +if run_compile: + print("Run torch compile") + pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) + +prompt = "ghibli style, a fantasy landscape with castles" + +for _ in range(3): + image = pipe(prompt=prompt, image=init_image).images[0] +``` + +### Stable Diffusion inpainting + +```python +from diffusers import StableDiffusionInpaintPipeline +import requests +import torch +from PIL import Image +from io import BytesIO + +url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg" + +def download_image(url): + response = requests.get(url) + return Image.open(BytesIO(response.content)).convert("RGB") + + +img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png" +mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png" + +init_image = download_image(img_url).resize((512, 512)) +mask_image = download_image(mask_url).resize((512, 512)) + +path = "runwayml/stable-diffusion-inpainting" + +run_compile = True # Set True / False + +pipe = StableDiffusionInpaintPipeline.from_pretrained(path, torch_dtype=torch.float16, use_safetensors=True) +pipe = pipe.to("cuda") +pipe.unet.to(memory_format=torch.channels_last) + +if run_compile: + print("Run torch compile") + pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) + +prompt = "ghibli style, a fantasy landscape with castles" + +for _ in range(3): + image = pipe(prompt=prompt, image=init_image, mask_image=mask_image).images[0] +``` + +### ControlNet + +```python +from diffusers import StableDiffusionControlNetPipeline, ControlNetModel +import requests +import torch +from PIL import Image +from io import BytesIO + +url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg" + +response = requests.get(url) +init_image = Image.open(BytesIO(response.content)).convert("RGB") +init_image = init_image.resize((512, 512)) + +path = "runwayml/stable-diffusion-v1-5" + +run_compile = True # Set True / False +controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16, use_safetensors=True) +pipe = StableDiffusionControlNetPipeline.from_pretrained( + path, controlnet=controlnet, torch_dtype=torch.float16, use_safetensors=True +) + +pipe = pipe.to("cuda") +pipe.unet.to(memory_format=torch.channels_last) +pipe.controlnet.to(memory_format=torch.channels_last) + +if run_compile: + print("Run torch compile") + pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) + pipe.controlnet = torch.compile(pipe.controlnet, mode="reduce-overhead", fullgraph=True) + +prompt = "ghibli style, a fantasy landscape with castles" + +for _ in range(3): + image = pipe(prompt=prompt, image=init_image).images[0] +``` + +### DeepFloyd IF text-to-image + upscaling + +```python +from diffusers import DiffusionPipeline +import torch + +run_compile = True # Set True / False + +pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-M-v1.0", variant="fp16", text_encoder=None, torch_dtype=torch.float16, use_safetensors=True) +pipe.to("cuda") +pipe_2 = DiffusionPipeline.from_pretrained("DeepFloyd/IF-II-M-v1.0", variant="fp16", text_encoder=None, torch_dtype=torch.float16, use_safetensors=True) +pipe_2.to("cuda") +pipe_3 = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-x4-upscaler", torch_dtype=torch.float16, use_safetensors=True) +pipe_3.to("cuda") + + +pipe.unet.to(memory_format=torch.channels_last) +pipe_2.unet.to(memory_format=torch.channels_last) +pipe_3.unet.to(memory_format=torch.channels_last) + +if run_compile: + pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) + pipe_2.unet = torch.compile(pipe_2.unet, mode="reduce-overhead", fullgraph=True) + pipe_3.unet = torch.compile(pipe_3.unet, mode="reduce-overhead", fullgraph=True) + +prompt = "the blue hulk" + +prompt_embeds = torch.randn((1, 2, 4096), dtype=torch.float16) +neg_prompt_embeds = torch.randn((1, 2, 4096), dtype=torch.float16) + +for _ in range(3): + image = pipe(prompt_embeds=prompt_embeds, negative_prompt_embeds=neg_prompt_embeds, output_type="pt").images + image_2 = pipe_2(image=image, prompt_embeds=prompt_embeds, negative_prompt_embeds=neg_prompt_embeds, output_type="pt").images + image_3 = pipe_3(prompt=prompt, image=image, noise_level=100).images +``` +
+ +The graph below highlights the relative speed-ups for the [`StableDiffusionPipeline`] across five GPU families with PyTorch 2.0 and `torch.compile` enabled. The benchmarks for the following graphs are measured in *number of iterations/second*. + +![t2i_speedup](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/pt2_benchmarks/t2i_speedup.png) + +To give you an even better idea of how this speed-up holds for the other pipelines, consider the following +graph for an A100 with PyTorch 2.0 and `torch.compile`: + +![a100_numbers](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/pt2_benchmarks/a100_numbers.png) + +In the following tables, we report our findings in terms of the *number of iterations/second*. + +### A100 (batch size: 1) + +| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | +|:---:|:---:|:---:|:---:|:---:| +| SD - txt2img | 21.66 | 23.13 | 44.03 | 49.74 | +| SD - img2img | 21.81 | 22.40 | 43.92 | 46.32 | +| SD - inpaint | 22.24 | 23.23 | 43.76 | 49.25 | +| SD - controlnet | 15.02 | 15.82 | 32.13 | 36.08 | +| IF | 20.21 /
13.84 /
24.00 | 20.12 /
13.70 /
24.03 | ❌ | 97.34 /
27.23 /
111.66 | +| SDXL - txt2img | 8.64 | 9.9 | - | - | + +### A100 (batch size: 4) + +| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | +|:---:|:---:|:---:|:---:|:---:| +| SD - txt2img | 11.6 | 13.12 | 14.62 | 17.27 | +| SD - img2img | 11.47 | 13.06 | 14.66 | 17.25 | +| SD - inpaint | 11.67 | 13.31 | 14.88 | 17.48 | +| SD - controlnet | 8.28 | 9.38 | 10.51 | 12.41 | +| IF | 25.02 | 18.04 | ❌ | 48.47 | +| SDXL - txt2img | 2.44 | 2.74 | - | - | + +### A100 (batch size: 16) + +| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | +|:---:|:---:|:---:|:---:|:---:| +| SD - txt2img | 3.04 | 3.6 | 3.83 | 4.68 | +| SD - img2img | 2.98 | 3.58 | 3.83 | 4.67 | +| SD - inpaint | 3.04 | 3.66 | 3.9 | 4.76 | +| SD - controlnet | 2.15 | 2.58 | 2.74 | 3.35 | +| IF | 8.78 | 9.82 | ❌ | 16.77 | +| SDXL - txt2img | 0.64 | 0.72 | - | - | + +### V100 (batch size: 1) + +| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | +|:---:|:---:|:---:|:---:|:---:| +| SD - txt2img | 18.99 | 19.14 | 20.95 | 22.17 | +| SD - img2img | 18.56 | 19.18 | 20.95 | 22.11 | +| SD - inpaint | 19.14 | 19.06 | 21.08 | 22.20 | +| SD - controlnet | 13.48 | 13.93 | 15.18 | 15.88 | +| IF | 20.01 /
9.08 /
23.34 | 19.79 /
8.98 /
24.10 | ❌ | 55.75 /
11.57 /
57.67 | + +### V100 (batch size: 4) + +| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | +|:---:|:---:|:---:|:---:|:---:| +| SD - txt2img | 5.96 | 5.89 | 6.83 | 6.86 | +| SD - img2img | 5.90 | 5.91 | 6.81 | 6.82 | +| SD - inpaint | 5.99 | 6.03 | 6.93 | 6.95 | +| SD - controlnet | 4.26 | 4.29 | 4.92 | 4.93 | +| IF | 15.41 | 14.76 | ❌ | 22.95 | + +### V100 (batch size: 16) + +| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | +|:---:|:---:|:---:|:---:|:---:| +| SD - txt2img | 1.66 | 1.66 | 1.92 | 1.90 | +| SD - img2img | 1.65 | 1.65 | 1.91 | 1.89 | +| SD - inpaint | 1.69 | 1.69 | 1.95 | 1.93 | +| SD - controlnet | 1.19 | 1.19 | OOM after warmup | 1.36 | +| IF | 5.43 | 5.29 | ❌ | 7.06 | + +### T4 (batch size: 1) + +| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | +|:---:|:---:|:---:|:---:|:---:| +| SD - txt2img | 6.9 | 6.95 | 7.3 | 7.56 | +| SD - img2img | 6.84 | 6.99 | 7.04 | 7.55 | +| SD - inpaint | 6.91 | 6.7 | 7.01 | 7.37 | +| SD - controlnet | 4.89 | 4.86 | 5.35 | 5.48 | +| IF | 17.42 /
2.47 /
18.52 | 16.96 /
2.45 /
18.69 | ❌ | 24.63 /
2.47 /
23.39 | +| SDXL - txt2img | 1.15 | 1.16 | - | - | + +### T4 (batch size: 4) + +| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | +|:---:|:---:|:---:|:---:|:---:| +| SD - txt2img | 1.79 | 1.79 | 2.03 | 1.99 | +| SD - img2img | 1.77 | 1.77 | 2.05 | 2.04 | +| SD - inpaint | 1.81 | 1.82 | 2.09 | 2.09 | +| SD - controlnet | 1.34 | 1.27 | 1.47 | 1.46 | +| IF | 5.79 | 5.61 | ❌ | 7.39 | +| SDXL - txt2img | 0.288 | 0.289 | - | - | + +### T4 (batch size: 16) + +| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | +|:---:|:---:|:---:|:---:|:---:| +| SD - txt2img | 2.34s | 2.30s | OOM after 2nd iteration | 1.99s | +| SD - img2img | 2.35s | 2.31s | OOM after warmup | 2.00s | +| SD - inpaint | 2.30s | 2.26s | OOM after 2nd iteration | 1.95s | +| SD - controlnet | OOM after 2nd iteration | OOM after 2nd iteration | OOM after warmup | OOM after warmup | +| IF * | 1.44 | 1.44 | ❌ | 1.94 | +| SDXL - txt2img | OOM | OOM | - | - | + +### RTX 3090 (batch size: 1) + +| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | +|:---:|:---:|:---:|:---:|:---:| +| SD - txt2img | 22.56 | 22.84 | 23.84 | 25.69 | +| SD - img2img | 22.25 | 22.61 | 24.1 | 25.83 | +| SD - inpaint | 22.22 | 22.54 | 24.26 | 26.02 | +| SD - controlnet | 16.03 | 16.33 | 17.38 | 18.56 | +| IF | 27.08 /
9.07 /
31.23 | 26.75 /
8.92 /
31.47 | ❌ | 68.08 /
11.16 /
65.29 | + +### RTX 3090 (batch size: 4) + +| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | +|:---:|:---:|:---:|:---:|:---:| +| SD - txt2img | 6.46 | 6.35 | 7.29 | 7.3 | +| SD - img2img | 6.33 | 6.27 | 7.31 | 7.26 | +| SD - inpaint | 6.47 | 6.4 | 7.44 | 7.39 | +| SD - controlnet | 4.59 | 4.54 | 5.27 | 5.26 | +| IF | 16.81 | 16.62 | ❌ | 21.57 | + +### RTX 3090 (batch size: 16) + +| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | +|:---:|:---:|:---:|:---:|:---:| +| SD - txt2img | 1.7 | 1.69 | 1.93 | 1.91 | +| SD - img2img | 1.68 | 1.67 | 1.93 | 1.9 | +| SD - inpaint | 1.72 | 1.71 | 1.97 | 1.94 | +| SD - controlnet | 1.23 | 1.22 | 1.4 | 1.38 | +| IF | 5.01 | 5.00 | ❌ | 6.33 | + +### RTX 4090 (batch size: 1) + +| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | +|:---:|:---:|:---:|:---:|:---:| +| SD - txt2img | 40.5 | 41.89 | 44.65 | 49.81 | +| SD - img2img | 40.39 | 41.95 | 44.46 | 49.8 | +| SD - inpaint | 40.51 | 41.88 | 44.58 | 49.72 | +| SD - controlnet | 29.27 | 30.29 | 32.26 | 36.03 | +| IF | 69.71 /
18.78 /
85.49 | 69.13 /
18.80 /
85.56 | ❌ | 124.60 /
26.37 /
138.79 | +| SDXL - txt2img | 6.8 | 8.18 | - | - | + +### RTX 4090 (batch size: 4) + +| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | +|:---:|:---:|:---:|:---:|:---:| +| SD - txt2img | 12.62 | 12.84 | 15.32 | 15.59 | +| SD - img2img | 12.61 | 12,.79 | 15.35 | 15.66 | +| SD - inpaint | 12.65 | 12.81 | 15.3 | 15.58 | +| SD - controlnet | 9.1 | 9.25 | 11.03 | 11.22 | +| IF | 31.88 | 31.14 | ❌ | 43.92 | +| SDXL - txt2img | 2.19 | 2.35 | - | - | + +### RTX 4090 (batch size: 16) + +| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | +|:---:|:---:|:---:|:---:|:---:| +| SD - txt2img | 3.17 | 3.2 | 3.84 | 3.85 | +| SD - img2img | 3.16 | 3.2 | 3.84 | 3.85 | +| SD - inpaint | 3.17 | 3.2 | 3.85 | 3.85 | +| SD - controlnet | 2.23 | 2.3 | 2.7 | 2.75 | +| IF | 9.26 | 9.2 | ❌ | 13.31 | +| SDXL - txt2img | 0.52 | 0.53 | - | - | + +## Notes + +* Follow this [PR](https://github.com/huggingface/diffusers/pull/3313) for more details on the environment used for conducting the benchmarks. +* For the DeepFloyd IF pipeline where batch sizes > 1, we only used a batch size of > 1 in the first IF pipeline for text-to-image generation and NOT for upscaling. That means the two upscaling pipelines received a batch size of 1. + +*Thanks to [Horace He](https://github.com/Chillee) from the PyTorch team for their support in improving our support of `torch.compile()` in Diffusers.* diff --git a/docs/source/jp/quicktour.md b/docs/source/jp/quicktour.md new file mode 100644 index 000000000000..3cf6851e4683 --- /dev/null +++ b/docs/source/jp/quicktour.md @@ -0,0 +1,314 @@ + + +[[open-in-colab]] + +# Quicktour + +Diffusion models are trained to denoise random Gaussian noise step-by-step to generate a sample of interest, such as an image or audio. This has sparked a tremendous amount of interest in generative AI, and you have probably seen examples of diffusion generated images on the internet. 🧚 Diffusers is a library aimed at making diffusion models widely accessible to everyone. + +Whether you're a developer or an everyday user, this quicktour will introduce you to 🧚 Diffusers and help you get up and generating quickly! There are three main components of the library to know about: + +* The [`DiffusionPipeline`] is a high-level end-to-end class designed to rapidly generate samples from pretrained diffusion models for inference. +* Popular pretrained [model](./api/models) architectures and modules that can be used as building blocks for creating diffusion systems. +* Many different [schedulers](./api/schedulers/overview) - algorithms that control how noise is added for training, and how to generate denoised images during inference. + +The quicktour will show you how to use the [`DiffusionPipeline`] for inference, and then walk you through how to combine a model and scheduler to replicate what's happening inside the [`DiffusionPipeline`]. + + + +The quicktour is a simplified version of the introductory 🧚 Diffusers [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/diffusers_intro.ipynb) to help you get started quickly. If you want to learn more about 🧚 Diffusers goal, design philosophy, and additional details about it's core API, check out the notebook! + + + +Before you begin, make sure you have all the necessary libraries installed: + +```py +# uncomment to install the necessary libraries in Colab +#!pip install --upgrade diffusers accelerate transformers +``` + +- [🀗 Accelerate](https://huggingface.co/docs/accelerate/index) speeds up model loading for inference and training. +- [🀗 Transformers](https://huggingface.co/docs/transformers/index) is required to run the most popular diffusion models, such as [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview). + +## DiffusionPipeline + +The [`DiffusionPipeline`] is the easiest way to use a pretrained diffusion system for inference. It is an end-to-end system containing the model and the scheduler. You can use the [`DiffusionPipeline`] out-of-the-box for many tasks. Take a look at the table below for some supported tasks, and for a complete list of supported tasks, check out the [🧚 Diffusers Summary](./api/pipelines/overview#diffusers-summary) table. + +| **Task** | **Description** | **Pipeline** +|------------------------------|--------------------------------------------------------------------------------------------------------------|-----------------| +| Unconditional Image Generation | generate an image from Gaussian noise | [unconditional_image_generation](./using-diffusers/unconditional_image_generation) | +| Text-Guided Image Generation | generate an image given a text prompt | [conditional_image_generation](./using-diffusers/conditional_image_generation) | +| Text-Guided Image-to-Image Translation | adapt an image guided by a text prompt | [img2img](./using-diffusers/img2img) | +| Text-Guided Image-Inpainting | fill the masked part of an image given the image, the mask and a text prompt | [inpaint](./using-diffusers/inpaint) | +| Text-Guided Depth-to-Image Translation | adapt parts of an image guided by a text prompt while preserving structure via depth estimation | [depth2img](./using-diffusers/depth2img) | + +Start by creating an instance of a [`DiffusionPipeline`] and specify which pipeline checkpoint you would like to download. +You can use the [`DiffusionPipeline`] for any [checkpoint](https://huggingface.co/models?library=diffusers&sort=downloads) stored on the Hugging Face Hub. +In this quicktour, you'll load the [`stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5) checkpoint for text-to-image generation. + + + +For [Stable Diffusion](https://huggingface.co/CompVis/stable-diffusion) models, please carefully read the [license](https://huggingface.co/spaces/CompVis/stable-diffusion-license) first before running the model. 🧚 Diffusers implements a [`safety_checker`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/safety_checker.py) to prevent offensive or harmful content, but the model's improved image generation capabilities can still produce potentially harmful content. + + + +Load the model with the [`~DiffusionPipeline.from_pretrained`] method: + +```python +>>> from diffusers import DiffusionPipeline + +>>> pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", use_safetensors=True) +``` + +The [`DiffusionPipeline`] downloads and caches all modeling, tokenization, and scheduling components. You'll see that the Stable Diffusion pipeline is composed of the [`UNet2DConditionModel`] and [`PNDMScheduler`] among other things: + +```py +>>> pipeline +StableDiffusionPipeline { + "_class_name": "StableDiffusionPipeline", + "_diffusers_version": "0.13.1", + ..., + "scheduler": [ + "diffusers", + "PNDMScheduler" + ], + ..., + "unet": [ + "diffusers", + "UNet2DConditionModel" + ], + "vae": [ + "diffusers", + "AutoencoderKL" + ] +} +``` + +We strongly recommend running the pipeline on a GPU because the model consists of roughly 1.4 billion parameters. +You can move the generator object to a GPU, just like you would in PyTorch: + +```python +>>> pipeline.to("cuda") +``` + +Now you can pass a text prompt to the `pipeline` to generate an image, and then access the denoised image. By default, the image output is wrapped in a [`PIL.Image`](https://pillow.readthedocs.io/en/stable/reference/Image.html?highlight=image#the-image-class) object. + +```python +>>> image = pipeline("An image of a squirrel in Picasso style").images[0] +>>> image +``` + +
+ +
+ +Save the image by calling `save`: + +```python +>>> image.save("image_of_squirrel_painting.png") +``` + +### Local pipeline + +You can also use the pipeline locally. The only difference is you need to download the weights first: + +```bash +!git lfs install +!git clone https://huggingface.co/runwayml/stable-diffusion-v1-5 +``` + +Then load the saved weights into the pipeline: + +```python +>>> pipeline = DiffusionPipeline.from_pretrained("./stable-diffusion-v1-5", use_safetensors=True) +``` + +Now you can run the pipeline as you would in the section above. + +### Swapping schedulers + +Different schedulers come with different denoising speeds and quality trade-offs. The best way to find out which one works best for you is to try them out! One of the main features of 🧚 Diffusers is to allow you to easily switch between schedulers. For example, to replace the default [`PNDMScheduler`] with the [`EulerDiscreteScheduler`], load it with the [`~diffusers.ConfigMixin.from_config`] method: + +```py +>>> from diffusers import EulerDiscreteScheduler + +>>> pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", use_safetensors=True) +>>> pipeline.scheduler = EulerDiscreteScheduler.from_config(pipeline.scheduler.config) +``` + +Try generating an image with the new scheduler and see if you notice a difference! + +In the next section, you'll take a closer look at the components - the model and scheduler - that make up the [`DiffusionPipeline`] and learn how to use these components to generate an image of a cat. + +## Models + +Most models take a noisy sample, and at each timestep it predicts the *noise residual* (other models learn to predict the previous sample directly or the velocity or [`v-prediction`](https://github.com/huggingface/diffusers/blob/5e5ce13e2f89ac45a0066cb3f369462a3cf1d9ef/src/diffusers/schedulers/scheduling_ddim.py#L110)), the difference between a less noisy image and the input image. You can mix and match models to create other diffusion systems. + +Models are initiated with the [`~ModelMixin.from_pretrained`] method which also locally caches the model weights so it is faster the next time you load the model. For the quicktour, you'll load the [`UNet2DModel`], a basic unconditional image generation model with a checkpoint trained on cat images: + +```py +>>> from diffusers import UNet2DModel + +>>> repo_id = "google/ddpm-cat-256" +>>> model = UNet2DModel.from_pretrained(repo_id, use_safetensors=True) +``` + +To access the model parameters, call `model.config`: + +```py +>>> model.config +``` + +The model configuration is a 🧊 frozen 🧊 dictionary, which means those parameters can't be changed after the model is created. This is intentional and ensures that the parameters used to define the model architecture at the start remain the same, while other parameters can still be adjusted during inference. + +Some of the most important parameters are: + +* `sample_size`: the height and width dimension of the input sample. +* `in_channels`: the number of input channels of the input sample. +* `down_block_types` and `up_block_types`: the type of down- and upsampling blocks used to create the UNet architecture. +* `block_out_channels`: the number of output channels of the downsampling blocks; also used in reverse order for the number of input channels of the upsampling blocks. +* `layers_per_block`: the number of ResNet blocks present in each UNet block. + +To use the model for inference, create the image shape with random Gaussian noise. It should have a `batch` axis because the model can receive multiple random noises, a `channel` axis corresponding to the number of input channels, and a `sample_size` axis for the height and width of the image: + +```py +>>> import torch + +>>> torch.manual_seed(0) + +>>> noisy_sample = torch.randn(1, model.config.in_channels, model.config.sample_size, model.config.sample_size) +>>> noisy_sample.shape +torch.Size([1, 3, 256, 256]) +``` + +For inference, pass the noisy image to the model and a `timestep`. The `timestep` indicates how noisy the input image is, with more noise at the beginning and less at the end. This helps the model determine its position in the diffusion process, whether it is closer to the start or the end. Use the `sample` method to get the model output: + +```py +>>> with torch.no_grad(): +... noisy_residual = model(sample=noisy_sample, timestep=2).sample +``` + +To generate actual examples though, you'll need a scheduler to guide the denoising process. In the next section, you'll learn how to couple a model with a scheduler. + +## Schedulers + +Schedulers manage going from a noisy sample to a less noisy sample given the model output - in this case, it is the `noisy_residual`. + + + +🧚 Diffusers is a toolbox for building diffusion systems. While the [`DiffusionPipeline`] is a convenient way to get started with a pre-built diffusion system, you can also choose your own model and scheduler components separately to build a custom diffusion system. + + + +For the quicktour, you'll instantiate the [`DDPMScheduler`] with it's [`~diffusers.ConfigMixin.from_config`] method: + +```py +>>> from diffusers import DDPMScheduler + +>>> scheduler = DDPMScheduler.from_config(repo_id) +>>> scheduler +DDPMScheduler { + "_class_name": "DDPMScheduler", + "_diffusers_version": "0.13.1", + "beta_end": 0.02, + "beta_schedule": "linear", + "beta_start": 0.0001, + "clip_sample": true, + "clip_sample_range": 1.0, + "num_train_timesteps": 1000, + "prediction_type": "epsilon", + "trained_betas": null, + "variance_type": "fixed_small" +} +``` + + + +💡 Notice how the scheduler is instantiated from a configuration. Unlike a model, a scheduler does not have trainable weights and is parameter-free! + + + +Some of the most important parameters are: + +* `num_train_timesteps`: the length of the denoising process or in other words, the number of timesteps required to process random Gaussian noise into a data sample. +* `beta_schedule`: the type of noise schedule to use for inference and training. +* `beta_start` and `beta_end`: the start and end noise values for the noise schedule. + +To predict a slightly less noisy image, pass the following to the scheduler's [`~diffusers.DDPMScheduler.step`] method: model output, `timestep`, and current `sample`. + +```py +>>> less_noisy_sample = scheduler.step(model_output=noisy_residual, timestep=2, sample=noisy_sample).prev_sample +>>> less_noisy_sample.shape +``` + +The `less_noisy_sample` can be passed to the next `timestep` where it'll get even less noisier! Let's bring it all together now and visualize the entire denoising process. + +First, create a function that postprocesses and displays the denoised image as a `PIL.Image`: + +```py +>>> import PIL.Image +>>> import numpy as np + + +>>> def display_sample(sample, i): +... image_processed = sample.cpu().permute(0, 2, 3, 1) +... image_processed = (image_processed + 1.0) * 127.5 +... image_processed = image_processed.numpy().astype(np.uint8) + +... image_pil = PIL.Image.fromarray(image_processed[0]) +... display(f"Image at step {i}") +... display(image_pil) +``` + +To speed up the denoising process, move the input and model to a GPU: + +```py +>>> model.to("cuda") +>>> noisy_sample = noisy_sample.to("cuda") +``` + +Now create a denoising loop that predicts the residual of the less noisy sample, and computes the less noisy sample with the scheduler: + +```py +>>> import tqdm + +>>> sample = noisy_sample + +>>> for i, t in enumerate(tqdm.tqdm(scheduler.timesteps)): +... # 1. predict noise residual +... with torch.no_grad(): +... residual = model(sample, t).sample + +... # 2. compute less noisy image and set x_t -> x_t-1 +... sample = scheduler.step(residual, t, sample).prev_sample + +... # 3. optionally look at image +... if (i + 1) % 50 == 0: +... display_sample(sample, i + 1) +``` + +Sit back and watch as a cat is generated from nothing but noise! 😻 + +
+ +
+ +## Next steps + +Hopefully you generated some cool images with 🧚 Diffusers in this quicktour! For your next steps, you can: + +* Train or finetune a model to generate your own images in the [training](./tutorials/basic_training) tutorial. +* See example official and community [training or finetuning scripts](https://github.com/huggingface/diffusers/tree/main/examples#-diffusers-examples) for a variety of use cases. +* Learn more about loading, accessing, changing and comparing schedulers in the [Using different Schedulers](./using-diffusers/schedulers) guide. +* Explore prompt engineering, speed and memory optimizations, and tips and tricks for generating higher quality images with the [Stable Diffusion](./stable_diffusion) guide. +* Dive deeper into speeding up 🧚 Diffusers with guides on [optimized PyTorch on a GPU](./optimization/fp16), and inference guides for running [Stable Diffusion on Apple Silicon (M1/M2)](./optimization/mps) and [ONNX Runtime](./optimization/onnx). diff --git a/docs/source/jp/stable_diffusion.md b/docs/source/jp/stable_diffusion.md new file mode 100644 index 000000000000..31d5f9dc6bb8 --- /dev/null +++ b/docs/source/jp/stable_diffusion.md @@ -0,0 +1,260 @@ + + +# Effective and efficient diffusion + +[[open-in-colab]] + +Getting the [`DiffusionPipeline`] to generate images in a certain style or include what you want can be tricky. Often times, you have to run the [`DiffusionPipeline`] several times before you end up with an image you're happy with. But generating something out of nothing is a computationally intensive process, especially if you're running inference over and over again. + +This is why it's important to get the most *computational* (speed) and *memory* (GPU RAM) efficiency from the pipeline to reduce the time between inference cycles so you can iterate faster. + +This tutorial walks you through how to generate faster and better with the [`DiffusionPipeline`]. + +Begin by loading the [`runwayml/stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5) model: + +```python +from diffusers import DiffusionPipeline + +model_id = "runwayml/stable-diffusion-v1-5" +pipeline = DiffusionPipeline.from_pretrained(model_id, use_safetensors=True) +``` + +The example prompt you'll use is a portrait of an old warrior chief, but feel free to use your own prompt: + +```python +prompt = "portrait photo of a old warrior chief" +``` + +## Speed + + + +💡 If you don't have access to a GPU, you can use one for free from a GPU provider like [Colab](https://colab.research.google.com/)! + + + +One of the simplest ways to speed up inference is to place the pipeline on a GPU the same way you would with any PyTorch module: + +```python +pipeline = pipeline.to("cuda") +``` + +To make sure you can use the same image and improve on it, use a [`Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) and set a seed for [reproducibility](./using-diffusers/reproducibility): + +```python +import torch + +generator = torch.Generator("cuda").manual_seed(0) +``` + +Now you can generate an image: + +```python +image = pipeline(prompt, generator=generator).images[0] +image +``` + +
+ +
+ +This process took ~30 seconds on a T4 GPU (it might be faster if your allocated GPU is better than a T4). By default, the [`DiffusionPipeline`] runs inference with full `float32` precision for 50 inference steps. You can speed this up by switching to a lower precision like `float16` or running fewer inference steps. + +Let's start by loading the model in `float16` and generate an image: + +```python +import torch + +pipeline = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16, use_safetensors=True) +pipeline = pipeline.to("cuda") +generator = torch.Generator("cuda").manual_seed(0) +image = pipeline(prompt, generator=generator).images[0] +image +``` + +
+ +
+ +This time, it only took ~11 seconds to generate the image, which is almost 3x faster than before! + + + +💡 We strongly suggest always running your pipelines in `float16`, and so far, we've rarely seen any degradation in output quality. + + + +Another option is to reduce the number of inference steps. Choosing a more efficient scheduler could help decrease the number of steps without sacrificing output quality. You can find which schedulers are compatible with the current model in the [`DiffusionPipeline`] by calling the `compatibles` method: + +```python +pipeline.scheduler.compatibles +[ + diffusers.schedulers.scheduling_lms_discrete.LMSDiscreteScheduler, + diffusers.schedulers.scheduling_unipc_multistep.UniPCMultistepScheduler, + diffusers.schedulers.scheduling_k_dpm_2_discrete.KDPM2DiscreteScheduler, + diffusers.schedulers.scheduling_deis_multistep.DEISMultistepScheduler, + diffusers.schedulers.scheduling_euler_discrete.EulerDiscreteScheduler, + diffusers.schedulers.scheduling_dpmsolver_multistep.DPMSolverMultistepScheduler, + diffusers.schedulers.scheduling_ddpm.DDPMScheduler, + diffusers.schedulers.scheduling_dpmsolver_singlestep.DPMSolverSinglestepScheduler, + diffusers.schedulers.scheduling_k_dpm_2_ancestral_discrete.KDPM2AncestralDiscreteScheduler, + diffusers.schedulers.scheduling_heun_discrete.HeunDiscreteScheduler, + diffusers.schedulers.scheduling_pndm.PNDMScheduler, + diffusers.schedulers.scheduling_euler_ancestral_discrete.EulerAncestralDiscreteScheduler, + diffusers.schedulers.scheduling_ddim.DDIMScheduler, +] +``` + +The Stable Diffusion model uses the [`PNDMScheduler`] by default which usually requires ~50 inference steps, but more performant schedulers like [`DPMSolverMultistepScheduler`], require only ~20 or 25 inference steps. Use the [`ConfigMixin.from_config`] method to load a new scheduler: + +```python +from diffusers import DPMSolverMultistepScheduler + +pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config) +``` + +Now set the `num_inference_steps` to 20: + +```python +generator = torch.Generator("cuda").manual_seed(0) +image = pipeline(prompt, generator=generator, num_inference_steps=20).images[0] +image +``` + +
+ +
+ +Great, you've managed to cut the inference time to just 4 seconds! ⚡ + +## Memory + +The other key to improving pipeline performance is consuming less memory, which indirectly implies more speed, since you're often trying to maximize the number of images generated per second. The easiest way to see how many images you can generate at once is to try out different batch sizes until you get an `OutOfMemoryError` (OOM). + +Create a function that'll generate a batch of images from a list of prompts and `Generators`. Make sure to assign each `Generator` a seed so you can reuse it if it produces a good result. + +```python +def get_inputs(batch_size=1): + generator = [torch.Generator("cuda").manual_seed(i) for i in range(batch_size)] + prompts = batch_size * [prompt] + num_inference_steps = 20 + + return {"prompt": prompts, "generator": generator, "num_inference_steps": num_inference_steps} +``` + +Start with `batch_size=4` and see how much memory you've consumed: + +```python +from diffusers.utils import make_image_grid + +images = pipeline(**get_inputs(batch_size=4)).images +make_image_grid(images, 2, 2) +``` + +Unless you have a GPU with more RAM, the code above probably returned an `OOM` error! Most of the memory is taken up by the cross-attention layers. Instead of running this operation in a batch, you can run it sequentially to save a significant amount of memory. All you have to do is configure the pipeline to use the [`~DiffusionPipeline.enable_attention_slicing`] function: + +```python +pipeline.enable_attention_slicing() +``` + +Now try increasing the `batch_size` to 8! + +```python +images = pipeline(**get_inputs(batch_size=8)).images +make_image_grid(images, rows=2, cols=4) +``` + +
+ +
+ +Whereas before you couldn't even generate a batch of 4 images, now you can generate a batch of 8 images at ~3.5 seconds per image! This is probably the fastest you can go on a T4 GPU without sacrificing quality. + +## Quality + +In the last two sections, you learned how to optimize the speed of your pipeline by using `fp16`, reducing the number of inference steps by using a more performant scheduler, and enabling attention slicing to reduce memory consumption. Now you're going to focus on how to improve the quality of generated images. + +### Better checkpoints + +The most obvious step is to use better checkpoints. The Stable Diffusion model is a good starting point, and since its official launch, several improved versions have also been released. However, using a newer version doesn't automatically mean you'll get better results. You'll still have to experiment with different checkpoints yourself, and do a little research (such as using [negative prompts](https://minimaxir.com/2022/11/stable-diffusion-negative-prompt/)) to get the best results. + +As the field grows, there are more and more high-quality checkpoints finetuned to produce certain styles. Try exploring the [Hub](https://huggingface.co/models?library=diffusers&sort=downloads) and [Diffusers Gallery](https://huggingface.co/spaces/huggingface-projects/diffusers-gallery) to find one you're interested in! + +### Better pipeline components + +You can also try replacing the current pipeline components with a newer version. Let's try loading the latest [autodecoder](https://huggingface.co/stabilityai/stable-diffusion-2-1/tree/main/vae) from Stability AI into the pipeline, and generate some images: + +```python +from diffusers import AutoencoderKL + +vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse", torch_dtype=torch.float16).to("cuda") +pipeline.vae = vae +images = pipeline(**get_inputs(batch_size=8)).images +make_image_grid(images, rows=2, cols=4) +``` + +
+ +
+ +### Better prompt engineering + +The text prompt you use to generate an image is super important, so much so that it is called *prompt engineering*. Some considerations to keep during prompt engineering are: + +- How is the image or similar images of the one I want to generate stored on the internet? +- What additional detail can I give that steers the model towards the style I want? + +With this in mind, let's improve the prompt to include color and higher quality details: + +```python +prompt += ", tribal panther make up, blue on red, side profile, looking away, serious eyes" +prompt += " 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta" +``` + +Generate a batch of images with the new prompt: + +```python +images = pipeline(**get_inputs(batch_size=8)).images +make_image_grid(images, rows=2, cols=4) +``` + +
+ +
+ +Pretty impressive! Let's tweak the second image - corresponding to the `Generator` with a seed of `1` - a bit more by adding some text about the age of the subject: + +```python +prompts = [ + "portrait photo of the oldest warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta", + "portrait photo of a old warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta", + "portrait photo of a warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta", + "portrait photo of a young warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta", +] + +generator = [torch.Generator("cuda").manual_seed(1) for _ in range(len(prompts))] +images = pipeline(prompt=prompts, generator=generator, num_inference_steps=25).images +make_image_grid(images, 2, 2) +``` + +
+ +
+ +## Next steps + +In this tutorial, you learned how to optimize a [`DiffusionPipeline`] for computational and memory efficiency as well as improving the quality of generated outputs. If you're interested in making your pipeline even faster, take a look at the following resources: + +- Learn how [PyTorch 2.0](./optimization/torch2.0) and [`torch.compile`](https://pytorch.org/docs/stable/generated/torch.compile.html) can yield 5 - 300% faster inference speed. On an A100 GPU, inference can be up to 50% faster! +- If you can't use PyTorch 2, we recommend you install [xFormers](./optimization/xformers). Its memory-efficient attention mechanism works great with PyTorch 1.13.1 for faster speed and reduced memory consumption. +- Other optimization techniques, such as model offloading, are covered in [this guide](./optimization/fp16). diff --git a/docs/source/jp/training/create_dataset.md b/docs/source/jp/training/create_dataset.md new file mode 100644 index 000000000000..f215d3eb2c1b --- /dev/null +++ b/docs/source/jp/training/create_dataset.md @@ -0,0 +1,90 @@ +# Create a dataset for training + +There are many datasets on the [Hub](https://huggingface.co/datasets?task_categories=task_categories:text-to-image&sort=downloads) to train a model on, but if you can't find one you're interested in or want to use your own, you can create a dataset with the 🀗 [Datasets](hf.co/docs/datasets) library. The dataset structure depends on the task you want to train your model on. The most basic dataset structure is a directory of images for tasks like unconditional image generation. Another dataset structure may be a directory of images and a text file containing their corresponding text captions for tasks like text-to-image generation. + +This guide will show you two ways to create a dataset to finetune on: + +- provide a folder of images to the `--train_data_dir` argument +- upload a dataset to the Hub and pass the dataset repository id to the `--dataset_name` argument + + + +💡 Learn more about how to create an image dataset for training in the [Create an image dataset](https://huggingface.co/docs/datasets/image_dataset) guide. + + + +## Provide a dataset as a folder + +For unconditional generation, you can provide your own dataset as a folder of images. The training script uses the [`ImageFolder`](https://huggingface.co/docs/datasets/en/image_dataset#imagefolder) builder from 🀗 Datasets to automatically build a dataset from the folder. Your directory structure should look like: + +```bash +data_dir/xxx.png +data_dir/xxy.png +data_dir/[...]/xxz.png +``` + +Pass the path to the dataset directory to the `--train_data_dir` argument, and then you can start training: + +```bash +accelerate launch train_unconditional.py \ + --train_data_dir \ + +``` + +## Upload your data to the Hub + + + +💡 For more details and context about creating and uploading a dataset to the Hub, take a look at the [Image search with 🀗 Datasets](https://huggingface.co/blog/image-search-datasets) post. + + + +Start by creating a dataset with the [`ImageFolder`](https://huggingface.co/docs/datasets/image_load#imagefolder) feature, which creates an `image` column containing the PIL-encoded images. + +You can use the `data_dir` or `data_files` parameters to specify the location of the dataset. The `data_files` parameter supports mapping specific files to dataset splits like `train` or `test`: + +```python +from datasets import load_dataset + +# example 1: local folder +dataset = load_dataset("imagefolder", data_dir="path_to_your_folder") + +# example 2: local files (supported formats are tar, gzip, zip, xz, rar, zstd) +dataset = load_dataset("imagefolder", data_files="path_to_zip_file") + +# example 3: remote files (supported formats are tar, gzip, zip, xz, rar, zstd) +dataset = load_dataset( + "imagefolder", + data_files="https://download.microsoft.com/download/3/E/1/3E1C3F21-ECDB-4869-8368-6DEBA77B919F/kagglecatsanddogs_3367a.zip", +) + +# example 4: providing several splits +dataset = load_dataset( + "imagefolder", data_files={"train": ["path/to/file1", "path/to/file2"], "test": ["path/to/file3", "path/to/file4"]} +) +``` + +Then use the [`~datasets.Dataset.push_to_hub`] method to upload the dataset to the Hub: + +```python +# assuming you have ran the huggingface-cli login command in a terminal +dataset.push_to_hub("name_of_your_dataset") + +# if you want to push to a private repo, simply pass private=True: +dataset.push_to_hub("name_of_your_dataset", private=True) +``` + +Now the dataset is available for training by passing the dataset name to the `--dataset_name` argument: + +```bash +accelerate launch --mixed_precision="fp16" train_text_to_image.py \ + --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" \ + --dataset_name="name_of_your_dataset" \ + +``` + +## Next steps + +Now that you've created a dataset, you can plug it into the `train_data_dir` (if your dataset is local) or `dataset_name` (if your dataset is on the Hub) arguments of a training script. + +For your next steps, feel free to try and use your dataset to train a model for [unconditional generation](unconditional_training) or [text-to-image generation](text2image)! \ No newline at end of file diff --git a/docs/source/jp/training/dreambooth.md b/docs/source/jp/training/dreambooth.md new file mode 100644 index 000000000000..30a20a971966 --- /dev/null +++ b/docs/source/jp/training/dreambooth.md @@ -0,0 +1,710 @@ + + +# DreamBooth + +[DreamBooth](https://arxiv.org/abs/2208.12242) is a method to personalize text-to-image models like Stable Diffusion given just a few (3-5) images of a subject. It allows the model to generate contextualized images of the subject in different scenes, poses, and views. + +![Dreambooth examples from the project's blog](https://dreambooth.github.io/DreamBooth_files/teaser_static.jpg) +Dreambooth examples from the project's blog. + +This guide will show you how to finetune DreamBooth with the [`CompVis/stable-diffusion-v1-4`](https://huggingface.co/CompVis/stable-diffusion-v1-4) model for various GPU sizes, and with Flax. All the training scripts for DreamBooth used in this guide can be found [here](https://github.com/huggingface/diffusers/tree/main/examples/dreambooth) if you're interested in digging deeper and seeing how things work. + +Before running the scripts, make sure you install the library's training dependencies. We also recommend installing 🧚 Diffusers from the `main` GitHub branch: + +```bash +pip install git+https://github.com/huggingface/diffusers +pip install -U -r diffusers/examples/dreambooth/requirements.txt +``` + +xFormers is not part of the training requirements, but we recommend you [install](../optimization/xformers) it if you can because it could make your training faster and less memory intensive. + +After all the dependencies have been set up, initialize a [🀗 Accelerate](https://github.com/huggingface/accelerate/) environment with: + +```bash +accelerate config +``` + +To setup a default 🀗 Accelerate environment without choosing any configurations: + +```bash +accelerate config default +``` + +Or if your environment doesn't support an interactive shell like a notebook, you can use: + +```py +from accelerate.utils import write_basic_config + +write_basic_config() +``` + +Finally, download a [few images of a dog](https://huggingface.co/datasets/diffusers/dog-example) to DreamBooth with: + +```py +from huggingface_hub import snapshot_download + +local_dir = "./dog" +snapshot_download( + "diffusers/dog-example", + local_dir=local_dir, + repo_type="dataset", + ignore_patterns=".gitattributes", +) +``` + +To use your own dataset, take a look at the [Create a dataset for training](create_dataset) guide. + +## Finetuning + + + +DreamBooth finetuning is very sensitive to hyperparameters and easy to overfit. We recommend you take a look at our [in-depth analysis](https://huggingface.co/blog/dreambooth) with recommended settings for different subjects to help you choose the appropriate hyperparameters. + + + + + +Set the `INSTANCE_DIR` environment variable to the path of the directory containing the dog images. + +Specify the `MODEL_NAME` environment variable (either a Hub model repository id or a path to the directory containing the model weights) and pass it to the [`pretrained_model_name_or_path`] argument. The `instance_prompt` argument is a text prompt that contains a unique identifier, such as `sks`, and the class the image belongs to, which in this example is `a photo of a sks dog`. + +```bash +export MODEL_NAME="CompVis/stable-diffusion-v1-4" +export INSTANCE_DIR="./dog" +export OUTPUT_DIR="path_to_saved_model" +``` + +Then you can launch the training script (you can find the full training script [here](https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth.py)) with the following command: + +```bash +accelerate launch train_dreambooth.py \ + --pretrained_model_name_or_path=$MODEL_NAME \ + --instance_data_dir=$INSTANCE_DIR \ + --output_dir=$OUTPUT_DIR \ + --instance_prompt="a photo of sks dog" \ + --resolution=512 \ + --train_batch_size=1 \ + --gradient_accumulation_steps=1 \ + --learning_rate=5e-6 \ + --lr_scheduler="constant" \ + --lr_warmup_steps=0 \ + --max_train_steps=400 \ + --push_to_hub +``` + + +If you have access to TPUs or want to train even faster, you can try out the [Flax training script](https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth_flax.py). The Flax training script doesn't support gradient checkpointing or gradient accumulation, so you'll need a GPU with at least 30GB of memory. + +Before running the script, make sure you have the requirements installed: + +```bash +pip install -U -r requirements.txt +``` + +Specify the `MODEL_NAME` environment variable (either a Hub model repository id or a path to the directory containing the model weights) and pass it to the [`pretrained_model_name_or_path`] argument. The `instance_prompt` argument is a text prompt that contains a unique identifier, such as `sks`, and the class the image belongs to, which in this example is `a photo of a sks dog`. + +Now you can launch the training script with the following command: + +```bash +export MODEL_NAME="duongna/stable-diffusion-v1-4-flax" +export INSTANCE_DIR="./dog" +export OUTPUT_DIR="path-to-save-model" + +python train_dreambooth_flax.py \ + --pretrained_model_name_or_path=$MODEL_NAME \ + --instance_data_dir=$INSTANCE_DIR \ + --output_dir=$OUTPUT_DIR \ + --instance_prompt="a photo of sks dog" \ + --resolution=512 \ + --train_batch_size=1 \ + --learning_rate=5e-6 \ + --max_train_steps=400 \ + --push_to_hub +``` + + + +## Finetuning with prior-preserving loss + +Prior preservation is used to avoid overfitting and language-drift (check out the [paper](https://arxiv.org/abs/2208.12242) to learn more if you're interested). For prior preservation, you use other images of the same class as part of the training process. The nice thing is that you can generate those images using the Stable Diffusion model itself! The training script will save the generated images to a local path you specify. + +The authors recommend generating `num_epochs * num_samples` images for prior preservation. In most cases, 200-300 images work well. + + + +```bash +export MODEL_NAME="CompVis/stable-diffusion-v1-4" +export INSTANCE_DIR="./dog" +export CLASS_DIR="path_to_class_images" +export OUTPUT_DIR="path_to_saved_model" + +accelerate launch train_dreambooth.py \ + --pretrained_model_name_or_path=$MODEL_NAME \ + --instance_data_dir=$INSTANCE_DIR \ + --class_data_dir=$CLASS_DIR \ + --output_dir=$OUTPUT_DIR \ + --with_prior_preservation --prior_loss_weight=1.0 \ + --instance_prompt="a photo of sks dog" \ + --class_prompt="a photo of dog" \ + --resolution=512 \ + --train_batch_size=1 \ + --gradient_accumulation_steps=1 \ + --learning_rate=5e-6 \ + --lr_scheduler="constant" \ + --lr_warmup_steps=0 \ + --num_class_images=200 \ + --max_train_steps=800 \ + --push_to_hub +``` + + +```bash +export MODEL_NAME="duongna/stable-diffusion-v1-4-flax" +export INSTANCE_DIR="./dog" +export CLASS_DIR="path-to-class-images" +export OUTPUT_DIR="path-to-save-model" + +python train_dreambooth_flax.py \ + --pretrained_model_name_or_path=$MODEL_NAME \ + --instance_data_dir=$INSTANCE_DIR \ + --class_data_dir=$CLASS_DIR \ + --output_dir=$OUTPUT_DIR \ + --with_prior_preservation --prior_loss_weight=1.0 \ + --instance_prompt="a photo of sks dog" \ + --class_prompt="a photo of dog" \ + --resolution=512 \ + --train_batch_size=1 \ + --learning_rate=5e-6 \ + --num_class_images=200 \ + --max_train_steps=800 \ + --push_to_hub +``` + + + +## Finetuning the text encoder and UNet + +The script also allows you to finetune the `text_encoder` along with the `unet`. In our experiments (check out the [Training Stable Diffusion with DreamBooth using 🧚 Diffusers](https://huggingface.co/blog/dreambooth) post for more details), this yields much better results, especially when generating images of faces. + + + +Training the text encoder requires additional memory and it won't fit on a 16GB GPU. You'll need at least 24GB VRAM to use this option. + + + +Pass the `--train_text_encoder` argument to the training script to enable finetuning the `text_encoder` and `unet`: + + + +```bash +export MODEL_NAME="CompVis/stable-diffusion-v1-4" +export INSTANCE_DIR="./dog" +export CLASS_DIR="path_to_class_images" +export OUTPUT_DIR="path_to_saved_model" + +accelerate launch train_dreambooth.py \ + --pretrained_model_name_or_path=$MODEL_NAME \ + --train_text_encoder \ + --instance_data_dir=$INSTANCE_DIR \ + --class_data_dir=$CLASS_DIR \ + --output_dir=$OUTPUT_DIR \ + --with_prior_preservation --prior_loss_weight=1.0 \ + --instance_prompt="a photo of sks dog" \ + --class_prompt="a photo of dog" \ + --resolution=512 \ + --train_batch_size=1 \ + --use_8bit_adam \ + --gradient_checkpointing \ + --learning_rate=2e-6 \ + --lr_scheduler="constant" \ + --lr_warmup_steps=0 \ + --num_class_images=200 \ + --max_train_steps=800 \ + --push_to_hub +``` + + +```bash +export MODEL_NAME="duongna/stable-diffusion-v1-4-flax" +export INSTANCE_DIR="./dog" +export CLASS_DIR="path-to-class-images" +export OUTPUT_DIR="path-to-save-model" + +python train_dreambooth_flax.py \ + --pretrained_model_name_or_path=$MODEL_NAME \ + --train_text_encoder \ + --instance_data_dir=$INSTANCE_DIR \ + --class_data_dir=$CLASS_DIR \ + --output_dir=$OUTPUT_DIR \ + --with_prior_preservation --prior_loss_weight=1.0 \ + --instance_prompt="a photo of sks dog" \ + --class_prompt="a photo of dog" \ + --resolution=512 \ + --train_batch_size=1 \ + --learning_rate=2e-6 \ + --num_class_images=200 \ + --max_train_steps=800 \ + --push_to_hub +``` + + + +## Finetuning with LoRA + +You can also use Low-Rank Adaptation of Large Language Models (LoRA), a fine-tuning technique for accelerating training large models, on DreamBooth. For more details, take a look at the [LoRA training](./lora#dreambooth) guide. + +## Saving checkpoints while training + +It's easy to overfit while training with Dreambooth, so sometimes it's useful to save regular checkpoints during the training process. One of the intermediate checkpoints might actually work better than the final model! Pass the following argument to the training script to enable saving checkpoints: + +```bash + --checkpointing_steps=500 +``` + +This saves the full training state in subfolders of your `output_dir`. Subfolder names begin with the prefix `checkpoint-`, followed by the number of steps performed so far; for example, `checkpoint-1500` would be a checkpoint saved after 1500 training steps. + +### Resume training from a saved checkpoint + +If you want to resume training from any of the saved checkpoints, you can pass the argument `--resume_from_checkpoint` to the script and specify the name of the checkpoint you want to use. You can also use the special string `"latest"` to resume from the last saved checkpoint (the one with the largest number of steps). For example, the following would resume training from the checkpoint saved after 1500 steps: + +```bash + --resume_from_checkpoint="checkpoint-1500" +``` + +This is a good opportunity to tweak some of your hyperparameters if you wish. + +### Inference from a saved checkpoint + +Saved checkpoints are stored in a format suitable for resuming training. They not only include the model weights, but also the state of the optimizer, data loaders, and learning rate. + +If you have **`"accelerate>=0.16.0"`** installed, use the following code to run +inference from an intermediate checkpoint. + +```python +from diffusers import DiffusionPipeline, UNet2DConditionModel +from transformers import CLIPTextModel +import torch + +# Load the pipeline with the same arguments (model, revision) that were used for training +model_id = "CompVis/stable-diffusion-v1-4" + +unet = UNet2DConditionModel.from_pretrained("/sddata/dreambooth/daruma-v2-1/checkpoint-100/unet") + +# if you have trained with `--args.train_text_encoder` make sure to also load the text encoder +text_encoder = CLIPTextModel.from_pretrained("/sddata/dreambooth/daruma-v2-1/checkpoint-100/text_encoder") + +pipeline = DiffusionPipeline.from_pretrained( + model_id, unet=unet, text_encoder=text_encoder, dtype=torch.float16, use_safetensors=True +) +pipeline.to("cuda") + +# Perform inference, or save, or push to the hub +pipeline.save_pretrained("dreambooth-pipeline") +``` + +If you have **`"accelerate<0.16.0"`** installed, you need to convert it to an inference pipeline first: + +```python +from accelerate import Accelerator +from diffusers import DiffusionPipeline + +# Load the pipeline with the same arguments (model, revision) that were used for training +model_id = "CompVis/stable-diffusion-v1-4" +pipeline = DiffusionPipeline.from_pretrained(model_id, use_safetensors=True) + +accelerator = Accelerator() + +# Use text_encoder if `--train_text_encoder` was used for the initial training +unet, text_encoder = accelerator.prepare(pipeline.unet, pipeline.text_encoder) + +# Restore state from a checkpoint path. You have to use the absolute path here. +accelerator.load_state("/sddata/dreambooth/daruma-v2-1/checkpoint-100") + +# Rebuild the pipeline with the unwrapped models (assignment to .unet and .text_encoder should work too) +pipeline = DiffusionPipeline.from_pretrained( + model_id, + unet=accelerator.unwrap_model(unet), + text_encoder=accelerator.unwrap_model(text_encoder), + use_safetensors=True, +) + +# Perform inference, or save, or push to the hub +pipeline.save_pretrained("dreambooth-pipeline") +``` + +## Optimizations for different GPU sizes + +Depending on your hardware, there are a few different ways to optimize DreamBooth on GPUs from 16GB to just 8GB! + +### xFormers + +[xFormers](https://github.com/facebookresearch/xformers) is a toolbox for optimizing Transformers, and it includes a [memory-efficient attention](https://facebookresearch.github.io/xformers/components/ops.html#module-xformers.ops) mechanism that is used in 🧚 Diffusers. You'll need to [install xFormers](./optimization/xformers) and then add the following argument to your training script: + +```bash + --enable_xformers_memory_efficient_attention +``` + +xFormers is not available in Flax. + +### Set gradients to none + +Another way you can lower your memory footprint is to [set the gradients](https://pytorch.org/docs/stable/generated/torch.optim.Optimizer.zero_grad.html) to `None` instead of zero. However, this may change certain behaviors, so if you run into any issues, try removing this argument. Add the following argument to your training script to set the gradients to `None`: + +```bash + --set_grads_to_none +``` + +### 16GB GPU + +With the help of gradient checkpointing and [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) 8-bit optimizer, it's possible to train DreamBooth on a 16GB GPU. Make sure you have bitsandbytes installed: + +```bash +pip install bitsandbytes +``` + +Then pass the `--use_8bit_adam` option to the training script: + +```bash +export MODEL_NAME="CompVis/stable-diffusion-v1-4" +export INSTANCE_DIR="./dog" +export CLASS_DIR="path_to_class_images" +export OUTPUT_DIR="path_to_saved_model" + +accelerate launch train_dreambooth.py \ + --pretrained_model_name_or_path=$MODEL_NAME \ + --instance_data_dir=$INSTANCE_DIR \ + --class_data_dir=$CLASS_DIR \ + --output_dir=$OUTPUT_DIR \ + --with_prior_preservation --prior_loss_weight=1.0 \ + --instance_prompt="a photo of sks dog" \ + --class_prompt="a photo of dog" \ + --resolution=512 \ + --train_batch_size=1 \ + --gradient_accumulation_steps=2 --gradient_checkpointing \ + --use_8bit_adam \ + --learning_rate=5e-6 \ + --lr_scheduler="constant" \ + --lr_warmup_steps=0 \ + --num_class_images=200 \ + --max_train_steps=800 \ + --push_to_hub +``` + +### 12GB GPU + +To run DreamBooth on a 12GB GPU, you'll need to enable gradient checkpointing, the 8-bit optimizer, xFormers, and set the gradients to `None`: + +```bash +export MODEL_NAME="CompVis/stable-diffusion-v1-4" +export INSTANCE_DIR="./dog" +export CLASS_DIR="path-to-class-images" +export OUTPUT_DIR="path-to-save-model" + +accelerate launch train_dreambooth.py \ + --pretrained_model_name_or_path=$MODEL_NAME \ + --instance_data_dir=$INSTANCE_DIR \ + --class_data_dir=$CLASS_DIR \ + --output_dir=$OUTPUT_DIR \ + --with_prior_preservation --prior_loss_weight=1.0 \ + --instance_prompt="a photo of sks dog" \ + --class_prompt="a photo of dog" \ + --resolution=512 \ + --train_batch_size=1 \ + --gradient_accumulation_steps=1 --gradient_checkpointing \ + --use_8bit_adam \ + --enable_xformers_memory_efficient_attention \ + --set_grads_to_none \ + --learning_rate=2e-6 \ + --lr_scheduler="constant" \ + --lr_warmup_steps=0 \ + --num_class_images=200 \ + --max_train_steps=800 \ + --push_to_hub +``` + +### 8 GB GPU + +For 8GB GPUs, you'll need the help of [DeepSpeed](https://www.deepspeed.ai/) to offload some +tensors from the VRAM to either the CPU or NVME, enabling training with less GPU memory. + +Run the following command to configure your 🀗 Accelerate environment: + +```bash +accelerate config +``` + +During configuration, confirm that you want to use DeepSpeed. Now it's possible to train on under 8GB VRAM by combining DeepSpeed stage 2, fp16 mixed precision, and offloading the model parameters and the optimizer state to the CPU. The drawback is that this requires more system RAM, about 25 GB. See [the DeepSpeed documentation](https://huggingface.co/docs/accelerate/usage_guides/deepspeed) for more configuration options. + +You should also change the default Adam optimizer to DeepSpeed's optimized version of Adam +[`deepspeed.ops.adam.DeepSpeedCPUAdam`](https://deepspeed.readthedocs.io/en/latest/optimizers.html#adam-cpu) for a substantial speedup. Enabling `DeepSpeedCPUAdam` requires your system's CUDA toolchain version to be the same as the one installed with PyTorch. + +8-bit optimizers don't seem to be compatible with DeepSpeed at the moment. + +Launch training with the following command: + +```bash +export MODEL_NAME="CompVis/stable-diffusion-v1-4" +export INSTANCE_DIR="./dog" +export CLASS_DIR="path_to_class_images" +export OUTPUT_DIR="path_to_saved_model" + +accelerate launch train_dreambooth.py \ + --pretrained_model_name_or_path=$MODEL_NAME \ + --instance_data_dir=$INSTANCE_DIR \ + --class_data_dir=$CLASS_DIR \ + --output_dir=$OUTPUT_DIR \ + --with_prior_preservation --prior_loss_weight=1.0 \ + --instance_prompt="a photo of sks dog" \ + --class_prompt="a photo of dog" \ + --resolution=512 \ + --train_batch_size=1 \ + --sample_batch_size=1 \ + --gradient_accumulation_steps=1 --gradient_checkpointing \ + --learning_rate=5e-6 \ + --lr_scheduler="constant" \ + --lr_warmup_steps=0 \ + --num_class_images=200 \ + --max_train_steps=800 \ + --mixed_precision=fp16 \ + --push_to_hub +``` + +## Inference + +Once you have trained a model, specify the path to where the model is saved, and use it for inference in the [`StableDiffusionPipeline`]. Make sure your prompts include the special `identifier` used during training (`sks` in the previous examples). + +If you have **`"accelerate>=0.16.0"`** installed, you can use the following code to run +inference from an intermediate checkpoint: + +```python +from diffusers import DiffusionPipeline +import torch + +model_id = "path_to_saved_model" +pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16, use_safetensors=True).to("cuda") + +prompt = "A photo of sks dog in a bucket" +image = pipe(prompt, num_inference_steps=50, guidance_scale=7.5).images[0] + +image.save("dog-bucket.png") +``` + +You may also run inference from any of the [saved training checkpoints](#inference-from-a-saved-checkpoint). + +## IF + +You can use the lora and full dreambooth scripts to train the text to image [IF model](https://huggingface.co/DeepFloyd/IF-I-XL-v1.0) and the stage II upscaler +[IF model](https://huggingface.co/DeepFloyd/IF-II-L-v1.0). + +Note that IF has a predicted variance, and our finetuning scripts only train the models predicted error, so for finetuned IF models we switch to a fixed +variance schedule. The full finetuning scripts will update the scheduler config for the full saved model. However, when loading saved LoRA weights, you +must also update the pipeline's scheduler config. + +```py +from diffusers import DiffusionPipeline + +pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", use_safetensors=True) + +pipe.load_lora_weights("") + +# Update scheduler config to fixed variance schedule +pipe.scheduler = pipe.scheduler.__class__.from_config(pipe.scheduler.config, variance_type="fixed_small") +``` + +Additionally, a few alternative cli flags are needed for IF. + +`--resolution=64`: IF is a pixel space diffusion model. In order to operate on un-compressed pixels, the input images are of a much smaller resolution. + +`--pre_compute_text_embeddings`: IF uses [T5](https://huggingface.co/docs/transformers/model_doc/t5) for its text encoder. In order to save GPU memory, we pre compute all text embeddings and then de-allocate +T5. + +`--tokenizer_max_length=77`: T5 has a longer default text length, but the default IF encoding procedure uses a smaller number. + +`--text_encoder_use_attention_mask`: T5 passes the attention mask to the text encoder. + +### Tips and Tricks +We find LoRA to be sufficient for finetuning the stage I model as the low resolution of the model makes representing finegrained detail hard regardless. + +For common and/or not-visually complex object concepts, you can get away with not-finetuning the upscaler. Just be sure to adjust the prompt passed to the +upscaler to remove the new token from the instance prompt. I.e. if your stage I prompt is "a sks dog", use "a dog" for your stage II prompt. + +For finegrained detail like faces that aren't present in the original training set, we find that full finetuning of the stage II upscaler is better than +LoRA finetuning stage II. + +For finegrained detail like faces, we find that lower learning rates along with larger batch sizes work best. + +For stage II, we find that lower learning rates are also needed. + +We found experimentally that the DDPM scheduler with the default larger number of denoising steps to sometimes work better than the DPM Solver scheduler +used in the training scripts. + +### Stage II additional validation images + +The stage II validation requires images to upscale, we can download a downsized version of the training set: + +```py +from huggingface_hub import snapshot_download + +local_dir = "./dog_downsized" +snapshot_download( + "diffusers/dog-example-downsized", + local_dir=local_dir, + repo_type="dataset", + ignore_patterns=".gitattributes", +) +``` + +### IF stage I LoRA Dreambooth +This training configuration requires ~28 GB VRAM. + +```sh +export MODEL_NAME="DeepFloyd/IF-I-XL-v1.0" +export INSTANCE_DIR="dog" +export OUTPUT_DIR="dreambooth_dog_lora" + +accelerate launch train_dreambooth_lora.py \ + --report_to wandb \ + --pretrained_model_name_or_path=$MODEL_NAME \ + --instance_data_dir=$INSTANCE_DIR \ + --output_dir=$OUTPUT_DIR \ + --instance_prompt="a sks dog" \ + --resolution=64 \ + --train_batch_size=4 \ + --gradient_accumulation_steps=1 \ + --learning_rate=5e-6 \ + --scale_lr \ + --max_train_steps=1200 \ + --validation_prompt="a sks dog" \ + --validation_epochs=25 \ + --checkpointing_steps=100 \ + --pre_compute_text_embeddings \ + --tokenizer_max_length=77 \ + --text_encoder_use_attention_mask +``` + +### IF stage II LoRA Dreambooth + +`--validation_images`: These images are upscaled during validation steps. + +`--class_labels_conditioning=timesteps`: Pass additional conditioning to the UNet needed for stage II. + +`--learning_rate=1e-6`: Lower learning rate than stage I. + +`--resolution=256`: The upscaler expects higher resolution inputs + +```sh +export MODEL_NAME="DeepFloyd/IF-II-L-v1.0" +export INSTANCE_DIR="dog" +export OUTPUT_DIR="dreambooth_dog_upscale" +export VALIDATION_IMAGES="dog_downsized/image_1.png dog_downsized/image_2.png dog_downsized/image_3.png dog_downsized/image_4.png" + +python train_dreambooth_lora.py \ + --report_to wandb \ + --pretrained_model_name_or_path=$MODEL_NAME \ + --instance_data_dir=$INSTANCE_DIR \ + --output_dir=$OUTPUT_DIR \ + --instance_prompt="a sks dog" \ + --resolution=256 \ + --train_batch_size=4 \ + --gradient_accumulation_steps=1 \ + --learning_rate=1e-6 \ + --max_train_steps=2000 \ + --validation_prompt="a sks dog" \ + --validation_epochs=100 \ + --checkpointing_steps=500 \ + --pre_compute_text_embeddings \ + --tokenizer_max_length=77 \ + --text_encoder_use_attention_mask \ + --validation_images $VALIDATION_IMAGES \ + --class_labels_conditioning=timesteps +``` + +### IF Stage I Full Dreambooth +`--skip_save_text_encoder`: When training the full model, this will skip saving the entire T5 with the finetuned model. You can still load the pipeline +with a T5 loaded from the original model. + +`use_8bit_adam`: Due to the size of the optimizer states, we recommend training the full XL IF model with 8bit adam. + +`--learning_rate=1e-7`: For full dreambooth, IF requires very low learning rates. With higher learning rates model quality will degrade. Note that it is +likely the learning rate can be increased with larger batch sizes. + +Using 8bit adam and a batch size of 4, the model can be trained in ~48 GB VRAM. + +```sh +export MODEL_NAME="DeepFloyd/IF-I-XL-v1.0" + +export INSTANCE_DIR="dog" +export OUTPUT_DIR="dreambooth_if" + +accelerate launch train_dreambooth.py \ + --pretrained_model_name_or_path=$MODEL_NAME \ + --instance_data_dir=$INSTANCE_DIR \ + --output_dir=$OUTPUT_DIR \ + --instance_prompt="a photo of sks dog" \ + --resolution=64 \ + --train_batch_size=4 \ + --gradient_accumulation_steps=1 \ + --learning_rate=1e-7 \ + --max_train_steps=150 \ + --validation_prompt "a photo of sks dog" \ + --validation_steps 25 \ + --text_encoder_use_attention_mask \ + --tokenizer_max_length 77 \ + --pre_compute_text_embeddings \ + --use_8bit_adam \ + --set_grads_to_none \ + --skip_save_text_encoder \ + --push_to_hub +``` + +### IF Stage II Full Dreambooth + +`--learning_rate=5e-6`: With a smaller effective batch size of 4, we found that we required learning rates as low as +1e-8. + +`--resolution=256`: The upscaler expects higher resolution inputs + +`--train_batch_size=2` and `--gradient_accumulation_steps=6`: We found that full training of stage II particularly with +faces required large effective batch sizes. + +```sh +export MODEL_NAME="DeepFloyd/IF-II-L-v1.0" +export INSTANCE_DIR="dog" +export OUTPUT_DIR="dreambooth_dog_upscale" +export VALIDATION_IMAGES="dog_downsized/image_1.png dog_downsized/image_2.png dog_downsized/image_3.png dog_downsized/image_4.png" + +accelerate launch train_dreambooth.py \ + --report_to wandb \ + --pretrained_model_name_or_path=$MODEL_NAME \ + --instance_data_dir=$INSTANCE_DIR \ + --output_dir=$OUTPUT_DIR \ + --instance_prompt="a sks dog" \ + --resolution=256 \ + --train_batch_size=2 \ + --gradient_accumulation_steps=6 \ + --learning_rate=5e-6 \ + --max_train_steps=2000 \ + --validation_prompt="a sks dog" \ + --validation_steps=150 \ + --checkpointing_steps=500 \ + --pre_compute_text_embeddings \ + --tokenizer_max_length=77 \ + --text_encoder_use_attention_mask \ + --validation_images $VALIDATION_IMAGES \ + --class_labels_conditioning timesteps \ + --push_to_hub +``` + +## Stable Diffusion XL + +We support fine-tuning of the UNet and text encoders shipped in [Stable Diffusion XL](https://huggingface.co/papers/2307.01952) with DreamBooth and LoRA via the `train_dreambooth_lora_sdxl.py` script. Please refer to the docs [here](https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/README_sdxl.md). \ No newline at end of file diff --git a/docs/source/jp/training/overview.md b/docs/source/jp/training/overview.md new file mode 100644 index 000000000000..c6fe339eda73 --- /dev/null +++ b/docs/source/jp/training/overview.md @@ -0,0 +1,84 @@ + + +# 🧚 Diffusers Training Examples + +Diffusers training examples are a collection of scripts to demonstrate how to effectively use the `diffusers` library +for a variety of use cases. + +**Note**: If you are looking for **official** examples on how to use `diffusers` for inference, +please have a look at [src/diffusers/pipelines](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines) + +Our examples aspire to be **self-contained**, **easy-to-tweak**, **beginner-friendly** and for **one-purpose-only**. +More specifically, this means: + +- **Self-contained**: An example script shall only depend on "pip-install-able" Python packages that can be found in a `requirements.txt` file. Example scripts shall **not** depend on any local files. This means that one can simply download an example script, *e.g.* [train_unconditional.py](https://github.com/huggingface/diffusers/blob/main/examples/unconditional_image_generation/train_unconditional.py), install the required dependencies, *e.g.* [requirements.txt](https://github.com/huggingface/diffusers/blob/main/examples/unconditional_image_generation/requirements.txt) and execute the example script. +- **Easy-to-tweak**: While we strive to present as many use cases as possible, the example scripts are just that - examples. It is expected that they won't work out-of-the box on your specific problem and that you will be required to change a few lines of code to adapt them to your needs. To help you with that, most of the examples fully expose the preprocessing of the data and the training loop to allow you to tweak and edit them as required. +- **Beginner-friendly**: We do not aim for providing state-of-the-art training scripts for the newest models, but rather examples that can be used as a way to better understand diffusion models and how to use them with the `diffusers` library. We often purposefully leave out certain state-of-the-art methods if we consider them too complex for beginners. +- **One-purpose-only**: Examples should show one task and one task only. Even if a task is from a modeling +point of view very similar, *e.g.* image super-resolution and image modification tend to use the same model and training method, we want examples to showcase only one task to keep them as readable and easy-to-understand as possible. + +We provide **official** examples that cover the most popular tasks of diffusion models. +*Official* examples are **actively** maintained by the `diffusers` maintainers and we try to rigorously follow our example philosophy as defined above. +If you feel like another important example should exist, we are more than happy to welcome a [Feature Request](https://github.com/huggingface/diffusers/issues/new?assignees=&labels=&template=feature_request.md&title=) or directly a [Pull Request](https://github.com/huggingface/diffusers/compare) from you! + +Training examples show how to pretrain or fine-tune diffusion models for a variety of tasks. Currently we support: + +- [Unconditional Training](./unconditional_training) +- [Text-to-Image Training](./text2image)* +- [Text Inversion](./text_inversion) +- [Dreambooth](./dreambooth)* +- [LoRA Support](./lora)* +- [ControlNet](./controlnet)* +- [InstructPix2Pix](./instructpix2pix)* +- [Custom Diffusion](./custom_diffusion) +- [T2I-Adapters](./t2i_adapters)* + +*: Supports [Stable Diffusion XL](../api/pipelines/stable_diffusion/stable_diffusion_xl). + +If possible, please [install xFormers](../optimization/xformers) for memory efficient attention. This could help make your training faster and less memory intensive. + +| Task | 🀗 Accelerate | 🀗 Datasets | Colab +|---|---|:---:|:---:| +| [**Unconditional Image Generation**](./unconditional_training) | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/training_example.ipynb) +| [**Text-to-Image fine-tuning**](./text2image) | ✅ | ✅ | +| [**Textual Inversion**](./text_inversion) | ✅ | - | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/sd_textual_inversion_training.ipynb) +| [**Dreambooth**](./dreambooth) | ✅ | - | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/sd_dreambooth_training.ipynb) +| [**Training with LoRA**](./lora) | ✅ | - | - | +| [**ControlNet**](./controlnet) | ✅ | ✅ | - | +| [**InstructPix2Pix**](./instructpix2pix) | ✅ | ✅ | - | +| [**Custom Diffusion**](./custom_diffusion) | ✅ | ✅ | - | +| [**T2I Adapters**](./t2i_adapters) | ✅ | ✅ | - | + +## Community + +In addition, we provide **community** examples, which are examples added and maintained by our community. +Community examples can consist of both *training* examples or *inference* pipelines. +For such examples, we are more lenient regarding the philosophy defined above and also cannot guarantee to provide maintenance for every issue. +Examples that are useful for the community, but are either not yet deemed popular or not yet following our above philosophy should go into the [community examples](https://github.com/huggingface/diffusers/tree/main/examples/community) folder. The community folder therefore includes training examples and inference pipelines. +**Note**: Community examples can be a [great first contribution](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22) to show to the community how you like to use `diffusers` 🪄. + +## Important note + +To make sure you can successfully run the latest versions of the example scripts, you have to **install the library from source** and install some example-specific requirements. To do this, execute the following steps in a new virtual environment: + +```bash +git clone https://github.com/huggingface/diffusers +cd diffusers +pip install . +``` + +Then cd in the example folder of your choice and run + +```bash +pip install -r requirements.txt +``` diff --git a/docs/source/jp/training/text_inversion.md b/docs/source/jp/training/text_inversion.md new file mode 100644 index 000000000000..7cc7d57e7c6c --- /dev/null +++ b/docs/source/jp/training/text_inversion.md @@ -0,0 +1,277 @@ + + + + +# Textual Inversion + +[Textual Inversion](https://arxiv.org/abs/2208.01618) is a technique for capturing novel concepts from a small number of example images. While the technique was originally demonstrated with a [latent diffusion model](https://github.com/CompVis/latent-diffusion), it has since been applied to other model variants like [Stable Diffusion](https://huggingface.co/docs/diffusers/main/en/conceptual/stable_diffusion). The learned concepts can be used to better control the images generated from text-to-image pipelines. It learns new "words" in the text encoder's embedding space, which are used within text prompts for personalized image generation. + +![Textual Inversion example](https://textual-inversion.github.io/static/images/editing/colorful_teapot.JPG) +By using just 3-5 images you can teach new concepts to a model such as Stable Diffusion for personalized image generation (image source). + +This guide will show you how to train a [`runwayml/stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5) model with Textual Inversion. All the training scripts for Textual Inversion used in this guide can be found [here](https://github.com/huggingface/diffusers/tree/main/examples/textual_inversion) if you're interested in taking a closer look at how things work under the hood. + + + +There is a community-created collection of trained Textual Inversion models in the [Stable Diffusion Textual Inversion Concepts Library](https://huggingface.co/sd-concepts-library) which are readily available for inference. Over time, this'll hopefully grow into a useful resource as more concepts are added! + + + +Before you begin, make sure you install the library's training dependencies: + +```bash +pip install diffusers accelerate transformers +``` + +After all the dependencies have been set up, initialize a [🀗Accelerate](https://github.com/huggingface/accelerate/) environment with: + +```bash +accelerate config +``` + +To setup a default 🀗 Accelerate environment without choosing any configurations: + +```bash +accelerate config default +``` + +Or if your environment doesn't support an interactive shell like a notebook, you can use: + +```bash +from accelerate.utils import write_basic_config + +write_basic_config() +``` + +Finally, you try and [install xFormers](https://huggingface.co/docs/diffusers/main/en/training/optimization/xformers) to reduce your memory footprint with xFormers memory-efficient attention. Once you have xFormers installed, add the `--enable_xformers_memory_efficient_attention` argument to the training script. xFormers is not supported for Flax. + +## Upload model to Hub + +If you want to store your model on the Hub, add the following argument to the training script: + +```bash +--push_to_hub +``` + +## Save and load checkpoints + +It is often a good idea to regularly save checkpoints of your model during training. This way, you can resume training from a saved checkpoint if your training is interrupted for any reason. To save a checkpoint, pass the following argument to the training script to save the full training state in a subfolder in `output_dir` every 500 steps: + +```bash +--checkpointing_steps=500 +``` + +To resume training from a saved checkpoint, pass the following argument to the training script and the specific checkpoint you'd like to resume from: + +```bash +--resume_from_checkpoint="checkpoint-1500" +``` + +## Finetuning + +For your training dataset, download these [images of a cat toy](https://huggingface.co/datasets/diffusers/cat_toy_example) and store them in a directory. To use your own dataset, take a look at the [Create a dataset for training](create_dataset) guide. + +```py +from huggingface_hub import snapshot_download + +local_dir = "./cat" +snapshot_download( + "diffusers/cat_toy_example", local_dir=local_dir, repo_type="dataset", ignore_patterns=".gitattributes" +) +``` + +Specify the `MODEL_NAME` environment variable (either a Hub model repository id or a path to the directory containing the model weights) and pass it to the [`pretrained_model_name_or_path`](https://huggingface.co/docs/diffusers/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.from_pretrained.pretrained_model_name_or_path) argument, and the `DATA_DIR` environment variable to the path of the directory containing the images. + +Now you can launch the [training script](https://github.com/huggingface/diffusers/blob/main/examples/textual_inversion/textual_inversion.py). The script creates and saves the following files to your repository: `learned_embeds.bin`, `token_identifier.txt`, and `type_of_concept.txt`. + + + +💡 A full training run takes ~1 hour on one V100 GPU. While you're waiting for the training to complete, feel free to check out [how Textual Inversion works](#how-it-works) in the section below if you're curious! + + + + + +```bash +export MODEL_NAME="runwayml/stable-diffusion-v1-5" +export DATA_DIR="./cat" + +accelerate launch textual_inversion.py \ + --pretrained_model_name_or_path=$MODEL_NAME \ + --train_data_dir=$DATA_DIR \ + --learnable_property="object" \ + --placeholder_token="" --initializer_token="toy" \ + --resolution=512 \ + --train_batch_size=1 \ + --gradient_accumulation_steps=4 \ + --max_train_steps=3000 \ + --learning_rate=5.0e-04 --scale_lr \ + --lr_scheduler="constant" \ + --lr_warmup_steps=0 \ + --output_dir="textual_inversion_cat" \ + --push_to_hub +``` + + + +💡 If you want to increase the trainable capacity, you can associate your placeholder token, *e.g.* `` to +multiple embedding vectors. This can help the model to better capture the style of more (complex) images. +To enable training multiple embedding vectors, simply pass: + +```bash +--num_vectors=5 +``` + + + + +If you have access to TPUs, try out the [Flax training script](https://github.com/huggingface/diffusers/blob/main/examples/textual_inversion/textual_inversion_flax.py) to train even faster (this'll also work for GPUs). With the same configuration settings, the Flax training script should be at least 70% faster than the PyTorch training script! ⚡ + +Before you begin, make sure you install the Flax specific dependencies: + +```bash +pip install -U -r requirements_flax.txt +``` + +Specify the `MODEL_NAME` environment variable (either a Hub model repository id or a path to the directory containing the model weights) and pass it to the [`pretrained_model_name_or_path`](https://huggingface.co/docs/diffusers/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.from_pretrained.pretrained_model_name_or_path) argument. + +Then you can launch the [training script](https://github.com/huggingface/diffusers/blob/main/examples/textual_inversion/textual_inversion_flax.py): + +```bash +export MODEL_NAME="duongna/stable-diffusion-v1-4-flax" +export DATA_DIR="./cat" + +python textual_inversion_flax.py \ + --pretrained_model_name_or_path=$MODEL_NAME \ + --train_data_dir=$DATA_DIR \ + --learnable_property="object" \ + --placeholder_token="" --initializer_token="toy" \ + --resolution=512 \ + --train_batch_size=1 \ + --max_train_steps=3000 \ + --learning_rate=5.0e-04 --scale_lr \ + --output_dir="textual_inversion_cat" \ + --push_to_hub +``` + + + +### Intermediate logging + +If you're interested in following along with your model training progress, you can save the generated images from the training process. Add the following arguments to the training script to enable intermediate logging: + +- `validation_prompt`, the prompt used to generate samples (this is set to `None` by default and intermediate logging is disabled) +- `num_validation_images`, the number of sample images to generate +- `validation_steps`, the number of steps before generating `num_validation_images` from the `validation_prompt` + +```bash +--validation_prompt="A backpack" +--num_validation_images=4 +--validation_steps=100 +``` + +## Inference + +Once you have trained a model, you can use it for inference with the [`StableDiffusionPipeline`]. + +The textual inversion script will by default only save the textual inversion embedding vector(s) that have +been added to the text encoder embedding matrix and consequently been trained. + + + + + +💡 The community has created a large library of different textual inversion embedding vectors, called [sd-concepts-library](https://huggingface.co/sd-concepts-library). +Instead of training textual inversion embeddings from scratch you can also see whether a fitting textual inversion embedding has already been added to the library. + + + +To load the textual inversion embeddings you first need to load the base model that was used when training +your textual inversion embedding vectors. Here we assume that [`runwayml/stable-diffusion-v1-5`](runwayml/stable-diffusion-v1-5) +was used as a base model so we load it first: +```python +from diffusers import StableDiffusionPipeline +import torch + +model_id = "runwayml/stable-diffusion-v1-5" +pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16, use_safetensors=True).to("cuda") +``` + +Next, we need to load the textual inversion embedding vector which can be done via the [`TextualInversionLoaderMixin.load_textual_inversion`] +function. Here we'll load the embeddings of the "" example from before. +```python +pipe.load_textual_inversion("sd-concepts-library/cat-toy") +``` + +Now we can run the pipeline making sure that the placeholder token `` is used in our prompt. + +```python +prompt = "A backpack" + +image = pipe(prompt, num_inference_steps=50).images[0] +image.save("cat-backpack.png") +``` + +The function [`TextualInversionLoaderMixin.load_textual_inversion`] can not only +load textual embedding vectors saved in Diffusers' format, but also embedding vectors +saved in [Automatic1111](https://github.com/AUTOMATIC1111/stable-diffusion-webui) format. +To do so, you can first download an embedding vector from [civitAI](https://civitai.com/models/3036?modelVersionId=8387) +and then load it locally: +```python +pipe.load_textual_inversion("./charturnerv2.pt") +``` + + +Currently there is no `load_textual_inversion` function for Flax so one has to make sure the textual inversion +embedding vector is saved as part of the model after training. + +The model can then be run just like any other Flax model: + +```python +import jax +import numpy as np +from flax.jax_utils import replicate +from flax.training.common_utils import shard +from diffusers import FlaxStableDiffusionPipeline + +model_path = "path-to-your-trained-model" +pipeline, params = FlaxStableDiffusionPipeline.from_pretrained(model_path, dtype=jax.numpy.bfloat16) + +prompt = "A backpack" +prng_seed = jax.random.PRNGKey(0) +num_inference_steps = 50 + +num_samples = jax.device_count() +prompt = num_samples * [prompt] +prompt_ids = pipeline.prepare_inputs(prompt) + +# shard inputs and rng +params = replicate(params) +prng_seed = jax.random.split(prng_seed, jax.device_count()) +prompt_ids = shard(prompt_ids) + +images = pipeline(prompt_ids, params, prng_seed, num_inference_steps, jit=True).images +images = pipeline.numpy_to_pil(np.asarray(images.reshape((num_samples,) + images.shape[-3:]))) +image.save("cat-backpack.png") +``` + + + +## How it works + +![Diagram from the paper showing overview](https://textual-inversion.github.io/static/images/training/training.JPG) +Architecture overview from the Textual Inversion blog post. + +Usually, text prompts are tokenized into an embedding before being passed to a model, which is often a transformer. Textual Inversion does something similar, but it learns a new token embedding, `v*`, from a special token `S*` in the diagram above. The model output is used to condition the diffusion model, which helps the diffusion model understand the prompt and new concepts from just a few example images. + +To do this, Textual Inversion uses a generator model and noisy versions of the training images. The generator tries to predict less noisy versions of the images, and the token embedding `v*` is optimized based on how well the generator does. If the token embedding successfully captures the new concept, it gives more useful information to the diffusion model and helps create clearer images with less noise. This optimization process typically occurs after several thousand steps of exposure to a variety of prompt and image variants. diff --git a/docs/source/jp/tutorials/autopipeline.md b/docs/source/jp/tutorials/autopipeline.md new file mode 100644 index 000000000000..973a83c73eb1 --- /dev/null +++ b/docs/source/jp/tutorials/autopipeline.md @@ -0,0 +1,146 @@ +# AutoPipeline + +🀗 Diffusers is able to complete many different tasks, and you can often reuse the same pretrained weights for multiple tasks such as text-to-image, image-to-image, and inpainting. If you're new to the library and diffusion models though, it may be difficult to know which pipeline to use for a task. For example, if you're using the [runwayml/stable-diffusion-v1-5](https://huggingface.co/runwayml/stable-diffusion-v1-5) checkpoint for text-to-image, you might not know that you could also use it for image-to-image and inpainting by loading the checkpoint with the [`StableDiffusionImg2ImgPipeline`] and [`StableDiffusionInpaintPipeline`] classes respectively. + +The `AutoPipeline` class is designed to simplify the variety of pipelines in 🀗 Diffusers. It is a generic, *task-first* pipeline that lets you focus on the task. The `AutoPipeline` automatically detects the correct pipeline class to use, which makes it easier to load a checkpoint for a task without knowing the specific pipeline class name. + + + +Take a look at the [AutoPipeline](./pipelines/auto_pipeline) reference to see which tasks are supported. Currently, it supports text-to-image, image-to-image, and inpainting. + + + +This tutorial shows you how to use an `AutoPipeline` to automatically infer the pipeline class to load for a specific task, given the pretrained weights. + +## Choose an AutoPipeline for your task + +Start by picking a checkpoint. For example, if you're interested in text-to-image with the [runwayml/stable-diffusion-v1-5](https://huggingface.co/runwayml/stable-diffusion-v1-5) checkpoint, use [`AutoPipelineForText2Image`]: + +```py +from diffusers import AutoPipelineForText2Image +import torch + +pipeline = AutoPipelineForText2Image.from_pretrained( + "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True +).to("cuda") +prompt = "peasant and dragon combat, wood cutting style, viking era, bevel with rune" + +image = pipeline(prompt, num_inference_steps=25).images[0] +``` + +
+ generated image of peasant fighting dragon in wood cutting style +
+ +Under the hood, [`AutoPipelineForText2Image`]: + +1. automatically detects a `"stable-diffusion"` class from the [`model_index.json`](https://huggingface.co/runwayml/stable-diffusion-v1-5/blob/main/model_index.json) file +2. loads the corresponding text-to-image [`StableDiffusionPipline`] based on the `"stable-diffusion"` class name + +Likewise, for image-to-image, [`AutoPipelineForImage2Image`] detects a `"stable-diffusion"` checkpoint from the `model_index.json` file and it'll load the corresponding [`StableDiffusionImg2ImgPipeline`] behind the scenes. You can also pass any additional arguments specific to the pipeline class such as `strength`, which determines the amount of noise or variation added to an input image: + +```py +from diffusers import AutoPipelineForImage2Image + +pipeline = AutoPipelineForImage2Image.from_pretrained( + "runwayml/stable-diffusion-v1-5", + torch_dtype=torch.float16, + use_safetensors=True, +).to("cuda") +prompt = "a portrait of a dog wearing a pearl earring" + +url = "https://upload.wikimedia.org/wikipedia/commons/thumb/0/0f/1665_Girl_with_a_Pearl_Earring.jpg/800px-1665_Girl_with_a_Pearl_Earring.jpg" + +response = requests.get(url) +image = Image.open(BytesIO(response.content)).convert("RGB") +image.thumbnail((768, 768)) + +image = pipeline(prompt, image, num_inference_steps=200, strength=0.75, guidance_scale=10.5).images[0] +``` + +
+ generated image of a vermeer portrait of a dog wearing a pearl earring +
+ +And if you want to do inpainting, then [`AutoPipelineForInpainting`] loads the underlying [`StableDiffusionInpaintPipeline`] class in the same way: + +```py +from diffusers import AutoPipelineForInpainting +from diffusers.utils import load_image + +pipeline = AutoPipelineForInpainting.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, use_safetensors=True +).to("cuda") + +img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png" +mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png" + +init_image = load_image(img_url).convert("RGB") +mask_image = load_image(mask_url).convert("RGB") + +prompt = "A majestic tiger sitting on a bench" +image = pipeline(prompt, image=init_image, mask_image=mask_image, num_inference_steps=50, strength=0.80).images[0] +``` + +
+ generated image of a tiger sitting on a bench +
+ +If you try to load an unsupported checkpoint, it'll throw an error: + +```py +from diffusers import AutoPipelineForImage2Image +import torch + +pipeline = AutoPipelineForImage2Image.from_pretrained( + "openai/shap-e-img2img", torch_dtype=torch.float16, use_safetensors=True +) +"ValueError: AutoPipeline can't find a pipeline linked to ShapEImg2ImgPipeline for None" +``` + +## Use multiple pipelines + +For some workflows or if you're loading many pipelines, it is more memory-efficient to reuse the same components from a checkpoint instead of reloading them which would unnecessarily consume additional memory. For example, if you're using a checkpoint for text-to-image and you want to use it again for image-to-image, use the [`~AutoPipelineForImage2Image.from_pipe`] method. This method creates a new pipeline from the components of a previously loaded pipeline at no additional memory cost. + +The [`~AutoPipelineForImage2Image.from_pipe`] method detects the original pipeline class and maps it to the new pipeline class corresponding to the task you want to do. For example, if you load a `"stable-diffusion"` class pipeline for text-to-image: + +```py +from diffusers import AutoPipelineForText2Image, AutoPipelineForImage2Image + +pipeline_text2img = AutoPipelineForText2Image.from_pretrained( + "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True +) +print(type(pipeline_text2img)) +"" +``` + +Then [`~AutoPipelineForImage2Image.from_pipe`] maps the original `"stable-diffusion"` pipeline class to [`StableDiffusionImg2ImgPipeline`]: + +```py +pipeline_img2img = AutoPipelineForImage2Image.from_pipe(pipeline_text2img) +print(type(pipeline_img2img)) +"" +``` + +If you passed an optional argument - like disabling the safety checker - to the original pipeline, this argument is also passed on to the new pipeline: + +```py +from diffusers import AutoPipelineForText2Image, AutoPipelineForImage2Image + +pipeline_text2img = AutoPipelineForText2Image.from_pretrained( + "runwayml/stable-diffusion-v1-5", + torch_dtype=torch.float16, + use_safetensors=True, + requires_safety_checker=False, +).to("cuda") + +pipeline_img2img = AutoPipelineForImage2Image.from_pipe(pipeline_text2img) +print(pipe.config.requires_safety_checker) +"False" +``` + +You can overwrite any of the arguments and even configuration from the original pipeline if you want to change the behavior of the new pipeline. For example, to turn the safety checker back on and add the `strength` argument: + +```py +pipeline_img2img = AutoPipelineForImage2Image.from_pipe(pipeline_text2img, requires_safety_checker=True, strength=0.3) +``` diff --git a/docs/source/jp/tutorials/basic_training.md b/docs/source/jp/tutorials/basic_training.md new file mode 100644 index 000000000000..3a9366baf84a --- /dev/null +++ b/docs/source/jp/tutorials/basic_training.md @@ -0,0 +1,404 @@ + + +[[open-in-colab]] + +# Train a diffusion model + +Unconditional image generation is a popular application of diffusion models that generates images that look like those in the dataset used for training. Typically, the best results are obtained from finetuning a pretrained model on a specific dataset. You can find many of these checkpoints on the [Hub](https://huggingface.co/search/full-text?q=unconditional-image-generation&type=model), but if you can't find one you like, you can always train your own! + +This tutorial will teach you how to train a [`UNet2DModel`] from scratch on a subset of the [Smithsonian Butterflies](https://huggingface.co/datasets/huggan/smithsonian_butterflies_subset) dataset to generate your own 🊋 butterflies 🊋. + + + +💡 This training tutorial is based on the [Training with 🧚 Diffusers](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/training_example.ipynb) notebook. For additional details and context about diffusion models like how they work, check out the notebook! + + + +Before you begin, make sure you have 🀗 Datasets installed to load and preprocess image datasets, and 🀗 Accelerate, to simplify training on any number of GPUs. The following command will also install [TensorBoard](https://www.tensorflow.org/tensorboard) to visualize training metrics (you can also use [Weights & Biases](https://docs.wandb.ai/) to track your training). + +```py +# uncomment to install the necessary libraries in Colab +#!pip install diffusers[training] +``` + +We encourage you to share your model with the community, and in order to do that, you'll need to login to your Hugging Face account (create one [here](https://hf.co/join) if you don't already have one!). You can login from a notebook and enter your token when prompted: + +```py +>>> from huggingface_hub import notebook_login + +>>> notebook_login() +``` + +Or login in from the terminal: + +```bash +huggingface-cli login +``` + +Since the model checkpoints are quite large, install [Git-LFS](https://git-lfs.com/) to version these large files: + +```bash +!sudo apt -qq install git-lfs +!git config --global credential.helper store +``` + +## Training configuration + +For convenience, create a `TrainingConfig` class containing the training hyperparameters (feel free to adjust them): + +```py +>>> from dataclasses import dataclass + + +>>> @dataclass +... class TrainingConfig: +... image_size = 128 # the generated image resolution +... train_batch_size = 16 +... eval_batch_size = 16 # how many images to sample during evaluation +... num_epochs = 50 +... gradient_accumulation_steps = 1 +... learning_rate = 1e-4 +... lr_warmup_steps = 500 +... save_image_epochs = 10 +... save_model_epochs = 30 +... mixed_precision = "fp16" # `no` for float32, `fp16` for automatic mixed precision +... output_dir = "ddpm-butterflies-128" # the model name locally and on the HF Hub + +... push_to_hub = True # whether to upload the saved model to the HF Hub +... hub_private_repo = False +... overwrite_output_dir = True # overwrite the old model when re-running the notebook +... seed = 0 + + +>>> config = TrainingConfig() +``` + +## Load the dataset + +You can easily load the [Smithsonian Butterflies](https://huggingface.co/datasets/huggan/smithsonian_butterflies_subset) dataset with the 🀗 Datasets library: + +```py +>>> from datasets import load_dataset + +>>> config.dataset_name = "huggan/smithsonian_butterflies_subset" +>>> dataset = load_dataset(config.dataset_name, split="train") +``` + + + +💡 You can find additional datasets from the [HugGan Community Event](https://huggingface.co/huggan) or you can use your own dataset by creating a local [`ImageFolder`](https://huggingface.co/docs/datasets/image_dataset#imagefolder). Set `config.dataset_name` to the repository id of the dataset if it is from the HugGan Community Event, or `imagefolder` if you're using your own images. + + + +🀗 Datasets uses the [`~datasets.Image`] feature to automatically decode the image data and load it as a [`PIL.Image`](https://pillow.readthedocs.io/en/stable/reference/Image.html) which we can visualize: + +```py +>>> import matplotlib.pyplot as plt + +>>> fig, axs = plt.subplots(1, 4, figsize=(16, 4)) +>>> for i, image in enumerate(dataset[:4]["image"]): +... axs[i].imshow(image) +... axs[i].set_axis_off() +>>> fig.show() +``` + +
+ +
+ +The images are all different sizes though, so you'll need to preprocess them first: + +* `Resize` changes the image size to the one defined in `config.image_size`. +* `RandomHorizontalFlip` augments the dataset by randomly mirroring the images. +* `Normalize` is important to rescale the pixel values into a [-1, 1] range, which is what the model expects. + +```py +>>> from torchvision import transforms + +>>> preprocess = transforms.Compose( +... [ +... transforms.Resize((config.image_size, config.image_size)), +... transforms.RandomHorizontalFlip(), +... transforms.ToTensor(), +... transforms.Normalize([0.5], [0.5]), +... ] +... ) +``` + +Use 🀗 Datasets' [`~datasets.Dataset.set_transform`] method to apply the `preprocess` function on the fly during training: + +```py +>>> def transform(examples): +... images = [preprocess(image.convert("RGB")) for image in examples["image"]] +... return {"images": images} + + +>>> dataset.set_transform(transform) +``` + +Feel free to visualize the images again to confirm that they've been resized. Now you're ready to wrap the dataset in a [DataLoader](https://pytorch.org/docs/stable/data#torch.utils.data.DataLoader) for training! + +```py +>>> import torch + +>>> train_dataloader = torch.utils.data.DataLoader(dataset, batch_size=config.train_batch_size, shuffle=True) +``` + +## Create a UNet2DModel + +Pretrained models in 🧚 Diffusers are easily created from their model class with the parameters you want. For example, to create a [`UNet2DModel`]: + +```py +>>> from diffusers import UNet2DModel + +>>> model = UNet2DModel( +... sample_size=config.image_size, # the target image resolution +... in_channels=3, # the number of input channels, 3 for RGB images +... out_channels=3, # the number of output channels +... layers_per_block=2, # how many ResNet layers to use per UNet block +... block_out_channels=(128, 128, 256, 256, 512, 512), # the number of output channels for each UNet block +... down_block_types=( +... "DownBlock2D", # a regular ResNet downsampling block +... "DownBlock2D", +... "DownBlock2D", +... "DownBlock2D", +... "AttnDownBlock2D", # a ResNet downsampling block with spatial self-attention +... "DownBlock2D", +... ), +... up_block_types=( +... "UpBlock2D", # a regular ResNet upsampling block +... "AttnUpBlock2D", # a ResNet upsampling block with spatial self-attention +... "UpBlock2D", +... "UpBlock2D", +... "UpBlock2D", +... "UpBlock2D", +... ), +... ) +``` + +It is often a good idea to quickly check the sample image shape matches the model output shape: + +```py +>>> sample_image = dataset[0]["images"].unsqueeze(0) +>>> print("Input shape:", sample_image.shape) +Input shape: torch.Size([1, 3, 128, 128]) + +>>> print("Output shape:", model(sample_image, timestep=0).sample.shape) +Output shape: torch.Size([1, 3, 128, 128]) +``` + +Great! Next, you'll need a scheduler to add some noise to the image. + +## Create a scheduler + +The scheduler behaves differently depending on whether you're using the model for training or inference. During inference, the scheduler generates image from the noise. During training, the scheduler takes a model output - or a sample - from a specific point in the diffusion process and applies noise to the image according to a *noise schedule* and an *update rule*. + +Let's take a look at the [`DDPMScheduler`] and use the `add_noise` method to add some random noise to the `sample_image` from before: + +```py +>>> import torch +>>> from PIL import Image +>>> from diffusers import DDPMScheduler + +>>> noise_scheduler = DDPMScheduler(num_train_timesteps=1000) +>>> noise = torch.randn(sample_image.shape) +>>> timesteps = torch.LongTensor([50]) +>>> noisy_image = noise_scheduler.add_noise(sample_image, noise, timesteps) + +>>> Image.fromarray(((noisy_image.permute(0, 2, 3, 1) + 1.0) * 127.5).type(torch.uint8).numpy()[0]) +``` + +
+ +
+ +The training objective of the model is to predict the noise added to the image. The loss at this step can be calculated by: + +```py +>>> import torch.nn.functional as F + +>>> noise_pred = model(noisy_image, timesteps).sample +>>> loss = F.mse_loss(noise_pred, noise) +``` + +## Train the model + +By now, you have most of the pieces to start training the model and all that's left is putting everything together. + +First, you'll need an optimizer and a learning rate scheduler: + +```py +>>> from diffusers.optimization import get_cosine_schedule_with_warmup + +>>> optimizer = torch.optim.AdamW(model.parameters(), lr=config.learning_rate) +>>> lr_scheduler = get_cosine_schedule_with_warmup( +... optimizer=optimizer, +... num_warmup_steps=config.lr_warmup_steps, +... num_training_steps=(len(train_dataloader) * config.num_epochs), +... ) +``` + +Then, you'll need a way to evaluate the model. For evaluation, you can use the [`DDPMPipeline`] to generate a batch of sample images and save it as a grid: + +```py +>>> from diffusers import DDPMPipeline +>>> from diffusers.utils import make_image_grid +>>> import math +>>> import os + + +>>> def evaluate(config, epoch, pipeline): +... # Sample some images from random noise (this is the backward diffusion process). +... # The default pipeline output type is `List[PIL.Image]` +... images = pipeline( +... batch_size=config.eval_batch_size, +... generator=torch.manual_seed(config.seed), +... ).images + +... # Make a grid out of the images +... image_grid = make_image_grid(images, rows=4, cols=4) + +... # Save the images +... test_dir = os.path.join(config.output_dir, "samples") +... os.makedirs(test_dir, exist_ok=True) +... image_grid.save(f"{test_dir}/{epoch:04d}.png") +``` + +Now you can wrap all these components together in a training loop with 🀗 Accelerate for easy TensorBoard logging, gradient accumulation, and mixed precision training. To upload the model to the Hub, write a function to get your repository name and information and then push it to the Hub. + + + +💡 The training loop below may look intimidating and long, but it'll be worth it later when you launch your training in just one line of code! If you can't wait and want to start generating images, feel free to copy and run the code below. You can always come back and examine the training loop more closely later, like when you're waiting for your model to finish training. 🀗 + + + +```py +>>> from accelerate import Accelerator +>>> from huggingface_hub import create_repo, upload_folder +>>> from tqdm.auto import tqdm +>>> from pathlib import Path +>>> import os + +>>> def train_loop(config, model, noise_scheduler, optimizer, train_dataloader, lr_scheduler): +... # Initialize accelerator and tensorboard logging +... accelerator = Accelerator( +... mixed_precision=config.mixed_precision, +... gradient_accumulation_steps=config.gradient_accumulation_steps, +... log_with="tensorboard", +... project_dir=os.path.join(config.output_dir, "logs"), +... ) +... if accelerator.is_main_process: +... if config.output_dir is not None: +... os.makedirs(config.output_dir, exist_ok=True) +... if config.push_to_hub: +... repo_id = create_repo( +... repo_id=config.hub_model_id or Path(config.output_dir).name, exist_ok=True +... ).repo_id +... accelerator.init_trackers("train_example") + +... # Prepare everything +... # There is no specific order to remember, you just need to unpack the +... # objects in the same order you gave them to the prepare method. +... model, optimizer, train_dataloader, lr_scheduler = accelerator.prepare( +... model, optimizer, train_dataloader, lr_scheduler +... ) + +... global_step = 0 + +... # Now you train the model +... for epoch in range(config.num_epochs): +... progress_bar = tqdm(total=len(train_dataloader), disable=not accelerator.is_local_main_process) +... progress_bar.set_description(f"Epoch {epoch}") + +... for step, batch in enumerate(train_dataloader): +... clean_images = batch["images"] +... # Sample noise to add to the images +... noise = torch.randn(clean_images.shape).to(clean_images.device) +... bs = clean_images.shape[0] + +... # Sample a random timestep for each image +... timesteps = torch.randint( +... 0, noise_scheduler.config.num_train_timesteps, (bs,), device=clean_images.device +... ).long() + +... # Add noise to the clean images according to the noise magnitude at each timestep +... # (this is the forward diffusion process) +... noisy_images = noise_scheduler.add_noise(clean_images, noise, timesteps) + +... with accelerator.accumulate(model): +... # Predict the noise residual +... noise_pred = model(noisy_images, timesteps, return_dict=False)[0] +... loss = F.mse_loss(noise_pred, noise) +... accelerator.backward(loss) + +... accelerator.clip_grad_norm_(model.parameters(), 1.0) +... optimizer.step() +... lr_scheduler.step() +... optimizer.zero_grad() + +... progress_bar.update(1) +... logs = {"loss": loss.detach().item(), "lr": lr_scheduler.get_last_lr()[0], "step": global_step} +... progress_bar.set_postfix(**logs) +... accelerator.log(logs, step=global_step) +... global_step += 1 + +... # After each epoch you optionally sample some demo images with evaluate() and save the model +... if accelerator.is_main_process: +... pipeline = DDPMPipeline(unet=accelerator.unwrap_model(model), scheduler=noise_scheduler) + +... if (epoch + 1) % config.save_image_epochs == 0 or epoch == config.num_epochs - 1: +... evaluate(config, epoch, pipeline) + +... if (epoch + 1) % config.save_model_epochs == 0 or epoch == config.num_epochs - 1: +... if config.push_to_hub: +... upload_folder( +... repo_id=repo_id, +... folder_path=config.output_dir, +... commit_message=f"Epoch {epoch}", +... ignore_patterns=["step_*", "epoch_*"], +... ) +... else: +... pipeline.save_pretrained(config.output_dir) +``` + +Phew, that was quite a bit of code! But you're finally ready to launch the training with 🀗 Accelerate's [`~accelerate.notebook_launcher`] function. Pass the function the training loop, all the training arguments, and the number of processes (you can change this value to the number of GPUs available to you) to use for training: + +```py +>>> from accelerate import notebook_launcher + +>>> args = (config, model, noise_scheduler, optimizer, train_dataloader, lr_scheduler) + +>>> notebook_launcher(train_loop, args, num_processes=1) +``` + +Once training is complete, take a look at the final 🊋 images 🊋 generated by your diffusion model! + +```py +>>> import glob + +>>> sample_images = sorted(glob.glob(f"{config.output_dir}/samples/*.png")) +>>> Image.open(sample_images[-1]) +``` + +
+ +
+ +## Next steps + +Unconditional image generation is one example of a task that can be trained. You can explore other tasks and training techniques by visiting the [🧚 Diffusers Training Examples](../training/overview) page. Here are some examples of what you can learn: + +* [Textual Inversion](../training/text_inversion), an algorithm that teaches a model a specific visual concept and integrates it into the generated image. +* [DreamBooth](../training/dreambooth), a technique for generating personalized images of a subject given several input images of the subject. +* [Guide](../training/text2image) to finetuning a Stable Diffusion model on your own dataset. +* [Guide](../training/lora) to using LoRA, a memory-efficient technique for finetuning really large models faster. diff --git a/docs/source/jp/tutorials/tutorial_overview.md b/docs/source/jp/tutorials/tutorial_overview.md new file mode 100644 index 000000000000..0cec9a317ddb --- /dev/null +++ b/docs/source/jp/tutorials/tutorial_overview.md @@ -0,0 +1,23 @@ + + +# Overview + +Welcome to 🧚 Diffusers! If you're new to diffusion models and generative AI, and want to learn more, then you've come to the right place. These beginner-friendly tutorials are designed to provide a gentle introduction to diffusion models and help you understand the library fundamentals - the core components and how 🧚 Diffusers is meant to be used. + +You'll learn how to use a pipeline for inference to rapidly generate things, and then deconstruct that pipeline to really understand how to use the library as a modular toolbox for building your own diffusion systems. In the next lesson, you'll learn how to train your own diffusion model to generate what you want. + +After completing the tutorials, you'll have gained the necessary skills to start exploring the library on your own and see how to use it for your own projects and applications. + +Feel free to join our community on [Discord](https://discord.com/invite/JfAtkvEtRb) or the [forums](https://discuss.huggingface.co/c/discussion-related-to-httpsgithubcomhuggingfacediffusers/63) to connect and collaborate with other users and developers! + +Let's start diffusing! 🧚 \ No newline at end of file diff --git a/docs/source/jp/tutorials/using_peft_for_inference.md b/docs/source/jp/tutorials/using_peft_for_inference.md new file mode 100644 index 000000000000..4629cf8ba43c --- /dev/null +++ b/docs/source/jp/tutorials/using_peft_for_inference.md @@ -0,0 +1,165 @@ + + +[[open-in-colab]] + +# Inference with PEFT + +There are many adapters trained in different styles to achieve different effects. You can even combine multiple adapters to create new and unique images. With the 🀗 [PEFT](https://huggingface.co/docs/peft/index) integration in 🀗 Diffusers, it is really easy to load and manage adapters for inference. In this guide, you'll learn how to use different adapters with [Stable Diffusion XL (SDXL)](./pipelines/stable_diffusion/stable_diffusion_xl) for inference. + +Throughout this guide, you'll use LoRA as the main adapter technique, so we'll use the terms LoRA and adapter interchangeably. You should have some familiarity with LoRA, and if you don't, we welcome you to check out the [LoRA guide](https://huggingface.co/docs/peft/conceptual_guides/lora). + +Let's first install all the required libraries. + +```bash +!pip install -q transformers accelerate +# Will be updated once the stable releases are done. +!pip install -q git+https://github.com/huggingface/peft.git +!pip install -q git+https://github.com/huggingface/diffusers.git +``` + +Now, let's load a pipeline with a SDXL checkpoint: + +```python +from diffusers import DiffusionPipeline +import torch + +pipe_id = "stabilityai/stable-diffusion-xl-base-1.0" +pipe = DiffusionPipeline.from_pretrained(pipe_id, torch_dtype=torch.float16).to("cuda") +``` + + +Next, load a LoRA checkpoint with the [`~diffusers.loaders.StableDiffusionXLLoraLoaderMixin.load_lora_weights`] method. + +With the 🀗 PEFT integration, you can assign a specific `adapter_name` to the checkpoint, which let's you easily switch between different LoRA checkpoints. Let's call this adapter `"toy"`. + +```python +pipe.load_lora_weights("CiroN2022/toy-face", weight_name="toy_face_sdxl.safetensors", adapter_name="toy") +``` + +And then perform inference: + +```python +prompt = "toy_face of a hacker with a hoodie" + +lora_scale= 0.9 +image = pipe( + prompt, num_inference_steps=30, cross_attention_kwargs={"scale": lora_scale}, generator=torch.manual_seed(0) +).images[0] +image +``` + +![toy-face](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/peft_integration/diffusers_peft_lora_inference_8_1.png) + + +With the `adapter_name` parameter, it is really easy to use another adapter for inference! Load the [nerijs/pixel-art-xl](https://huggingface.co/nerijs/pixel-art-xl) adapter that has been fine-tuned to generate pixel art images, and let's call it `"pixel"`. + +The pipeline automatically sets the first loaded adapter (`"toy"`) as the active adapter. But you can activate the `"pixel"` adapter with the [`~diffusers.loaders.set_adapters`] method as shown below: + +```python +pipe.load_lora_weights("nerijs/pixel-art-xl", weight_name="pixel-art-xl.safetensors", adapter_name="pixel") +pipe.set_adapters("pixel") +``` + +Let's now generate an image with the second adapter and check the result: + +```python +prompt = "a hacker with a hoodie, pixel art" +image = pipe( + prompt, num_inference_steps=30, cross_attention_kwargs={"scale": lora_scale}, generator=torch.manual_seed(0) +).images[0] +image +``` + +![pixel-art](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/peft_integration/diffusers_peft_lora_inference_12_1.png) + +## Combine multiple adapters + +You can also perform multi-adapter inference where you combine different adapter checkpoints for inference. + +Once again, use the [`~diffusers.loaders.set_adapters`] method to activate two LoRA checkpoints and specify the weight for how the checkpoints should be combined. + +```python +pipe.set_adapters(["pixel", "toy"], adapter_weights=[0.5, 1.0]) +``` + +Now that we have set these two adapters, let's generate an image from the combined adapters! + + + +LoRA checkpoints in the diffusion community are almost always obtained with [DreamBooth](https://huggingface.co/docs/diffusers/main/en/training/dreambooth). DreamBooth training often relies on "trigger" words in the input text prompts in order for the generation results to look as expected. When you combine multiple LoRA checkpoints, it's important to ensure the trigger words for the corresponding LoRA checkpoints are present in the input text prompts. + + + +The trigger words for [CiroN2022/toy-face](https://hf.co/CiroN2022/toy-face) and [nerijs/pixel-art-xl](https://hf.co/nerijs/pixel-art-xl) are found in their repositories. + + +```python +# Notice how the prompt is constructed. +prompt = "toy_face of a hacker with a hoodie, pixel art" +image = pipe( + prompt, num_inference_steps=30, cross_attention_kwargs={"scale": 1.0}, generator=torch.manual_seed(0) +).images[0] +image +``` + +![toy-face-pixel-art](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/peft_integration/diffusers_peft_lora_inference_16_1.png) + +Impressive! As you can see, the model was able to generate an image that mixes the characteristics of both adapters. + +If you want to go back to using only one adapter, use the [`~diffusers.loaders.set_adapters`] method to activate the `"toy"` adapter: + +```python +# First, set the adapter. +pipe.set_adapters("toy") + +# Then, run inference. +prompt = "toy_face of a hacker with a hoodie" +lora_scale= 0.9 +image = pipe( + prompt, num_inference_steps=30, cross_attention_kwargs={"scale": lora_scale}, generator=torch.manual_seed(0) +).images[0] +image +``` + +![toy-face-again](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/peft_integration/diffusers_peft_lora_inference_18_1.png) + + +If you want to switch to only the base model, disable all LoRAs with the [`~diffusers.loaders.disable_lora`] method. + + +```python +pipe.disable_lora() + +prompt = "toy_face of a hacker with a hoodie" +lora_scale= 0.9 +image = pipe(prompt, num_inference_steps=30, generator=torch.manual_seed(0)).images[0] +image +``` + +![no-lora](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/peft_integration/diffusers_peft_lora_inference_20_1.png) + +## Monitoring active adapters + +You have attached multiple adapters in this tutorial, and if you're feeling a bit lost on what adapters have been attached to the pipeline's components, you can easily check the list of active adapters using the [`~diffusers.loaders.get_active_adapters`] method: + +```python +active_adapters = pipe.get_active_adapters() +>>> ["toy", "pixel"] +``` + +You can also get the active adapters of each pipeline component with [`~diffusers.loaders.get_list_adapters`]: + +```python +list_adapters_component_wise = pipe.get_list_adapters() +>>> {"text_encoder": ["toy", "pixel"], "unet": ["toy", "pixel"], "text_encoder_2": ["toy", "pixel"]} +``` diff --git a/docs/source/jp/using-diffusers/pipeline_overview.md b/docs/source/jp/using-diffusers/pipeline_overview.md new file mode 100644 index 000000000000..4ee25b51dc6f --- /dev/null +++ b/docs/source/jp/using-diffusers/pipeline_overview.md @@ -0,0 +1,17 @@ + + +# Overview + +A pipeline is an end-to-end class that provides a quick and easy way to use a diffusion system for inference by bundling independently trained models and schedulers together. Certain combinations of models and schedulers define specific pipeline types, like [`StableDiffusionXLPipeline`] or [`StableDiffusionControlNetPipeline`], with specific capabilities. All pipeline types inherit from the base [`DiffusionPipeline`] class; pass it any checkpoint, and it'll automatically detect the pipeline type and load the necessary components. + +This section introduces you to some of the more complex pipelines like Stable Diffusion XL, ControlNet, and DiffEdit, which require additional inputs. You'll also learn how to use a distilled version of the Stable Diffusion model to speed up inference, how to control randomness on your hardware when generating images, and how to create a community pipeline for a custom task like generating images from speech. \ No newline at end of file diff --git a/docs/source/jp/using-diffusers/sdxl.md b/docs/source/jp/using-diffusers/sdxl.md new file mode 100644 index 000000000000..36286ecad863 --- /dev/null +++ b/docs/source/jp/using-diffusers/sdxl.md @@ -0,0 +1,431 @@ +# Stable Diffusion XL + +[[open-in-colab]] + +[Stable Diffusion XL](https://huggingface.co/papers/2307.01952) (SDXL) is a powerful text-to-image generation model that iterates on the previous Stable Diffusion models in three key ways: + +1. the UNet is 3x larger and SDXL combines a second text encoder (OpenCLIP ViT-bigG/14) with the original text encoder to significantly increase the number of parameters +2. introduces size and crop-conditioning to preserve training data from being discarded and gain more control over how a generated image should be cropped +3. introduces a two-stage model process; the *base* model (can also be run as a standalone model) generates an image as an input to the *refiner* model which adds additional high-quality details + +This guide will show you how to use SDXL for text-to-image, image-to-image, and inpainting. + +Before you begin, make sure you have the following libraries installed: + +```py +# uncomment to install the necessary libraries in Colab +#!pip install diffusers transformers accelerate safetensors omegaconf invisible-watermark>=0.2.0 +``` + + + +We recommend installing the [invisible-watermark](https://pypi.org/project/invisible-watermark/) library to help identify images that are generated. If the invisible-watermark library is installed, it is used by default. To disable the watermarker: + +```py +pipeline = StableDiffusionXLPipeline.from_pretrained(..., add_watermarker=False) +``` + + + +## Load model checkpoints + +Model weights may be stored in separate subfolders on the Hub or locally, in which case, you should use the [`~StableDiffusionXLPipeline.from_pretrained`] method: + +```py +from diffusers import StableDiffusionXLPipeline, StableDiffusionXLImg2ImgPipeline +import torch + +pipeline = StableDiffusionXLPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True +).to("cuda") + +refiner = StableDiffusionXLImg2ImgPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-refiner-1.0", torch_dtype=torch.float16, use_safetensors=True, variant="fp16" +).to("cuda") +``` + +You can also use the [`~StableDiffusionXLPipeline.from_single_file`] method to load a model checkpoint stored in a single file format (`.ckpt` or `.safetensors`) from the Hub or locally: + +```py +from diffusers import StableDiffusionXLPipeline, StableDiffusionXLImg2ImgPipeline +import torch + +pipeline = StableDiffusionXLPipeline.from_single_file( + "https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/blob/main/sd_xl_base_1.0.safetensors", torch_dtype=torch.float16, variant="fp16", use_safetensors=True +).to("cuda") + +refiner = StableDiffusionXLImg2ImgPipeline.from_single_file( + "https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0/blob/main/sd_xl_refiner_1.0.safetensors", torch_dtype=torch.float16, use_safetensors=True, variant="fp16" +).to("cuda") +``` + +## Text-to-image + +For text-to-image, pass a text prompt. By default, SDXL generates a 1024x1024 image for the best results. You can try setting the `height` and `width` parameters to 768x768 or 512x512, but anything below 512x512 is not likely to work. + +```py +from diffusers import AutoPipelineForText2Image +import torch + +pipeline_text2image = AutoPipelineForText2Image.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True +).to("cuda") + +prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" +image = pipeline(prompt=prompt).images[0] +``` + +
+ generated image of an astronaut in a jungle +
+ +## Image-to-image + +For image-to-image, SDXL works especially well with image sizes between 768x768 and 1024x1024. Pass an initial image, and a text prompt to condition the image with: + +```py +from diffusers import AutoPipelineForImg2Img +from diffusers.utils import load_image + +# use from_pipe to avoid consuming additional memory when loading a checkpoint +pipeline = AutoPipelineForImage2Image.from_pipe(pipeline_text2image).to("cuda") +url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-img2img.png" + +init_image = load_image(url).convert("RGB") +prompt = "a dog catching a frisbee in the jungle" +image = pipeline(prompt, image=init_image, strength=0.8, guidance_scale=10.5).images[0] +``` + +
+ generated image of a dog catching a frisbee in a jungle +
+ +## Inpainting + +For inpainting, you'll need the original image and a mask of what you want to replace in the original image. Create a prompt to describe what you want to replace the masked area with. + +```py +from diffusers import AutoPipelineForInpainting +from diffusers.utils import load_image + +# use from_pipe to avoid consuming additional memory when loading a checkpoint +pipeline = AutoPipelineForInpainting.from_pipe(pipeline_text2image).to("cuda") + +img_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-text2img.png" +mask_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-inpaint-mask.png" + +init_image = load_image(img_url).convert("RGB") +mask_image = load_image(mask_url).convert("RGB") + +prompt = "A deep sea diver floating" +image = pipeline(prompt=prompt, image=init_image, mask_image=mask_image, strength=0.85, guidance_scale=12.5).images[0] +``` + +
+ generated image of a deep sea diver in a jungle +
+ +## Refine image quality + +SDXL includes a [refiner model](https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0) specialized in denoising low-noise stage images to generate higher-quality images from the base model. There are two ways to use the refiner: + +1. use the base and refiner model together to produce a refined image +2. use the base model to produce an image, and subsequently use the refiner model to add more details to the image (this is how SDXL is originally trained) + +### Base + refiner model + +When you use the base and refiner model together to generate an image, this is known as an ([*ensemble of expert denoisers*](https://research.nvidia.com/labs/dir/eDiff-I/)). The ensemble of expert denoisers approach requires less overall denoising steps versus passing the base model's output to the refiner model, so it should be significantly faster to run. However, you won't be able to inspect the base model's output because it still contains a large amount of noise. + +As an ensemble of expert denoisers, the base model serves as the expert during the high-noise diffusion stage and the refiner model serves as the expert during the low-noise diffusion stage. Load the base and refiner model: + +```py +from diffusers import DiffusionPipeline +import torch + +base = DiffusionPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True +).to("cuda") + +refiner = DiffusionPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-refiner-1.0", + text_encoder_2=base.text_encoder_2, + vae=base.vae, + torch_dtype=torch.float16, + use_safetensors=True, + variant="fp16", +).to("cuda") +``` + +To use this approach, you need to define the number of timesteps for each model to run through their respective stages. For the base model, this is controlled by the [`denoising_end`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLPipeline.__call__.denoising_end) parameter and for the refiner model, it is controlled by the [`denoising_start`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLImg2ImgPipeline.__call__.denoising_start) parameter. + + + +The `denoising_end` and `denoising_start` parameters should be a float between 0 and 1. These parameters are represented as a proportion of discrete timesteps as defined by the scheduler. If you're also using the `strength` parameter, it'll be ignored because the number of denoising steps is determined by the discrete timesteps the model is trained on and the declared fractional cutoff. + + + +Let's set `denoising_end=0.8` so the base model performs the first 80% of denoising the **high-noise** timesteps and set `denoising_start=0.8` so the refiner model performs the last 20% of denoising the **low-noise** timesteps. The base model output should be in **latent** space instead of a PIL image. + +```py +prompt = "A majestic lion jumping from a big stone at night" + +image = base( + prompt=prompt, + num_inference_steps=40, + denoising_end=0.8, + output_type="latent", +).images +image = refiner( + prompt=prompt, + num_inference_steps=40, + denoising_start=0.8, + image=image, +).images[0] +``` + +
+
+ generated image of a lion on a rock at night +
base model
+
+
+ generated image of a lion on a rock at night in higher quality +
ensemble of expert denoisers
+
+
+ +The refiner model can also be used for inpainting in the [`StableDiffusionXLInpaintPipeline`]: + +```py +from diffusers import StableDiffusionXLInpaintPipeline +from diffusers.utils import load_image + +base = StableDiffusionXLInpaintPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True +).to("cuda") + +refiner = StableDiffusionXLInpaintPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-refiner-1.0", + text_encoder_2=pipe.text_encoder_2, + vae=pipe.vae, + torch_dtype=torch.float16, + use_safetensors=True, + variant="fp16", +).to("cuda") + +img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png" +mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png" + +init_image = load_image(img_url).convert("RGB") +mask_image = load_image(mask_url).convert("RGB") + +prompt = "A majestic tiger sitting on a bench" +num_inference_steps = 75 +high_noise_frac = 0.7 + +image = base( + prompt=prompt, + image=init_image, + mask_image=mask_image, + num_inference_steps=num_inference_steps, + denoising_end=high_noise_frac, + output_type="latent", +).images +image = refiner( + prompt=prompt, + image=image, + mask_image=mask_image, + num_inference_steps=num_inference_steps, + denoising_start=high_noise_frac, +).images[0] +``` + +This ensemble of expert denoisers method works well for all available schedulers! + +### Base to refiner model + +SDXL gets a boost in image quality by using the refiner model to add additional high-quality details to the fully-denoised image from the base model, in an image-to-image setting. + +Load the base and refiner models: + +```py +from diffusers import DiffusionPipeline +import torch + +base = DiffusionPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True +).to("cuda") + +refiner = DiffusionPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-refiner-1.0", + text_encoder_2=pipe.text_encoder_2, + vae=pipe.vae, + torch_dtype=torch.float16, + use_safetensors=True, + variant="fp16", +).to("cuda") +``` + +Generate an image from the base model, and set the model output to **latent** space: + +```py +prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" + +image = base(prompt=prompt, output_type="latent").images[0] +``` + +Pass the generated image to the refiner model: + +```py +image = refiner(prompt=prompt, image=image[None, :]).images[0] +``` + +
+
+ generated image of an astronaut riding a green horse on Mars +
base model
+
+
+ higher quality generated image of an astronaut riding a green horse on Mars +
base model + refiner model
+
+
+ +For inpainting, load the refiner model in the [`StableDiffusionXLInpaintPipeline`], remove the `denoising_end` and `denoising_start` parameters, and choose a smaller number of inference steps for the refiner. + +## Micro-conditioning + +SDXL training involves several additional conditioning techniques, which are referred to as *micro-conditioning*. These include original image size, target image size, and cropping parameters. The micro-conditionings can be used at inference time to create high-quality, centered images. + + + +You can use both micro-conditioning and negative micro-conditioning parameters thanks to classifier-free guidance. They are available in the [`StableDiffusionXLPipeline`], [`StableDiffusionXLImg2ImgPipeline`], [`StableDiffusionXLInpaintPipeline`], and [`StableDiffusionXLControlNetPipeline`]. + + + +### Size conditioning + +There are two types of size conditioning: + +- [`original_size`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLPipeline.__call__.original_size) conditioning comes from upscaled images in the training batch (because it would be wasteful to discard the smaller images which make up almost 40% of the total training data). This way, SDXL learns that upscaling artifacts are not supposed to be present in high-resolution images. During inference, you can use `original_size` to indicate the original image resolution. Using the default value of `(1024, 1024)` produces higher-quality images that resemble the 1024x1024 images in the dataset. If you choose to use a lower resolution, such as `(256, 256)`, the model still generates 1024x1024 images, but they'll look like the low resolution images (simpler patterns, blurring) in the dataset. + +- [`target_size`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLPipeline.__call__.target_size) conditioning comes from finetuning SDXL to support different image aspect ratios. During inference, if you use the default value of `(1024, 1024)`, you'll get an image that resembles the composition of square images in the dataset. We recommend using the same value for `target_size` and `original_size`, but feel free to experiment with other options! + +🀗 Diffusers also lets you specify negative conditions about an image's size to steer generation away from certain image resolutions: + +```py +from diffusers import StableDiffusionXLPipeline +import torch + +pipe = StableDiffusionXLPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True +).to("cuda") + +prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" +image = pipe( + prompt=prompt, + negative_original_size=(512, 512), + negative_target_size=(1024, 1024), +).images[0] +``` + +
+ +
Images negative conditioned on image resolutions of (128, 128), (256, 256), and (512, 512).
+
+ +### Crop conditioning + +Images generated by previous Stable Diffusion models may sometimes appear to be cropped. This is because images are actually cropped during training so that all the images in a batch have the same size. By conditioning on crop coordinates, SDXL *learns* that no cropping - coordinates `(0, 0)` - usually correlates with centered subjects and complete faces (this is the default value in 🀗 Diffusers). You can experiment with different coordinates if you want to generate off-centered compositions! + +```py +from diffusers import StableDiffusionXLPipeline +import torch + + +pipeline = StableDiffusionXLPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True +).to("cuda") + +prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" +image = pipeline(prompt=prompt, crops_coords_top_left=(256,0)).images[0] +``` + +
+ generated image of an astronaut in a jungle, slightly cropped +
+ +You can also specify negative cropping coordinates to steer generation away from certain cropping parameters: + +```py +from diffusers import StableDiffusionXLPipeline +import torch + +pipe = StableDiffusionXLPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True +).to("cuda") + +prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" +image = pipe( + prompt=prompt, + negative_original_size=(512, 512), + negative_crops_coords_top_left=(0, 0), + negative_target_size=(1024, 1024), +).images[0] +``` + +## Use a different prompt for each text-encoder + +SDXL uses two text-encoders, so it is possible to pass a different prompt to each text-encoder, which can [improve quality](https://github.com/huggingface/diffusers/issues/4004#issuecomment-1627764201). Pass your original prompt to `prompt` and the second prompt to `prompt_2` (use `negative_prompt` and `negative_prompt_2` if you're using a negative prompts): + +```py +from diffusers import StableDiffusionXLPipeline +import torch + +pipeline = StableDiffusionXLPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True +).to("cuda") + +# prompt is passed to OAI CLIP-ViT/L-14 +prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" +# prompt_2 is passed to OpenCLIP-ViT/bigG-14 +prompt_2 = "Van Gogh painting" +image = pipeline(prompt=prompt, prompt_2=prompt_2).images[0] +``` + +
+ generated image of an astronaut in a jungle in the style of a van gogh painting +
+ +The dual text-encoders also support textual inversion embeddings that need to be loaded separately as explained in the [SDXL textual inversion](textual_inversion_inference#stable-diffusion-xl] section. + +## Optimizations + +SDXL is a large model, and you may need to optimize memory to get it to run on your hardware. Here are some tips to save memory and speed up inference. + +1. Offload the model to the CPU with [`~StableDiffusionXLPipeline.enable_model_cpu_offload`] for out-of-memory errors: + +```diff +- base.to("cuda") +- refiner.to("cuda") ++ base.enable_model_cpu_offload ++ refiner.enable_model_cpu_offload +``` + +2. Use `torch.compile` for ~20% speed-up (you need `torch>2.0`): + +```diff ++ base.unet = torch.compile(base.unet, mode="reduce-overhead", fullgraph=True) ++ refiner.unet = torch.compile(refiner.unet, mode="reduce-overhead", fullgraph=True) +``` + +3. Enable [xFormers](/optimization/xformers) to run SDXL if `torch<2.0`: + +```diff ++ base.enable_xformers_memory_efficient_attention() ++ refiner.enable_xformers_memory_efficient_attention() +``` + +## Other resources + +If you're interested in experimenting with a minimal version of the [`UNet2DConditionModel`] used in SDXL, take a look at the [minSDXL](https://github.com/cloneofsimo/minSDXL) implementation which is written in PyTorch and directly compatible with 🀗 Diffusers. From 3325d20bbdc86fc93ba866a53e7c5b974a3bd6f7 Mon Sep 17 00:00:00 2001 From: isamu-isozaki Date: Sat, 21 Oct 2023 23:47:24 -0400 Subject: [PATCH 2/9] Finished installation.md --- docs/source/jp/installation.md | 67 +++++++++++++++++----------------- 1 file changed, 34 insertions(+), 33 deletions(-) diff --git a/docs/source/jp/installation.md b/docs/source/jp/installation.md index 1a0951bf7bba..3e4e8b21255c 100644 --- a/docs/source/jp/installation.md +++ b/docs/source/jp/installation.md @@ -10,7 +10,7 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -# Installation +# むンストヌル Install 🀗 Diffusers for whichever deep learning library you're working with. @@ -54,37 +54,39 @@ pip install diffusers["flax"] transformers ## Install from source -Before installing 🀗 Diffusers from source, make sure you have `torch` and 🀗 Accelerate installed. +゜ヌスからのむンストヌル -For `torch` installation, refer to the `torch` [installation](https://pytorch.org/get-started/locally/#start-locally) guide. +゜ヌスから🀗 Diffusersをむンストヌルする前に、`torch`ず🀗 Accelerateがむンストヌルされおいるこずを確認しおください。 -To install 🀗 Accelerate: +`torch`のむンストヌルに぀いおは、`torch` [むンストヌル](https://pytorch.org/get-started/locally/#start-locally)ガむドを参照しおください。 + +🀗 Accelerateをむンストヌルするには ```bash pip install accelerate ``` -Install 🀗 Diffusers from source with the following command: +以䞋のコマンドで゜ヌスから🀗 Diffusersをむンストヌルできたす ```bash pip install git+https://github.com/huggingface/diffusers ``` -This command installs the bleeding edge `main` version rather than the latest `stable` version. -The `main` version is useful for staying up-to-date with the latest developments. -For instance, if a bug has been fixed since the last official release but a new release hasn't been rolled out yet. -However, this means the `main` version may not always be stable. -We strive to keep the `main` version operational, and most issues are usually resolved within a few hours or a day. -If you run into a problem, please open an [Issue](https://github.com/huggingface/diffusers/issues/new/choose), so we can fix it even sooner! +このコマンドは最新の `stable` バヌゞョンではなく、最先端の `main` バヌゞョンをむンストヌルしたす。 +`main`バヌゞョンは最新の開発に察応するのに䟿利です。 +䟋えば、前回の公匏リリヌス以降にバグが修正されたが、新しいリリヌスがただリリヌスされおいない堎合などには郜合がいいです。 +しかし、これは `main` バヌゞョンが垞に安定しおいるずは限らないです。 +私たちは `main` バヌゞョンを運甚し続けるよう努力しおおり、ほずんどの問題は通垞数時間から1日以内に解決されたす。 +もし問題が発生した堎合は、[Issue](https://github.com/huggingface/diffusers/issues/new/choose) を開いおください -## Editable install +## 線集可胜なむンストヌル -You will need an editable install if you'd like to: +以䞋の堎合、線集可胜なむンストヌルが必芁です -* Use the `main` version of the source code. -* Contribute to 🀗 Diffusers and need to test changes in the code. +* ゜ヌスコヌドの `main` バヌゞョンを䜿甚する。 +* 🀗 Diffusers に貢献し、コヌドの倉曎をテストする必芁がある堎合。 -Clone the repository and install 🀗 Diffusers with the following commands: +リポゞトリをクロヌンし、次のコマンドで 🀗 Diffusers をむンストヌルしおください ```bash git clone https://github.com/huggingface/diffusers.git @@ -104,43 +106,42 @@ pip install -e ".[flax]" -These commands will link the folder you cloned the repository to and your Python library paths. -Python will now look inside the folder you cloned to in addition to the normal library paths. -For example, if your Python packages are typically installed in `~/anaconda3/envs/main/lib/python3.8/site-packages/`, Python will also search the `~/diffusers/` folder you cloned to. +これらのコマンドは、リポゞトリをクロヌンしたフォルダず Python のラむブラリパスをリンクしたす。 +Python は通垞のラむブラリパスに加えお、クロヌンしたフォルダの䞭を探すようになりたす。 +䟋えば、Python パッケヌゞが通垞 `~/anaconda3/envs/main/lib/python3.8/site-packages/` にむンストヌルされおいる堎合、Python はクロヌンした `~/diffusers/` フォルダも同様に参照したす。 -You must keep the `diffusers` folder if you want to keep using the library. +ラむブラリを䜿い続けたい堎合は、`diffusers`フォルダを残しおおく必芁がありたす。 -Now you can easily update your clone to the latest version of 🀗 Diffusers with the following command: +これで、以䞋のコマンドで簡単にクロヌンを最新版の🀗 Diffusersにアップデヌトできたす ```bash cd ~/diffusers/ git pull ``` -Your Python environment will find the `main` version of 🀗 Diffusers on the next run. +Python環境は次の実行時に `main` バヌゞョンの🀗 Diffusersを芋぀けたす。 -## Notice on telemetry logging +## テレメトリヌ・ロギングに関するお知らせ -Our library gathers telemetry information during `from_pretrained()` requests. -This data includes the version of Diffusers and PyTorch/Flax, the requested model or pipeline class, -and the path to a pre-trained checkpoint if it is hosted on the Hub. -This usage data helps us debug issues and prioritize new features. -Telemetry is only sent when loading models and pipelines from the HuggingFace Hub, -and is not collected during local usage. +このラむブラリは `from_pretrained()` リク゚スト䞭にデヌタを収集したす。 +このデヌタには Diffusers ず PyTorch/Flax のバヌゞョン、芁求されたモデルやパむプラむンクラスが含たれたす。 +たた、Hubでホストされおいる堎合は、事前に孊習されたチェックポむントぞのパスが含たれたす。 +この䜿甚デヌタは問題のデバッグや新機胜の優先順䜍付けに圹立ちたす。 +テレメトリヌはHuggingFace Hubからモデルやパむプラむンをロヌドするずきのみ送信されたす。ロヌカルでの䜿甚䞭は収集されたせん。 -We understand that not everyone wants to share additional information, and we respect your privacy, -so you can disable telemetry collection by setting the `DISABLE_TELEMETRY` environment variable from your terminal: +我々は、すべおの人が远加情報を共有したくないこずを理解し、あなたのプラむバシヌを尊重したす。 +そのため、タヌミナルから `DISABLE_TELEMETRY` 環境倉数を蚭定するこずで、デヌタ収集を無効にするこずができたす -On Linux/MacOS: +Linux/MacOSの堎合 ```bash export DISABLE_TELEMETRY=YES ``` -On Windows: +Windows の堎合 ```bash set DISABLE_TELEMETRY=YES ``` From de96a183b188dbb4db511f0c140ad0590691775d Mon Sep 17 00:00:00 2001 From: isamu-isozaki Date: Sun, 22 Oct 2023 00:17:02 -0400 Subject: [PATCH 3/9] Properly finished installation.md and almost finished quicktour --- docs/source/jp/_toctree.yml | 74 +- .../stable_diffusion/stable_diffusion_xl.md | 52 -- docs/source/jp/installation.md | 26 +- docs/source/jp/optimization/fp16.md | 68 -- docs/source/jp/optimization/memory.md | 357 --------- docs/source/jp/optimization/open_vino.md | 81 -- docs/source/jp/optimization/opt_overview.md | 17 - docs/source/jp/optimization/tome.md | 89 --- docs/source/jp/optimization/torch2.0.md | 434 ----------- docs/source/jp/quicktour.md | 26 +- docs/source/jp/training/create_dataset.md | 90 --- docs/source/jp/training/dreambooth.md | 710 ------------------ docs/source/jp/training/overview.md | 84 --- docs/source/jp/training/text_inversion.md | 277 ------- docs/source/jp/tutorials/autopipeline.md | 146 ---- docs/source/jp/tutorials/basic_training.md | 404 ---------- docs/source/jp/tutorials/tutorial_overview.md | 23 - .../jp/tutorials/using_peft_for_inference.md | 165 ---- .../jp/using-diffusers/pipeline_overview.md | 17 - docs/source/jp/using-diffusers/sdxl.md | 431 ----------- 20 files changed, 26 insertions(+), 3545 deletions(-) delete mode 100644 docs/source/jp/api/pipelines/stable_diffusion/stable_diffusion_xl.md delete mode 100644 docs/source/jp/optimization/fp16.md delete mode 100644 docs/source/jp/optimization/memory.md delete mode 100644 docs/source/jp/optimization/open_vino.md delete mode 100644 docs/source/jp/optimization/opt_overview.md delete mode 100644 docs/source/jp/optimization/tome.md delete mode 100644 docs/source/jp/optimization/torch2.0.md delete mode 100644 docs/source/jp/training/create_dataset.md delete mode 100644 docs/source/jp/training/dreambooth.md delete mode 100644 docs/source/jp/training/overview.md delete mode 100644 docs/source/jp/training/text_inversion.md delete mode 100644 docs/source/jp/tutorials/autopipeline.md delete mode 100644 docs/source/jp/tutorials/basic_training.md delete mode 100644 docs/source/jp/tutorials/tutorial_overview.md delete mode 100644 docs/source/jp/tutorials/using_peft_for_inference.md delete mode 100644 docs/source/jp/using-diffusers/pipeline_overview.md delete mode 100644 docs/source/jp/using-diffusers/sdxl.md diff --git a/docs/source/jp/_toctree.yml b/docs/source/jp/_toctree.yml index f5c8b7d1ddea..dc1c30afe4bc 100644 --- a/docs/source/jp/_toctree.yml +++ b/docs/source/jp/_toctree.yml @@ -9,76 +9,4 @@ title: むンストヌル - local: in_translation title: 翻蚳に぀いお - title: はじめに -- sections: - - local: tutorials/tutorial_overview - title: 抂芁 - - local: using-diffusers/write_own_pipeline - title: モデルずスケゞュヌラを理解する - - local: tutorials/autopipeline - title: 自動パむプラむン - - local: tutorials/basic_training - title: 拡散モデルのトレヌニング - - local: tutorials/using_peft_for_inference - title: PEFTで効率よく生成 - title: チュヌトリアル -- sections: - - sections: - - local: using-diffusers/loading_overview - title: 抂芁 - - local: using-diffusers/loading - title: パむプラむン、モデル、スケゞュヌラのロヌド - - local: using-diffusers/schedulers - title: 異なるスケゞュヌラのロヌドず比范 - - local: using-diffusers/custom_pipeline_overview - title: コミュニティ・パむプラむンをロヌドする - - local: using-diffusers/using_safetensors - title: 安党なモデルのロヌド - - local: using-diffusers/other-formats - title: 様々なStable Diffusionフォヌマットのロヌド - - local: using-diffusers/push_to_hub - title: ファむルをハブにプッシュする - title: 技術 - - sections: - - local: using-diffusers/pipeline_overview - title: 抂芁 - - local: using-diffusers/sdxl - title: Stable Diffusion XL - title: 生成のためのパむプラむン - - sections: - - local: training/overview - title: 抂芁 - - local: training/create_dataset - title: トレヌニングのためのデヌタセット䜜成 - - local: training/text_inversion - title: Textual Inversion - - local: training/dreambooth - title: Dreambooth - title: トレヌニング - title: Diffusersの䜿い方 -- sections: - - local: optimization/opt_overview - title: 抂芁 - - sections: - - local: optimization/fp16 - title: 生成のスピヌドアップ - - local: optimization/memory - title: メモリヌ䜿甚の削枛 - - local: optimization/torch2.0 - title: Torch 2.0 - - local: optimization/tome - title: トヌクンのマヌゞ - title: 䞀般的な最適化 - - sections: - - local: optimization/open_vino - title: OpenVINO - title: 最適化されたモデルタむプ - title: 最適化 -- sections: - - sections: - - sections: - - local: api/pipelines/stable_diffusion/stable_diffusion_xl - title: Stable Diffusion XL - title: Stable Diffusion - title: パむプラむン - title: API + title: はじめに \ No newline at end of file diff --git a/docs/source/jp/api/pipelines/stable_diffusion/stable_diffusion_xl.md b/docs/source/jp/api/pipelines/stable_diffusion/stable_diffusion_xl.md deleted file mode 100644 index aedb03d51caf..000000000000 --- a/docs/source/jp/api/pipelines/stable_diffusion/stable_diffusion_xl.md +++ /dev/null @@ -1,52 +0,0 @@ - - -# Stable Diffusion XL - -Stable Diffusion XL (SDXL) was proposed in [SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis](https://huggingface.co/papers/2307.01952) by Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas MÃŒller, Joe Penna, and Robin Rombach. - -The abstract from the paper is: - -*We present SDXL, a latent diffusion model for text-to-image synthesis. Compared to previous versions of Stable Diffusion, SDXL leverages a three times larger UNet backbone: The increase of model parameters is mainly due to more attention blocks and a larger cross-attention context as SDXL uses a second text encoder. We design multiple novel conditioning schemes and train SDXL on multiple aspect ratios. We also introduce a refinement model which is used to improve the visual fidelity of samples generated by SDXL using a post-hoc image-to-image technique. We demonstrate that SDXL shows drastically improved performance compared the previous versions of Stable Diffusion and achieves results competitive with those of black-box state-of-the-art image generators.* - -## Tips - -- Most SDXL checkpoints work best with an image size of 1024x1024. Image sizes of 768x768 and 512x512 are also supported, but the results aren't as good. Anything below 512x512 is not recommended and likely won't for for default checkpoints like [stabilityai/stable-diffusion-xl-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0). -- SDXL can pass a different prompt for each of the text encoders it was trained on. We can even pass different parts of the same prompt to the text encoders. -- SDXL output images can be improved by making use of a refiner model in an image-to-image setting. -- SDXL offers `negative_original_size`, `negative_crops_coords_top_left`, and `negative_target_size` to negatively condition the model on image resolution and cropping parameters. - - - -To learn how to use SDXL for various tasks, how to optimize performance, and other usage examples, take a look at the [Stable Diffusion XL](../../../using-diffusers/sdxl) guide. - -Check out the [Stability AI](https://huggingface.co/stabilityai) Hub organization for the official base and refiner model checkpoints! - - - -## StableDiffusionXLPipeline - -[[autodoc]] StableDiffusionXLPipeline - - all - - __call__ - -## StableDiffusionXLImg2ImgPipeline - -[[autodoc]] StableDiffusionXLImg2ImgPipeline - - all - - __call__ - -## StableDiffusionXLInpaintPipeline - -[[autodoc]] StableDiffusionXLInpaintPipeline - - all - - __call__ diff --git a/docs/source/jp/installation.md b/docs/source/jp/installation.md index 3e4e8b21255c..dbfd19d6cb7a 100644 --- a/docs/source/jp/installation.md +++ b/docs/source/jp/installation.md @@ -12,32 +12,32 @@ specific language governing permissions and limitations under the License. # むンストヌル -Install 🀗 Diffusers for whichever deep learning library you're working with. +お䜿いのディヌプラヌニングラむブラリに合わせおDiffusersをむンストヌルできたす。 -🀗 Diffusers is tested on Python 3.8+, PyTorch 1.7.0+ and Flax. Follow the installation instructions below for the deep learning library you are using: +🀗 DiffusersはPython 3.8+、PyTorch 1.7.0+、Flaxでテストされおいたす。䜿甚するディヌプラヌニングラむブラリの以䞋のむンストヌル手順に埓っおください -- [PyTorch](https://pytorch.org/get-started/locally/) installation instructions. -- [Flax](https://flax.readthedocs.io/en/latest/) installation instructions. +- [PyTorch](https://pytorch.org/get-started/locally/)のむンストヌル手順。 +- [Flax](https://flax.readthedocs.io/en/latest/)のむンストヌル手順。 -## Install with pip +## pip でむンストヌル -You should install 🀗 Diffusers in a [virtual environment](https://docs.python.org/3/library/venv.html). -If you're unfamiliar with Python virtual environments, take a look at this [guide](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/). -A virtual environment makes it easier to manage different projects and avoid compatibility issues between dependencies. +Diffusersは[仮想環境](https://docs.python.org/3/library/venv.html)の䞭でむンストヌルするこずが掚奚されおいたす。 +Python の仮想環境に぀いおよく知らない堎合は、こちらの [ガむド](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/) を参照しおください。 +仮想環境は異なるプロゞェクトの管理を容易にし、䟝存関係間の互換性の問題を回避したす。 -Start by creating a virtual environment in your project directory: +ではさっそく、プロゞェクトディレクトリに仮想環境を䜜っおみたす ```bash python -m venv .env ``` -Activate the virtual environment: +仮想環境をアクティブにしたす ```bash source .env/bin/activate ``` -🀗 Diffusers also relies on the 🀗 Transformers library, and you can install both with the following command: +🀗 Diffusers もたた 🀗 Transformers ラむブラリに䟝存しおおり、以䞋のコマンドで䞡方をむンストヌルできたす @@ -52,9 +52,7 @@ pip install diffusers["flax"] transformers -## Install from source - -゜ヌスからのむンストヌル +## ゜ヌスからのむンストヌル ゜ヌスから🀗 Diffusersをむンストヌルする前に、`torch`ず🀗 Accelerateがむンストヌルされおいるこずを確認しおください。 diff --git a/docs/source/jp/optimization/fp16.md b/docs/source/jp/optimization/fp16.md deleted file mode 100644 index 2ac16786eb46..000000000000 --- a/docs/source/jp/optimization/fp16.md +++ /dev/null @@ -1,68 +0,0 @@ - - -# Speed up inference - -There are several ways to optimize 🀗 Diffusers for inference speed. As a general rule of thumb, we recommend using either [xFormers](xformers) or `torch.nn.functional.scaled_dot_product_attention` in PyTorch 2.0 for their memory-efficient attention. - - - -In many cases, optimizing for speed or memory leads to improved performance in the other, so you should try to optimize for both whenever you can. This guide focuses on inference speed, but you can learn more about preserving memory in the [Reduce memory usage](memory) guide. - - - -The results below are obtained from generating a single 512x512 image from the prompt `a photo of an astronaut riding a horse on mars` with 50 DDIM steps on a Nvidia Titan RTX, demonstrating the speed-up you can expect. - -| | latency | speed-up | -| ---------------- | ------- | ------- | -| original | 9.50s | x1 | -| fp16 | 3.61s | x2.63 | -| channels last | 3.30s | x2.88 | -| traced UNet | 3.21s | x2.96 | -| memory efficient attention | 2.63s | x3.61 | - -## Use TensorFloat-32 - -On Ampere and later CUDA devices, matrix multiplications and convolutions can use the [TensorFloat-32 (TF32)](https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/) mode for faster, but slightly less accurate computations. By default, PyTorch enables TF32 mode for convolutions but not matrix multiplications. Unless your network requires full float32 precision, we recommend enabling TF32 for matrix multiplications. It can significantly speeds up computations with typically negligible loss in numerical accuracy. - -```python -import torch - -torch.backends.cuda.matmul.allow_tf32 = True -``` - -You can learn more about TF32 in the [Mixed precision training](https://huggingface.co/docs/transformers/en/perf_train_gpu_one#tf32) guide. - -## Half-precision weights - -To save GPU memory and get more speed, try loading and running the model weights directly in half-precision or float16: - -```Python -import torch -from diffusers import DiffusionPipeline - -pipe = DiffusionPipeline.from_pretrained( - "runwayml/stable-diffusion-v1-5", - torch_dtype=torch.float16, - use_safetensors=True, -) -pipe = pipe.to("cuda") - -prompt = "a photo of an astronaut riding a horse on mars" -image = pipe(prompt).images[0] -``` - - - -Don't use [`torch.autocast`](https://pytorch.org/docs/stable/amp.html#torch.autocast) in any of the pipelines as it can lead to black images and is always slower than pure float16 precision. - - \ No newline at end of file diff --git a/docs/source/jp/optimization/memory.md b/docs/source/jp/optimization/memory.md deleted file mode 100644 index c91fed1b2784..000000000000 --- a/docs/source/jp/optimization/memory.md +++ /dev/null @@ -1,357 +0,0 @@ -# Reduce memory usage - -A barrier to using diffusion models is the large amount of memory required. To overcome this challenge, there are several memory-reducing techniques you can use to run even some of the largest models on free-tier or consumer GPUs. Some of these techniques can even be combined to further reduce memory usage. - - - -In many cases, optimizing for memory or speed leads to improved performance in the other, so you should try to optimize for both whenever you can. This guide focuses on minimizing memory usage, but you can also learn more about how to [Speed up inference](fp16). - - - -The results below are obtained from generating a single 512x512 image from the prompt a photo of an astronaut riding a horse on mars with 50 DDIM steps on a Nvidia Titan RTX, demonstrating the speed-up you can expect as a result of reduced memory consumption. - -| | latency | speed-up | -| ---------------- | ------- | ------- | -| original | 9.50s | x1 | -| fp16 | 3.61s | x2.63 | -| channels last | 3.30s | x2.88 | -| traced UNet | 3.21s | x2.96 | -| memory-efficient attention | 2.63s | x3.61 | - - -## Sliced VAE - -Sliced VAE enables decoding large batches of images with limited VRAM or batches with 32 images or more by decoding the batches of latents one image at a time. You'll likely want to couple this with [`~ModelMixin.enable_xformers_memory_efficient_attention`] to further reduce memory use. - -To use sliced VAE, call [`~StableDiffusionPipeline.enable_vae_slicing`] on your pipeline before inference: - -```python -import torch -from diffusers import StableDiffusionPipeline - -pipe = StableDiffusionPipeline.from_pretrained( - "runwayml/stable-diffusion-v1-5", - torch_dtype=torch.float16, - use_safetensors=True, -) -pipe = pipe.to("cuda") - -prompt = "a photo of an astronaut riding a horse on mars" -pipe.enable_vae_slicing() -images = pipe([prompt] * 32).images -``` - -You may see a small performance boost in VAE decoding on multi-image batches, and there should be no performance impact on single-image batches. - -## Tiled VAE - -Tiled VAE processing also enables working with large images on limited VRAM (for example, generating 4k images on 8GB of VRAM) by splitting the image into overlapping tiles, decoding the tiles, and then blending the outputs together to compose the final image. You should also used tiled VAE with [`~ModelMixin.enable_xformers_memory_efficient_attention`] to further reduce memory use. - -To use tiled VAE processing, call [`~StableDiffusionPipeline.enable_vae_tiling`] on your pipeline before inference: - -```python -import torch -from diffusers import StableDiffusionPipeline, UniPCMultistepScheduler - -pipe = StableDiffusionPipeline.from_pretrained( - "runwayml/stable-diffusion-v1-5", - torch_dtype=torch.float16, - use_safetensors=True, -) -pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config) -pipe = pipe.to("cuda") -prompt = "a beautiful landscape photograph" -pipe.enable_vae_tiling() -pipe.enable_xformers_memory_efficient_attention() - -image = pipe([prompt], width=3840, height=2224, num_inference_steps=20).images[0] -``` - -The output image has some tile-to-tile tone variation because the tiles are decoded separately, but you shouldn't see any sharp and obvious seams between the tiles. Tiling is turned off for images that are 512x512 or smaller. - -## CPU offloading - -Offloading the weights to the CPU and only loading them on the GPU when performing the forward pass can also save memory. Often, this technique can reduce memory consumption to less than 3GB. - -To perform CPU offloading, call [`~StableDiffusionPipeline.enable_sequential_cpu_offload`]: - -```Python -import torch -from diffusers import StableDiffusionPipeline - -pipe = StableDiffusionPipeline.from_pretrained( - "runwayml/stable-diffusion-v1-5", - torch_dtype=torch.float16, - use_safetensors=True, -) - -prompt = "a photo of an astronaut riding a horse on mars" -pipe.enable_sequential_cpu_offload() -image = pipe(prompt).images[0] -``` - -CPU offloading works on submodules rather than whole models. This is the best way to minimize memory consumption, but inference is much slower due to the iterative nature of the diffusion process. The UNet component of the pipeline runs several times (as many as `num_inference_steps`); each time, the different UNet submodules are sequentially onloaded and offloaded as needed, resulting in a large number of memory transfers. - - - -Consider using [model offloading](#model-offloading) if you want to optimize for speed because it is much faster. The tradeoff is your memory savings won't be as large. - - - -CPU offloading can also be chained with attention slicing to reduce memory consumption to less than 2GB. - -```Python -import torch -from diffusers import StableDiffusionPipeline - -pipe = StableDiffusionPipeline.from_pretrained( - "runwayml/stable-diffusion-v1-5", - torch_dtype=torch.float16, - use_safetensors=True, -) - -prompt = "a photo of an astronaut riding a horse on mars" -pipe.enable_sequential_cpu_offload() - -image = pipe(prompt).images[0] -``` - - - -When using [`~StableDiffusionPipeline.enable_sequential_cpu_offload`], don't move the pipeline to CUDA beforehand or else the gain in memory consumption will only be minimal (see this [issue](https://github.com/huggingface/diffusers/issues/1934) for more information). - -[`~StableDiffusionPipeline.enable_sequential_cpu_offload`] is a stateful operation that installs hooks on the models. - - - -## Model offloading - - - -Model offloading requires 🀗 Accelerate version 0.17.0 or higher. - - - -[Sequential CPU offloading](#cpu-offloading) preserves a lot of memory but it makes inference slower because submodules are moved to GPU as needed, and they're immediately returned to the CPU when a new module runs. - -Full-model offloading is an alternative that moves whole models to the GPU, instead of handling each model's constituent *submodules*. There is a negligible impact on inference time (compared with moving the pipeline to `cuda`), and it still provides some memory savings. - -During model offloading, only one of the main components of the pipeline (typically the text encoder, UNet and VAE) -is placed on the GPU while the others wait on the CPU. Components like the UNet that run for multiple iterations stay on the GPU until they're no longer needed. - -Enable model offloading by calling [`~StableDiffusionPipeline.enable_model_cpu_offload`] on the pipeline: - -```Python -import torch -from diffusers import StableDiffusionPipeline - -pipe = StableDiffusionPipeline.from_pretrained( - "runwayml/stable-diffusion-v1-5", - torch_dtype=torch.float16, - use_safetensors=True, -) - -prompt = "a photo of an astronaut riding a horse on mars" -pipe.enable_model_cpu_offload() -image = pipe(prompt).images[0] -``` - -Model offloading can also be combined with attention slicing for additional memory savings. - -```Python -import torch -from diffusers import StableDiffusionPipeline - -pipe = StableDiffusionPipeline.from_pretrained( - "runwayml/stable-diffusion-v1-5", - torch_dtype=torch.float16, - use_safetensors=True, -) - -prompt = "a photo of an astronaut riding a horse on mars" -pipe.enable_model_cpu_offload() - -image = pipe(prompt).images[0] -``` - - - -In order to properly offload models after they're called, it is required to run the entire pipeline and models are called in the pipeline's expected order. Exercise caution if models are reused outside the context of the pipeline after hooks have been installed. See [Removing Hooks](https://huggingface.co/docs/accelerate/en/package_reference/big_modeling#accelerate.hooks.remove_hook_from_module) -for more information. - -[`~StableDiffusionPipeline.enable_model_cpu_offload`] is a stateful operation that installs hooks on the models and state on the pipeline. - - - -## Channels-last memory format - -The channels-last memory format is an alternative way of ordering NCHW tensors in memory to preserve dimension ordering. Channels-last tensors are ordered in such a way that the channels become the densest dimension (storing images pixel-per-pixel). Since not all operators currently support the channels-last format, it may result in worst performance but you should still try and see if it works for your model. - -For example, to set the pipeline's UNet to use the channels-last format: - -```python -print(pipe.unet.conv_out.state_dict()["weight"].stride()) # (2880, 9, 3, 1) -pipe.unet.to(memory_format=torch.channels_last) # in-place operation -print( - pipe.unet.conv_out.state_dict()["weight"].stride() -) # (2880, 1, 960, 320) having a stride of 1 for the 2nd dimension proves that it works -``` - -## Tracing - -Tracing runs an example input tensor through the model and captures the operations that are performed on it as that input makes its way through the model's layers. The executable or `ScriptFunction` that is returned is optimized with just-in-time compilation. - -To trace a UNet: - -```python -import time -import torch -from diffusers import StableDiffusionPipeline -import functools - -# torch disable grad -torch.set_grad_enabled(False) - -# set variables -n_experiments = 2 -unet_runs_per_experiment = 50 - - -# load inputs -def generate_inputs(): - sample = torch.randn(2, 4, 64, 64).half().cuda() - timestep = torch.rand(1).half().cuda() * 999 - encoder_hidden_states = torch.randn(2, 77, 768).half().cuda() - return sample, timestep, encoder_hidden_states - - -pipe = StableDiffusionPipeline.from_pretrained( - "runwayml/stable-diffusion-v1-5", - torch_dtype=torch.float16, - use_safetensors=True, -).to("cuda") -unet = pipe.unet -unet.eval() -unet.to(memory_format=torch.channels_last) # use channels_last memory format -unet.forward = functools.partial(unet.forward, return_dict=False) # set return_dict=False as default - -# warmup -for _ in range(3): - with torch.inference_mode(): - inputs = generate_inputs() - orig_output = unet(*inputs) - -# trace -print("tracing..") -unet_traced = torch.jit.trace(unet, inputs) -unet_traced.eval() -print("done tracing") - - -# warmup and optimize graph -for _ in range(5): - with torch.inference_mode(): - inputs = generate_inputs() - orig_output = unet_traced(*inputs) - - -# benchmarking -with torch.inference_mode(): - for _ in range(n_experiments): - torch.cuda.synchronize() - start_time = time.time() - for _ in range(unet_runs_per_experiment): - orig_output = unet_traced(*inputs) - torch.cuda.synchronize() - print(f"unet traced inference took {time.time() - start_time:.2f} seconds") - for _ in range(n_experiments): - torch.cuda.synchronize() - start_time = time.time() - for _ in range(unet_runs_per_experiment): - orig_output = unet(*inputs) - torch.cuda.synchronize() - print(f"unet inference took {time.time() - start_time:.2f} seconds") - -# save the model -unet_traced.save("unet_traced.pt") -``` - -Replace the `unet` attribute of the pipeline with the traced model: - -```python -from diffusers import StableDiffusionPipeline -import torch -from dataclasses import dataclass - - -@dataclass -class UNet2DConditionOutput: - sample: torch.FloatTensor - - -pipe = StableDiffusionPipeline.from_pretrained( - "runwayml/stable-diffusion-v1-5", - torch_dtype=torch.float16, - use_safetensors=True, -).to("cuda") - -# use jitted unet -unet_traced = torch.jit.load("unet_traced.pt") - - -# del pipe.unet -class TracedUNet(torch.nn.Module): - def __init__(self): - super().__init__() - self.in_channels = pipe.unet.in_channels - self.device = pipe.unet.device - - def forward(self, latent_model_input, t, encoder_hidden_states): - sample = unet_traced(latent_model_input, t, encoder_hidden_states)[0] - return UNet2DConditionOutput(sample=sample) - - -pipe.unet = TracedUNet() - -with torch.inference_mode(): - image = pipe([prompt] * 1, num_inference_steps=50).images[0] -``` - -## Memory-efficient attention - -Recent work on optimizing bandwidth in the attention block has generated huge speed-ups and reductions in GPU memory usage. The most recent type of memory-efficient attention is [Flash Attention](https://arxiv.org/pdf/2205.14135.pdf) (you can check out the original code at [HazyResearch/flash-attention](https://github.com/HazyResearch/flash-attention)). - - - -If you have PyTorch >= 2.0 installed, you should not expect a speed-up for inference when enabling `xformers`. - - - -To use Flash Attention, install the following: - -- PyTorch > 1.12 -- CUDA available -- [xFormers](xformers) - -Then call [`~ModelMixin.enable_xformers_memory_efficient_attention`] on the pipeline: - -```python -from diffusers import DiffusionPipeline -import torch - -pipe = DiffusionPipeline.from_pretrained( - "runwayml/stable-diffusion-v1-5", - torch_dtype=torch.float16, - use_safetensors=True, -).to("cuda") - -pipe.enable_xformers_memory_efficient_attention() - -with torch.inference_mode(): - sample = pipe("a small cat") - -# optional: You can disable it via -# pipe.disable_xformers_memory_efficient_attention() -``` - -The iteration speed when using `xformers` should match the iteration speed of Torch 2.0 as described [here](torch2.0). diff --git a/docs/source/jp/optimization/open_vino.md b/docs/source/jp/optimization/open_vino.md deleted file mode 100644 index 606c2207bcda..000000000000 --- a/docs/source/jp/optimization/open_vino.md +++ /dev/null @@ -1,81 +0,0 @@ - - - -# OpenVINO - -🀗 [Optimum](https://github.com/huggingface/optimum-intel) provides Stable Diffusion pipelines compatible with OpenVINO to perform inference on a variety of Intel processors (see the [full list]((https://docs.openvino.ai/latest/openvino_docs_OV_UG_supported_plugins_Supported_Devices.html)) of supported devices). - -You'll need to install 🀗 Optimum Intel with the `--upgrade-strategy eager` option to ensure [`optimum-intel`](https://github.com/huggingface/optimum-intel) is using the latest version: - -``` -pip install --upgrade-strategy eager optimum["openvino"] -``` - -This guide will show you how to use the Stable Diffusion and Stable Diffusion XL (SDXL) pipelines with OpenVINO. - -## Stable Diffusion - -To load and run inference, use the [`~optimum.intel.OVStableDiffusionPipeline`]. If you want to load a PyTorch model and convert it to the OpenVINO format on-the-fly, set `export=True`: - -```python -from optimum.intel import OVStableDiffusionPipeline - -model_id = "runwayml/stable-diffusion-v1-5" -pipeline = OVStableDiffusionPipeline.from_pretrained(model_id, export=True) -prompt = "sailing ship in storm by Rembrandt" -image = pipeline(prompt).images[0] - -# Don't forget to save the exported model -pipeline.save_pretrained("openvino-sd-v1-5") -``` - -To further speed-up inference, statically reshape the model. If you change any parameters such as the outputs height or width, you’ll need to statically reshape your model again. - -```python -# Define the shapes related to the inputs and desired outputs -batch_size, num_images, height, width = 1, 1, 512, 512 - -# Statically reshape the model -pipeline.reshape(batch_size, height, width, num_images) -# Compile the model before inference -pipeline.compile() - -image = pipeline( - prompt, - height=height, - width=width, - num_images_per_prompt=num_images, -).images[0] -``` -
- -
- -You can find more examples in the 🀗 Optimum [documentation](https://huggingface.co/docs/optimum/intel/inference#stable-diffusion), and Stable Diffusion is supported for text-to-image, image-to-image, and inpainting. - -## Stable Diffusion XL - -To load and run inference with SDXL, use the [`~optimum.intel.OVStableDiffusionXLPipeline`]: - -```python -from optimum.intel import OVStableDiffusionXLPipeline - -model_id = "stabilityai/stable-diffusion-xl-base-1.0" -pipeline = OVStableDiffusionXLPipeline.from_pretrained(model_id) -prompt = "sailing ship in storm by Rembrandt" -image = pipeline(prompt).images[0] -``` - -To further speed-up inference, [statically reshape](#stable-diffusion) the model as shown in the Stable Diffusion section. - -You can find more examples in the 🀗 Optimum [documentation](https://huggingface.co/docs/optimum/intel/inference#stable-diffusion-xl), and running SDXL in OpenVINO is supported for text-to-image and image-to-image. diff --git a/docs/source/jp/optimization/opt_overview.md b/docs/source/jp/optimization/opt_overview.md deleted file mode 100644 index 1f809bb011ce..000000000000 --- a/docs/source/jp/optimization/opt_overview.md +++ /dev/null @@ -1,17 +0,0 @@ - - -# Overview - -Generating high-quality outputs is computationally intensive, especially during each iterative step where you go from a noisy output to a less noisy output. One of 🀗 Diffuser's goal is to make this technology widely accessible to everyone, which includes enabling fast inference on consumer and specialized hardware. - -This section will cover tips and tricks - like half-precision weights and sliced attention - for optimizing inference speed and reducing memory-consumption. You'll also learn how to speed up your PyTorch code with [`torch.compile`](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) or [ONNX Runtime](https://onnxruntime.ai/docs/), and enable memory-efficient attention with [xFormers](https://facebookresearch.github.io/xformers/). There are also guides for running inference on specific hardware like Apple Silicon, and Intel or Habana processors. \ No newline at end of file diff --git a/docs/source/jp/optimization/tome.md b/docs/source/jp/optimization/tome.md deleted file mode 100644 index 66d69c6900cc..000000000000 --- a/docs/source/jp/optimization/tome.md +++ /dev/null @@ -1,89 +0,0 @@ - - -# Token merging - -[Token merging](https://huggingface.co/papers/2303.17604) (ToMe) merges redundant tokens/patches progressively in the forward pass of a Transformer-based network which can speed-up the inference latency of [`StableDiffusionPipeline`]. - -You can use ToMe from the [`tomesd`](https://github.com/dbolya/tomesd) library with the [`apply_patch`](https://github.com/dbolya/tomesd?tab=readme-ov-file#usage) function: - -```diff -from diffusers import StableDiffusionPipeline -import tomesd - -pipeline = StableDiffusionPipeline.from_pretrained( - "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True, -).to("cuda") -+ tomesd.apply_patch(pipeline, ratio=0.5) - -image = pipeline("a photo of an astronaut riding a horse on mars").images[0] -``` - -The `apply_patch` function exposes a number of [arguments](https://github.com/dbolya/tomesd#usage) to help strike a balance between pipeline inference speed and the quality of the generated tokens. The most important argument is `ratio` which controls the number of tokens that are merged during the forward pass. - -As reported in the [paper](https://huggingface.co/papers/2303.17604), ToMe can greatly preserve the quality of the generated images while boosting inference speed. By increasing the `ratio`, you can speed-up inference even further, but at the cost of some degraded image quality. - -To test the quality of the generated images, we sampled a few prompts from [Parti Prompts](https://parti.research.google/) and performed inference with the [`StableDiffusionPipeline`] with the following settings: - -
- -
- -We didn’t notice any significant decrease in the quality of the generated samples, and you can check out the generated samples in this [WandB report](https://wandb.ai/sayakpaul/tomesd-results/runs/23j4bj3i?workspace=). If you're interested in reproducing this experiment, use this [script](https://gist.github.com/sayakpaul/8cac98d7f22399085a060992f411ecbd). - -## Benchmarks - -We also benchmarked the impact of `tomesd` on the [`StableDiffusionPipeline`] with [xFormers](https://huggingface.co/docs/diffusers/optimization/xformers) enabled across several image resolutions. The results are obtained from A100 and V100 GPUs in the following development environment: - -```bash -- `diffusers` version: 0.15.1 -- Python version: 3.8.16 -- PyTorch version (GPU?): 1.13.1+cu116 (True) -- Huggingface_hub version: 0.13.2 -- Transformers version: 4.27.2 -- Accelerate version: 0.18.0 -- xFormers version: 0.0.16 -- tomesd version: 0.1.2 -``` - -To reproduce this benchmark, feel free to use this [script](https://gist.github.com/sayakpaul/27aec6bca7eb7b0e0aa4112205850335). The results are reported in seconds, and where applicable we report the speed-up percentage over the vanilla pipeline when using ToMe and ToMe + xFormers. - -| **GPU** | **Resolution** | **Batch size** | **Vanilla** | **ToMe** | **ToMe + xFormers** | -|----------|----------------|----------------|-------------|----------------|---------------------| -| **A100** | 512 | 10 | 6.88 | 5.26 (+23.55%) | 4.69 (+31.83%) | -| | 768 | 10 | OOM | 14.71 | 11 | -| | | 8 | OOM | 11.56 | 8.84 | -| | | 4 | OOM | 5.98 | 4.66 | -| | | 2 | 4.99 | 3.24 (+35.07%) | 2.1 (+37.88%) | -| | | 1 | 3.29 | 2.24 (+31.91%) | 2.03 (+38.3%) | -| | 1024 | 10 | OOM | OOM | OOM | -| | | 8 | OOM | OOM | OOM | -| | | 4 | OOM | 12.51 | 9.09 | -| | | 2 | OOM | 6.52 | 4.96 | -| | | 1 | 6.4 | 3.61 (+43.59%) | 2.81 (+56.09%) | -| **V100** | 512 | 10 | OOM | 10.03 | 9.29 | -| | | 8 | OOM | 8.05 | 7.47 | -| | | 4 | 5.7 | 4.3 (+24.56%) | 3.98 (+30.18%) | -| | | 2 | 3.14 | 2.43 (+22.61%) | 2.27 (+27.71%) | -| | | 1 | 1.88 | 1.57 (+16.49%) | 1.57 (+16.49%) | -| | 768 | 10 | OOM | OOM | 23.67 | -| | | 8 | OOM | OOM | 18.81 | -| | | 4 | OOM | 11.81 | 9.7 | -| | | 2 | OOM | 6.27 | 5.2 | -| | | 1 | 5.43 | 3.38 (+37.75%) | 2.82 (+48.07%) | -| | 1024 | 10 | OOM | OOM | OOM | -| | | 8 | OOM | OOM | OOM | -| | | 4 | OOM | OOM | 19.35 | -| | | 2 | OOM | 13 | 10.78 | -| | | 1 | OOM | 6.66 | 5.54 | - -As seen in the tables above, the speed-up from `tomesd` becomes more pronounced for larger image resolutions. It is also interesting to note that with `tomesd`, it is possible to run the pipeline on a higher resolution like 1024x1024. You may be able to speed-up inference even more with [`torch.compile`](torch2.0). diff --git a/docs/source/jp/optimization/torch2.0.md b/docs/source/jp/optimization/torch2.0.md deleted file mode 100644 index 1e07b876514f..000000000000 --- a/docs/source/jp/optimization/torch2.0.md +++ /dev/null @@ -1,434 +0,0 @@ - - -# Torch 2.0 - -🀗 Diffusers supports the latest optimizations from [PyTorch 2.0](https://pytorch.org/get-started/pytorch-2.0/) which include: - -1. A memory-efficient attention implementation, scaled dot product attention, without requiring any extra dependencies such as xFormers. -2. [`torch.compile`](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html), a just-in-time (JIT) compiler to provide an extra performance boost when individual models are compiled. - -Both of these optimizations require PyTorch 2.0 or later and 🀗 Diffusers > 0.13.0. - -```bash -pip install --upgrade torch diffusers -``` - -## Scaled dot product attention - -[`torch.nn.functional.scaled_dot_product_attention`](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention) (SDPA) is an optimized and memory-efficient attention (similar to xFormers) that automatically enables several other optimizations depending on the model inputs and GPU type. SDPA is enabled by default if you're using PyTorch 2.0 and the latest version of 🀗 Diffusers, so you don't need to add anything to your code. - -However, if you want to explicitly enable it, you can set a [`DiffusionPipeline`] to use [`~models.attention_processor.AttnProcessor2_0`]: - -```diff - import torch - from diffusers import DiffusionPipeline -+ from diffusers.models.attention_processor import AttnProcessor2_0 - - pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True).to("cuda") -+ pipe.unet.set_attn_processor(AttnProcessor2_0()) - - prompt = "a photo of an astronaut riding a horse on mars" - image = pipe(prompt).images[0] -``` - -SDPA should be as fast and memory efficient as `xFormers`; check the [benchmark](#benchmark) for more details. - -In some cases - such as making the pipeline more deterministic or converting it to other formats - it may be helpful to use the vanilla attention processor, [`~models.attention_processor.AttnProcessor`]. To revert to [`~models.attention_processor.AttnProcessor`], call the [`~UNet2DConditionModel.set_default_attn_processor`] function on the pipeline: - -```diff - import torch - from diffusers import DiffusionPipeline - from diffusers.models.attention_processor import AttnProcessor - - pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True).to("cuda") -+ pipe.unet.set_default_attn_processor() - - prompt = "a photo of an astronaut riding a horse on mars" - image = pipe(prompt).images[0] -``` - -## torch.compile - -The `torch.compile` function can often provide an additional speed-up to your PyTorch code. In 🀗 Diffusers, it is usually best to wrap the UNet with `torch.compile` because it does most of the heavy lifting in the pipeline. - -```python -from diffusers import DiffusionPipeline -import torch - -pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True).to("cuda") -pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) -images = pipe(prompt, num_inference_steps=steps, num_images_per_prompt=batch_size).images[0] -``` - -Depending on GPU type, `torch.compile` can provide an *additional speed-up* of **5-300x** on top of SDPA! If you're using more recent GPU architectures such as Ampere (A100, 3090), Ada (4090), and Hopper (H100), `torch.compile` is able to squeeze even more performance out of these GPUs. - -Compilation requires some time to complete, so it is best suited for situations where you prepare your pipeline once and then perform the same type of inference operations multiple times. For example, calling the compiled pipeline on a different image size triggers compilation again which can be expensive. - -For more information and different options about `torch.compile`, refer to the [`torch_compile`](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) tutorial. - -## Benchmark - -We conducted a comprehensive benchmark with PyTorch 2.0's efficient attention implementation and `torch.compile` across different GPUs and batch sizes for five of our most used pipelines. The code is benchmarked on 🀗 Diffusers v0.17.0.dev0 to optimize `torch.compile` usage (see [here](https://github.com/huggingface/diffusers/pull/3313) for more details). - -Expand the dropdown below to find the code used to benchmark each pipeline: - -
- -### Stable Diffusion text-to-image - -```python -from diffusers import DiffusionPipeline -import torch - -path = "runwayml/stable-diffusion-v1-5" - -run_compile = True # Set True / False - -pipe = DiffusionPipeline.from_pretrained(path, torch_dtype=torch.float16, use_safetensors=True) -pipe = pipe.to("cuda") -pipe.unet.to(memory_format=torch.channels_last) - -if run_compile: - print("Run torch compile") - pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) - -prompt = "ghibli style, a fantasy landscape with castles" - -for _ in range(3): - images = pipe(prompt=prompt).images -``` - -### Stable Diffusion image-to-image - -```python -from diffusers import StableDiffusionImg2ImgPipeline -import requests -import torch -from PIL import Image -from io import BytesIO - -url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg" - -response = requests.get(url) -init_image = Image.open(BytesIO(response.content)).convert("RGB") -init_image = init_image.resize((512, 512)) - -path = "runwayml/stable-diffusion-v1-5" - -run_compile = True # Set True / False - -pipe = StableDiffusionImg2ImgPipeline.from_pretrained(path, torch_dtype=torch.float16, use_safetensors=True) -pipe = pipe.to("cuda") -pipe.unet.to(memory_format=torch.channels_last) - -if run_compile: - print("Run torch compile") - pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) - -prompt = "ghibli style, a fantasy landscape with castles" - -for _ in range(3): - image = pipe(prompt=prompt, image=init_image).images[0] -``` - -### Stable Diffusion inpainting - -```python -from diffusers import StableDiffusionInpaintPipeline -import requests -import torch -from PIL import Image -from io import BytesIO - -url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg" - -def download_image(url): - response = requests.get(url) - return Image.open(BytesIO(response.content)).convert("RGB") - - -img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png" -mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png" - -init_image = download_image(img_url).resize((512, 512)) -mask_image = download_image(mask_url).resize((512, 512)) - -path = "runwayml/stable-diffusion-inpainting" - -run_compile = True # Set True / False - -pipe = StableDiffusionInpaintPipeline.from_pretrained(path, torch_dtype=torch.float16, use_safetensors=True) -pipe = pipe.to("cuda") -pipe.unet.to(memory_format=torch.channels_last) - -if run_compile: - print("Run torch compile") - pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) - -prompt = "ghibli style, a fantasy landscape with castles" - -for _ in range(3): - image = pipe(prompt=prompt, image=init_image, mask_image=mask_image).images[0] -``` - -### ControlNet - -```python -from diffusers import StableDiffusionControlNetPipeline, ControlNetModel -import requests -import torch -from PIL import Image -from io import BytesIO - -url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg" - -response = requests.get(url) -init_image = Image.open(BytesIO(response.content)).convert("RGB") -init_image = init_image.resize((512, 512)) - -path = "runwayml/stable-diffusion-v1-5" - -run_compile = True # Set True / False -controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16, use_safetensors=True) -pipe = StableDiffusionControlNetPipeline.from_pretrained( - path, controlnet=controlnet, torch_dtype=torch.float16, use_safetensors=True -) - -pipe = pipe.to("cuda") -pipe.unet.to(memory_format=torch.channels_last) -pipe.controlnet.to(memory_format=torch.channels_last) - -if run_compile: - print("Run torch compile") - pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) - pipe.controlnet = torch.compile(pipe.controlnet, mode="reduce-overhead", fullgraph=True) - -prompt = "ghibli style, a fantasy landscape with castles" - -for _ in range(3): - image = pipe(prompt=prompt, image=init_image).images[0] -``` - -### DeepFloyd IF text-to-image + upscaling - -```python -from diffusers import DiffusionPipeline -import torch - -run_compile = True # Set True / False - -pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-M-v1.0", variant="fp16", text_encoder=None, torch_dtype=torch.float16, use_safetensors=True) -pipe.to("cuda") -pipe_2 = DiffusionPipeline.from_pretrained("DeepFloyd/IF-II-M-v1.0", variant="fp16", text_encoder=None, torch_dtype=torch.float16, use_safetensors=True) -pipe_2.to("cuda") -pipe_3 = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-x4-upscaler", torch_dtype=torch.float16, use_safetensors=True) -pipe_3.to("cuda") - - -pipe.unet.to(memory_format=torch.channels_last) -pipe_2.unet.to(memory_format=torch.channels_last) -pipe_3.unet.to(memory_format=torch.channels_last) - -if run_compile: - pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) - pipe_2.unet = torch.compile(pipe_2.unet, mode="reduce-overhead", fullgraph=True) - pipe_3.unet = torch.compile(pipe_3.unet, mode="reduce-overhead", fullgraph=True) - -prompt = "the blue hulk" - -prompt_embeds = torch.randn((1, 2, 4096), dtype=torch.float16) -neg_prompt_embeds = torch.randn((1, 2, 4096), dtype=torch.float16) - -for _ in range(3): - image = pipe(prompt_embeds=prompt_embeds, negative_prompt_embeds=neg_prompt_embeds, output_type="pt").images - image_2 = pipe_2(image=image, prompt_embeds=prompt_embeds, negative_prompt_embeds=neg_prompt_embeds, output_type="pt").images - image_3 = pipe_3(prompt=prompt, image=image, noise_level=100).images -``` -
- -The graph below highlights the relative speed-ups for the [`StableDiffusionPipeline`] across five GPU families with PyTorch 2.0 and `torch.compile` enabled. The benchmarks for the following graphs are measured in *number of iterations/second*. - -![t2i_speedup](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/pt2_benchmarks/t2i_speedup.png) - -To give you an even better idea of how this speed-up holds for the other pipelines, consider the following -graph for an A100 with PyTorch 2.0 and `torch.compile`: - -![a100_numbers](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/pt2_benchmarks/a100_numbers.png) - -In the following tables, we report our findings in terms of the *number of iterations/second*. - -### A100 (batch size: 1) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 21.66 | 23.13 | 44.03 | 49.74 | -| SD - img2img | 21.81 | 22.40 | 43.92 | 46.32 | -| SD - inpaint | 22.24 | 23.23 | 43.76 | 49.25 | -| SD - controlnet | 15.02 | 15.82 | 32.13 | 36.08 | -| IF | 20.21 /
13.84 /
24.00 | 20.12 /
13.70 /
24.03 | ❌ | 97.34 /
27.23 /
111.66 | -| SDXL - txt2img | 8.64 | 9.9 | - | - | - -### A100 (batch size: 4) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 11.6 | 13.12 | 14.62 | 17.27 | -| SD - img2img | 11.47 | 13.06 | 14.66 | 17.25 | -| SD - inpaint | 11.67 | 13.31 | 14.88 | 17.48 | -| SD - controlnet | 8.28 | 9.38 | 10.51 | 12.41 | -| IF | 25.02 | 18.04 | ❌ | 48.47 | -| SDXL - txt2img | 2.44 | 2.74 | - | - | - -### A100 (batch size: 16) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 3.04 | 3.6 | 3.83 | 4.68 | -| SD - img2img | 2.98 | 3.58 | 3.83 | 4.67 | -| SD - inpaint | 3.04 | 3.66 | 3.9 | 4.76 | -| SD - controlnet | 2.15 | 2.58 | 2.74 | 3.35 | -| IF | 8.78 | 9.82 | ❌ | 16.77 | -| SDXL - txt2img | 0.64 | 0.72 | - | - | - -### V100 (batch size: 1) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 18.99 | 19.14 | 20.95 | 22.17 | -| SD - img2img | 18.56 | 19.18 | 20.95 | 22.11 | -| SD - inpaint | 19.14 | 19.06 | 21.08 | 22.20 | -| SD - controlnet | 13.48 | 13.93 | 15.18 | 15.88 | -| IF | 20.01 /
9.08 /
23.34 | 19.79 /
8.98 /
24.10 | ❌ | 55.75 /
11.57 /
57.67 | - -### V100 (batch size: 4) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 5.96 | 5.89 | 6.83 | 6.86 | -| SD - img2img | 5.90 | 5.91 | 6.81 | 6.82 | -| SD - inpaint | 5.99 | 6.03 | 6.93 | 6.95 | -| SD - controlnet | 4.26 | 4.29 | 4.92 | 4.93 | -| IF | 15.41 | 14.76 | ❌ | 22.95 | - -### V100 (batch size: 16) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 1.66 | 1.66 | 1.92 | 1.90 | -| SD - img2img | 1.65 | 1.65 | 1.91 | 1.89 | -| SD - inpaint | 1.69 | 1.69 | 1.95 | 1.93 | -| SD - controlnet | 1.19 | 1.19 | OOM after warmup | 1.36 | -| IF | 5.43 | 5.29 | ❌ | 7.06 | - -### T4 (batch size: 1) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 6.9 | 6.95 | 7.3 | 7.56 | -| SD - img2img | 6.84 | 6.99 | 7.04 | 7.55 | -| SD - inpaint | 6.91 | 6.7 | 7.01 | 7.37 | -| SD - controlnet | 4.89 | 4.86 | 5.35 | 5.48 | -| IF | 17.42 /
2.47 /
18.52 | 16.96 /
2.45 /
18.69 | ❌ | 24.63 /
2.47 /
23.39 | -| SDXL - txt2img | 1.15 | 1.16 | - | - | - -### T4 (batch size: 4) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 1.79 | 1.79 | 2.03 | 1.99 | -| SD - img2img | 1.77 | 1.77 | 2.05 | 2.04 | -| SD - inpaint | 1.81 | 1.82 | 2.09 | 2.09 | -| SD - controlnet | 1.34 | 1.27 | 1.47 | 1.46 | -| IF | 5.79 | 5.61 | ❌ | 7.39 | -| SDXL - txt2img | 0.288 | 0.289 | - | - | - -### T4 (batch size: 16) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 2.34s | 2.30s | OOM after 2nd iteration | 1.99s | -| SD - img2img | 2.35s | 2.31s | OOM after warmup | 2.00s | -| SD - inpaint | 2.30s | 2.26s | OOM after 2nd iteration | 1.95s | -| SD - controlnet | OOM after 2nd iteration | OOM after 2nd iteration | OOM after warmup | OOM after warmup | -| IF * | 1.44 | 1.44 | ❌ | 1.94 | -| SDXL - txt2img | OOM | OOM | - | - | - -### RTX 3090 (batch size: 1) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 22.56 | 22.84 | 23.84 | 25.69 | -| SD - img2img | 22.25 | 22.61 | 24.1 | 25.83 | -| SD - inpaint | 22.22 | 22.54 | 24.26 | 26.02 | -| SD - controlnet | 16.03 | 16.33 | 17.38 | 18.56 | -| IF | 27.08 /
9.07 /
31.23 | 26.75 /
8.92 /
31.47 | ❌ | 68.08 /
11.16 /
65.29 | - -### RTX 3090 (batch size: 4) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 6.46 | 6.35 | 7.29 | 7.3 | -| SD - img2img | 6.33 | 6.27 | 7.31 | 7.26 | -| SD - inpaint | 6.47 | 6.4 | 7.44 | 7.39 | -| SD - controlnet | 4.59 | 4.54 | 5.27 | 5.26 | -| IF | 16.81 | 16.62 | ❌ | 21.57 | - -### RTX 3090 (batch size: 16) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 1.7 | 1.69 | 1.93 | 1.91 | -| SD - img2img | 1.68 | 1.67 | 1.93 | 1.9 | -| SD - inpaint | 1.72 | 1.71 | 1.97 | 1.94 | -| SD - controlnet | 1.23 | 1.22 | 1.4 | 1.38 | -| IF | 5.01 | 5.00 | ❌ | 6.33 | - -### RTX 4090 (batch size: 1) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 40.5 | 41.89 | 44.65 | 49.81 | -| SD - img2img | 40.39 | 41.95 | 44.46 | 49.8 | -| SD - inpaint | 40.51 | 41.88 | 44.58 | 49.72 | -| SD - controlnet | 29.27 | 30.29 | 32.26 | 36.03 | -| IF | 69.71 /
18.78 /
85.49 | 69.13 /
18.80 /
85.56 | ❌ | 124.60 /
26.37 /
138.79 | -| SDXL - txt2img | 6.8 | 8.18 | - | - | - -### RTX 4090 (batch size: 4) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 12.62 | 12.84 | 15.32 | 15.59 | -| SD - img2img | 12.61 | 12,.79 | 15.35 | 15.66 | -| SD - inpaint | 12.65 | 12.81 | 15.3 | 15.58 | -| SD - controlnet | 9.1 | 9.25 | 11.03 | 11.22 | -| IF | 31.88 | 31.14 | ❌ | 43.92 | -| SDXL - txt2img | 2.19 | 2.35 | - | - | - -### RTX 4090 (batch size: 16) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 3.17 | 3.2 | 3.84 | 3.85 | -| SD - img2img | 3.16 | 3.2 | 3.84 | 3.85 | -| SD - inpaint | 3.17 | 3.2 | 3.85 | 3.85 | -| SD - controlnet | 2.23 | 2.3 | 2.7 | 2.75 | -| IF | 9.26 | 9.2 | ❌ | 13.31 | -| SDXL - txt2img | 0.52 | 0.53 | - | - | - -## Notes - -* Follow this [PR](https://github.com/huggingface/diffusers/pull/3313) for more details on the environment used for conducting the benchmarks. -* For the DeepFloyd IF pipeline where batch sizes > 1, we only used a batch size of > 1 in the first IF pipeline for text-to-image generation and NOT for upscaling. That means the two upscaling pipelines received a batch size of 1. - -*Thanks to [Horace He](https://github.com/Chillee) from the PyTorch team for their support in improving our support of `torch.compile()` in Diffusers.* diff --git a/docs/source/jp/quicktour.md b/docs/source/jp/quicktour.md index 3cf6851e4683..e8b0be9d8a1c 100644 --- a/docs/source/jp/quicktour.md +++ b/docs/source/jp/quicktour.md @@ -12,37 +12,37 @@ specific language governing permissions and limitations under the License. [[open-in-colab]] -# Quicktour +# 簡単な案内 -Diffusion models are trained to denoise random Gaussian noise step-by-step to generate a sample of interest, such as an image or audio. This has sparked a tremendous amount of interest in generative AI, and you have probably seen examples of diffusion generated images on the internet. 🧚 Diffusers is a library aimed at making diffusion models widely accessible to everyone. +拡散モデル(Diffusion Model)は、ランダムなガりスノむズを段階的にノむズ陀去するように孊習され、画像や音声などの目的のものを生成できたす。これは生成AIに倚倧な関心を呌び起こしたした。むンタヌネット䞊で拡散によっお生成された画像の䟋を芋たこずがあるでしょう。🧚 Diffusersは、誰もが拡散モデルに広くアクセスできるようにするこずを目的ずしたラむブラリです。 -Whether you're a developer or an everyday user, this quicktour will introduce you to 🧚 Diffusers and help you get up and generating quickly! There are three main components of the library to know about: +この案内では、開発者たたは日垞的なナヌザヌに関わらず、🧚 Diffusers を玹介し、玠早く目的のものを生成できるようにしたすこのラむブラリには3぀の䞻芁コンポヌネントがありたす: -* The [`DiffusionPipeline`] is a high-level end-to-end class designed to rapidly generate samples from pretrained diffusion models for inference. -* Popular pretrained [model](./api/models) architectures and modules that can be used as building blocks for creating diffusion systems. -* Many different [schedulers](./api/schedulers/overview) - algorithms that control how noise is added for training, and how to generate denoised images during inference. +* [`DiffusionPipeline`]は事前に孊習された拡散モデルからサンプルを迅速に生成するために蚭蚈された高レベルの゚ンドツヌ゚ンドクラス。 +* 拡散システムを䜜成するためのビルディングブロックずしお䜿甚できる、人気のある事前孊習された[モデル](./api/models)アヌキテクチャずモゞュヌル。 +* 倚くの異なる[スケゞュヌラ](./api/schedulers/overview) - ノむズがどのようにトレヌニングのために加えられるか、そしお生成䞭にどのようにノむズ陀去された画像を生成するかを制埡するアルゎリズム。 -The quicktour will show you how to use the [`DiffusionPipeline`] for inference, and then walk you through how to combine a model and scheduler to replicate what's happening inside the [`DiffusionPipeline`]. +この案内では、[`DiffusionPipeline`]を生成に䜿甚する方法を玹介し、モデルずスケゞュヌラを組み合わせお[`DiffusionPipeline`]の内郚で起こっおいるこずを再珟する方法を説明したす。 -The quicktour is a simplified version of the introductory 🧚 Diffusers [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/diffusers_intro.ipynb) to help you get started quickly. If you want to learn more about 🧚 Diffusers goal, design philosophy, and additional details about it's core API, check out the notebook! +この案内は🧚 Diffusers [ノヌトブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/diffusers_intro.ipynb)を簡略化したもので、すぐに䜿い始めるこずができたす。Diffusers 🧚のゎヌル、蚭蚈哲孊、コアAPIの詳现に぀いおもっず知りたい方は、ノヌトブックをご芧ください -Before you begin, make sure you have all the necessary libraries installed: +始める前に必芁なラむブラリヌがすべおむンストヌルされおいるこずを確認しおください ```py # uncomment to install the necessary libraries in Colab #!pip install --upgrade diffusers accelerate transformers ``` -- [🀗 Accelerate](https://huggingface.co/docs/accelerate/index) speeds up model loading for inference and training. -- [🀗 Transformers](https://huggingface.co/docs/transformers/index) is required to run the most popular diffusion models, such as [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview). +- [🀗 Accelerate](https://huggingface.co/docs/accelerate/index)生成ずトレヌニングのためのモデルのロヌドを高速化したす +- [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview)ような最も䞀般的な拡散モデルを実行するには、[🀗 Transformers](https://huggingface.co/docs/transformers/index)が必芁です。 -## DiffusionPipeline +# 拡散パむプラむン -The [`DiffusionPipeline`] is the easiest way to use a pretrained diffusion system for inference. It is an end-to-end system containing the model and the scheduler. You can use the [`DiffusionPipeline`] out-of-the-box for many tasks. Take a look at the table below for some supported tasks, and for a complete list of supported tasks, check out the [🧚 Diffusers Summary](./api/pipelines/overview#diffusers-summary) table. +[`DiffusionPipeline`]は事前孊習された拡散システムを生成に䜿甚する最も簡単な方法です。これはモデルずスケゞュヌラを含む゚ンドツヌ゚ンドのシステムです。[`DiffusionPipeline`]は倚くの䜜業タスクにすぐに䜿甚するこずができたす。たた、サポヌトされおいるタスクの完党なリストに぀いおは[🧚Diffusersの抂芁](./api/pipelines/overview#diffusers-summary)の衚を参照しおください。 | **Task** | **Description** | **Pipeline** |------------------------------|--------------------------------------------------------------------------------------------------------------|-----------------| diff --git a/docs/source/jp/training/create_dataset.md b/docs/source/jp/training/create_dataset.md deleted file mode 100644 index f215d3eb2c1b..000000000000 --- a/docs/source/jp/training/create_dataset.md +++ /dev/null @@ -1,90 +0,0 @@ -# Create a dataset for training - -There are many datasets on the [Hub](https://huggingface.co/datasets?task_categories=task_categories:text-to-image&sort=downloads) to train a model on, but if you can't find one you're interested in or want to use your own, you can create a dataset with the 🀗 [Datasets](hf.co/docs/datasets) library. The dataset structure depends on the task you want to train your model on. The most basic dataset structure is a directory of images for tasks like unconditional image generation. Another dataset structure may be a directory of images and a text file containing their corresponding text captions for tasks like text-to-image generation. - -This guide will show you two ways to create a dataset to finetune on: - -- provide a folder of images to the `--train_data_dir` argument -- upload a dataset to the Hub and pass the dataset repository id to the `--dataset_name` argument - - - -💡 Learn more about how to create an image dataset for training in the [Create an image dataset](https://huggingface.co/docs/datasets/image_dataset) guide. - - - -## Provide a dataset as a folder - -For unconditional generation, you can provide your own dataset as a folder of images. The training script uses the [`ImageFolder`](https://huggingface.co/docs/datasets/en/image_dataset#imagefolder) builder from 🀗 Datasets to automatically build a dataset from the folder. Your directory structure should look like: - -```bash -data_dir/xxx.png -data_dir/xxy.png -data_dir/[...]/xxz.png -``` - -Pass the path to the dataset directory to the `--train_data_dir` argument, and then you can start training: - -```bash -accelerate launch train_unconditional.py \ - --train_data_dir \ - -``` - -## Upload your data to the Hub - - - -💡 For more details and context about creating and uploading a dataset to the Hub, take a look at the [Image search with 🀗 Datasets](https://huggingface.co/blog/image-search-datasets) post. - - - -Start by creating a dataset with the [`ImageFolder`](https://huggingface.co/docs/datasets/image_load#imagefolder) feature, which creates an `image` column containing the PIL-encoded images. - -You can use the `data_dir` or `data_files` parameters to specify the location of the dataset. The `data_files` parameter supports mapping specific files to dataset splits like `train` or `test`: - -```python -from datasets import load_dataset - -# example 1: local folder -dataset = load_dataset("imagefolder", data_dir="path_to_your_folder") - -# example 2: local files (supported formats are tar, gzip, zip, xz, rar, zstd) -dataset = load_dataset("imagefolder", data_files="path_to_zip_file") - -# example 3: remote files (supported formats are tar, gzip, zip, xz, rar, zstd) -dataset = load_dataset( - "imagefolder", - data_files="https://download.microsoft.com/download/3/E/1/3E1C3F21-ECDB-4869-8368-6DEBA77B919F/kagglecatsanddogs_3367a.zip", -) - -# example 4: providing several splits -dataset = load_dataset( - "imagefolder", data_files={"train": ["path/to/file1", "path/to/file2"], "test": ["path/to/file3", "path/to/file4"]} -) -``` - -Then use the [`~datasets.Dataset.push_to_hub`] method to upload the dataset to the Hub: - -```python -# assuming you have ran the huggingface-cli login command in a terminal -dataset.push_to_hub("name_of_your_dataset") - -# if you want to push to a private repo, simply pass private=True: -dataset.push_to_hub("name_of_your_dataset", private=True) -``` - -Now the dataset is available for training by passing the dataset name to the `--dataset_name` argument: - -```bash -accelerate launch --mixed_precision="fp16" train_text_to_image.py \ - --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" \ - --dataset_name="name_of_your_dataset" \ - -``` - -## Next steps - -Now that you've created a dataset, you can plug it into the `train_data_dir` (if your dataset is local) or `dataset_name` (if your dataset is on the Hub) arguments of a training script. - -For your next steps, feel free to try and use your dataset to train a model for [unconditional generation](unconditional_training) or [text-to-image generation](text2image)! \ No newline at end of file diff --git a/docs/source/jp/training/dreambooth.md b/docs/source/jp/training/dreambooth.md deleted file mode 100644 index 30a20a971966..000000000000 --- a/docs/source/jp/training/dreambooth.md +++ /dev/null @@ -1,710 +0,0 @@ - - -# DreamBooth - -[DreamBooth](https://arxiv.org/abs/2208.12242) is a method to personalize text-to-image models like Stable Diffusion given just a few (3-5) images of a subject. It allows the model to generate contextualized images of the subject in different scenes, poses, and views. - -![Dreambooth examples from the project's blog](https://dreambooth.github.io/DreamBooth_files/teaser_static.jpg) -Dreambooth examples from the project's blog. - -This guide will show you how to finetune DreamBooth with the [`CompVis/stable-diffusion-v1-4`](https://huggingface.co/CompVis/stable-diffusion-v1-4) model for various GPU sizes, and with Flax. All the training scripts for DreamBooth used in this guide can be found [here](https://github.com/huggingface/diffusers/tree/main/examples/dreambooth) if you're interested in digging deeper and seeing how things work. - -Before running the scripts, make sure you install the library's training dependencies. We also recommend installing 🧚 Diffusers from the `main` GitHub branch: - -```bash -pip install git+https://github.com/huggingface/diffusers -pip install -U -r diffusers/examples/dreambooth/requirements.txt -``` - -xFormers is not part of the training requirements, but we recommend you [install](../optimization/xformers) it if you can because it could make your training faster and less memory intensive. - -After all the dependencies have been set up, initialize a [🀗 Accelerate](https://github.com/huggingface/accelerate/) environment with: - -```bash -accelerate config -``` - -To setup a default 🀗 Accelerate environment without choosing any configurations: - -```bash -accelerate config default -``` - -Or if your environment doesn't support an interactive shell like a notebook, you can use: - -```py -from accelerate.utils import write_basic_config - -write_basic_config() -``` - -Finally, download a [few images of a dog](https://huggingface.co/datasets/diffusers/dog-example) to DreamBooth with: - -```py -from huggingface_hub import snapshot_download - -local_dir = "./dog" -snapshot_download( - "diffusers/dog-example", - local_dir=local_dir, - repo_type="dataset", - ignore_patterns=".gitattributes", -) -``` - -To use your own dataset, take a look at the [Create a dataset for training](create_dataset) guide. - -## Finetuning - - - -DreamBooth finetuning is very sensitive to hyperparameters and easy to overfit. We recommend you take a look at our [in-depth analysis](https://huggingface.co/blog/dreambooth) with recommended settings for different subjects to help you choose the appropriate hyperparameters. - - - - - -Set the `INSTANCE_DIR` environment variable to the path of the directory containing the dog images. - -Specify the `MODEL_NAME` environment variable (either a Hub model repository id or a path to the directory containing the model weights) and pass it to the [`pretrained_model_name_or_path`] argument. The `instance_prompt` argument is a text prompt that contains a unique identifier, such as `sks`, and the class the image belongs to, which in this example is `a photo of a sks dog`. - -```bash -export MODEL_NAME="CompVis/stable-diffusion-v1-4" -export INSTANCE_DIR="./dog" -export OUTPUT_DIR="path_to_saved_model" -``` - -Then you can launch the training script (you can find the full training script [here](https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth.py)) with the following command: - -```bash -accelerate launch train_dreambooth.py \ - --pretrained_model_name_or_path=$MODEL_NAME \ - --instance_data_dir=$INSTANCE_DIR \ - --output_dir=$OUTPUT_DIR \ - --instance_prompt="a photo of sks dog" \ - --resolution=512 \ - --train_batch_size=1 \ - --gradient_accumulation_steps=1 \ - --learning_rate=5e-6 \ - --lr_scheduler="constant" \ - --lr_warmup_steps=0 \ - --max_train_steps=400 \ - --push_to_hub -``` - - -If you have access to TPUs or want to train even faster, you can try out the [Flax training script](https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth_flax.py). The Flax training script doesn't support gradient checkpointing or gradient accumulation, so you'll need a GPU with at least 30GB of memory. - -Before running the script, make sure you have the requirements installed: - -```bash -pip install -U -r requirements.txt -``` - -Specify the `MODEL_NAME` environment variable (either a Hub model repository id or a path to the directory containing the model weights) and pass it to the [`pretrained_model_name_or_path`] argument. The `instance_prompt` argument is a text prompt that contains a unique identifier, such as `sks`, and the class the image belongs to, which in this example is `a photo of a sks dog`. - -Now you can launch the training script with the following command: - -```bash -export MODEL_NAME="duongna/stable-diffusion-v1-4-flax" -export INSTANCE_DIR="./dog" -export OUTPUT_DIR="path-to-save-model" - -python train_dreambooth_flax.py \ - --pretrained_model_name_or_path=$MODEL_NAME \ - --instance_data_dir=$INSTANCE_DIR \ - --output_dir=$OUTPUT_DIR \ - --instance_prompt="a photo of sks dog" \ - --resolution=512 \ - --train_batch_size=1 \ - --learning_rate=5e-6 \ - --max_train_steps=400 \ - --push_to_hub -``` - - - -## Finetuning with prior-preserving loss - -Prior preservation is used to avoid overfitting and language-drift (check out the [paper](https://arxiv.org/abs/2208.12242) to learn more if you're interested). For prior preservation, you use other images of the same class as part of the training process. The nice thing is that you can generate those images using the Stable Diffusion model itself! The training script will save the generated images to a local path you specify. - -The authors recommend generating `num_epochs * num_samples` images for prior preservation. In most cases, 200-300 images work well. - - - -```bash -export MODEL_NAME="CompVis/stable-diffusion-v1-4" -export INSTANCE_DIR="./dog" -export CLASS_DIR="path_to_class_images" -export OUTPUT_DIR="path_to_saved_model" - -accelerate launch train_dreambooth.py \ - --pretrained_model_name_or_path=$MODEL_NAME \ - --instance_data_dir=$INSTANCE_DIR \ - --class_data_dir=$CLASS_DIR \ - --output_dir=$OUTPUT_DIR \ - --with_prior_preservation --prior_loss_weight=1.0 \ - --instance_prompt="a photo of sks dog" \ - --class_prompt="a photo of dog" \ - --resolution=512 \ - --train_batch_size=1 \ - --gradient_accumulation_steps=1 \ - --learning_rate=5e-6 \ - --lr_scheduler="constant" \ - --lr_warmup_steps=0 \ - --num_class_images=200 \ - --max_train_steps=800 \ - --push_to_hub -``` - - -```bash -export MODEL_NAME="duongna/stable-diffusion-v1-4-flax" -export INSTANCE_DIR="./dog" -export CLASS_DIR="path-to-class-images" -export OUTPUT_DIR="path-to-save-model" - -python train_dreambooth_flax.py \ - --pretrained_model_name_or_path=$MODEL_NAME \ - --instance_data_dir=$INSTANCE_DIR \ - --class_data_dir=$CLASS_DIR \ - --output_dir=$OUTPUT_DIR \ - --with_prior_preservation --prior_loss_weight=1.0 \ - --instance_prompt="a photo of sks dog" \ - --class_prompt="a photo of dog" \ - --resolution=512 \ - --train_batch_size=1 \ - --learning_rate=5e-6 \ - --num_class_images=200 \ - --max_train_steps=800 \ - --push_to_hub -``` - - - -## Finetuning the text encoder and UNet - -The script also allows you to finetune the `text_encoder` along with the `unet`. In our experiments (check out the [Training Stable Diffusion with DreamBooth using 🧚 Diffusers](https://huggingface.co/blog/dreambooth) post for more details), this yields much better results, especially when generating images of faces. - - - -Training the text encoder requires additional memory and it won't fit on a 16GB GPU. You'll need at least 24GB VRAM to use this option. - - - -Pass the `--train_text_encoder` argument to the training script to enable finetuning the `text_encoder` and `unet`: - - - -```bash -export MODEL_NAME="CompVis/stable-diffusion-v1-4" -export INSTANCE_DIR="./dog" -export CLASS_DIR="path_to_class_images" -export OUTPUT_DIR="path_to_saved_model" - -accelerate launch train_dreambooth.py \ - --pretrained_model_name_or_path=$MODEL_NAME \ - --train_text_encoder \ - --instance_data_dir=$INSTANCE_DIR \ - --class_data_dir=$CLASS_DIR \ - --output_dir=$OUTPUT_DIR \ - --with_prior_preservation --prior_loss_weight=1.0 \ - --instance_prompt="a photo of sks dog" \ - --class_prompt="a photo of dog" \ - --resolution=512 \ - --train_batch_size=1 \ - --use_8bit_adam \ - --gradient_checkpointing \ - --learning_rate=2e-6 \ - --lr_scheduler="constant" \ - --lr_warmup_steps=0 \ - --num_class_images=200 \ - --max_train_steps=800 \ - --push_to_hub -``` - - -```bash -export MODEL_NAME="duongna/stable-diffusion-v1-4-flax" -export INSTANCE_DIR="./dog" -export CLASS_DIR="path-to-class-images" -export OUTPUT_DIR="path-to-save-model" - -python train_dreambooth_flax.py \ - --pretrained_model_name_or_path=$MODEL_NAME \ - --train_text_encoder \ - --instance_data_dir=$INSTANCE_DIR \ - --class_data_dir=$CLASS_DIR \ - --output_dir=$OUTPUT_DIR \ - --with_prior_preservation --prior_loss_weight=1.0 \ - --instance_prompt="a photo of sks dog" \ - --class_prompt="a photo of dog" \ - --resolution=512 \ - --train_batch_size=1 \ - --learning_rate=2e-6 \ - --num_class_images=200 \ - --max_train_steps=800 \ - --push_to_hub -``` - - - -## Finetuning with LoRA - -You can also use Low-Rank Adaptation of Large Language Models (LoRA), a fine-tuning technique for accelerating training large models, on DreamBooth. For more details, take a look at the [LoRA training](./lora#dreambooth) guide. - -## Saving checkpoints while training - -It's easy to overfit while training with Dreambooth, so sometimes it's useful to save regular checkpoints during the training process. One of the intermediate checkpoints might actually work better than the final model! Pass the following argument to the training script to enable saving checkpoints: - -```bash - --checkpointing_steps=500 -``` - -This saves the full training state in subfolders of your `output_dir`. Subfolder names begin with the prefix `checkpoint-`, followed by the number of steps performed so far; for example, `checkpoint-1500` would be a checkpoint saved after 1500 training steps. - -### Resume training from a saved checkpoint - -If you want to resume training from any of the saved checkpoints, you can pass the argument `--resume_from_checkpoint` to the script and specify the name of the checkpoint you want to use. You can also use the special string `"latest"` to resume from the last saved checkpoint (the one with the largest number of steps). For example, the following would resume training from the checkpoint saved after 1500 steps: - -```bash - --resume_from_checkpoint="checkpoint-1500" -``` - -This is a good opportunity to tweak some of your hyperparameters if you wish. - -### Inference from a saved checkpoint - -Saved checkpoints are stored in a format suitable for resuming training. They not only include the model weights, but also the state of the optimizer, data loaders, and learning rate. - -If you have **`"accelerate>=0.16.0"`** installed, use the following code to run -inference from an intermediate checkpoint. - -```python -from diffusers import DiffusionPipeline, UNet2DConditionModel -from transformers import CLIPTextModel -import torch - -# Load the pipeline with the same arguments (model, revision) that were used for training -model_id = "CompVis/stable-diffusion-v1-4" - -unet = UNet2DConditionModel.from_pretrained("/sddata/dreambooth/daruma-v2-1/checkpoint-100/unet") - -# if you have trained with `--args.train_text_encoder` make sure to also load the text encoder -text_encoder = CLIPTextModel.from_pretrained("/sddata/dreambooth/daruma-v2-1/checkpoint-100/text_encoder") - -pipeline = DiffusionPipeline.from_pretrained( - model_id, unet=unet, text_encoder=text_encoder, dtype=torch.float16, use_safetensors=True -) -pipeline.to("cuda") - -# Perform inference, or save, or push to the hub -pipeline.save_pretrained("dreambooth-pipeline") -``` - -If you have **`"accelerate<0.16.0"`** installed, you need to convert it to an inference pipeline first: - -```python -from accelerate import Accelerator -from diffusers import DiffusionPipeline - -# Load the pipeline with the same arguments (model, revision) that were used for training -model_id = "CompVis/stable-diffusion-v1-4" -pipeline = DiffusionPipeline.from_pretrained(model_id, use_safetensors=True) - -accelerator = Accelerator() - -# Use text_encoder if `--train_text_encoder` was used for the initial training -unet, text_encoder = accelerator.prepare(pipeline.unet, pipeline.text_encoder) - -# Restore state from a checkpoint path. You have to use the absolute path here. -accelerator.load_state("/sddata/dreambooth/daruma-v2-1/checkpoint-100") - -# Rebuild the pipeline with the unwrapped models (assignment to .unet and .text_encoder should work too) -pipeline = DiffusionPipeline.from_pretrained( - model_id, - unet=accelerator.unwrap_model(unet), - text_encoder=accelerator.unwrap_model(text_encoder), - use_safetensors=True, -) - -# Perform inference, or save, or push to the hub -pipeline.save_pretrained("dreambooth-pipeline") -``` - -## Optimizations for different GPU sizes - -Depending on your hardware, there are a few different ways to optimize DreamBooth on GPUs from 16GB to just 8GB! - -### xFormers - -[xFormers](https://github.com/facebookresearch/xformers) is a toolbox for optimizing Transformers, and it includes a [memory-efficient attention](https://facebookresearch.github.io/xformers/components/ops.html#module-xformers.ops) mechanism that is used in 🧚 Diffusers. You'll need to [install xFormers](./optimization/xformers) and then add the following argument to your training script: - -```bash - --enable_xformers_memory_efficient_attention -``` - -xFormers is not available in Flax. - -### Set gradients to none - -Another way you can lower your memory footprint is to [set the gradients](https://pytorch.org/docs/stable/generated/torch.optim.Optimizer.zero_grad.html) to `None` instead of zero. However, this may change certain behaviors, so if you run into any issues, try removing this argument. Add the following argument to your training script to set the gradients to `None`: - -```bash - --set_grads_to_none -``` - -### 16GB GPU - -With the help of gradient checkpointing and [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) 8-bit optimizer, it's possible to train DreamBooth on a 16GB GPU. Make sure you have bitsandbytes installed: - -```bash -pip install bitsandbytes -``` - -Then pass the `--use_8bit_adam` option to the training script: - -```bash -export MODEL_NAME="CompVis/stable-diffusion-v1-4" -export INSTANCE_DIR="./dog" -export CLASS_DIR="path_to_class_images" -export OUTPUT_DIR="path_to_saved_model" - -accelerate launch train_dreambooth.py \ - --pretrained_model_name_or_path=$MODEL_NAME \ - --instance_data_dir=$INSTANCE_DIR \ - --class_data_dir=$CLASS_DIR \ - --output_dir=$OUTPUT_DIR \ - --with_prior_preservation --prior_loss_weight=1.0 \ - --instance_prompt="a photo of sks dog" \ - --class_prompt="a photo of dog" \ - --resolution=512 \ - --train_batch_size=1 \ - --gradient_accumulation_steps=2 --gradient_checkpointing \ - --use_8bit_adam \ - --learning_rate=5e-6 \ - --lr_scheduler="constant" \ - --lr_warmup_steps=0 \ - --num_class_images=200 \ - --max_train_steps=800 \ - --push_to_hub -``` - -### 12GB GPU - -To run DreamBooth on a 12GB GPU, you'll need to enable gradient checkpointing, the 8-bit optimizer, xFormers, and set the gradients to `None`: - -```bash -export MODEL_NAME="CompVis/stable-diffusion-v1-4" -export INSTANCE_DIR="./dog" -export CLASS_DIR="path-to-class-images" -export OUTPUT_DIR="path-to-save-model" - -accelerate launch train_dreambooth.py \ - --pretrained_model_name_or_path=$MODEL_NAME \ - --instance_data_dir=$INSTANCE_DIR \ - --class_data_dir=$CLASS_DIR \ - --output_dir=$OUTPUT_DIR \ - --with_prior_preservation --prior_loss_weight=1.0 \ - --instance_prompt="a photo of sks dog" \ - --class_prompt="a photo of dog" \ - --resolution=512 \ - --train_batch_size=1 \ - --gradient_accumulation_steps=1 --gradient_checkpointing \ - --use_8bit_adam \ - --enable_xformers_memory_efficient_attention \ - --set_grads_to_none \ - --learning_rate=2e-6 \ - --lr_scheduler="constant" \ - --lr_warmup_steps=0 \ - --num_class_images=200 \ - --max_train_steps=800 \ - --push_to_hub -``` - -### 8 GB GPU - -For 8GB GPUs, you'll need the help of [DeepSpeed](https://www.deepspeed.ai/) to offload some -tensors from the VRAM to either the CPU or NVME, enabling training with less GPU memory. - -Run the following command to configure your 🀗 Accelerate environment: - -```bash -accelerate config -``` - -During configuration, confirm that you want to use DeepSpeed. Now it's possible to train on under 8GB VRAM by combining DeepSpeed stage 2, fp16 mixed precision, and offloading the model parameters and the optimizer state to the CPU. The drawback is that this requires more system RAM, about 25 GB. See [the DeepSpeed documentation](https://huggingface.co/docs/accelerate/usage_guides/deepspeed) for more configuration options. - -You should also change the default Adam optimizer to DeepSpeed's optimized version of Adam -[`deepspeed.ops.adam.DeepSpeedCPUAdam`](https://deepspeed.readthedocs.io/en/latest/optimizers.html#adam-cpu) for a substantial speedup. Enabling `DeepSpeedCPUAdam` requires your system's CUDA toolchain version to be the same as the one installed with PyTorch. - -8-bit optimizers don't seem to be compatible with DeepSpeed at the moment. - -Launch training with the following command: - -```bash -export MODEL_NAME="CompVis/stable-diffusion-v1-4" -export INSTANCE_DIR="./dog" -export CLASS_DIR="path_to_class_images" -export OUTPUT_DIR="path_to_saved_model" - -accelerate launch train_dreambooth.py \ - --pretrained_model_name_or_path=$MODEL_NAME \ - --instance_data_dir=$INSTANCE_DIR \ - --class_data_dir=$CLASS_DIR \ - --output_dir=$OUTPUT_DIR \ - --with_prior_preservation --prior_loss_weight=1.0 \ - --instance_prompt="a photo of sks dog" \ - --class_prompt="a photo of dog" \ - --resolution=512 \ - --train_batch_size=1 \ - --sample_batch_size=1 \ - --gradient_accumulation_steps=1 --gradient_checkpointing \ - --learning_rate=5e-6 \ - --lr_scheduler="constant" \ - --lr_warmup_steps=0 \ - --num_class_images=200 \ - --max_train_steps=800 \ - --mixed_precision=fp16 \ - --push_to_hub -``` - -## Inference - -Once you have trained a model, specify the path to where the model is saved, and use it for inference in the [`StableDiffusionPipeline`]. Make sure your prompts include the special `identifier` used during training (`sks` in the previous examples). - -If you have **`"accelerate>=0.16.0"`** installed, you can use the following code to run -inference from an intermediate checkpoint: - -```python -from diffusers import DiffusionPipeline -import torch - -model_id = "path_to_saved_model" -pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16, use_safetensors=True).to("cuda") - -prompt = "A photo of sks dog in a bucket" -image = pipe(prompt, num_inference_steps=50, guidance_scale=7.5).images[0] - -image.save("dog-bucket.png") -``` - -You may also run inference from any of the [saved training checkpoints](#inference-from-a-saved-checkpoint). - -## IF - -You can use the lora and full dreambooth scripts to train the text to image [IF model](https://huggingface.co/DeepFloyd/IF-I-XL-v1.0) and the stage II upscaler -[IF model](https://huggingface.co/DeepFloyd/IF-II-L-v1.0). - -Note that IF has a predicted variance, and our finetuning scripts only train the models predicted error, so for finetuned IF models we switch to a fixed -variance schedule. The full finetuning scripts will update the scheduler config for the full saved model. However, when loading saved LoRA weights, you -must also update the pipeline's scheduler config. - -```py -from diffusers import DiffusionPipeline - -pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", use_safetensors=True) - -pipe.load_lora_weights("") - -# Update scheduler config to fixed variance schedule -pipe.scheduler = pipe.scheduler.__class__.from_config(pipe.scheduler.config, variance_type="fixed_small") -``` - -Additionally, a few alternative cli flags are needed for IF. - -`--resolution=64`: IF is a pixel space diffusion model. In order to operate on un-compressed pixels, the input images are of a much smaller resolution. - -`--pre_compute_text_embeddings`: IF uses [T5](https://huggingface.co/docs/transformers/model_doc/t5) for its text encoder. In order to save GPU memory, we pre compute all text embeddings and then de-allocate -T5. - -`--tokenizer_max_length=77`: T5 has a longer default text length, but the default IF encoding procedure uses a smaller number. - -`--text_encoder_use_attention_mask`: T5 passes the attention mask to the text encoder. - -### Tips and Tricks -We find LoRA to be sufficient for finetuning the stage I model as the low resolution of the model makes representing finegrained detail hard regardless. - -For common and/or not-visually complex object concepts, you can get away with not-finetuning the upscaler. Just be sure to adjust the prompt passed to the -upscaler to remove the new token from the instance prompt. I.e. if your stage I prompt is "a sks dog", use "a dog" for your stage II prompt. - -For finegrained detail like faces that aren't present in the original training set, we find that full finetuning of the stage II upscaler is better than -LoRA finetuning stage II. - -For finegrained detail like faces, we find that lower learning rates along with larger batch sizes work best. - -For stage II, we find that lower learning rates are also needed. - -We found experimentally that the DDPM scheduler with the default larger number of denoising steps to sometimes work better than the DPM Solver scheduler -used in the training scripts. - -### Stage II additional validation images - -The stage II validation requires images to upscale, we can download a downsized version of the training set: - -```py -from huggingface_hub import snapshot_download - -local_dir = "./dog_downsized" -snapshot_download( - "diffusers/dog-example-downsized", - local_dir=local_dir, - repo_type="dataset", - ignore_patterns=".gitattributes", -) -``` - -### IF stage I LoRA Dreambooth -This training configuration requires ~28 GB VRAM. - -```sh -export MODEL_NAME="DeepFloyd/IF-I-XL-v1.0" -export INSTANCE_DIR="dog" -export OUTPUT_DIR="dreambooth_dog_lora" - -accelerate launch train_dreambooth_lora.py \ - --report_to wandb \ - --pretrained_model_name_or_path=$MODEL_NAME \ - --instance_data_dir=$INSTANCE_DIR \ - --output_dir=$OUTPUT_DIR \ - --instance_prompt="a sks dog" \ - --resolution=64 \ - --train_batch_size=4 \ - --gradient_accumulation_steps=1 \ - --learning_rate=5e-6 \ - --scale_lr \ - --max_train_steps=1200 \ - --validation_prompt="a sks dog" \ - --validation_epochs=25 \ - --checkpointing_steps=100 \ - --pre_compute_text_embeddings \ - --tokenizer_max_length=77 \ - --text_encoder_use_attention_mask -``` - -### IF stage II LoRA Dreambooth - -`--validation_images`: These images are upscaled during validation steps. - -`--class_labels_conditioning=timesteps`: Pass additional conditioning to the UNet needed for stage II. - -`--learning_rate=1e-6`: Lower learning rate than stage I. - -`--resolution=256`: The upscaler expects higher resolution inputs - -```sh -export MODEL_NAME="DeepFloyd/IF-II-L-v1.0" -export INSTANCE_DIR="dog" -export OUTPUT_DIR="dreambooth_dog_upscale" -export VALIDATION_IMAGES="dog_downsized/image_1.png dog_downsized/image_2.png dog_downsized/image_3.png dog_downsized/image_4.png" - -python train_dreambooth_lora.py \ - --report_to wandb \ - --pretrained_model_name_or_path=$MODEL_NAME \ - --instance_data_dir=$INSTANCE_DIR \ - --output_dir=$OUTPUT_DIR \ - --instance_prompt="a sks dog" \ - --resolution=256 \ - --train_batch_size=4 \ - --gradient_accumulation_steps=1 \ - --learning_rate=1e-6 \ - --max_train_steps=2000 \ - --validation_prompt="a sks dog" \ - --validation_epochs=100 \ - --checkpointing_steps=500 \ - --pre_compute_text_embeddings \ - --tokenizer_max_length=77 \ - --text_encoder_use_attention_mask \ - --validation_images $VALIDATION_IMAGES \ - --class_labels_conditioning=timesteps -``` - -### IF Stage I Full Dreambooth -`--skip_save_text_encoder`: When training the full model, this will skip saving the entire T5 with the finetuned model. You can still load the pipeline -with a T5 loaded from the original model. - -`use_8bit_adam`: Due to the size of the optimizer states, we recommend training the full XL IF model with 8bit adam. - -`--learning_rate=1e-7`: For full dreambooth, IF requires very low learning rates. With higher learning rates model quality will degrade. Note that it is -likely the learning rate can be increased with larger batch sizes. - -Using 8bit adam and a batch size of 4, the model can be trained in ~48 GB VRAM. - -```sh -export MODEL_NAME="DeepFloyd/IF-I-XL-v1.0" - -export INSTANCE_DIR="dog" -export OUTPUT_DIR="dreambooth_if" - -accelerate launch train_dreambooth.py \ - --pretrained_model_name_or_path=$MODEL_NAME \ - --instance_data_dir=$INSTANCE_DIR \ - --output_dir=$OUTPUT_DIR \ - --instance_prompt="a photo of sks dog" \ - --resolution=64 \ - --train_batch_size=4 \ - --gradient_accumulation_steps=1 \ - --learning_rate=1e-7 \ - --max_train_steps=150 \ - --validation_prompt "a photo of sks dog" \ - --validation_steps 25 \ - --text_encoder_use_attention_mask \ - --tokenizer_max_length 77 \ - --pre_compute_text_embeddings \ - --use_8bit_adam \ - --set_grads_to_none \ - --skip_save_text_encoder \ - --push_to_hub -``` - -### IF Stage II Full Dreambooth - -`--learning_rate=5e-6`: With a smaller effective batch size of 4, we found that we required learning rates as low as -1e-8. - -`--resolution=256`: The upscaler expects higher resolution inputs - -`--train_batch_size=2` and `--gradient_accumulation_steps=6`: We found that full training of stage II particularly with -faces required large effective batch sizes. - -```sh -export MODEL_NAME="DeepFloyd/IF-II-L-v1.0" -export INSTANCE_DIR="dog" -export OUTPUT_DIR="dreambooth_dog_upscale" -export VALIDATION_IMAGES="dog_downsized/image_1.png dog_downsized/image_2.png dog_downsized/image_3.png dog_downsized/image_4.png" - -accelerate launch train_dreambooth.py \ - --report_to wandb \ - --pretrained_model_name_or_path=$MODEL_NAME \ - --instance_data_dir=$INSTANCE_DIR \ - --output_dir=$OUTPUT_DIR \ - --instance_prompt="a sks dog" \ - --resolution=256 \ - --train_batch_size=2 \ - --gradient_accumulation_steps=6 \ - --learning_rate=5e-6 \ - --max_train_steps=2000 \ - --validation_prompt="a sks dog" \ - --validation_steps=150 \ - --checkpointing_steps=500 \ - --pre_compute_text_embeddings \ - --tokenizer_max_length=77 \ - --text_encoder_use_attention_mask \ - --validation_images $VALIDATION_IMAGES \ - --class_labels_conditioning timesteps \ - --push_to_hub -``` - -## Stable Diffusion XL - -We support fine-tuning of the UNet and text encoders shipped in [Stable Diffusion XL](https://huggingface.co/papers/2307.01952) with DreamBooth and LoRA via the `train_dreambooth_lora_sdxl.py` script. Please refer to the docs [here](https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/README_sdxl.md). \ No newline at end of file diff --git a/docs/source/jp/training/overview.md b/docs/source/jp/training/overview.md deleted file mode 100644 index c6fe339eda73..000000000000 --- a/docs/source/jp/training/overview.md +++ /dev/null @@ -1,84 +0,0 @@ - - -# 🧚 Diffusers Training Examples - -Diffusers training examples are a collection of scripts to demonstrate how to effectively use the `diffusers` library -for a variety of use cases. - -**Note**: If you are looking for **official** examples on how to use `diffusers` for inference, -please have a look at [src/diffusers/pipelines](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines) - -Our examples aspire to be **self-contained**, **easy-to-tweak**, **beginner-friendly** and for **one-purpose-only**. -More specifically, this means: - -- **Self-contained**: An example script shall only depend on "pip-install-able" Python packages that can be found in a `requirements.txt` file. Example scripts shall **not** depend on any local files. This means that one can simply download an example script, *e.g.* [train_unconditional.py](https://github.com/huggingface/diffusers/blob/main/examples/unconditional_image_generation/train_unconditional.py), install the required dependencies, *e.g.* [requirements.txt](https://github.com/huggingface/diffusers/blob/main/examples/unconditional_image_generation/requirements.txt) and execute the example script. -- **Easy-to-tweak**: While we strive to present as many use cases as possible, the example scripts are just that - examples. It is expected that they won't work out-of-the box on your specific problem and that you will be required to change a few lines of code to adapt them to your needs. To help you with that, most of the examples fully expose the preprocessing of the data and the training loop to allow you to tweak and edit them as required. -- **Beginner-friendly**: We do not aim for providing state-of-the-art training scripts for the newest models, but rather examples that can be used as a way to better understand diffusion models and how to use them with the `diffusers` library. We often purposefully leave out certain state-of-the-art methods if we consider them too complex for beginners. -- **One-purpose-only**: Examples should show one task and one task only. Even if a task is from a modeling -point of view very similar, *e.g.* image super-resolution and image modification tend to use the same model and training method, we want examples to showcase only one task to keep them as readable and easy-to-understand as possible. - -We provide **official** examples that cover the most popular tasks of diffusion models. -*Official* examples are **actively** maintained by the `diffusers` maintainers and we try to rigorously follow our example philosophy as defined above. -If you feel like another important example should exist, we are more than happy to welcome a [Feature Request](https://github.com/huggingface/diffusers/issues/new?assignees=&labels=&template=feature_request.md&title=) or directly a [Pull Request](https://github.com/huggingface/diffusers/compare) from you! - -Training examples show how to pretrain or fine-tune diffusion models for a variety of tasks. Currently we support: - -- [Unconditional Training](./unconditional_training) -- [Text-to-Image Training](./text2image)* -- [Text Inversion](./text_inversion) -- [Dreambooth](./dreambooth)* -- [LoRA Support](./lora)* -- [ControlNet](./controlnet)* -- [InstructPix2Pix](./instructpix2pix)* -- [Custom Diffusion](./custom_diffusion) -- [T2I-Adapters](./t2i_adapters)* - -*: Supports [Stable Diffusion XL](../api/pipelines/stable_diffusion/stable_diffusion_xl). - -If possible, please [install xFormers](../optimization/xformers) for memory efficient attention. This could help make your training faster and less memory intensive. - -| Task | 🀗 Accelerate | 🀗 Datasets | Colab -|---|---|:---:|:---:| -| [**Unconditional Image Generation**](./unconditional_training) | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/training_example.ipynb) -| [**Text-to-Image fine-tuning**](./text2image) | ✅ | ✅ | -| [**Textual Inversion**](./text_inversion) | ✅ | - | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/sd_textual_inversion_training.ipynb) -| [**Dreambooth**](./dreambooth) | ✅ | - | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/sd_dreambooth_training.ipynb) -| [**Training with LoRA**](./lora) | ✅ | - | - | -| [**ControlNet**](./controlnet) | ✅ | ✅ | - | -| [**InstructPix2Pix**](./instructpix2pix) | ✅ | ✅ | - | -| [**Custom Diffusion**](./custom_diffusion) | ✅ | ✅ | - | -| [**T2I Adapters**](./t2i_adapters) | ✅ | ✅ | - | - -## Community - -In addition, we provide **community** examples, which are examples added and maintained by our community. -Community examples can consist of both *training* examples or *inference* pipelines. -For such examples, we are more lenient regarding the philosophy defined above and also cannot guarantee to provide maintenance for every issue. -Examples that are useful for the community, but are either not yet deemed popular or not yet following our above philosophy should go into the [community examples](https://github.com/huggingface/diffusers/tree/main/examples/community) folder. The community folder therefore includes training examples and inference pipelines. -**Note**: Community examples can be a [great first contribution](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22) to show to the community how you like to use `diffusers` 🪄. - -## Important note - -To make sure you can successfully run the latest versions of the example scripts, you have to **install the library from source** and install some example-specific requirements. To do this, execute the following steps in a new virtual environment: - -```bash -git clone https://github.com/huggingface/diffusers -cd diffusers -pip install . -``` - -Then cd in the example folder of your choice and run - -```bash -pip install -r requirements.txt -``` diff --git a/docs/source/jp/training/text_inversion.md b/docs/source/jp/training/text_inversion.md deleted file mode 100644 index 7cc7d57e7c6c..000000000000 --- a/docs/source/jp/training/text_inversion.md +++ /dev/null @@ -1,277 +0,0 @@ - - - - -# Textual Inversion - -[Textual Inversion](https://arxiv.org/abs/2208.01618) is a technique for capturing novel concepts from a small number of example images. While the technique was originally demonstrated with a [latent diffusion model](https://github.com/CompVis/latent-diffusion), it has since been applied to other model variants like [Stable Diffusion](https://huggingface.co/docs/diffusers/main/en/conceptual/stable_diffusion). The learned concepts can be used to better control the images generated from text-to-image pipelines. It learns new "words" in the text encoder's embedding space, which are used within text prompts for personalized image generation. - -![Textual Inversion example](https://textual-inversion.github.io/static/images/editing/colorful_teapot.JPG) -By using just 3-5 images you can teach new concepts to a model such as Stable Diffusion for personalized image generation (image source). - -This guide will show you how to train a [`runwayml/stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5) model with Textual Inversion. All the training scripts for Textual Inversion used in this guide can be found [here](https://github.com/huggingface/diffusers/tree/main/examples/textual_inversion) if you're interested in taking a closer look at how things work under the hood. - - - -There is a community-created collection of trained Textual Inversion models in the [Stable Diffusion Textual Inversion Concepts Library](https://huggingface.co/sd-concepts-library) which are readily available for inference. Over time, this'll hopefully grow into a useful resource as more concepts are added! - - - -Before you begin, make sure you install the library's training dependencies: - -```bash -pip install diffusers accelerate transformers -``` - -After all the dependencies have been set up, initialize a [🀗Accelerate](https://github.com/huggingface/accelerate/) environment with: - -```bash -accelerate config -``` - -To setup a default 🀗 Accelerate environment without choosing any configurations: - -```bash -accelerate config default -``` - -Or if your environment doesn't support an interactive shell like a notebook, you can use: - -```bash -from accelerate.utils import write_basic_config - -write_basic_config() -``` - -Finally, you try and [install xFormers](https://huggingface.co/docs/diffusers/main/en/training/optimization/xformers) to reduce your memory footprint with xFormers memory-efficient attention. Once you have xFormers installed, add the `--enable_xformers_memory_efficient_attention` argument to the training script. xFormers is not supported for Flax. - -## Upload model to Hub - -If you want to store your model on the Hub, add the following argument to the training script: - -```bash ---push_to_hub -``` - -## Save and load checkpoints - -It is often a good idea to regularly save checkpoints of your model during training. This way, you can resume training from a saved checkpoint if your training is interrupted for any reason. To save a checkpoint, pass the following argument to the training script to save the full training state in a subfolder in `output_dir` every 500 steps: - -```bash ---checkpointing_steps=500 -``` - -To resume training from a saved checkpoint, pass the following argument to the training script and the specific checkpoint you'd like to resume from: - -```bash ---resume_from_checkpoint="checkpoint-1500" -``` - -## Finetuning - -For your training dataset, download these [images of a cat toy](https://huggingface.co/datasets/diffusers/cat_toy_example) and store them in a directory. To use your own dataset, take a look at the [Create a dataset for training](create_dataset) guide. - -```py -from huggingface_hub import snapshot_download - -local_dir = "./cat" -snapshot_download( - "diffusers/cat_toy_example", local_dir=local_dir, repo_type="dataset", ignore_patterns=".gitattributes" -) -``` - -Specify the `MODEL_NAME` environment variable (either a Hub model repository id or a path to the directory containing the model weights) and pass it to the [`pretrained_model_name_or_path`](https://huggingface.co/docs/diffusers/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.from_pretrained.pretrained_model_name_or_path) argument, and the `DATA_DIR` environment variable to the path of the directory containing the images. - -Now you can launch the [training script](https://github.com/huggingface/diffusers/blob/main/examples/textual_inversion/textual_inversion.py). The script creates and saves the following files to your repository: `learned_embeds.bin`, `token_identifier.txt`, and `type_of_concept.txt`. - - - -💡 A full training run takes ~1 hour on one V100 GPU. While you're waiting for the training to complete, feel free to check out [how Textual Inversion works](#how-it-works) in the section below if you're curious! - - - - - -```bash -export MODEL_NAME="runwayml/stable-diffusion-v1-5" -export DATA_DIR="./cat" - -accelerate launch textual_inversion.py \ - --pretrained_model_name_or_path=$MODEL_NAME \ - --train_data_dir=$DATA_DIR \ - --learnable_property="object" \ - --placeholder_token="" --initializer_token="toy" \ - --resolution=512 \ - --train_batch_size=1 \ - --gradient_accumulation_steps=4 \ - --max_train_steps=3000 \ - --learning_rate=5.0e-04 --scale_lr \ - --lr_scheduler="constant" \ - --lr_warmup_steps=0 \ - --output_dir="textual_inversion_cat" \ - --push_to_hub -``` - - - -💡 If you want to increase the trainable capacity, you can associate your placeholder token, *e.g.* `` to -multiple embedding vectors. This can help the model to better capture the style of more (complex) images. -To enable training multiple embedding vectors, simply pass: - -```bash ---num_vectors=5 -``` - - - - -If you have access to TPUs, try out the [Flax training script](https://github.com/huggingface/diffusers/blob/main/examples/textual_inversion/textual_inversion_flax.py) to train even faster (this'll also work for GPUs). With the same configuration settings, the Flax training script should be at least 70% faster than the PyTorch training script! ⚡ - -Before you begin, make sure you install the Flax specific dependencies: - -```bash -pip install -U -r requirements_flax.txt -``` - -Specify the `MODEL_NAME` environment variable (either a Hub model repository id or a path to the directory containing the model weights) and pass it to the [`pretrained_model_name_or_path`](https://huggingface.co/docs/diffusers/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.from_pretrained.pretrained_model_name_or_path) argument. - -Then you can launch the [training script](https://github.com/huggingface/diffusers/blob/main/examples/textual_inversion/textual_inversion_flax.py): - -```bash -export MODEL_NAME="duongna/stable-diffusion-v1-4-flax" -export DATA_DIR="./cat" - -python textual_inversion_flax.py \ - --pretrained_model_name_or_path=$MODEL_NAME \ - --train_data_dir=$DATA_DIR \ - --learnable_property="object" \ - --placeholder_token="" --initializer_token="toy" \ - --resolution=512 \ - --train_batch_size=1 \ - --max_train_steps=3000 \ - --learning_rate=5.0e-04 --scale_lr \ - --output_dir="textual_inversion_cat" \ - --push_to_hub -``` - - - -### Intermediate logging - -If you're interested in following along with your model training progress, you can save the generated images from the training process. Add the following arguments to the training script to enable intermediate logging: - -- `validation_prompt`, the prompt used to generate samples (this is set to `None` by default and intermediate logging is disabled) -- `num_validation_images`, the number of sample images to generate -- `validation_steps`, the number of steps before generating `num_validation_images` from the `validation_prompt` - -```bash ---validation_prompt="A backpack" ---num_validation_images=4 ---validation_steps=100 -``` - -## Inference - -Once you have trained a model, you can use it for inference with the [`StableDiffusionPipeline`]. - -The textual inversion script will by default only save the textual inversion embedding vector(s) that have -been added to the text encoder embedding matrix and consequently been trained. - - - - - -💡 The community has created a large library of different textual inversion embedding vectors, called [sd-concepts-library](https://huggingface.co/sd-concepts-library). -Instead of training textual inversion embeddings from scratch you can also see whether a fitting textual inversion embedding has already been added to the library. - - - -To load the textual inversion embeddings you first need to load the base model that was used when training -your textual inversion embedding vectors. Here we assume that [`runwayml/stable-diffusion-v1-5`](runwayml/stable-diffusion-v1-5) -was used as a base model so we load it first: -```python -from diffusers import StableDiffusionPipeline -import torch - -model_id = "runwayml/stable-diffusion-v1-5" -pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16, use_safetensors=True).to("cuda") -``` - -Next, we need to load the textual inversion embedding vector which can be done via the [`TextualInversionLoaderMixin.load_textual_inversion`] -function. Here we'll load the embeddings of the "" example from before. -```python -pipe.load_textual_inversion("sd-concepts-library/cat-toy") -``` - -Now we can run the pipeline making sure that the placeholder token `` is used in our prompt. - -```python -prompt = "A backpack" - -image = pipe(prompt, num_inference_steps=50).images[0] -image.save("cat-backpack.png") -``` - -The function [`TextualInversionLoaderMixin.load_textual_inversion`] can not only -load textual embedding vectors saved in Diffusers' format, but also embedding vectors -saved in [Automatic1111](https://github.com/AUTOMATIC1111/stable-diffusion-webui) format. -To do so, you can first download an embedding vector from [civitAI](https://civitai.com/models/3036?modelVersionId=8387) -and then load it locally: -```python -pipe.load_textual_inversion("./charturnerv2.pt") -``` - - -Currently there is no `load_textual_inversion` function for Flax so one has to make sure the textual inversion -embedding vector is saved as part of the model after training. - -The model can then be run just like any other Flax model: - -```python -import jax -import numpy as np -from flax.jax_utils import replicate -from flax.training.common_utils import shard -from diffusers import FlaxStableDiffusionPipeline - -model_path = "path-to-your-trained-model" -pipeline, params = FlaxStableDiffusionPipeline.from_pretrained(model_path, dtype=jax.numpy.bfloat16) - -prompt = "A backpack" -prng_seed = jax.random.PRNGKey(0) -num_inference_steps = 50 - -num_samples = jax.device_count() -prompt = num_samples * [prompt] -prompt_ids = pipeline.prepare_inputs(prompt) - -# shard inputs and rng -params = replicate(params) -prng_seed = jax.random.split(prng_seed, jax.device_count()) -prompt_ids = shard(prompt_ids) - -images = pipeline(prompt_ids, params, prng_seed, num_inference_steps, jit=True).images -images = pipeline.numpy_to_pil(np.asarray(images.reshape((num_samples,) + images.shape[-3:]))) -image.save("cat-backpack.png") -``` - - - -## How it works - -![Diagram from the paper showing overview](https://textual-inversion.github.io/static/images/training/training.JPG) -Architecture overview from the Textual Inversion blog post. - -Usually, text prompts are tokenized into an embedding before being passed to a model, which is often a transformer. Textual Inversion does something similar, but it learns a new token embedding, `v*`, from a special token `S*` in the diagram above. The model output is used to condition the diffusion model, which helps the diffusion model understand the prompt and new concepts from just a few example images. - -To do this, Textual Inversion uses a generator model and noisy versions of the training images. The generator tries to predict less noisy versions of the images, and the token embedding `v*` is optimized based on how well the generator does. If the token embedding successfully captures the new concept, it gives more useful information to the diffusion model and helps create clearer images with less noise. This optimization process typically occurs after several thousand steps of exposure to a variety of prompt and image variants. diff --git a/docs/source/jp/tutorials/autopipeline.md b/docs/source/jp/tutorials/autopipeline.md deleted file mode 100644 index 973a83c73eb1..000000000000 --- a/docs/source/jp/tutorials/autopipeline.md +++ /dev/null @@ -1,146 +0,0 @@ -# AutoPipeline - -🀗 Diffusers is able to complete many different tasks, and you can often reuse the same pretrained weights for multiple tasks such as text-to-image, image-to-image, and inpainting. If you're new to the library and diffusion models though, it may be difficult to know which pipeline to use for a task. For example, if you're using the [runwayml/stable-diffusion-v1-5](https://huggingface.co/runwayml/stable-diffusion-v1-5) checkpoint for text-to-image, you might not know that you could also use it for image-to-image and inpainting by loading the checkpoint with the [`StableDiffusionImg2ImgPipeline`] and [`StableDiffusionInpaintPipeline`] classes respectively. - -The `AutoPipeline` class is designed to simplify the variety of pipelines in 🀗 Diffusers. It is a generic, *task-first* pipeline that lets you focus on the task. The `AutoPipeline` automatically detects the correct pipeline class to use, which makes it easier to load a checkpoint for a task without knowing the specific pipeline class name. - - - -Take a look at the [AutoPipeline](./pipelines/auto_pipeline) reference to see which tasks are supported. Currently, it supports text-to-image, image-to-image, and inpainting. - - - -This tutorial shows you how to use an `AutoPipeline` to automatically infer the pipeline class to load for a specific task, given the pretrained weights. - -## Choose an AutoPipeline for your task - -Start by picking a checkpoint. For example, if you're interested in text-to-image with the [runwayml/stable-diffusion-v1-5](https://huggingface.co/runwayml/stable-diffusion-v1-5) checkpoint, use [`AutoPipelineForText2Image`]: - -```py -from diffusers import AutoPipelineForText2Image -import torch - -pipeline = AutoPipelineForText2Image.from_pretrained( - "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True -).to("cuda") -prompt = "peasant and dragon combat, wood cutting style, viking era, bevel with rune" - -image = pipeline(prompt, num_inference_steps=25).images[0] -``` - -
- generated image of peasant fighting dragon in wood cutting style -
- -Under the hood, [`AutoPipelineForText2Image`]: - -1. automatically detects a `"stable-diffusion"` class from the [`model_index.json`](https://huggingface.co/runwayml/stable-diffusion-v1-5/blob/main/model_index.json) file -2. loads the corresponding text-to-image [`StableDiffusionPipline`] based on the `"stable-diffusion"` class name - -Likewise, for image-to-image, [`AutoPipelineForImage2Image`] detects a `"stable-diffusion"` checkpoint from the `model_index.json` file and it'll load the corresponding [`StableDiffusionImg2ImgPipeline`] behind the scenes. You can also pass any additional arguments specific to the pipeline class such as `strength`, which determines the amount of noise or variation added to an input image: - -```py -from diffusers import AutoPipelineForImage2Image - -pipeline = AutoPipelineForImage2Image.from_pretrained( - "runwayml/stable-diffusion-v1-5", - torch_dtype=torch.float16, - use_safetensors=True, -).to("cuda") -prompt = "a portrait of a dog wearing a pearl earring" - -url = "https://upload.wikimedia.org/wikipedia/commons/thumb/0/0f/1665_Girl_with_a_Pearl_Earring.jpg/800px-1665_Girl_with_a_Pearl_Earring.jpg" - -response = requests.get(url) -image = Image.open(BytesIO(response.content)).convert("RGB") -image.thumbnail((768, 768)) - -image = pipeline(prompt, image, num_inference_steps=200, strength=0.75, guidance_scale=10.5).images[0] -``` - -
- generated image of a vermeer portrait of a dog wearing a pearl earring -
- -And if you want to do inpainting, then [`AutoPipelineForInpainting`] loads the underlying [`StableDiffusionInpaintPipeline`] class in the same way: - -```py -from diffusers import AutoPipelineForInpainting -from diffusers.utils import load_image - -pipeline = AutoPipelineForInpainting.from_pretrained( - "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, use_safetensors=True -).to("cuda") - -img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png" -mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png" - -init_image = load_image(img_url).convert("RGB") -mask_image = load_image(mask_url).convert("RGB") - -prompt = "A majestic tiger sitting on a bench" -image = pipeline(prompt, image=init_image, mask_image=mask_image, num_inference_steps=50, strength=0.80).images[0] -``` - -
- generated image of a tiger sitting on a bench -
- -If you try to load an unsupported checkpoint, it'll throw an error: - -```py -from diffusers import AutoPipelineForImage2Image -import torch - -pipeline = AutoPipelineForImage2Image.from_pretrained( - "openai/shap-e-img2img", torch_dtype=torch.float16, use_safetensors=True -) -"ValueError: AutoPipeline can't find a pipeline linked to ShapEImg2ImgPipeline for None" -``` - -## Use multiple pipelines - -For some workflows or if you're loading many pipelines, it is more memory-efficient to reuse the same components from a checkpoint instead of reloading them which would unnecessarily consume additional memory. For example, if you're using a checkpoint for text-to-image and you want to use it again for image-to-image, use the [`~AutoPipelineForImage2Image.from_pipe`] method. This method creates a new pipeline from the components of a previously loaded pipeline at no additional memory cost. - -The [`~AutoPipelineForImage2Image.from_pipe`] method detects the original pipeline class and maps it to the new pipeline class corresponding to the task you want to do. For example, if you load a `"stable-diffusion"` class pipeline for text-to-image: - -```py -from diffusers import AutoPipelineForText2Image, AutoPipelineForImage2Image - -pipeline_text2img = AutoPipelineForText2Image.from_pretrained( - "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True -) -print(type(pipeline_text2img)) -"" -``` - -Then [`~AutoPipelineForImage2Image.from_pipe`] maps the original `"stable-diffusion"` pipeline class to [`StableDiffusionImg2ImgPipeline`]: - -```py -pipeline_img2img = AutoPipelineForImage2Image.from_pipe(pipeline_text2img) -print(type(pipeline_img2img)) -"" -``` - -If you passed an optional argument - like disabling the safety checker - to the original pipeline, this argument is also passed on to the new pipeline: - -```py -from diffusers import AutoPipelineForText2Image, AutoPipelineForImage2Image - -pipeline_text2img = AutoPipelineForText2Image.from_pretrained( - "runwayml/stable-diffusion-v1-5", - torch_dtype=torch.float16, - use_safetensors=True, - requires_safety_checker=False, -).to("cuda") - -pipeline_img2img = AutoPipelineForImage2Image.from_pipe(pipeline_text2img) -print(pipe.config.requires_safety_checker) -"False" -``` - -You can overwrite any of the arguments and even configuration from the original pipeline if you want to change the behavior of the new pipeline. For example, to turn the safety checker back on and add the `strength` argument: - -```py -pipeline_img2img = AutoPipelineForImage2Image.from_pipe(pipeline_text2img, requires_safety_checker=True, strength=0.3) -``` diff --git a/docs/source/jp/tutorials/basic_training.md b/docs/source/jp/tutorials/basic_training.md deleted file mode 100644 index 3a9366baf84a..000000000000 --- a/docs/source/jp/tutorials/basic_training.md +++ /dev/null @@ -1,404 +0,0 @@ - - -[[open-in-colab]] - -# Train a diffusion model - -Unconditional image generation is a popular application of diffusion models that generates images that look like those in the dataset used for training. Typically, the best results are obtained from finetuning a pretrained model on a specific dataset. You can find many of these checkpoints on the [Hub](https://huggingface.co/search/full-text?q=unconditional-image-generation&type=model), but if you can't find one you like, you can always train your own! - -This tutorial will teach you how to train a [`UNet2DModel`] from scratch on a subset of the [Smithsonian Butterflies](https://huggingface.co/datasets/huggan/smithsonian_butterflies_subset) dataset to generate your own 🊋 butterflies 🊋. - - - -💡 This training tutorial is based on the [Training with 🧚 Diffusers](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/training_example.ipynb) notebook. For additional details and context about diffusion models like how they work, check out the notebook! - - - -Before you begin, make sure you have 🀗 Datasets installed to load and preprocess image datasets, and 🀗 Accelerate, to simplify training on any number of GPUs. The following command will also install [TensorBoard](https://www.tensorflow.org/tensorboard) to visualize training metrics (you can also use [Weights & Biases](https://docs.wandb.ai/) to track your training). - -```py -# uncomment to install the necessary libraries in Colab -#!pip install diffusers[training] -``` - -We encourage you to share your model with the community, and in order to do that, you'll need to login to your Hugging Face account (create one [here](https://hf.co/join) if you don't already have one!). You can login from a notebook and enter your token when prompted: - -```py ->>> from huggingface_hub import notebook_login - ->>> notebook_login() -``` - -Or login in from the terminal: - -```bash -huggingface-cli login -``` - -Since the model checkpoints are quite large, install [Git-LFS](https://git-lfs.com/) to version these large files: - -```bash -!sudo apt -qq install git-lfs -!git config --global credential.helper store -``` - -## Training configuration - -For convenience, create a `TrainingConfig` class containing the training hyperparameters (feel free to adjust them): - -```py ->>> from dataclasses import dataclass - - ->>> @dataclass -... class TrainingConfig: -... image_size = 128 # the generated image resolution -... train_batch_size = 16 -... eval_batch_size = 16 # how many images to sample during evaluation -... num_epochs = 50 -... gradient_accumulation_steps = 1 -... learning_rate = 1e-4 -... lr_warmup_steps = 500 -... save_image_epochs = 10 -... save_model_epochs = 30 -... mixed_precision = "fp16" # `no` for float32, `fp16` for automatic mixed precision -... output_dir = "ddpm-butterflies-128" # the model name locally and on the HF Hub - -... push_to_hub = True # whether to upload the saved model to the HF Hub -... hub_private_repo = False -... overwrite_output_dir = True # overwrite the old model when re-running the notebook -... seed = 0 - - ->>> config = TrainingConfig() -``` - -## Load the dataset - -You can easily load the [Smithsonian Butterflies](https://huggingface.co/datasets/huggan/smithsonian_butterflies_subset) dataset with the 🀗 Datasets library: - -```py ->>> from datasets import load_dataset - ->>> config.dataset_name = "huggan/smithsonian_butterflies_subset" ->>> dataset = load_dataset(config.dataset_name, split="train") -``` - - - -💡 You can find additional datasets from the [HugGan Community Event](https://huggingface.co/huggan) or you can use your own dataset by creating a local [`ImageFolder`](https://huggingface.co/docs/datasets/image_dataset#imagefolder). Set `config.dataset_name` to the repository id of the dataset if it is from the HugGan Community Event, or `imagefolder` if you're using your own images. - - - -🀗 Datasets uses the [`~datasets.Image`] feature to automatically decode the image data and load it as a [`PIL.Image`](https://pillow.readthedocs.io/en/stable/reference/Image.html) which we can visualize: - -```py ->>> import matplotlib.pyplot as plt - ->>> fig, axs = plt.subplots(1, 4, figsize=(16, 4)) ->>> for i, image in enumerate(dataset[:4]["image"]): -... axs[i].imshow(image) -... axs[i].set_axis_off() ->>> fig.show() -``` - -
- -
- -The images are all different sizes though, so you'll need to preprocess them first: - -* `Resize` changes the image size to the one defined in `config.image_size`. -* `RandomHorizontalFlip` augments the dataset by randomly mirroring the images. -* `Normalize` is important to rescale the pixel values into a [-1, 1] range, which is what the model expects. - -```py ->>> from torchvision import transforms - ->>> preprocess = transforms.Compose( -... [ -... transforms.Resize((config.image_size, config.image_size)), -... transforms.RandomHorizontalFlip(), -... transforms.ToTensor(), -... transforms.Normalize([0.5], [0.5]), -... ] -... ) -``` - -Use 🀗 Datasets' [`~datasets.Dataset.set_transform`] method to apply the `preprocess` function on the fly during training: - -```py ->>> def transform(examples): -... images = [preprocess(image.convert("RGB")) for image in examples["image"]] -... return {"images": images} - - ->>> dataset.set_transform(transform) -``` - -Feel free to visualize the images again to confirm that they've been resized. Now you're ready to wrap the dataset in a [DataLoader](https://pytorch.org/docs/stable/data#torch.utils.data.DataLoader) for training! - -```py ->>> import torch - ->>> train_dataloader = torch.utils.data.DataLoader(dataset, batch_size=config.train_batch_size, shuffle=True) -``` - -## Create a UNet2DModel - -Pretrained models in 🧚 Diffusers are easily created from their model class with the parameters you want. For example, to create a [`UNet2DModel`]: - -```py ->>> from diffusers import UNet2DModel - ->>> model = UNet2DModel( -... sample_size=config.image_size, # the target image resolution -... in_channels=3, # the number of input channels, 3 for RGB images -... out_channels=3, # the number of output channels -... layers_per_block=2, # how many ResNet layers to use per UNet block -... block_out_channels=(128, 128, 256, 256, 512, 512), # the number of output channels for each UNet block -... down_block_types=( -... "DownBlock2D", # a regular ResNet downsampling block -... "DownBlock2D", -... "DownBlock2D", -... "DownBlock2D", -... "AttnDownBlock2D", # a ResNet downsampling block with spatial self-attention -... "DownBlock2D", -... ), -... up_block_types=( -... "UpBlock2D", # a regular ResNet upsampling block -... "AttnUpBlock2D", # a ResNet upsampling block with spatial self-attention -... "UpBlock2D", -... "UpBlock2D", -... "UpBlock2D", -... "UpBlock2D", -... ), -... ) -``` - -It is often a good idea to quickly check the sample image shape matches the model output shape: - -```py ->>> sample_image = dataset[0]["images"].unsqueeze(0) ->>> print("Input shape:", sample_image.shape) -Input shape: torch.Size([1, 3, 128, 128]) - ->>> print("Output shape:", model(sample_image, timestep=0).sample.shape) -Output shape: torch.Size([1, 3, 128, 128]) -``` - -Great! Next, you'll need a scheduler to add some noise to the image. - -## Create a scheduler - -The scheduler behaves differently depending on whether you're using the model for training or inference. During inference, the scheduler generates image from the noise. During training, the scheduler takes a model output - or a sample - from a specific point in the diffusion process and applies noise to the image according to a *noise schedule* and an *update rule*. - -Let's take a look at the [`DDPMScheduler`] and use the `add_noise` method to add some random noise to the `sample_image` from before: - -```py ->>> import torch ->>> from PIL import Image ->>> from diffusers import DDPMScheduler - ->>> noise_scheduler = DDPMScheduler(num_train_timesteps=1000) ->>> noise = torch.randn(sample_image.shape) ->>> timesteps = torch.LongTensor([50]) ->>> noisy_image = noise_scheduler.add_noise(sample_image, noise, timesteps) - ->>> Image.fromarray(((noisy_image.permute(0, 2, 3, 1) + 1.0) * 127.5).type(torch.uint8).numpy()[0]) -``` - -
- -
- -The training objective of the model is to predict the noise added to the image. The loss at this step can be calculated by: - -```py ->>> import torch.nn.functional as F - ->>> noise_pred = model(noisy_image, timesteps).sample ->>> loss = F.mse_loss(noise_pred, noise) -``` - -## Train the model - -By now, you have most of the pieces to start training the model and all that's left is putting everything together. - -First, you'll need an optimizer and a learning rate scheduler: - -```py ->>> from diffusers.optimization import get_cosine_schedule_with_warmup - ->>> optimizer = torch.optim.AdamW(model.parameters(), lr=config.learning_rate) ->>> lr_scheduler = get_cosine_schedule_with_warmup( -... optimizer=optimizer, -... num_warmup_steps=config.lr_warmup_steps, -... num_training_steps=(len(train_dataloader) * config.num_epochs), -... ) -``` - -Then, you'll need a way to evaluate the model. For evaluation, you can use the [`DDPMPipeline`] to generate a batch of sample images and save it as a grid: - -```py ->>> from diffusers import DDPMPipeline ->>> from diffusers.utils import make_image_grid ->>> import math ->>> import os - - ->>> def evaluate(config, epoch, pipeline): -... # Sample some images from random noise (this is the backward diffusion process). -... # The default pipeline output type is `List[PIL.Image]` -... images = pipeline( -... batch_size=config.eval_batch_size, -... generator=torch.manual_seed(config.seed), -... ).images - -... # Make a grid out of the images -... image_grid = make_image_grid(images, rows=4, cols=4) - -... # Save the images -... test_dir = os.path.join(config.output_dir, "samples") -... os.makedirs(test_dir, exist_ok=True) -... image_grid.save(f"{test_dir}/{epoch:04d}.png") -``` - -Now you can wrap all these components together in a training loop with 🀗 Accelerate for easy TensorBoard logging, gradient accumulation, and mixed precision training. To upload the model to the Hub, write a function to get your repository name and information and then push it to the Hub. - - - -💡 The training loop below may look intimidating and long, but it'll be worth it later when you launch your training in just one line of code! If you can't wait and want to start generating images, feel free to copy and run the code below. You can always come back and examine the training loop more closely later, like when you're waiting for your model to finish training. 🀗 - - - -```py ->>> from accelerate import Accelerator ->>> from huggingface_hub import create_repo, upload_folder ->>> from tqdm.auto import tqdm ->>> from pathlib import Path ->>> import os - ->>> def train_loop(config, model, noise_scheduler, optimizer, train_dataloader, lr_scheduler): -... # Initialize accelerator and tensorboard logging -... accelerator = Accelerator( -... mixed_precision=config.mixed_precision, -... gradient_accumulation_steps=config.gradient_accumulation_steps, -... log_with="tensorboard", -... project_dir=os.path.join(config.output_dir, "logs"), -... ) -... if accelerator.is_main_process: -... if config.output_dir is not None: -... os.makedirs(config.output_dir, exist_ok=True) -... if config.push_to_hub: -... repo_id = create_repo( -... repo_id=config.hub_model_id or Path(config.output_dir).name, exist_ok=True -... ).repo_id -... accelerator.init_trackers("train_example") - -... # Prepare everything -... # There is no specific order to remember, you just need to unpack the -... # objects in the same order you gave them to the prepare method. -... model, optimizer, train_dataloader, lr_scheduler = accelerator.prepare( -... model, optimizer, train_dataloader, lr_scheduler -... ) - -... global_step = 0 - -... # Now you train the model -... for epoch in range(config.num_epochs): -... progress_bar = tqdm(total=len(train_dataloader), disable=not accelerator.is_local_main_process) -... progress_bar.set_description(f"Epoch {epoch}") - -... for step, batch in enumerate(train_dataloader): -... clean_images = batch["images"] -... # Sample noise to add to the images -... noise = torch.randn(clean_images.shape).to(clean_images.device) -... bs = clean_images.shape[0] - -... # Sample a random timestep for each image -... timesteps = torch.randint( -... 0, noise_scheduler.config.num_train_timesteps, (bs,), device=clean_images.device -... ).long() - -... # Add noise to the clean images according to the noise magnitude at each timestep -... # (this is the forward diffusion process) -... noisy_images = noise_scheduler.add_noise(clean_images, noise, timesteps) - -... with accelerator.accumulate(model): -... # Predict the noise residual -... noise_pred = model(noisy_images, timesteps, return_dict=False)[0] -... loss = F.mse_loss(noise_pred, noise) -... accelerator.backward(loss) - -... accelerator.clip_grad_norm_(model.parameters(), 1.0) -... optimizer.step() -... lr_scheduler.step() -... optimizer.zero_grad() - -... progress_bar.update(1) -... logs = {"loss": loss.detach().item(), "lr": lr_scheduler.get_last_lr()[0], "step": global_step} -... progress_bar.set_postfix(**logs) -... accelerator.log(logs, step=global_step) -... global_step += 1 - -... # After each epoch you optionally sample some demo images with evaluate() and save the model -... if accelerator.is_main_process: -... pipeline = DDPMPipeline(unet=accelerator.unwrap_model(model), scheduler=noise_scheduler) - -... if (epoch + 1) % config.save_image_epochs == 0 or epoch == config.num_epochs - 1: -... evaluate(config, epoch, pipeline) - -... if (epoch + 1) % config.save_model_epochs == 0 or epoch == config.num_epochs - 1: -... if config.push_to_hub: -... upload_folder( -... repo_id=repo_id, -... folder_path=config.output_dir, -... commit_message=f"Epoch {epoch}", -... ignore_patterns=["step_*", "epoch_*"], -... ) -... else: -... pipeline.save_pretrained(config.output_dir) -``` - -Phew, that was quite a bit of code! But you're finally ready to launch the training with 🀗 Accelerate's [`~accelerate.notebook_launcher`] function. Pass the function the training loop, all the training arguments, and the number of processes (you can change this value to the number of GPUs available to you) to use for training: - -```py ->>> from accelerate import notebook_launcher - ->>> args = (config, model, noise_scheduler, optimizer, train_dataloader, lr_scheduler) - ->>> notebook_launcher(train_loop, args, num_processes=1) -``` - -Once training is complete, take a look at the final 🊋 images 🊋 generated by your diffusion model! - -```py ->>> import glob - ->>> sample_images = sorted(glob.glob(f"{config.output_dir}/samples/*.png")) ->>> Image.open(sample_images[-1]) -``` - -
- -
- -## Next steps - -Unconditional image generation is one example of a task that can be trained. You can explore other tasks and training techniques by visiting the [🧚 Diffusers Training Examples](../training/overview) page. Here are some examples of what you can learn: - -* [Textual Inversion](../training/text_inversion), an algorithm that teaches a model a specific visual concept and integrates it into the generated image. -* [DreamBooth](../training/dreambooth), a technique for generating personalized images of a subject given several input images of the subject. -* [Guide](../training/text2image) to finetuning a Stable Diffusion model on your own dataset. -* [Guide](../training/lora) to using LoRA, a memory-efficient technique for finetuning really large models faster. diff --git a/docs/source/jp/tutorials/tutorial_overview.md b/docs/source/jp/tutorials/tutorial_overview.md deleted file mode 100644 index 0cec9a317ddb..000000000000 --- a/docs/source/jp/tutorials/tutorial_overview.md +++ /dev/null @@ -1,23 +0,0 @@ - - -# Overview - -Welcome to 🧚 Diffusers! If you're new to diffusion models and generative AI, and want to learn more, then you've come to the right place. These beginner-friendly tutorials are designed to provide a gentle introduction to diffusion models and help you understand the library fundamentals - the core components and how 🧚 Diffusers is meant to be used. - -You'll learn how to use a pipeline for inference to rapidly generate things, and then deconstruct that pipeline to really understand how to use the library as a modular toolbox for building your own diffusion systems. In the next lesson, you'll learn how to train your own diffusion model to generate what you want. - -After completing the tutorials, you'll have gained the necessary skills to start exploring the library on your own and see how to use it for your own projects and applications. - -Feel free to join our community on [Discord](https://discord.com/invite/JfAtkvEtRb) or the [forums](https://discuss.huggingface.co/c/discussion-related-to-httpsgithubcomhuggingfacediffusers/63) to connect and collaborate with other users and developers! - -Let's start diffusing! 🧚 \ No newline at end of file diff --git a/docs/source/jp/tutorials/using_peft_for_inference.md b/docs/source/jp/tutorials/using_peft_for_inference.md deleted file mode 100644 index 4629cf8ba43c..000000000000 --- a/docs/source/jp/tutorials/using_peft_for_inference.md +++ /dev/null @@ -1,165 +0,0 @@ - - -[[open-in-colab]] - -# Inference with PEFT - -There are many adapters trained in different styles to achieve different effects. You can even combine multiple adapters to create new and unique images. With the 🀗 [PEFT](https://huggingface.co/docs/peft/index) integration in 🀗 Diffusers, it is really easy to load and manage adapters for inference. In this guide, you'll learn how to use different adapters with [Stable Diffusion XL (SDXL)](./pipelines/stable_diffusion/stable_diffusion_xl) for inference. - -Throughout this guide, you'll use LoRA as the main adapter technique, so we'll use the terms LoRA and adapter interchangeably. You should have some familiarity with LoRA, and if you don't, we welcome you to check out the [LoRA guide](https://huggingface.co/docs/peft/conceptual_guides/lora). - -Let's first install all the required libraries. - -```bash -!pip install -q transformers accelerate -# Will be updated once the stable releases are done. -!pip install -q git+https://github.com/huggingface/peft.git -!pip install -q git+https://github.com/huggingface/diffusers.git -``` - -Now, let's load a pipeline with a SDXL checkpoint: - -```python -from diffusers import DiffusionPipeline -import torch - -pipe_id = "stabilityai/stable-diffusion-xl-base-1.0" -pipe = DiffusionPipeline.from_pretrained(pipe_id, torch_dtype=torch.float16).to("cuda") -``` - - -Next, load a LoRA checkpoint with the [`~diffusers.loaders.StableDiffusionXLLoraLoaderMixin.load_lora_weights`] method. - -With the 🀗 PEFT integration, you can assign a specific `adapter_name` to the checkpoint, which let's you easily switch between different LoRA checkpoints. Let's call this adapter `"toy"`. - -```python -pipe.load_lora_weights("CiroN2022/toy-face", weight_name="toy_face_sdxl.safetensors", adapter_name="toy") -``` - -And then perform inference: - -```python -prompt = "toy_face of a hacker with a hoodie" - -lora_scale= 0.9 -image = pipe( - prompt, num_inference_steps=30, cross_attention_kwargs={"scale": lora_scale}, generator=torch.manual_seed(0) -).images[0] -image -``` - -![toy-face](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/peft_integration/diffusers_peft_lora_inference_8_1.png) - - -With the `adapter_name` parameter, it is really easy to use another adapter for inference! Load the [nerijs/pixel-art-xl](https://huggingface.co/nerijs/pixel-art-xl) adapter that has been fine-tuned to generate pixel art images, and let's call it `"pixel"`. - -The pipeline automatically sets the first loaded adapter (`"toy"`) as the active adapter. But you can activate the `"pixel"` adapter with the [`~diffusers.loaders.set_adapters`] method as shown below: - -```python -pipe.load_lora_weights("nerijs/pixel-art-xl", weight_name="pixel-art-xl.safetensors", adapter_name="pixel") -pipe.set_adapters("pixel") -``` - -Let's now generate an image with the second adapter and check the result: - -```python -prompt = "a hacker with a hoodie, pixel art" -image = pipe( - prompt, num_inference_steps=30, cross_attention_kwargs={"scale": lora_scale}, generator=torch.manual_seed(0) -).images[0] -image -``` - -![pixel-art](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/peft_integration/diffusers_peft_lora_inference_12_1.png) - -## Combine multiple adapters - -You can also perform multi-adapter inference where you combine different adapter checkpoints for inference. - -Once again, use the [`~diffusers.loaders.set_adapters`] method to activate two LoRA checkpoints and specify the weight for how the checkpoints should be combined. - -```python -pipe.set_adapters(["pixel", "toy"], adapter_weights=[0.5, 1.0]) -``` - -Now that we have set these two adapters, let's generate an image from the combined adapters! - - - -LoRA checkpoints in the diffusion community are almost always obtained with [DreamBooth](https://huggingface.co/docs/diffusers/main/en/training/dreambooth). DreamBooth training often relies on "trigger" words in the input text prompts in order for the generation results to look as expected. When you combine multiple LoRA checkpoints, it's important to ensure the trigger words for the corresponding LoRA checkpoints are present in the input text prompts. - - - -The trigger words for [CiroN2022/toy-face](https://hf.co/CiroN2022/toy-face) and [nerijs/pixel-art-xl](https://hf.co/nerijs/pixel-art-xl) are found in their repositories. - - -```python -# Notice how the prompt is constructed. -prompt = "toy_face of a hacker with a hoodie, pixel art" -image = pipe( - prompt, num_inference_steps=30, cross_attention_kwargs={"scale": 1.0}, generator=torch.manual_seed(0) -).images[0] -image -``` - -![toy-face-pixel-art](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/peft_integration/diffusers_peft_lora_inference_16_1.png) - -Impressive! As you can see, the model was able to generate an image that mixes the characteristics of both adapters. - -If you want to go back to using only one adapter, use the [`~diffusers.loaders.set_adapters`] method to activate the `"toy"` adapter: - -```python -# First, set the adapter. -pipe.set_adapters("toy") - -# Then, run inference. -prompt = "toy_face of a hacker with a hoodie" -lora_scale= 0.9 -image = pipe( - prompt, num_inference_steps=30, cross_attention_kwargs={"scale": lora_scale}, generator=torch.manual_seed(0) -).images[0] -image -``` - -![toy-face-again](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/peft_integration/diffusers_peft_lora_inference_18_1.png) - - -If you want to switch to only the base model, disable all LoRAs with the [`~diffusers.loaders.disable_lora`] method. - - -```python -pipe.disable_lora() - -prompt = "toy_face of a hacker with a hoodie" -lora_scale= 0.9 -image = pipe(prompt, num_inference_steps=30, generator=torch.manual_seed(0)).images[0] -image -``` - -![no-lora](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/peft_integration/diffusers_peft_lora_inference_20_1.png) - -## Monitoring active adapters - -You have attached multiple adapters in this tutorial, and if you're feeling a bit lost on what adapters have been attached to the pipeline's components, you can easily check the list of active adapters using the [`~diffusers.loaders.get_active_adapters`] method: - -```python -active_adapters = pipe.get_active_adapters() ->>> ["toy", "pixel"] -``` - -You can also get the active adapters of each pipeline component with [`~diffusers.loaders.get_list_adapters`]: - -```python -list_adapters_component_wise = pipe.get_list_adapters() ->>> {"text_encoder": ["toy", "pixel"], "unet": ["toy", "pixel"], "text_encoder_2": ["toy", "pixel"]} -``` diff --git a/docs/source/jp/using-diffusers/pipeline_overview.md b/docs/source/jp/using-diffusers/pipeline_overview.md deleted file mode 100644 index 4ee25b51dc6f..000000000000 --- a/docs/source/jp/using-diffusers/pipeline_overview.md +++ /dev/null @@ -1,17 +0,0 @@ - - -# Overview - -A pipeline is an end-to-end class that provides a quick and easy way to use a diffusion system for inference by bundling independently trained models and schedulers together. Certain combinations of models and schedulers define specific pipeline types, like [`StableDiffusionXLPipeline`] or [`StableDiffusionControlNetPipeline`], with specific capabilities. All pipeline types inherit from the base [`DiffusionPipeline`] class; pass it any checkpoint, and it'll automatically detect the pipeline type and load the necessary components. - -This section introduces you to some of the more complex pipelines like Stable Diffusion XL, ControlNet, and DiffEdit, which require additional inputs. You'll also learn how to use a distilled version of the Stable Diffusion model to speed up inference, how to control randomness on your hardware when generating images, and how to create a community pipeline for a custom task like generating images from speech. \ No newline at end of file diff --git a/docs/source/jp/using-diffusers/sdxl.md b/docs/source/jp/using-diffusers/sdxl.md deleted file mode 100644 index 36286ecad863..000000000000 --- a/docs/source/jp/using-diffusers/sdxl.md +++ /dev/null @@ -1,431 +0,0 @@ -# Stable Diffusion XL - -[[open-in-colab]] - -[Stable Diffusion XL](https://huggingface.co/papers/2307.01952) (SDXL) is a powerful text-to-image generation model that iterates on the previous Stable Diffusion models in three key ways: - -1. the UNet is 3x larger and SDXL combines a second text encoder (OpenCLIP ViT-bigG/14) with the original text encoder to significantly increase the number of parameters -2. introduces size and crop-conditioning to preserve training data from being discarded and gain more control over how a generated image should be cropped -3. introduces a two-stage model process; the *base* model (can also be run as a standalone model) generates an image as an input to the *refiner* model which adds additional high-quality details - -This guide will show you how to use SDXL for text-to-image, image-to-image, and inpainting. - -Before you begin, make sure you have the following libraries installed: - -```py -# uncomment to install the necessary libraries in Colab -#!pip install diffusers transformers accelerate safetensors omegaconf invisible-watermark>=0.2.0 -``` - - - -We recommend installing the [invisible-watermark](https://pypi.org/project/invisible-watermark/) library to help identify images that are generated. If the invisible-watermark library is installed, it is used by default. To disable the watermarker: - -```py -pipeline = StableDiffusionXLPipeline.from_pretrained(..., add_watermarker=False) -``` - - - -## Load model checkpoints - -Model weights may be stored in separate subfolders on the Hub or locally, in which case, you should use the [`~StableDiffusionXLPipeline.from_pretrained`] method: - -```py -from diffusers import StableDiffusionXLPipeline, StableDiffusionXLImg2ImgPipeline -import torch - -pipeline = StableDiffusionXLPipeline.from_pretrained( - "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True -).to("cuda") - -refiner = StableDiffusionXLImg2ImgPipeline.from_pretrained( - "stabilityai/stable-diffusion-xl-refiner-1.0", torch_dtype=torch.float16, use_safetensors=True, variant="fp16" -).to("cuda") -``` - -You can also use the [`~StableDiffusionXLPipeline.from_single_file`] method to load a model checkpoint stored in a single file format (`.ckpt` or `.safetensors`) from the Hub or locally: - -```py -from diffusers import StableDiffusionXLPipeline, StableDiffusionXLImg2ImgPipeline -import torch - -pipeline = StableDiffusionXLPipeline.from_single_file( - "https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/blob/main/sd_xl_base_1.0.safetensors", torch_dtype=torch.float16, variant="fp16", use_safetensors=True -).to("cuda") - -refiner = StableDiffusionXLImg2ImgPipeline.from_single_file( - "https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0/blob/main/sd_xl_refiner_1.0.safetensors", torch_dtype=torch.float16, use_safetensors=True, variant="fp16" -).to("cuda") -``` - -## Text-to-image - -For text-to-image, pass a text prompt. By default, SDXL generates a 1024x1024 image for the best results. You can try setting the `height` and `width` parameters to 768x768 or 512x512, but anything below 512x512 is not likely to work. - -```py -from diffusers import AutoPipelineForText2Image -import torch - -pipeline_text2image = AutoPipelineForText2Image.from_pretrained( - "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True -).to("cuda") - -prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" -image = pipeline(prompt=prompt).images[0] -``` - -
- generated image of an astronaut in a jungle -
- -## Image-to-image - -For image-to-image, SDXL works especially well with image sizes between 768x768 and 1024x1024. Pass an initial image, and a text prompt to condition the image with: - -```py -from diffusers import AutoPipelineForImg2Img -from diffusers.utils import load_image - -# use from_pipe to avoid consuming additional memory when loading a checkpoint -pipeline = AutoPipelineForImage2Image.from_pipe(pipeline_text2image).to("cuda") -url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-img2img.png" - -init_image = load_image(url).convert("RGB") -prompt = "a dog catching a frisbee in the jungle" -image = pipeline(prompt, image=init_image, strength=0.8, guidance_scale=10.5).images[0] -``` - -
- generated image of a dog catching a frisbee in a jungle -
- -## Inpainting - -For inpainting, you'll need the original image and a mask of what you want to replace in the original image. Create a prompt to describe what you want to replace the masked area with. - -```py -from diffusers import AutoPipelineForInpainting -from diffusers.utils import load_image - -# use from_pipe to avoid consuming additional memory when loading a checkpoint -pipeline = AutoPipelineForInpainting.from_pipe(pipeline_text2image).to("cuda") - -img_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-text2img.png" -mask_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-inpaint-mask.png" - -init_image = load_image(img_url).convert("RGB") -mask_image = load_image(mask_url).convert("RGB") - -prompt = "A deep sea diver floating" -image = pipeline(prompt=prompt, image=init_image, mask_image=mask_image, strength=0.85, guidance_scale=12.5).images[0] -``` - -
- generated image of a deep sea diver in a jungle -
- -## Refine image quality - -SDXL includes a [refiner model](https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0) specialized in denoising low-noise stage images to generate higher-quality images from the base model. There are two ways to use the refiner: - -1. use the base and refiner model together to produce a refined image -2. use the base model to produce an image, and subsequently use the refiner model to add more details to the image (this is how SDXL is originally trained) - -### Base + refiner model - -When you use the base and refiner model together to generate an image, this is known as an ([*ensemble of expert denoisers*](https://research.nvidia.com/labs/dir/eDiff-I/)). The ensemble of expert denoisers approach requires less overall denoising steps versus passing the base model's output to the refiner model, so it should be significantly faster to run. However, you won't be able to inspect the base model's output because it still contains a large amount of noise. - -As an ensemble of expert denoisers, the base model serves as the expert during the high-noise diffusion stage and the refiner model serves as the expert during the low-noise diffusion stage. Load the base and refiner model: - -```py -from diffusers import DiffusionPipeline -import torch - -base = DiffusionPipeline.from_pretrained( - "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True -).to("cuda") - -refiner = DiffusionPipeline.from_pretrained( - "stabilityai/stable-diffusion-xl-refiner-1.0", - text_encoder_2=base.text_encoder_2, - vae=base.vae, - torch_dtype=torch.float16, - use_safetensors=True, - variant="fp16", -).to("cuda") -``` - -To use this approach, you need to define the number of timesteps for each model to run through their respective stages. For the base model, this is controlled by the [`denoising_end`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLPipeline.__call__.denoising_end) parameter and for the refiner model, it is controlled by the [`denoising_start`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLImg2ImgPipeline.__call__.denoising_start) parameter. - - - -The `denoising_end` and `denoising_start` parameters should be a float between 0 and 1. These parameters are represented as a proportion of discrete timesteps as defined by the scheduler. If you're also using the `strength` parameter, it'll be ignored because the number of denoising steps is determined by the discrete timesteps the model is trained on and the declared fractional cutoff. - - - -Let's set `denoising_end=0.8` so the base model performs the first 80% of denoising the **high-noise** timesteps and set `denoising_start=0.8` so the refiner model performs the last 20% of denoising the **low-noise** timesteps. The base model output should be in **latent** space instead of a PIL image. - -```py -prompt = "A majestic lion jumping from a big stone at night" - -image = base( - prompt=prompt, - num_inference_steps=40, - denoising_end=0.8, - output_type="latent", -).images -image = refiner( - prompt=prompt, - num_inference_steps=40, - denoising_start=0.8, - image=image, -).images[0] -``` - -
-
- generated image of a lion on a rock at night -
base model
-
-
- generated image of a lion on a rock at night in higher quality -
ensemble of expert denoisers
-
-
- -The refiner model can also be used for inpainting in the [`StableDiffusionXLInpaintPipeline`]: - -```py -from diffusers import StableDiffusionXLInpaintPipeline -from diffusers.utils import load_image - -base = StableDiffusionXLInpaintPipeline.from_pretrained( - "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True -).to("cuda") - -refiner = StableDiffusionXLInpaintPipeline.from_pretrained( - "stabilityai/stable-diffusion-xl-refiner-1.0", - text_encoder_2=pipe.text_encoder_2, - vae=pipe.vae, - torch_dtype=torch.float16, - use_safetensors=True, - variant="fp16", -).to("cuda") - -img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png" -mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png" - -init_image = load_image(img_url).convert("RGB") -mask_image = load_image(mask_url).convert("RGB") - -prompt = "A majestic tiger sitting on a bench" -num_inference_steps = 75 -high_noise_frac = 0.7 - -image = base( - prompt=prompt, - image=init_image, - mask_image=mask_image, - num_inference_steps=num_inference_steps, - denoising_end=high_noise_frac, - output_type="latent", -).images -image = refiner( - prompt=prompt, - image=image, - mask_image=mask_image, - num_inference_steps=num_inference_steps, - denoising_start=high_noise_frac, -).images[0] -``` - -This ensemble of expert denoisers method works well for all available schedulers! - -### Base to refiner model - -SDXL gets a boost in image quality by using the refiner model to add additional high-quality details to the fully-denoised image from the base model, in an image-to-image setting. - -Load the base and refiner models: - -```py -from diffusers import DiffusionPipeline -import torch - -base = DiffusionPipeline.from_pretrained( - "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True -).to("cuda") - -refiner = DiffusionPipeline.from_pretrained( - "stabilityai/stable-diffusion-xl-refiner-1.0", - text_encoder_2=pipe.text_encoder_2, - vae=pipe.vae, - torch_dtype=torch.float16, - use_safetensors=True, - variant="fp16", -).to("cuda") -``` - -Generate an image from the base model, and set the model output to **latent** space: - -```py -prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" - -image = base(prompt=prompt, output_type="latent").images[0] -``` - -Pass the generated image to the refiner model: - -```py -image = refiner(prompt=prompt, image=image[None, :]).images[0] -``` - -
-
- generated image of an astronaut riding a green horse on Mars -
base model
-
-
- higher quality generated image of an astronaut riding a green horse on Mars -
base model + refiner model
-
-
- -For inpainting, load the refiner model in the [`StableDiffusionXLInpaintPipeline`], remove the `denoising_end` and `denoising_start` parameters, and choose a smaller number of inference steps for the refiner. - -## Micro-conditioning - -SDXL training involves several additional conditioning techniques, which are referred to as *micro-conditioning*. These include original image size, target image size, and cropping parameters. The micro-conditionings can be used at inference time to create high-quality, centered images. - - - -You can use both micro-conditioning and negative micro-conditioning parameters thanks to classifier-free guidance. They are available in the [`StableDiffusionXLPipeline`], [`StableDiffusionXLImg2ImgPipeline`], [`StableDiffusionXLInpaintPipeline`], and [`StableDiffusionXLControlNetPipeline`]. - - - -### Size conditioning - -There are two types of size conditioning: - -- [`original_size`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLPipeline.__call__.original_size) conditioning comes from upscaled images in the training batch (because it would be wasteful to discard the smaller images which make up almost 40% of the total training data). This way, SDXL learns that upscaling artifacts are not supposed to be present in high-resolution images. During inference, you can use `original_size` to indicate the original image resolution. Using the default value of `(1024, 1024)` produces higher-quality images that resemble the 1024x1024 images in the dataset. If you choose to use a lower resolution, such as `(256, 256)`, the model still generates 1024x1024 images, but they'll look like the low resolution images (simpler patterns, blurring) in the dataset. - -- [`target_size`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLPipeline.__call__.target_size) conditioning comes from finetuning SDXL to support different image aspect ratios. During inference, if you use the default value of `(1024, 1024)`, you'll get an image that resembles the composition of square images in the dataset. We recommend using the same value for `target_size` and `original_size`, but feel free to experiment with other options! - -🀗 Diffusers also lets you specify negative conditions about an image's size to steer generation away from certain image resolutions: - -```py -from diffusers import StableDiffusionXLPipeline -import torch - -pipe = StableDiffusionXLPipeline.from_pretrained( - "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True -).to("cuda") - -prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" -image = pipe( - prompt=prompt, - negative_original_size=(512, 512), - negative_target_size=(1024, 1024), -).images[0] -``` - -
- -
Images negative conditioned on image resolutions of (128, 128), (256, 256), and (512, 512).
-
- -### Crop conditioning - -Images generated by previous Stable Diffusion models may sometimes appear to be cropped. This is because images are actually cropped during training so that all the images in a batch have the same size. By conditioning on crop coordinates, SDXL *learns* that no cropping - coordinates `(0, 0)` - usually correlates with centered subjects and complete faces (this is the default value in 🀗 Diffusers). You can experiment with different coordinates if you want to generate off-centered compositions! - -```py -from diffusers import StableDiffusionXLPipeline -import torch - - -pipeline = StableDiffusionXLPipeline.from_pretrained( - "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True -).to("cuda") - -prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" -image = pipeline(prompt=prompt, crops_coords_top_left=(256,0)).images[0] -``` - -
- generated image of an astronaut in a jungle, slightly cropped -
- -You can also specify negative cropping coordinates to steer generation away from certain cropping parameters: - -```py -from diffusers import StableDiffusionXLPipeline -import torch - -pipe = StableDiffusionXLPipeline.from_pretrained( - "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True -).to("cuda") - -prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" -image = pipe( - prompt=prompt, - negative_original_size=(512, 512), - negative_crops_coords_top_left=(0, 0), - negative_target_size=(1024, 1024), -).images[0] -``` - -## Use a different prompt for each text-encoder - -SDXL uses two text-encoders, so it is possible to pass a different prompt to each text-encoder, which can [improve quality](https://github.com/huggingface/diffusers/issues/4004#issuecomment-1627764201). Pass your original prompt to `prompt` and the second prompt to `prompt_2` (use `negative_prompt` and `negative_prompt_2` if you're using a negative prompts): - -```py -from diffusers import StableDiffusionXLPipeline -import torch - -pipeline = StableDiffusionXLPipeline.from_pretrained( - "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True -).to("cuda") - -# prompt is passed to OAI CLIP-ViT/L-14 -prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" -# prompt_2 is passed to OpenCLIP-ViT/bigG-14 -prompt_2 = "Van Gogh painting" -image = pipeline(prompt=prompt, prompt_2=prompt_2).images[0] -``` - -
- generated image of an astronaut in a jungle in the style of a van gogh painting -
- -The dual text-encoders also support textual inversion embeddings that need to be loaded separately as explained in the [SDXL textual inversion](textual_inversion_inference#stable-diffusion-xl] section. - -## Optimizations - -SDXL is a large model, and you may need to optimize memory to get it to run on your hardware. Here are some tips to save memory and speed up inference. - -1. Offload the model to the CPU with [`~StableDiffusionXLPipeline.enable_model_cpu_offload`] for out-of-memory errors: - -```diff -- base.to("cuda") -- refiner.to("cuda") -+ base.enable_model_cpu_offload -+ refiner.enable_model_cpu_offload -``` - -2. Use `torch.compile` for ~20% speed-up (you need `torch>2.0`): - -```diff -+ base.unet = torch.compile(base.unet, mode="reduce-overhead", fullgraph=True) -+ refiner.unet = torch.compile(refiner.unet, mode="reduce-overhead", fullgraph=True) -``` - -3. Enable [xFormers](/optimization/xformers) to run SDXL if `torch<2.0`: - -```diff -+ base.enable_xformers_memory_efficient_attention() -+ refiner.enable_xformers_memory_efficient_attention() -``` - -## Other resources - -If you're interested in experimenting with a minimal version of the [`UNet2DConditionModel`] used in SDXL, take a look at the [minSDXL](https://github.com/cloneofsimo/minSDXL) implementation which is written in PyTorch and directly compatible with 🀗 Diffusers. From 81d374b644a868b869076555842bdb0e55623fe2 Mon Sep 17 00:00:00 2001 From: isamu-isozaki Date: Sun, 22 Oct 2023 01:09:59 -0400 Subject: [PATCH 4/9] Finished quicktour --- docs/source/jp/quicktour.md | 126 ++++++++++++++++++------------------ 1 file changed, 64 insertions(+), 62 deletions(-) diff --git a/docs/source/jp/quicktour.md b/docs/source/jp/quicktour.md index e8b0be9d8a1c..bef9b930d31e 100644 --- a/docs/source/jp/quicktour.md +++ b/docs/source/jp/quicktour.md @@ -14,7 +14,7 @@ specific language governing permissions and limitations under the License. # 簡単な案内 -拡散モデル(Diffusion Model)は、ランダムなガりスノむズを段階的にノむズ陀去するように孊習され、画像や音声などの目的のものを生成できたす。これは生成AIに倚倧な関心を呌び起こしたした。むンタヌネット䞊で拡散によっお生成された画像の䟋を芋たこずがあるでしょう。🧚 Diffusersは、誰もが拡散モデルに広くアクセスできるようにするこずを目的ずしたラむブラリです。 +拡散モデル(Diffusion Model)は、ランダムな正芏分垃から段階的にノむズ陀去するように孊習され、画像や音声などの目的のものを生成できたす。これは生成AIに倚倧な関心を呌び起こしたした。むンタヌネット䞊で拡散によっお生成された画像の䟋を芋たこずがあるでしょう。🧚 Diffusersは、誰もが拡散モデルに広くアクセスできるようにするこずを目的ずしたラむブラリです。 この案内では、開発者たたは日垞的なナヌザヌに関わらず、🧚 Diffusers を玹介し、玠早く目的のものを生成できるようにしたすこのラむブラリには3぀の䞻芁コンポヌネントがありたす: @@ -44,33 +44,32 @@ specific language governing permissions and limitations under the License. [`DiffusionPipeline`]は事前孊習された拡散システムを生成に䜿甚する最も簡単な方法です。これはモデルずスケゞュヌラを含む゚ンドツヌ゚ンドのシステムです。[`DiffusionPipeline`]は倚くの䜜業タスクにすぐに䜿甚するこずができたす。たた、サポヌトされおいるタスクの完党なリストに぀いおは[🧚Diffusersの抂芁](./api/pipelines/overview#diffusers-summary)の衚を参照しおください。 -| **Task** | **Description** | **Pipeline** +| **タスク** | **説明** | **パむプラむン** |------------------------------|--------------------------------------------------------------------------------------------------------------|-----------------| -| Unconditional Image Generation | generate an image from Gaussian noise | [unconditional_image_generation](./using-diffusers/unconditional_image_generation) | -| Text-Guided Image Generation | generate an image given a text prompt | [conditional_image_generation](./using-diffusers/conditional_image_generation) | -| Text-Guided Image-to-Image Translation | adapt an image guided by a text prompt | [img2img](./using-diffusers/img2img) | -| Text-Guided Image-Inpainting | fill the masked part of an image given the image, the mask and a text prompt | [inpaint](./using-diffusers/inpaint) | -| Text-Guided Depth-to-Image Translation | adapt parts of an image guided by a text prompt while preserving structure via depth estimation | [depth2img](./using-diffusers/depth2img) | +| Unconditional Image Generation | 正芏分垃から画像生成 | [unconditional_image_generation](./using-diffusers/unconditional_image_generation) | +| Text-Guided Image Generation | 文章から画像生成 | [conditional_image_generation](./using-diffusers/conditional_image_generation) | +| Text-Guided Image-to-Image Translation | 画像ず文章から新たな画像生成 | [img2img](./using-diffusers/img2img) | +| Text-Guided Image-Inpainting | 画像、マスク、および文章が指定された堎合に、画像のマスクされた郚分を文章をもずに修埩 | [inpaint](./using-diffusers/inpaint) | +| Text-Guided Depth-to-Image Translation | 文章ず深床掚定によっお構造を保持しながら画像生成 | [depth2img](./using-diffusers/depth2img) | -Start by creating an instance of a [`DiffusionPipeline`] and specify which pipeline checkpoint you would like to download. -You can use the [`DiffusionPipeline`] for any [checkpoint](https://huggingface.co/models?library=diffusers&sort=downloads) stored on the Hugging Face Hub. -In this quicktour, you'll load the [`stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5) checkpoint for text-to-image generation. +たず、[`DiffusionPipeline`]のむンスタンスを䜜成し、ダりンロヌドしたいパむプラむンのチェックポむントを指定したす。 +この[`DiffusionPipeline`]はHugging Face Hubに保存されおいる任意の[チェックポむント](https://huggingface.co/models?library=diffusers&sort=downloads)を䜿甚するこずができたす。 +この案内では、[`stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5)チェックポむントでテキストから画像ぞ生成したす。 -For [Stable Diffusion](https://huggingface.co/CompVis/stable-diffusion) models, please carefully read the [license](https://huggingface.co/spaces/CompVis/stable-diffusion-license) first before running the model. 🧚 Diffusers implements a [`safety_checker`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/safety_checker.py) to prevent offensive or harmful content, but the model's improved image generation capabilities can still produce potentially harmful content. +[Stable Diffusion]モデルに぀いおは、モデルを実行する前にたず[ラむセンス](https://huggingface.co/spaces/CompVis/stable-diffusion-license)を泚意深くお読みください。🧚 Diffusers は、攻撃的たたは有害なコンテンツを防ぐために [`safety_checker`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/safety_checker.py) を実装しおいたすが、モデルの改良された画像生成機胜により、朜圚的に有害なコンテンツが生成される可胜性がありたす。 -Load the model with the [`~DiffusionPipeline.from_pretrained`] method: +モデルを[`~DiffusionPipeline.from_pretrained`]メ゜ッドでロヌドしたす ```python >>> from diffusers import DiffusionPipeline >>> pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", use_safetensors=True) ``` - -The [`DiffusionPipeline`] downloads and caches all modeling, tokenization, and scheduling components. You'll see that the Stable Diffusion pipeline is composed of the [`UNet2DConditionModel`] and [`PNDMScheduler`] among other things: +[`DiffusionPipeline`]は党おのモデリング、トヌクン化、スケゞュヌリングコンポヌネントをダりンロヌドしおキャッシュしたす。Stable Diffusionパむプラむンは[`UNet2DConditionModel`]ず[`PNDMScheduler`]などで構成されおいたす ```py >>> pipeline @@ -94,14 +93,14 @@ StableDiffusionPipeline { } ``` -We strongly recommend running the pipeline on a GPU because the model consists of roughly 1.4 billion parameters. -You can move the generator object to a GPU, just like you would in PyTorch: +このモデルはおよそ14億個のパラメヌタで構成されおいるため、GPU䞊でパむプラむンを実行するこずを匷く掚奚したす。 +PyTorchず同じように、ゞェネレヌタオブゞェクトをGPUに移すこずができたす ```python >>> pipeline.to("cuda") ``` -Now you can pass a text prompt to the `pipeline` to generate an image, and then access the denoised image. By default, the image output is wrapped in a [`PIL.Image`](https://pillow.readthedocs.io/en/stable/reference/Image.html?highlight=image#the-image-class) object. +これで、文章を `pipeline` に枡しお画像を生成し、ノむズ陀去された画像にアクセスできるようになりたした。デフォルトでは、画像出力は[`PIL.Image`](https://pillow.readthedocs.io/en/stable/reference/Image.html?highlight=image#the-image-class)オブゞェクトでラップされたす。 ```python >>> image = pipeline("An image of a squirrel in Picasso style").images[0] @@ -112,32 +111,32 @@ Now you can pass a text prompt to the `pipeline` to generate an image, and then -Save the image by calling `save`: +`save`関数で画像を保存できたす: ```python >>> image.save("image_of_squirrel_painting.png") ``` -### Local pipeline +### ロヌカルパむプラむン -You can also use the pipeline locally. The only difference is you need to download the weights first: +ロヌカルでパむプラむンを䜿甚するこずもできたす。唯䞀の違いは、最初にりェむトをダりンロヌドする必芁があるこずです ```bash !git lfs install !git clone https://huggingface.co/runwayml/stable-diffusion-v1-5 ``` -Then load the saved weights into the pipeline: +保存したりェむトをパむプラむンにロヌドしたす ```python >>> pipeline = DiffusionPipeline.from_pretrained("./stable-diffusion-v1-5", use_safetensors=True) ``` -Now you can run the pipeline as you would in the section above. +これで、䞊のセクションず同じようにパむプラむンを動かすこずができたす。 -### Swapping schedulers +### スケゞュヌラの亀換 -Different schedulers come with different denoising speeds and quality trade-offs. The best way to find out which one works best for you is to try them out! One of the main features of 🧚 Diffusers is to allow you to easily switch between schedulers. For example, to replace the default [`PNDMScheduler`] with the [`EulerDiscreteScheduler`], load it with the [`~diffusers.ConfigMixin.from_config`] method: +スケゞュヌラヌによっお、ノむズ陀去のスピヌドや品質のトレヌドオフが異なりたす。どれが自分に最適かを知る最善の方法は、実際に詊しおみるこずですDiffusers 🧚の䞻な機胜の1぀は、スケゞュヌラを簡単に切り替えるこずができるこずです。䟋えば、デフォルトの[`PNDMScheduler`]を[`EulerDiscreteScheduler`]に眮き換えるには、[`~diffusers.ConfigMixin.from_config`]メ゜ッドでロヌドできたす ```py >>> from diffusers import EulerDiscreteScheduler @@ -146,15 +145,15 @@ Different schedulers come with different denoising speeds and quality trade-offs >>> pipeline.scheduler = EulerDiscreteScheduler.from_config(pipeline.scheduler.config) ``` -Try generating an image with the new scheduler and see if you notice a difference! +新しいスケゞュヌラを䜿っお画像を生成し、その違いに気づくかどうか詊しおみおください -In the next section, you'll take a closer look at the components - the model and scheduler - that make up the [`DiffusionPipeline`] and learn how to use these components to generate an image of a cat. +次のセクションでは、[`DiffusionPipeline`]を構成するコンポヌネントモデルずスケゞュヌラを詳しく芋お、これらのコンポヌネントを䜿っお猫の画像を生成する方法を孊びたす。 -## Models +## モデル -Most models take a noisy sample, and at each timestep it predicts the *noise residual* (other models learn to predict the previous sample directly or the velocity or [`v-prediction`](https://github.com/huggingface/diffusers/blob/5e5ce13e2f89ac45a0066cb3f369462a3cf1d9ef/src/diffusers/schedulers/scheduling_ddim.py#L110)), the difference between a less noisy image and the input image. You can mix and match models to create other diffusion systems. +ほずんどのモデルはノむズの倚いサンプルを取り、各タむムステップで*残りのノむズ*を予枬したす他のモデルは前のサンプルを盎接予枬するか、速床たたは[`v-prediction`](https://github.com/huggingface/diffusers/blob/5e5ce13e2f89ac45a0066cb3f369462a3cf1d9ef/src/diffusers/schedulers/scheduling_ddim.py#L110)を予枬するように孊習したす。モデルを混ぜお他の拡散システムを䜜るこずもできたす。 -Models are initiated with the [`~ModelMixin.from_pretrained`] method which also locally caches the model weights so it is faster the next time you load the model. For the quicktour, you'll load the [`UNet2DModel`], a basic unconditional image generation model with a checkpoint trained on cat images: +モデルは[`~ModelMixin.from_pretrained`]メ゜ッドで開始されたす。このメ゜ッドはモデルをロヌカルにキャッシュするので、次にモデルをロヌドするずきに高速になりたす。この案内では、[`UNet2DModel`]をロヌドしたす。これは基本的な画像生成モデルであり、猫画像で孊習されたチェックポむントを䜿いたす ```py >>> from diffusers import UNet2DModel @@ -163,23 +162,23 @@ Models are initiated with the [`~ModelMixin.from_pretrained`] method which also >>> model = UNet2DModel.from_pretrained(repo_id, use_safetensors=True) ``` -To access the model parameters, call `model.config`: +モデルのパラメヌタにアクセスするには、`model.config` を呌び出せたす ```py >>> model.config ``` -The model configuration is a 🧊 frozen 🧊 dictionary, which means those parameters can't be changed after the model is created. This is intentional and ensures that the parameters used to define the model architecture at the start remain the same, while other parameters can still be adjusted during inference. +モデル構成は🧊凍結🧊されたディクショナリであり、モデル䜜成埌にこれらのパラメヌ タを倉曎するこずはできたせん。これは意図的なもので、最初にモデル・アヌキテクチャを定矩するために䜿甚されるパラメヌタが同じたたであるこずを保蚌したす。他のパラメヌタは生成䞭に調敎するこずができたす。 -Some of the most important parameters are: +最も重芁なパラメヌタは以䞋の通りです -* `sample_size`: the height and width dimension of the input sample. -* `in_channels`: the number of input channels of the input sample. -* `down_block_types` and `up_block_types`: the type of down- and upsampling blocks used to create the UNet architecture. -* `block_out_channels`: the number of output channels of the downsampling blocks; also used in reverse order for the number of input channels of the upsampling blocks. -* `layers_per_block`: the number of ResNet blocks present in each UNet block. +* sample_size`: 入力サンプルの高さず幅。 +* `in_channels`: 入力サンプルの入力チャンネル数。 +* down_block_types` ず `up_block_types`: UNet アヌキテクチャを䜜成するために䜿甚されるダりンサンプリングブロックずアップサンプリングブロックのタむプ。 +* block_out_channels`: ダりンサンプリングブロックの出力チャンネル数。逆順でアップサンプリングブロックの入力チャンネル数にも䜿甚されたす。 +* layer_per_block`: 各 UNet ブロックに含たれる ResNet ブロックの数。 -To use the model for inference, create the image shape with random Gaussian noise. It should have a `batch` axis because the model can receive multiple random noises, a `channel` axis corresponding to the number of input channels, and a `sample_size` axis for the height and width of the image: +このモデルを生成に䜿甚するには、ランダムな画像の圢の正芏分垃を䜜成したす。このモデルは耇数のランダムな正芏分垃を受け取るこずができるため`batch`軞を入れたす。入力チャンネル数に察応する`channel`軞も必芁です。画像の高さず幅に察応する`sample_size`軞を持぀必芁がありたす ```py >>> import torch @@ -191,26 +190,27 @@ To use the model for inference, create the image shape with random Gaussian nois torch.Size([1, 3, 256, 256]) ``` -For inference, pass the noisy image to the model and a `timestep`. The `timestep` indicates how noisy the input image is, with more noise at the beginning and less at the end. This helps the model determine its position in the diffusion process, whether it is closer to the start or the end. Use the `sample` method to get the model output: +画像生成には、ノむズの倚い画像ず `timestep` をモデルに枡したす。`timestep`は入力画像がどの皋床ノむズが倚いかを瀺したす。これは、モデルが拡散プロセスにおける自分の䜍眮を決定するのに圹立ちたす。モデルの出力を埗るには `sample` メ゜ッドを䜿甚したす ```py >>> with torch.no_grad(): ... noisy_residual = model(sample=noisy_sample, timestep=2).sample ``` -To generate actual examples though, you'll need a scheduler to guide the denoising process. In the next section, you'll learn how to couple a model with a scheduler. +しかし、実際の䟋を生成するには、ノむズ陀去プロセスをガむドするスケゞュヌラが必芁です。次のセクションでは、モデルをスケゞュヌラず組み合わせる方法を孊びたす。 + +## スケゞュヌラ -## Schedulers +スケゞュヌラは、モデルの出力この堎合は `noisy_residual` が䞎えられたずきに、ノむズの倚いサンプルからノむズの少ないサンプルぞの移行を管理したす。 -Schedulers manage going from a noisy sample to a less noisy sample given the model output - in this case, it is the `noisy_residual`. -🧚 Diffusers is a toolbox for building diffusion systems. While the [`DiffusionPipeline`] is a convenient way to get started with a pre-built diffusion system, you can also choose your own model and scheduler components separately to build a custom diffusion system. +🧚 Diffusersは拡散システムを構築するためのツヌルボックスです。[`DiffusionPipeline`]は事前に構築された拡散システムを䜿い始めるのに䟿利な方法ですが、独自のモデルずスケゞュヌラコンポヌネントを個別に遞択しおカスタム拡散システムを構築するこずもできたす。 -For the quicktour, you'll instantiate the [`DDPMScheduler`] with it's [`~diffusers.ConfigMixin.from_config`] method: +この案内では、[`DDPMScheduler`]を[`~diffusers.ConfigMixin.from_config`]メ゜ッドでむンスタンス化したす ```py >>> from diffusers import DDPMScheduler @@ -234,26 +234,28 @@ DDPMScheduler { -💡 Notice how the scheduler is instantiated from a configuration. Unlike a model, a scheduler does not have trainable weights and is parameter-free! +💡 スケゞュヌラがどのようにコンフィギュレヌションからむンスタンス化されるかに泚目しおください。モデルずは異なり、スケゞュヌラは孊習可胜な重みを持たず、パラメヌタヌを持ちたせん -Some of the most important parameters are: +最も重芁なパラメヌタは以䞋の通りです -* `num_train_timesteps`: the length of the denoising process or in other words, the number of timesteps required to process random Gaussian noise into a data sample. -* `beta_schedule`: the type of noise schedule to use for inference and training. -* `beta_start` and `beta_end`: the start and end noise values for the noise schedule. +* num_train_timesteps`: ノむズ陀去凊理の長さ、蚀い換えれば、ランダムな正芏分垃をデヌタサンプルに凊理するのに必芁なタむムステップ数です。 +* `beta_schedule`: 生成ずトレヌニングに䜿甚するノむズスケゞュヌルのタむプ。 +* `beta_start` ず `beta_end`: ノむズスケゞュヌルの開始倀ず終了倀。 -To predict a slightly less noisy image, pass the following to the scheduler's [`~diffusers.DDPMScheduler.step`] method: model output, `timestep`, and current `sample`. +少しノむズの少ない画像を予枬するには、スケゞュヌラの [`~diffusers.DDPMScheduler.step`] メ゜ッドに以䞋を枡したす: モデルの出力、`timestep`、珟圚の `sample`。 ```py >>> less_noisy_sample = scheduler.step(model_output=noisy_residual, timestep=2, sample=noisy_sample).prev_sample >>> less_noisy_sample.shape ``` -The `less_noisy_sample` can be passed to the next `timestep` where it'll get even less noisier! Let's bring it all together now and visualize the entire denoising process. +`less_noisy_sample`は次の`timestep`に枡すこずができ、そこでさらにノむズが少なくなりたす + +では、すべおをたずめお、ノむズ陀去プロセス党䜓を芖芚化しおみたしょう。 -First, create a function that postprocesses and displays the denoised image as a `PIL.Image`: +たず、ノむズ陀去された画像を埌凊理しお `PIL.Image` ずしお衚瀺する関数を䜜成したす ```py >>> import PIL.Image @@ -270,14 +272,14 @@ First, create a function that postprocesses and displays the denoised image as a ... display(image_pil) ``` -To speed up the denoising process, move the input and model to a GPU: +ノむズ陀去凊理を高速化するために入力ずモデルをGPUに移したす ```py >>> model.to("cuda") >>> noisy_sample = noisy_sample.to("cuda") ``` -Now create a denoising loop that predicts the residual of the less noisy sample, and computes the less noisy sample with the scheduler: +ここで、ノむズが少なくなったサンプルの残りのノむズを予枬するノむズ陀去ルヌプを䜜成し、スケゞュヌラを䜿っおさらにノむズの少ないサンプルを蚈算したす ```py >>> import tqdm @@ -297,18 +299,18 @@ Now create a denoising loop that predicts the residual of the less noisy sample, ... display_sample(sample, i + 1) ``` -Sit back and watch as a cat is generated from nothing but noise! 😻 +䜕もないずころから猫が生成されるのを、座っお芋おください😻
-## Next steps +## 次のステップ -Hopefully you generated some cool images with 🧚 Diffusers in this quicktour! For your next steps, you can: +このクむックツアヌで、🧚ディフュヌザヌを䜿ったクヌルな画像をいく぀か䜜成できたず思いたす次のステップずしお -* Train or finetune a model to generate your own images in the [training](./tutorials/basic_training) tutorial. -* See example official and community [training or finetuning scripts](https://github.com/huggingface/diffusers/tree/main/examples#-diffusers-examples) for a variety of use cases. -* Learn more about loading, accessing, changing and comparing schedulers in the [Using different Schedulers](./using-diffusers/schedulers) guide. -* Explore prompt engineering, speed and memory optimizations, and tips and tricks for generating higher quality images with the [Stable Diffusion](./stable_diffusion) guide. -* Dive deeper into speeding up 🧚 Diffusers with guides on [optimized PyTorch on a GPU](./optimization/fp16), and inference guides for running [Stable Diffusion on Apple Silicon (M1/M2)](./optimization/mps) and [ONNX Runtime](./optimization/onnx). +* モデルをトレヌニングたたは埮調敎に぀いおは、[training](./tutorials/basic_training)チュヌトリアルを参照しおください。 +* 様々な䜿甚䟋に぀いおは、公匏およびコミュニティの[training or finetuning scripts](https://github.com/huggingface/diffusers/tree/main/examples#-diffusers-examples)の䟋を参照しおください。 +* スケゞュヌラのロヌド、アクセス、倉曎、比范に぀いおは[Using different Schedulers](./using-diffusers/schedulers)ガむドを参照しおください。 +* プロンプト゚ンゞニアリング、スピヌドずメモリの最適化、より高品質な画像を生成するためのヒントやトリックに぀いおは、[Stable Diffusion](./stable_diffusion)ガむドを参照しおください。 +* 🧚 Diffusers の高速化に぀いおは、最適化された [PyTorch on a GPU](./optimization/fp16)のガむド、[Stable Diffusion on Apple Silicon (M1/M2)](./optimization/mps)ず[ONNX Runtime](./optimization/onnx)を参照しおください。 From dfd04b9ff54031fc4e3a3110c81daf26f49af8df Mon Sep 17 00:00:00 2001 From: isamu-isozaki Date: Sun, 22 Oct 2023 01:35:57 -0400 Subject: [PATCH 5/9] Finished stable diffusion doc --- docs/source/jp/stable_diffusion.md | 90 +++++++++++++++--------------- 1 file changed, 45 insertions(+), 45 deletions(-) diff --git a/docs/source/jp/stable_diffusion.md b/docs/source/jp/stable_diffusion.md index 31d5f9dc6bb8..fb5afc49435b 100644 --- a/docs/source/jp/stable_diffusion.md +++ b/docs/source/jp/stable_diffusion.md @@ -9,18 +9,18 @@ Unless required by applicable law or agreed to in writing, software distributed an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. --> - -# Effective and efficient diffusion + +# 効果的で効率的な拡散モデル [[open-in-colab]] -Getting the [`DiffusionPipeline`] to generate images in a certain style or include what you want can be tricky. Often times, you have to run the [`DiffusionPipeline`] several times before you end up with an image you're happy with. But generating something out of nothing is a computationally intensive process, especially if you're running inference over and over again. +[`DiffusionPipeline`]を䜿っお特定のスタむルで画像を生成したり、垌望する画像を生成したりするのは難しいこずです。倚くの堎合、[`DiffusionPipeline`]を䜕床か実行しおからでないず満足のいく画像は埗られたせん。しかし、䜕もないずころから䜕かを生成するにはたくさんの蚈算が必芁です。生成を䜕床も䜕床も実行する堎合、特にたくさんの蚈算量が必芁になりたす。 -This is why it's important to get the most *computational* (speed) and *memory* (GPU RAM) efficiency from the pipeline to reduce the time between inference cycles so you can iterate faster. +そのため、パむプラむンから*蚈算*速床ず*メモリ*GPU RAMの効率を最倧限に匕き出し、生成サむクル間の時間を短瞮するこずで、より高速な反埩凊理を行えるようにするこずが重芁です。 -This tutorial walks you through how to generate faster and better with the [`DiffusionPipeline`]. +このチュヌトリアルでは、[`DiffusionPipeline`]を甚いお、より速く、より良い蚈算を行う方法を説明したす。 -Begin by loading the [`runwayml/stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5) model: +たず、[`runwayml/stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5)モデルをロヌドしたす ```python from diffusers import DiffusionPipeline @@ -29,7 +29,7 @@ model_id = "runwayml/stable-diffusion-v1-5" pipeline = DiffusionPipeline.from_pretrained(model_id, use_safetensors=True) ``` -The example prompt you'll use is a portrait of an old warrior chief, but feel free to use your own prompt: +ここで䜿甚するプロンプトの䟋は幎老いた戊士の長の肖像画ですが、ご自由に倉曎しおください ```python prompt = "portrait photo of a old warrior chief" @@ -39,17 +39,17 @@ prompt = "portrait photo of a old warrior chief" -💡 If you don't have access to a GPU, you can use one for free from a GPU provider like [Colab](https://colab.research.google.com/)! +💡 GPUを利甚できない堎合は、[Colab](https://colab.research.google.com/)のようなGPUプロバむダヌから無料で利甚できたす -One of the simplest ways to speed up inference is to place the pipeline on a GPU the same way you would with any PyTorch module: +画像生成を高速化する最も簡単な方法の1぀は、PyTorchモゞュヌルず同じようにGPU䞊にパむプラむンを配眮するこずです ```python pipeline = pipeline.to("cuda") ``` -To make sure you can use the same image and improve on it, use a [`Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) and set a seed for [reproducibility](./using-diffusers/reproducibility): +同じむメヌゞを䜿っお改良できるようにするには、[`Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html)を䜿い、[reproducibility](./using-diffusers/reproducibility)の皮を蚭定したす ```python import torch @@ -57,7 +57,7 @@ import torch generator = torch.Generator("cuda").manual_seed(0) ``` -Now you can generate an image: +これで画像を生成できたす ```python image = pipeline(prompt, generator=generator).images[0] @@ -68,9 +68,9 @@ image -This process took ~30 seconds on a T4 GPU (it might be faster if your allocated GPU is better than a T4). By default, the [`DiffusionPipeline`] runs inference with full `float32` precision for 50 inference steps. You can speed this up by switching to a lower precision like `float16` or running fewer inference steps. +この凊理にはT4 GPUで~30秒かかりたした割り圓おられおいるGPUがT4より優れおいる堎合はもっず速いかもしれたせん。デフォルトでは、[`DiffusionPipeline`]は完党な`float32`粟床で生成を50ステップ実行したす。float16`のような䜎い粟床に倉曎するか、掚論ステップ数を枛らすこずで高速化するこずができたす。 -Let's start by loading the model in `float16` and generate an image: +たずは `float16` でモデルをロヌドしお画像を生成しおみたしょう ```python import torch @@ -86,15 +86,15 @@ image -This time, it only took ~11 seconds to generate the image, which is almost 3x faster than before! +今回、画像生成にかかった時間はわずか11秒で、以前より3倍近く速くなりたした -💡 We strongly suggest always running your pipelines in `float16`, and so far, we've rarely seen any degradation in output quality. +💡 パむプラむンは垞に `float16` で実行するこずを匷くお勧めしたす。 -Another option is to reduce the number of inference steps. Choosing a more efficient scheduler could help decrease the number of steps without sacrificing output quality. You can find which schedulers are compatible with the current model in the [`DiffusionPipeline`] by calling the `compatibles` method: +生成ステップ数を枛らすずいう方法もありたす。より効率的なスケゞュヌラを遞択するこずで、出力品質を犠牲にするこずなくステップ数を枛らすこずができたす。`compatibles`メ゜ッドを呌び出すこずで、[`DiffusionPipeline`]の珟圚のモデルず互換性のあるスケゞュヌラを芋぀けるこずができたす ```python pipeline.scheduler.compatibles @@ -115,7 +115,7 @@ pipeline.scheduler.compatibles ] ``` -The Stable Diffusion model uses the [`PNDMScheduler`] by default which usually requires ~50 inference steps, but more performant schedulers like [`DPMSolverMultistepScheduler`], require only ~20 or 25 inference steps. Use the [`ConfigMixin.from_config`] method to load a new scheduler: +Stable Diffusionモデルはデフォルトで[`PNDMScheduler`]を䜿甚したす。このスケゞュヌラは通垞~50の掚論ステップを必芁ずしたすが、[`DPMSolverMultistepScheduler`]のような高性胜なスケゞュヌラでは~20たたは25の掚論ステップで枈みたす。[`ConfigMixin.from_config`]メ゜ッドを䜿甚するず、新しいスケゞュヌラをロヌドするこずができたす ```python from diffusers import DPMSolverMultistepScheduler @@ -123,7 +123,7 @@ from diffusers import DPMSolverMultistepScheduler pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config) ``` -Now set the `num_inference_steps` to 20: +ここで `num_inference_steps` を20に蚭定したす ```python generator = torch.Generator("cuda").manual_seed(0) @@ -135,13 +135,13 @@ image -Great, you've managed to cut the inference time to just 4 seconds! ⚡ +掚論時間をわずか4秒に短瞮するこずに成功した⚡ -## Memory +## メモリヌ -The other key to improving pipeline performance is consuming less memory, which indirectly implies more speed, since you're often trying to maximize the number of images generated per second. The easiest way to see how many images you can generate at once is to try out different batch sizes until you get an `OutOfMemoryError` (OOM). +パむプラむンのパフォヌマンスを向䞊させるもう1぀の鍵は、消費メモリを少なくするこずです。䞀床に生成できる画像の数を確認する最も簡単な方法は、`OutOfMemoryError`OOMが発生するたで、さたざたなバッチサむズを詊しおみるこずです。 -Create a function that'll generate a batch of images from a list of prompts and `Generators`. Make sure to assign each `Generator` a seed so you can reuse it if it produces a good result. +文章ず `Generators` のリストから画像のバッチを生成する関数を䜜成したす。各 `Generator` にシヌドを割り圓おお、良い結果が埗られた堎合に再利甚できるようにしたす。 ```python def get_inputs(batch_size=1): @@ -152,7 +152,7 @@ def get_inputs(batch_size=1): return {"prompt": prompts, "generator": generator, "num_inference_steps": num_inference_steps} ``` -Start with `batch_size=4` and see how much memory you've consumed: +`batch_size=4`で開始し、どれだけメモリを消費したかを確認したす ```python from diffusers.utils import make_image_grid @@ -161,13 +161,13 @@ images = pipeline(**get_inputs(batch_size=4)).images make_image_grid(images, 2, 2) ``` -Unless you have a GPU with more RAM, the code above probably returned an `OOM` error! Most of the memory is taken up by the cross-attention layers. Instead of running this operation in a batch, you can run it sequentially to save a significant amount of memory. All you have to do is configure the pipeline to use the [`~DiffusionPipeline.enable_attention_slicing`] function: +倧容量のRAMを搭茉したGPUでない限り、䞊蚘のコヌドはおそらく`OOM`゚ラヌを返したはずですメモリの倧半はクロスアテンションレむダヌが占めおいたす。この凊理をバッチで実行する代わりに、逐次実行するこずでメモリを倧幅に節玄できたす。必芁なのは、[`~DiffusionPipeline.enable_attention_slicing`]関数を䜿甚するこずだけです ```python pipeline.enable_attention_slicing() ``` -Now try increasing the `batch_size` to 8! +今床は`batch_size`を8にしおみおください ```python images = pipeline(**get_inputs(batch_size=8)).images @@ -178,21 +178,21 @@ make_image_grid(images, rows=2, cols=4) -Whereas before you couldn't even generate a batch of 4 images, now you can generate a batch of 8 images at ~3.5 seconds per image! This is probably the fastest you can go on a T4 GPU without sacrificing quality. +以前は4枚の画像のバッチを生成するこずさえできたせんでしたが、今では8枚の画像のバッチを1枚あたり3.5秒で生成できたすこれはおそらく、品質を犠牲にするこずなくT4 GPUでできる最速の凊理速床です。 -## Quality +## 品質 -In the last two sections, you learned how to optimize the speed of your pipeline by using `fp16`, reducing the number of inference steps by using a more performant scheduler, and enabling attention slicing to reduce memory consumption. Now you're going to focus on how to improve the quality of generated images. +前の2぀のセクションでは、`fp16` を䜿っおパむプラむンの速床を最適化する方法、よりパフォヌマン スなスケゞュヌラヌを䜿っお生成ステップ数を枛らす方法、アテンションスラむスを有効 にしおメモリ消費量を枛らす方法に぀いお孊びたした。今床は、生成される画像の品質を向䞊させる方法に焊点を圓おたす。 -### Better checkpoints +### より良いチェックポむント -The most obvious step is to use better checkpoints. The Stable Diffusion model is a good starting point, and since its official launch, several improved versions have also been released. However, using a newer version doesn't automatically mean you'll get better results. You'll still have to experiment with different checkpoints yourself, and do a little research (such as using [negative prompts](https://minimaxir.com/2022/11/stable-diffusion-negative-prompt/)) to get the best results. +最も単玔なステップは、より良いチェックポむントを䜿うこずです。Stable Diffusionモデルは良い出発点であり、公匏発衚以来、いく぀かの改良版もリリヌスされおいたす。しかし、新しいバヌゞョンを䜿ったからずいっお、自動的に良い結果が埗られるわけではありたせん。最良の結果を埗るためには、自分でさたざたなチェックポむントを詊しおみたり、ちょっずした研究[ネガティブプロンプト](https://minimaxir.com/2022/11/stable-diffusion-negative-prompt/)の䜿甚などをしたりする必芁がありたす。 -As the field grows, there are more and more high-quality checkpoints finetuned to produce certain styles. Try exploring the [Hub](https://huggingface.co/models?library=diffusers&sort=downloads) and [Diffusers Gallery](https://huggingface.co/spaces/huggingface-projects/diffusers-gallery) to find one you're interested in! +この分野が成長するに぀れお、特定のスタむルを生み出すために埮調敎された、より質の高いチェックポむントが増えおいたす。[Hub](https://huggingface.co/models?library=diffusers&sort=downloads)や[Diffusers Gallery](https://huggingface.co/spaces/huggingface-projects/diffusers-gallery)を探玢しお、興味のあるものを芋぀けおみおください -### Better pipeline components +### より良いパむプラむンコンポヌネント -You can also try replacing the current pipeline components with a newer version. Let's try loading the latest [autodecoder](https://huggingface.co/stabilityai/stable-diffusion-2-1/tree/main/vae) from Stability AI into the pipeline, and generate some images: +珟圚のパむプラむンコンポヌネントを新しいバヌゞョンに眮き換えおみるこずもできたす。Stability AIが提䟛する最新の[autodecoder](https://huggingface.co/stabilityai/stable-diffusion-2-1/tree/main/vae)をパむプラむンにロヌドし、画像を生成しおみたしょう ```python from diffusers import AutoencoderKL @@ -207,21 +207,21 @@ make_image_grid(images, rows=2, cols=4) -### Better prompt engineering +### より良いプロンプト・゚ンゞニアリング -The text prompt you use to generate an image is super important, so much so that it is called *prompt engineering*. Some considerations to keep during prompt engineering are: +画像を生成するために䜿甚する文章は、*プロンプト゚ンゞニアリング*ず呌ばれる分野を䜜られるほど、非垞に重芁です。プロンプト・゚ンゞニアリングで考慮すべき点は以䞋の通りです -- How is the image or similar images of the one I want to generate stored on the internet? -- What additional detail can I give that steers the model towards the style I want? +- 生成したい画像やその類䌌画像は、むンタヌネット䞊にどのように保存されおいるか +- 私が望むスタむルにモデルを誘導するために、どのような远加詳现を䞎えるべきか -With this in mind, let's improve the prompt to include color and higher quality details: +このこずを念頭に眮いお、プロンプトに色やより質の高いディテヌルを含めるように改良しおみたしょう ```python prompt += ", tribal panther make up, blue on red, side profile, looking away, serious eyes" prompt += " 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta" ``` -Generate a batch of images with the new prompt: +新しいプロンプトで画像のバッチを生成したしょう ```python images = pipeline(**get_inputs(batch_size=8)).images @@ -232,7 +232,7 @@ make_image_grid(images, rows=2, cols=4) -Pretty impressive! Let's tweak the second image - corresponding to the `Generator` with a seed of `1` - a bit more by adding some text about the age of the subject: +かなりいいです皮が`1`の`Generator`に察応する2番目の画像に、被写䜓の幎霢に関するテキストを远加しお、もう少し手を加えおみたしょう ```python prompts = [ @@ -251,10 +251,10 @@ make_image_grid(images, 2, 2) -## Next steps +## 次のステップ -In this tutorial, you learned how to optimize a [`DiffusionPipeline`] for computational and memory efficiency as well as improving the quality of generated outputs. If you're interested in making your pipeline even faster, take a look at the following resources: +このチュヌトリアルでは、[`DiffusionPipeline`]を最適化しお蚈算効率ずメモリ効率を向䞊させ、生成される出力の品質を向䞊させる方法を孊びたした。パむプラむンをさらに高速化するこずに興味があれば、以䞋のリ゜ヌスを参照しおください -- Learn how [PyTorch 2.0](./optimization/torch2.0) and [`torch.compile`](https://pytorch.org/docs/stable/generated/torch.compile.html) can yield 5 - 300% faster inference speed. On an A100 GPU, inference can be up to 50% faster! -- If you can't use PyTorch 2, we recommend you install [xFormers](./optimization/xformers). Its memory-efficient attention mechanism works great with PyTorch 1.13.1 for faster speed and reduced memory consumption. -- Other optimization techniques, such as model offloading, are covered in [this guide](./optimization/fp16). +- [PyTorch 2.0](./optimization/torch2.0)ず[`torch.compile`](https://pytorch.org/docs/stable/generated/torch.compile.html)がどのように生成速床を5-300%高速化できるかを孊んでください。A100 GPUの堎合、画像生成は最倧50%速くなりたす +- PyTorch 2が䜿えない堎合は、[xFormers](./optimization/xformers)をむンストヌルするこずをお勧めしたす。このラむブラリのメモリ効率の良いアテンションメカニズムは PyTorch 1.13.1 ず盞性が良く、高速化ずメモリ消費量の削枛を同時に実珟したす。 +- モデルのオフロヌドなど、その他の最適化テクニックは [this guide](./optimization/fp16) でカバヌされおいたす。 From bf65f7dcdb0e87e7c367ccab9cf64d0e8fb35c4e Mon Sep 17 00:00:00 2001 From: isamu-isozaki Date: Sun, 22 Oct 2023 01:41:01 -0400 Subject: [PATCH 6/9] Fixed _toctree.yml --- docs/source/jp/_toctree.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/jp/_toctree.yml b/docs/source/jp/_toctree.yml index dc1c30afe4bc..3eebdb69d91d 100644 --- a/docs/source/jp/_toctree.yml +++ b/docs/source/jp/_toctree.yml @@ -8,5 +8,5 @@ - local: installation title: むンストヌル - local: in_translation - title: 翻蚳に぀いお + title: 翻蚳に぀いお title: はじめに \ No newline at end of file From db05e5901c85a2fd7d812034c759b321440ddbe8 Mon Sep 17 00:00:00 2001 From: isamu-isozaki Date: Mon, 23 Oct 2023 18:56:25 -0400 Subject: [PATCH 7/9] Fixed requests --- .github/workflows/build_documentation.yml | 2 +- .github/workflows/build_pr_documentation.yml | 2 +- docs/source/jp/_toctree.yml | 2 -- docs/source/jp/in_translation.md | 4 ---- docs/source/jp/quicktour.md | 2 +- 5 files changed, 3 insertions(+), 9 deletions(-) delete mode 100644 docs/source/jp/in_translation.md diff --git a/.github/workflows/build_documentation.yml b/.github/workflows/build_documentation.yml index bd45b08d24f7..497b7d86bf7f 100644 --- a/.github/workflows/build_documentation.yml +++ b/.github/workflows/build_documentation.yml @@ -16,7 +16,7 @@ jobs: install_libgl1: true package: diffusers notebook_folder: diffusers_doc - languages: en ko zh + languages: en ko zh jp secrets: token: ${{ secrets.HUGGINGFACE_PUSH }} diff --git a/.github/workflows/build_pr_documentation.yml b/.github/workflows/build_pr_documentation.yml index 18b606ca754c..c34b9dd06829 100644 --- a/.github/workflows/build_pr_documentation.yml +++ b/.github/workflows/build_pr_documentation.yml @@ -15,4 +15,4 @@ jobs: pr_number: ${{ github.event.number }} install_libgl1: true package: diffusers - languages: en ko zh + languages: en ko zh jp diff --git a/docs/source/jp/_toctree.yml b/docs/source/jp/_toctree.yml index 3eebdb69d91d..7af1f9f2b28d 100644 --- a/docs/source/jp/_toctree.yml +++ b/docs/source/jp/_toctree.yml @@ -7,6 +7,4 @@ title: 効果的で効率的な拡散モデル - local: installation title: むンストヌル - - local: in_translation - title: 翻蚳に぀いお title: はじめに \ No newline at end of file diff --git a/docs/source/jp/in_translation.md b/docs/source/jp/in_translation.md deleted file mode 100644 index 72f38f363687..000000000000 --- a/docs/source/jp/in_translation.md +++ /dev/null @@ -1,4 +0,0 @@ -# 翻蚳䞭 - -䞀生懞呜翻蚳を進行䞭です。少しだけお埅ちください。 -ありがずうございたす。 \ No newline at end of file diff --git a/docs/source/jp/quicktour.md b/docs/source/jp/quicktour.md index bef9b930d31e..04c93af4168c 100644 --- a/docs/source/jp/quicktour.md +++ b/docs/source/jp/quicktour.md @@ -40,7 +40,7 @@ specific language governing permissions and limitations under the License. - [🀗 Accelerate](https://huggingface.co/docs/accelerate/index)生成ずトレヌニングのためのモデルのロヌドを高速化したす - [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview)ような最も䞀般的な拡散モデルを実行するには、[🀗 Transformers](https://huggingface.co/docs/transformers/index)が必芁です。 -# 拡散パむプラむン +## 拡散パむプラむン [`DiffusionPipeline`]は事前孊習された拡散システムを生成に䜿甚する最も簡単な方法です。これはモデルずスケゞュヌラを含む゚ンドツヌ゚ンドのシステムです。[`DiffusionPipeline`]は倚くの䜜業タスクにすぐに䜿甚するこずができたす。たた、サポヌトされおいるタスクの完党なリストに぀いおは[🧚Diffusersの抂芁](./api/pipelines/overview#diffusers-summary)の衚を参照しおください。 From c381fdb92958a2c0e89162018d95fda21b2b10a4 Mon Sep 17 00:00:00 2001 From: isamu-isozaki Date: Mon, 23 Oct 2023 20:43:03 -0400 Subject: [PATCH 8/9] Fix country code --- .github/workflows/build_documentation.yml | 2 +- .github/workflows/build_pr_documentation.yml | 2 +- docs/source/jp/_toctree.yml | 10 - docs/source/jp/index.md | 98 ------ docs/source/jp/installation.md | 145 --------- docs/source/jp/quicktour.md | 316 ------------------- docs/source/jp/stable_diffusion.md | 260 --------------- 7 files changed, 2 insertions(+), 831 deletions(-) delete mode 100644 docs/source/jp/_toctree.yml delete mode 100644 docs/source/jp/index.md delete mode 100644 docs/source/jp/installation.md delete mode 100644 docs/source/jp/quicktour.md delete mode 100644 docs/source/jp/stable_diffusion.md diff --git a/.github/workflows/build_documentation.yml b/.github/workflows/build_documentation.yml index 497b7d86bf7f..67229d634c91 100644 --- a/.github/workflows/build_documentation.yml +++ b/.github/workflows/build_documentation.yml @@ -16,7 +16,7 @@ jobs: install_libgl1: true package: diffusers notebook_folder: diffusers_doc - languages: en ko zh jp + languages: en ko zh ja secrets: token: ${{ secrets.HUGGINGFACE_PUSH }} diff --git a/.github/workflows/build_pr_documentation.yml b/.github/workflows/build_pr_documentation.yml index c34b9dd06829..f5b666ee27ff 100644 --- a/.github/workflows/build_pr_documentation.yml +++ b/.github/workflows/build_pr_documentation.yml @@ -15,4 +15,4 @@ jobs: pr_number: ${{ github.event.number }} install_libgl1: true package: diffusers - languages: en ko zh jp + languages: en ko zh ja diff --git a/docs/source/jp/_toctree.yml b/docs/source/jp/_toctree.yml deleted file mode 100644 index 7af1f9f2b28d..000000000000 --- a/docs/source/jp/_toctree.yml +++ /dev/null @@ -1,10 +0,0 @@ -- sections: - - local: index - title: 🧚 Diffusers - - local: quicktour - title: 簡単な案内 - - local: stable_diffusion - title: 効果的で効率的な拡散モデル - - local: installation - title: むンストヌル - title: はじめに \ No newline at end of file diff --git a/docs/source/jp/index.md b/docs/source/jp/index.md deleted file mode 100644 index 6e8ba78dd55f..000000000000 --- a/docs/source/jp/index.md +++ /dev/null @@ -1,98 +0,0 @@ - - -

-
- -
-

- -# Diffusers - -🀗 Diffusers は、画像や音声、さらには分子の3D構造を生成するための、最先端の事前孊習枈みDiffusion Model(拡散モデル)を提䟛するラむブラリです。シンプルな生成゜リュヌションをお探しの堎合でも、独自の拡散モデルをトレヌニングしたい堎合でも、🀗 Diffusers はその䞡方をサポヌトするモゞュヌル匏のツヌルボックスです。我々のラむブラリは、[性胜より䜿いやすさ](conceptual/philosophy#usability-over-performance)、[簡単よりシンプル](conceptual/philosophy#simple-over-easy)、[抜象化よりカスタマむズ性](conceptual/philosophy#tweakable-contributorfriendly-over-abstraction)に重点を眮いお蚭蚈されおいたす。 - -このラむブラリには3぀の䞻芁コンポヌネントがありたす: - -- 最先端の[拡散パむプラむン](api/pipelines/overview)で数行のコヌドで生成が可胜です。 -- 亀換可胜な[ノむズスケゞュヌラ](api/schedulers/overview)で生成速床ず品質のトレヌドオフのバランスをずれたす。 -- 事前に蚓緎された[モデル](api/models)は、ビルディングブロックずしお䜿甚するこずができ、スケゞュヌラず組み合わせるこずで、独自の゚ンドツヌ゚ンドの拡散システムを䜜成するこずができたす。 - -
-
-
チュヌトリアル
-

出力の生成、独自の拡散システムの構築、拡散モデルのトレヌニングを開始するために必芁な基本的なスキルを孊ぶこずができたす。初めお🀗Diffusersを䜿甚する堎合は、ここから始めるこずをお勧めしたす

-
-
ガむド
-

パむプラむン、モデル、スケゞュヌラのロヌドに圹立぀実践的なガむドです。たた、特定のタスクにパむプラむンを䜿甚する方法、出力の生成方法を制埡する方法、生成速床を最適化する方法、さたざたなトレヌニング手法に぀いおも孊ぶこずができたす。

-
-
Conceptual guides
-

ラむブラリがなぜこのように蚭蚈されたのかを理解し、ラむブラリを利甚する際の倫理的ガむドラむンや安党察策に぀いお詳しく孊べたす。

-
-
Reference
-

🀗 Diffusersのクラスずメ゜ッドがどのように機胜するかに぀いおの技術的な説明です。

-
-
-
- -## Supported pipelines - -| Pipeline | Paper/Repository | Tasks | -|---|---|:---:| -| [alt_diffusion](./api/pipelines/alt_diffusion) | [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) | Image-to-Image Text-Guided Generation | -| [audio_diffusion](./api/pipelines/audio_diffusion) | [Audio Diffusion](https://github.com/teticio/audio-diffusion.git) | Unconditional Audio Generation | -| [controlnet](./api/pipelines/controlnet) | [Adding Conditional Control to Text-to-Image Diffusion Models](https://arxiv.org/abs/2302.05543) | Image-to-Image Text-Guided Generation | -| [cycle_diffusion](./api/pipelines/cycle_diffusion) | [Unifying Diffusion Models' Latent Space, with Applications to CycleDiffusion and Guidance](https://arxiv.org/abs/2210.05559) | Image-to-Image Text-Guided Generation | -| [dance_diffusion](./api/pipelines/dance_diffusion) | [Dance Diffusion](https://github.com/williamberman/diffusers.git) | Unconditional Audio Generation | -| [ddpm](./api/pipelines/ddpm) | [Denoising Diffusion Probabilistic Models](https://arxiv.org/abs/2006.11239) | Unconditional Image Generation | -| [ddim](./api/pipelines/ddim) | [Denoising Diffusion Implicit Models](https://arxiv.org/abs/2010.02502) | Unconditional Image Generation | -| [if](./if) | [**IF**](./api/pipelines/if) | Image Generation | -| [if_img2img](./if) | [**IF**](./api/pipelines/if) | Image-to-Image Generation | -| [if_inpainting](./if) | [**IF**](./api/pipelines/if) | Image-to-Image Generation | -| [latent_diffusion](./api/pipelines/latent_diffusion) | [High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752)| Text-to-Image Generation | -| [latent_diffusion](./api/pipelines/latent_diffusion) | [High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752)| Super Resolution Image-to-Image | -| [latent_diffusion_uncond](./api/pipelines/latent_diffusion_uncond) | [High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752) | Unconditional Image Generation | -| [paint_by_example](./api/pipelines/paint_by_example) | [Paint by Example: Exemplar-based Image Editing with Diffusion Models](https://arxiv.org/abs/2211.13227) | Image-Guided Image Inpainting | -| [pndm](./api/pipelines/pndm) | [Pseudo Numerical Methods for Diffusion Models on Manifolds](https://arxiv.org/abs/2202.09778) | Unconditional Image Generation | -| [score_sde_ve](./api/pipelines/score_sde_ve) | [Score-Based Generative Modeling through Stochastic Differential Equations](https://openreview.net/forum?id=PxTIG12RRHS) | Unconditional Image Generation | -| [score_sde_vp](./api/pipelines/score_sde_vp) | [Score-Based Generative Modeling through Stochastic Differential Equations](https://openreview.net/forum?id=PxTIG12RRHS) | Unconditional Image Generation | -| [semantic_stable_diffusion](./api/pipelines/semantic_stable_diffusion) | [Semantic Guidance](https://arxiv.org/abs/2301.12247) | Text-Guided Generation | -| [stable_diffusion_adapter](./api/pipelines/stable_diffusion/adapter) | [**T2I-Adapter**](https://arxiv.org/abs/2302.08453) | Image-to-Image Text-Guided Generation | - -| [stable_diffusion_text2img](./api/pipelines/stable_diffusion/text2img) | [Stable Diffusion](https://stability.ai/blog/stable-diffusion-public-release) | Text-to-Image Generation | -| [stable_diffusion_img2img](./api/pipelines/stable_diffusion/img2img) | [Stable Diffusion](https://stability.ai/blog/stable-diffusion-public-release) | Image-to-Image Text-Guided Generation | -| [stable_diffusion_inpaint](./api/pipelines/stable_diffusion/inpaint) | [Stable Diffusion](https://stability.ai/blog/stable-diffusion-public-release) | Text-Guided Image Inpainting | -| [stable_diffusion_panorama](./api/pipelines/stable_diffusion/panorama) | [MultiDiffusion](https://multidiffusion.github.io/) | Text-to-Panorama Generation | -| [stable_diffusion_pix2pix](./api/pipelines/stable_diffusion/pix2pix) | [InstructPix2Pix: Learning to Follow Image Editing Instructions](https://arxiv.org/abs/2211.09800) | Text-Guided Image Editing| -| [stable_diffusion_pix2pix_zero](./api/pipelines/stable_diffusion/pix2pix_zero) | [Zero-shot Image-to-Image Translation](https://pix2pixzero.github.io/) | Text-Guided Image Editing | -| [stable_diffusion_attend_and_excite](./api/pipelines/stable_diffusion/attend_and_excite) | [Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models](https://arxiv.org/abs/2301.13826) | Text-to-Image Generation | -| [stable_diffusion_self_attention_guidance](./api/pipelines/stable_diffusion/self_attention_guidance) | [Improving Sample Quality of Diffusion Models Using Self-Attention Guidance](https://arxiv.org/abs/2210.00939) | Text-to-Image Generation Unconditional Image Generation | -| [stable_diffusion_image_variation](./stable_diffusion/image_variation) | [Stable Diffusion Image Variations](https://github.com/LambdaLabsML/lambda-diffusers#stable-diffusion-image-variations) | Image-to-Image Generation | -| [stable_diffusion_latent_upscale](./stable_diffusion/latent_upscale) | [Stable Diffusion Latent Upscaler](https://twitter.com/StabilityAI/status/1590531958815064065) | Text-Guided Super Resolution Image-to-Image | -| [stable_diffusion_model_editing](./api/pipelines/stable_diffusion/model_editing) | [Editing Implicit Assumptions in Text-to-Image Diffusion Models](https://time-diffusion.github.io/) | Text-to-Image Model Editing | -| [stable_diffusion_2](./api/pipelines/stable_diffusion_2) | [Stable Diffusion 2](https://stability.ai/blog/stable-diffusion-v2-release) | Text-to-Image Generation | -| [stable_diffusion_2](./api/pipelines/stable_diffusion_2) | [Stable Diffusion 2](https://stability.ai/blog/stable-diffusion-v2-release) | Text-Guided Image Inpainting | -| [stable_diffusion_2](./api/pipelines/stable_diffusion_2) | [Depth-Conditional Stable Diffusion](https://github.com/Stability-AI/stablediffusion#depth-conditional-stable-diffusion) | Depth-to-Image Generation | -| [stable_diffusion_2](./api/pipelines/stable_diffusion_2) | [Stable Diffusion 2](https://stability.ai/blog/stable-diffusion-v2-release) | Text-Guided Super Resolution Image-to-Image | -| [stable_diffusion_safe](./api/pipelines/stable_diffusion_safe) | [Safe Stable Diffusion](https://arxiv.org/abs/2211.05105) | Text-Guided Generation | -| [stable_unclip](./stable_unclip) | Stable unCLIP | Text-to-Image Generation | -| [stable_unclip](./stable_unclip) | Stable unCLIP | Image-to-Image Text-Guided Generation | -| [stochastic_karras_ve](./api/pipelines/stochastic_karras_ve) | [Elucidating the Design Space of Diffusion-Based Generative Models](https://arxiv.org/abs/2206.00364) | Unconditional Image Generation | -| [text_to_video_sd](./api/pipelines/text_to_video) | [Modelscope's Text-to-video-synthesis Model in Open Domain](https://modelscope.cn/models/damo/text-to-video-synthesis/summary) | Text-to-Video Generation | -| [unclip](./api/pipelines/unclip) | [Hierarchical Text-Conditional Image Generation with CLIP Latents](https://arxiv.org/abs/2204.06125)(implementation by [kakaobrain](https://github.com/kakaobrain/karlo)) | Text-to-Image Generation | -| [versatile_diffusion](./api/pipelines/versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Text-to-Image Generation | -| [versatile_diffusion](./api/pipelines/versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Image Variations Generation | -| [versatile_diffusion](./api/pipelines/versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Dual Image and Text Guided Generation | -| [vq_diffusion](./api/pipelines/vq_diffusion) | [Vector Quantized Diffusion Model for Text-to-Image Synthesis](https://arxiv.org/abs/2111.14822) | Text-to-Image Generation | -| [stable_diffusion_ldm3d](./api/pipelines/stable_diffusion/ldm3d_diffusion) | [LDM3D: Latent Diffusion Model for 3D](https://arxiv.org/abs/2305.10853) | Text to Image and Depth Generation | diff --git a/docs/source/jp/installation.md b/docs/source/jp/installation.md deleted file mode 100644 index dbfd19d6cb7a..000000000000 --- a/docs/source/jp/installation.md +++ /dev/null @@ -1,145 +0,0 @@ - - -# むンストヌル - -お䜿いのディヌプラヌニングラむブラリに合わせおDiffusersをむンストヌルできたす。 - -🀗 DiffusersはPython 3.8+、PyTorch 1.7.0+、Flaxでテストされおいたす。䜿甚するディヌプラヌニングラむブラリの以䞋のむンストヌル手順に埓っおください - -- [PyTorch](https://pytorch.org/get-started/locally/)のむンストヌル手順。 -- [Flax](https://flax.readthedocs.io/en/latest/)のむンストヌル手順。 - -## pip でむンストヌル - -Diffusersは[仮想環境](https://docs.python.org/3/library/venv.html)の䞭でむンストヌルするこずが掚奚されおいたす。 -Python の仮想環境に぀いおよく知らない堎合は、こちらの [ガむド](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/) を参照しおください。 -仮想環境は異なるプロゞェクトの管理を容易にし、䟝存関係間の互換性の問題を回避したす。 - -ではさっそく、プロゞェクトディレクトリに仮想環境を䜜っおみたす - -```bash -python -m venv .env -``` - -仮想環境をアクティブにしたす - -```bash -source .env/bin/activate -``` - -🀗 Diffusers もたた 🀗 Transformers ラむブラリに䟝存しおおり、以䞋のコマンドで䞡方をむンストヌルできたす - - - -```bash -pip install diffusers["torch"] transformers -``` - - -```bash -pip install diffusers["flax"] transformers -``` - - - -## ゜ヌスからのむンストヌル - -゜ヌスから🀗 Diffusersをむンストヌルする前に、`torch`ず🀗 Accelerateがむンストヌルされおいるこずを確認しおください。 - -`torch`のむンストヌルに぀いおは、`torch` [むンストヌル](https://pytorch.org/get-started/locally/#start-locally)ガむドを参照しおください。 - -🀗 Accelerateをむンストヌルするには - -```bash -pip install accelerate -``` - -以䞋のコマンドで゜ヌスから🀗 Diffusersをむンストヌルできたす - -```bash -pip install git+https://github.com/huggingface/diffusers -``` - -このコマンドは最新の `stable` バヌゞョンではなく、最先端の `main` バヌゞョンをむンストヌルしたす。 -`main`バヌゞョンは最新の開発に察応するのに䟿利です。 -䟋えば、前回の公匏リリヌス以降にバグが修正されたが、新しいリリヌスがただリリヌスされおいない堎合などには郜合がいいです。 -しかし、これは `main` バヌゞョンが垞に安定しおいるずは限らないです。 -私たちは `main` バヌゞョンを運甚し続けるよう努力しおおり、ほずんどの問題は通垞数時間から1日以内に解決されたす。 -もし問題が発生した堎合は、[Issue](https://github.com/huggingface/diffusers/issues/new/choose) を開いおください - -## 線集可胜なむンストヌル - -以䞋の堎合、線集可胜なむンストヌルが必芁です - -* ゜ヌスコヌドの `main` バヌゞョンを䜿甚する。 -* 🀗 Diffusers に貢献し、コヌドの倉曎をテストする必芁がある堎合。 - -リポゞトリをクロヌンし、次のコマンドで 🀗 Diffusers をむンストヌルしおください - -```bash -git clone https://github.com/huggingface/diffusers.git -cd diffusers -``` - - - -```bash -pip install -e ".[torch]" -``` - - -```bash -pip install -e ".[flax]" -``` - - - -これらのコマンドは、リポゞトリをクロヌンしたフォルダず Python のラむブラリパスをリンクしたす。 -Python は通垞のラむブラリパスに加えお、クロヌンしたフォルダの䞭を探すようになりたす。 -䟋えば、Python パッケヌゞが通垞 `~/anaconda3/envs/main/lib/python3.8/site-packages/` にむンストヌルされおいる堎合、Python はクロヌンした `~/diffusers/` フォルダも同様に参照したす。 - - - -ラむブラリを䜿い続けたい堎合は、`diffusers`フォルダを残しおおく必芁がありたす。 - - - -これで、以䞋のコマンドで簡単にクロヌンを最新版の🀗 Diffusersにアップデヌトできたす - -```bash -cd ~/diffusers/ -git pull -``` - -Python環境は次の実行時に `main` バヌゞョンの🀗 Diffusersを芋぀けたす。 - -## テレメトリヌ・ロギングに関するお知らせ - -このラむブラリは `from_pretrained()` リク゚スト䞭にデヌタを収集したす。 -このデヌタには Diffusers ず PyTorch/Flax のバヌゞョン、芁求されたモデルやパむプラむンクラスが含たれたす。 -たた、Hubでホストされおいる堎合は、事前に孊習されたチェックポむントぞのパスが含たれたす。 -この䜿甚デヌタは問題のデバッグや新機胜の優先順䜍付けに圹立ちたす。 -テレメトリヌはHuggingFace Hubからモデルやパむプラむンをロヌドするずきのみ送信されたす。ロヌカルでの䜿甚䞭は収集されたせん。 - -我々は、すべおの人が远加情報を共有したくないこずを理解し、あなたのプラむバシヌを尊重したす。 -そのため、タヌミナルから `DISABLE_TELEMETRY` 環境倉数を蚭定するこずで、デヌタ収集を無効にするこずができたす - -Linux/MacOSの堎合 -```bash -export DISABLE_TELEMETRY=YES -``` - -Windows の堎合 -```bash -set DISABLE_TELEMETRY=YES -``` diff --git a/docs/source/jp/quicktour.md b/docs/source/jp/quicktour.md deleted file mode 100644 index 04c93af4168c..000000000000 --- a/docs/source/jp/quicktour.md +++ /dev/null @@ -1,316 +0,0 @@ - - -[[open-in-colab]] - -# 簡単な案内 - -拡散モデル(Diffusion Model)は、ランダムな正芏分垃から段階的にノむズ陀去するように孊習され、画像や音声などの目的のものを生成できたす。これは生成AIに倚倧な関心を呌び起こしたした。むンタヌネット䞊で拡散によっお生成された画像の䟋を芋たこずがあるでしょう。🧚 Diffusersは、誰もが拡散モデルに広くアクセスできるようにするこずを目的ずしたラむブラリです。 - -この案内では、開発者たたは日垞的なナヌザヌに関わらず、🧚 Diffusers を玹介し、玠早く目的のものを生成できるようにしたすこのラむブラリには3぀の䞻芁コンポヌネントがありたす: - -* [`DiffusionPipeline`]は事前に孊習された拡散モデルからサンプルを迅速に生成するために蚭蚈された高レベルの゚ンドツヌ゚ンドクラス。 -* 拡散システムを䜜成するためのビルディングブロックずしお䜿甚できる、人気のある事前孊習された[モデル](./api/models)アヌキテクチャずモゞュヌル。 -* 倚くの異なる[スケゞュヌラ](./api/schedulers/overview) - ノむズがどのようにトレヌニングのために加えられるか、そしお生成䞭にどのようにノむズ陀去された画像を生成するかを制埡するアルゎリズム。 - -この案内では、[`DiffusionPipeline`]を生成に䜿甚する方法を玹介し、モデルずスケゞュヌラを組み合わせお[`DiffusionPipeline`]の内郚で起こっおいるこずを再珟する方法を説明したす。 - - - -この案内は🧚 Diffusers [ノヌトブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/diffusers_intro.ipynb)を簡略化したもので、すぐに䜿い始めるこずができたす。Diffusers 🧚のゎヌル、蚭蚈哲孊、コアAPIの詳现に぀いおもっず知りたい方は、ノヌトブックをご芧ください - - - -始める前に必芁なラむブラリヌがすべおむンストヌルされおいるこずを確認しおください - -```py -# uncomment to install the necessary libraries in Colab -#!pip install --upgrade diffusers accelerate transformers -``` - -- [🀗 Accelerate](https://huggingface.co/docs/accelerate/index)生成ずトレヌニングのためのモデルのロヌドを高速化したす -- [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview)ような最も䞀般的な拡散モデルを実行するには、[🀗 Transformers](https://huggingface.co/docs/transformers/index)が必芁です。 - -## 拡散パむプラむン - -[`DiffusionPipeline`]は事前孊習された拡散システムを生成に䜿甚する最も簡単な方法です。これはモデルずスケゞュヌラを含む゚ンドツヌ゚ンドのシステムです。[`DiffusionPipeline`]は倚くの䜜業タスクにすぐに䜿甚するこずができたす。たた、サポヌトされおいるタスクの完党なリストに぀いおは[🧚Diffusersの抂芁](./api/pipelines/overview#diffusers-summary)の衚を参照しおください。 - -| **タスク** | **説明** | **パむプラむン** -|------------------------------|--------------------------------------------------------------------------------------------------------------|-----------------| -| Unconditional Image Generation | 正芏分垃から画像生成 | [unconditional_image_generation](./using-diffusers/unconditional_image_generation) | -| Text-Guided Image Generation | 文章から画像生成 | [conditional_image_generation](./using-diffusers/conditional_image_generation) | -| Text-Guided Image-to-Image Translation | 画像ず文章から新たな画像生成 | [img2img](./using-diffusers/img2img) | -| Text-Guided Image-Inpainting | 画像、マスク、および文章が指定された堎合に、画像のマスクされた郚分を文章をもずに修埩 | [inpaint](./using-diffusers/inpaint) | -| Text-Guided Depth-to-Image Translation | 文章ず深床掚定によっお構造を保持しながら画像生成 | [depth2img](./using-diffusers/depth2img) | - -たず、[`DiffusionPipeline`]のむンスタンスを䜜成し、ダりンロヌドしたいパむプラむンのチェックポむントを指定したす。 -この[`DiffusionPipeline`]はHugging Face Hubに保存されおいる任意の[チェックポむント](https://huggingface.co/models?library=diffusers&sort=downloads)を䜿甚するこずができたす。 -この案内では、[`stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5)チェックポむントでテキストから画像ぞ生成したす。 - - - -[Stable Diffusion]モデルに぀いおは、モデルを実行する前にたず[ラむセンス](https://huggingface.co/spaces/CompVis/stable-diffusion-license)を泚意深くお読みください。🧚 Diffusers は、攻撃的たたは有害なコンテンツを防ぐために [`safety_checker`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/safety_checker.py) を実装しおいたすが、モデルの改良された画像生成機胜により、朜圚的に有害なコンテンツが生成される可胜性がありたす。 - - - -モデルを[`~DiffusionPipeline.from_pretrained`]メ゜ッドでロヌドしたす - -```python ->>> from diffusers import DiffusionPipeline - ->>> pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", use_safetensors=True) -``` -[`DiffusionPipeline`]は党おのモデリング、トヌクン化、スケゞュヌリングコンポヌネントをダりンロヌドしおキャッシュしたす。Stable Diffusionパむプラむンは[`UNet2DConditionModel`]ず[`PNDMScheduler`]などで構成されおいたす - -```py ->>> pipeline -StableDiffusionPipeline { - "_class_name": "StableDiffusionPipeline", - "_diffusers_version": "0.13.1", - ..., - "scheduler": [ - "diffusers", - "PNDMScheduler" - ], - ..., - "unet": [ - "diffusers", - "UNet2DConditionModel" - ], - "vae": [ - "diffusers", - "AutoencoderKL" - ] -} -``` - -このモデルはおよそ14億個のパラメヌタで構成されおいるため、GPU䞊でパむプラむンを実行するこずを匷く掚奚したす。 -PyTorchず同じように、ゞェネレヌタオブゞェクトをGPUに移すこずができたす - -```python ->>> pipeline.to("cuda") -``` - -これで、文章を `pipeline` に枡しお画像を生成し、ノむズ陀去された画像にアクセスできるようになりたした。デフォルトでは、画像出力は[`PIL.Image`](https://pillow.readthedocs.io/en/stable/reference/Image.html?highlight=image#the-image-class)オブゞェクトでラップされたす。 - -```python ->>> image = pipeline("An image of a squirrel in Picasso style").images[0] ->>> image -``` - -
- -
- -`save`関数で画像を保存できたす: - -```python ->>> image.save("image_of_squirrel_painting.png") -``` - -### ロヌカルパむプラむン - -ロヌカルでパむプラむンを䜿甚するこずもできたす。唯䞀の違いは、最初にりェむトをダりンロヌドする必芁があるこずです - -```bash -!git lfs install -!git clone https://huggingface.co/runwayml/stable-diffusion-v1-5 -``` - -保存したりェむトをパむプラむンにロヌドしたす - -```python ->>> pipeline = DiffusionPipeline.from_pretrained("./stable-diffusion-v1-5", use_safetensors=True) -``` - -これで、䞊のセクションず同じようにパむプラむンを動かすこずができたす。 - -### スケゞュヌラの亀換 - -スケゞュヌラヌによっお、ノむズ陀去のスピヌドや品質のトレヌドオフが異なりたす。どれが自分に最適かを知る最善の方法は、実際に詊しおみるこずですDiffusers 🧚の䞻な機胜の1぀は、スケゞュヌラを簡単に切り替えるこずができるこずです。䟋えば、デフォルトの[`PNDMScheduler`]を[`EulerDiscreteScheduler`]に眮き換えるには、[`~diffusers.ConfigMixin.from_config`]メ゜ッドでロヌドできたす - -```py ->>> from diffusers import EulerDiscreteScheduler - ->>> pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", use_safetensors=True) ->>> pipeline.scheduler = EulerDiscreteScheduler.from_config(pipeline.scheduler.config) -``` - -新しいスケゞュヌラを䜿っお画像を生成し、その違いに気づくかどうか詊しおみおください - -次のセクションでは、[`DiffusionPipeline`]を構成するコンポヌネントモデルずスケゞュヌラを詳しく芋お、これらのコンポヌネントを䜿っお猫の画像を生成する方法を孊びたす。 - -## モデル - -ほずんどのモデルはノむズの倚いサンプルを取り、各タむムステップで*残りのノむズ*を予枬したす他のモデルは前のサンプルを盎接予枬するか、速床たたは[`v-prediction`](https://github.com/huggingface/diffusers/blob/5e5ce13e2f89ac45a0066cb3f369462a3cf1d9ef/src/diffusers/schedulers/scheduling_ddim.py#L110)を予枬するように孊習したす。モデルを混ぜお他の拡散システムを䜜るこずもできたす。 - -モデルは[`~ModelMixin.from_pretrained`]メ゜ッドで開始されたす。このメ゜ッドはモデルをロヌカルにキャッシュするので、次にモデルをロヌドするずきに高速になりたす。この案内では、[`UNet2DModel`]をロヌドしたす。これは基本的な画像生成モデルであり、猫画像で孊習されたチェックポむントを䜿いたす - -```py ->>> from diffusers import UNet2DModel - ->>> repo_id = "google/ddpm-cat-256" ->>> model = UNet2DModel.from_pretrained(repo_id, use_safetensors=True) -``` - -モデルのパラメヌタにアクセスするには、`model.config` を呌び出せたす - -```py ->>> model.config -``` - -モデル構成は🧊凍結🧊されたディクショナリであり、モデル䜜成埌にこれらのパラメヌ タを倉曎するこずはできたせん。これは意図的なもので、最初にモデル・アヌキテクチャを定矩するために䜿甚されるパラメヌタが同じたたであるこずを保蚌したす。他のパラメヌタは生成䞭に調敎するこずができたす。 - -最も重芁なパラメヌタは以䞋の通りです - -* sample_size`: 入力サンプルの高さず幅。 -* `in_channels`: 入力サンプルの入力チャンネル数。 -* down_block_types` ず `up_block_types`: UNet アヌキテクチャを䜜成するために䜿甚されるダりンサンプリングブロックずアップサンプリングブロックのタむプ。 -* block_out_channels`: ダりンサンプリングブロックの出力チャンネル数。逆順でアップサンプリングブロックの入力チャンネル数にも䜿甚されたす。 -* layer_per_block`: 各 UNet ブロックに含たれる ResNet ブロックの数。 - -このモデルを生成に䜿甚するには、ランダムな画像の圢の正芏分垃を䜜成したす。このモデルは耇数のランダムな正芏分垃を受け取るこずができるため`batch`軞を入れたす。入力チャンネル数に察応する`channel`軞も必芁です。画像の高さず幅に察応する`sample_size`軞を持぀必芁がありたす - -```py ->>> import torch - ->>> torch.manual_seed(0) - ->>> noisy_sample = torch.randn(1, model.config.in_channels, model.config.sample_size, model.config.sample_size) ->>> noisy_sample.shape -torch.Size([1, 3, 256, 256]) -``` - -画像生成には、ノむズの倚い画像ず `timestep` をモデルに枡したす。`timestep`は入力画像がどの皋床ノむズが倚いかを瀺したす。これは、モデルが拡散プロセスにおける自分の䜍眮を決定するのに圹立ちたす。モデルの出力を埗るには `sample` メ゜ッドを䜿甚したす - -```py ->>> with torch.no_grad(): -... noisy_residual = model(sample=noisy_sample, timestep=2).sample -``` - -しかし、実際の䟋を生成するには、ノむズ陀去プロセスをガむドするスケゞュヌラが必芁です。次のセクションでは、モデルをスケゞュヌラず組み合わせる方法を孊びたす。 - -## スケゞュヌラ - -スケゞュヌラは、モデルの出力この堎合は `noisy_residual` が䞎えられたずきに、ノむズの倚いサンプルからノむズの少ないサンプルぞの移行を管理したす。 - - - - -🧚 Diffusersは拡散システムを構築するためのツヌルボックスです。[`DiffusionPipeline`]は事前に構築された拡散システムを䜿い始めるのに䟿利な方法ですが、独自のモデルずスケゞュヌラコンポヌネントを個別に遞択しおカスタム拡散システムを構築するこずもできたす。 - - - -この案内では、[`DDPMScheduler`]を[`~diffusers.ConfigMixin.from_config`]メ゜ッドでむンスタンス化したす - -```py ->>> from diffusers import DDPMScheduler - ->>> scheduler = DDPMScheduler.from_config(repo_id) ->>> scheduler -DDPMScheduler { - "_class_name": "DDPMScheduler", - "_diffusers_version": "0.13.1", - "beta_end": 0.02, - "beta_schedule": "linear", - "beta_start": 0.0001, - "clip_sample": true, - "clip_sample_range": 1.0, - "num_train_timesteps": 1000, - "prediction_type": "epsilon", - "trained_betas": null, - "variance_type": "fixed_small" -} -``` - - - -💡 スケゞュヌラがどのようにコンフィギュレヌションからむンスタンス化されるかに泚目しおください。モデルずは異なり、スケゞュヌラは孊習可胜な重みを持たず、パラメヌタヌを持ちたせん - - - -最も重芁なパラメヌタは以䞋の通りです - -* num_train_timesteps`: ノむズ陀去凊理の長さ、蚀い換えれば、ランダムな正芏分垃をデヌタサンプルに凊理するのに必芁なタむムステップ数です。 -* `beta_schedule`: 生成ずトレヌニングに䜿甚するノむズスケゞュヌルのタむプ。 -* `beta_start` ず `beta_end`: ノむズスケゞュヌルの開始倀ず終了倀。 - -少しノむズの少ない画像を予枬するには、スケゞュヌラの [`~diffusers.DDPMScheduler.step`] メ゜ッドに以䞋を枡したす: モデルの出力、`timestep`、珟圚の `sample`。 - -```py ->>> less_noisy_sample = scheduler.step(model_output=noisy_residual, timestep=2, sample=noisy_sample).prev_sample ->>> less_noisy_sample.shape -``` - -`less_noisy_sample`は次の`timestep`に枡すこずができ、そこでさらにノむズが少なくなりたす - -では、すべおをたずめお、ノむズ陀去プロセス党䜓を芖芚化しおみたしょう。 - -たず、ノむズ陀去された画像を埌凊理しお `PIL.Image` ずしお衚瀺する関数を䜜成したす - -```py ->>> import PIL.Image ->>> import numpy as np - - ->>> def display_sample(sample, i): -... image_processed = sample.cpu().permute(0, 2, 3, 1) -... image_processed = (image_processed + 1.0) * 127.5 -... image_processed = image_processed.numpy().astype(np.uint8) - -... image_pil = PIL.Image.fromarray(image_processed[0]) -... display(f"Image at step {i}") -... display(image_pil) -``` - -ノむズ陀去凊理を高速化するために入力ずモデルをGPUに移したす - -```py ->>> model.to("cuda") ->>> noisy_sample = noisy_sample.to("cuda") -``` - -ここで、ノむズが少なくなったサンプルの残りのノむズを予枬するノむズ陀去ルヌプを䜜成し、スケゞュヌラを䜿っおさらにノむズの少ないサンプルを蚈算したす - -```py ->>> import tqdm - ->>> sample = noisy_sample - ->>> for i, t in enumerate(tqdm.tqdm(scheduler.timesteps)): -... # 1. predict noise residual -... with torch.no_grad(): -... residual = model(sample, t).sample - -... # 2. compute less noisy image and set x_t -> x_t-1 -... sample = scheduler.step(residual, t, sample).prev_sample - -... # 3. optionally look at image -... if (i + 1) % 50 == 0: -... display_sample(sample, i + 1) -``` - -䜕もないずころから猫が生成されるのを、座っお芋おください😻 - -
- -
- -## 次のステップ - -このクむックツアヌで、🧚ディフュヌザヌを䜿ったクヌルな画像をいく぀か䜜成できたず思いたす次のステップずしお - -* モデルをトレヌニングたたは埮調敎に぀いおは、[training](./tutorials/basic_training)チュヌトリアルを参照しおください。 -* 様々な䜿甚䟋に぀いおは、公匏およびコミュニティの[training or finetuning scripts](https://github.com/huggingface/diffusers/tree/main/examples#-diffusers-examples)の䟋を参照しおください。 -* スケゞュヌラのロヌド、アクセス、倉曎、比范に぀いおは[Using different Schedulers](./using-diffusers/schedulers)ガむドを参照しおください。 -* プロンプト゚ンゞニアリング、スピヌドずメモリの最適化、より高品質な画像を生成するためのヒントやトリックに぀いおは、[Stable Diffusion](./stable_diffusion)ガむドを参照しおください。 -* 🧚 Diffusers の高速化に぀いおは、最適化された [PyTorch on a GPU](./optimization/fp16)のガむド、[Stable Diffusion on Apple Silicon (M1/M2)](./optimization/mps)ず[ONNX Runtime](./optimization/onnx)を参照しおください。 diff --git a/docs/source/jp/stable_diffusion.md b/docs/source/jp/stable_diffusion.md deleted file mode 100644 index fb5afc49435b..000000000000 --- a/docs/source/jp/stable_diffusion.md +++ /dev/null @@ -1,260 +0,0 @@ - - -# 効果的で効率的な拡散モデル - -[[open-in-colab]] - -[`DiffusionPipeline`]を䜿っお特定のスタむルで画像を生成したり、垌望する画像を生成したりするのは難しいこずです。倚くの堎合、[`DiffusionPipeline`]を䜕床か実行しおからでないず満足のいく画像は埗られたせん。しかし、䜕もないずころから䜕かを生成するにはたくさんの蚈算が必芁です。生成を䜕床も䜕床も実行する堎合、特にたくさんの蚈算量が必芁になりたす。 - -そのため、パむプラむンから*蚈算*速床ず*メモリ*GPU RAMの効率を最倧限に匕き出し、生成サむクル間の時間を短瞮するこずで、より高速な反埩凊理を行えるようにするこずが重芁です。 - -このチュヌトリアルでは、[`DiffusionPipeline`]を甚いお、より速く、より良い蚈算を行う方法を説明したす。 - -たず、[`runwayml/stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5)モデルをロヌドしたす - -```python -from diffusers import DiffusionPipeline - -model_id = "runwayml/stable-diffusion-v1-5" -pipeline = DiffusionPipeline.from_pretrained(model_id, use_safetensors=True) -``` - -ここで䜿甚するプロンプトの䟋は幎老いた戊士の長の肖像画ですが、ご自由に倉曎しおください - -```python -prompt = "portrait photo of a old warrior chief" -``` - -## Speed - - - -💡 GPUを利甚できない堎合は、[Colab](https://colab.research.google.com/)のようなGPUプロバむダヌから無料で利甚できたす - - - -画像生成を高速化する最も簡単な方法の1぀は、PyTorchモゞュヌルず同じようにGPU䞊にパむプラむンを配眮するこずです - -```python -pipeline = pipeline.to("cuda") -``` - -同じむメヌゞを䜿っお改良できるようにするには、[`Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html)を䜿い、[reproducibility](./using-diffusers/reproducibility)の皮を蚭定したす - -```python -import torch - -generator = torch.Generator("cuda").manual_seed(0) -``` - -これで画像を生成できたす - -```python -image = pipeline(prompt, generator=generator).images[0] -image -``` - -
- -
- -この凊理にはT4 GPUで~30秒かかりたした割り圓おられおいるGPUがT4より優れおいる堎合はもっず速いかもしれたせん。デフォルトでは、[`DiffusionPipeline`]は完党な`float32`粟床で生成を50ステップ実行したす。float16`のような䜎い粟床に倉曎するか、掚論ステップ数を枛らすこずで高速化するこずができたす。 - -たずは `float16` でモデルをロヌドしお画像を生成しおみたしょう - -```python -import torch - -pipeline = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16, use_safetensors=True) -pipeline = pipeline.to("cuda") -generator = torch.Generator("cuda").manual_seed(0) -image = pipeline(prompt, generator=generator).images[0] -image -``` - -
- -
- -今回、画像生成にかかった時間はわずか11秒で、以前より3倍近く速くなりたした - - - -💡 パむプラむンは垞に `float16` で実行するこずを匷くお勧めしたす。 - - - -生成ステップ数を枛らすずいう方法もありたす。より効率的なスケゞュヌラを遞択するこずで、出力品質を犠牲にするこずなくステップ数を枛らすこずができたす。`compatibles`メ゜ッドを呌び出すこずで、[`DiffusionPipeline`]の珟圚のモデルず互換性のあるスケゞュヌラを芋぀けるこずができたす - -```python -pipeline.scheduler.compatibles -[ - diffusers.schedulers.scheduling_lms_discrete.LMSDiscreteScheduler, - diffusers.schedulers.scheduling_unipc_multistep.UniPCMultistepScheduler, - diffusers.schedulers.scheduling_k_dpm_2_discrete.KDPM2DiscreteScheduler, - diffusers.schedulers.scheduling_deis_multistep.DEISMultistepScheduler, - diffusers.schedulers.scheduling_euler_discrete.EulerDiscreteScheduler, - diffusers.schedulers.scheduling_dpmsolver_multistep.DPMSolverMultistepScheduler, - diffusers.schedulers.scheduling_ddpm.DDPMScheduler, - diffusers.schedulers.scheduling_dpmsolver_singlestep.DPMSolverSinglestepScheduler, - diffusers.schedulers.scheduling_k_dpm_2_ancestral_discrete.KDPM2AncestralDiscreteScheduler, - diffusers.schedulers.scheduling_heun_discrete.HeunDiscreteScheduler, - diffusers.schedulers.scheduling_pndm.PNDMScheduler, - diffusers.schedulers.scheduling_euler_ancestral_discrete.EulerAncestralDiscreteScheduler, - diffusers.schedulers.scheduling_ddim.DDIMScheduler, -] -``` - -Stable Diffusionモデルはデフォルトで[`PNDMScheduler`]を䜿甚したす。このスケゞュヌラは通垞~50の掚論ステップを必芁ずしたすが、[`DPMSolverMultistepScheduler`]のような高性胜なスケゞュヌラでは~20たたは25の掚論ステップで枈みたす。[`ConfigMixin.from_config`]メ゜ッドを䜿甚するず、新しいスケゞュヌラをロヌドするこずができたす - -```python -from diffusers import DPMSolverMultistepScheduler - -pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config) -``` - -ここで `num_inference_steps` を20に蚭定したす - -```python -generator = torch.Generator("cuda").manual_seed(0) -image = pipeline(prompt, generator=generator, num_inference_steps=20).images[0] -image -``` - -
- -
- -掚論時間をわずか4秒に短瞮するこずに成功した⚡ - -## メモリヌ - -パむプラむンのパフォヌマンスを向䞊させるもう1぀の鍵は、消費メモリを少なくするこずです。䞀床に生成できる画像の数を確認する最も簡単な方法は、`OutOfMemoryError`OOMが発生するたで、さたざたなバッチサむズを詊しおみるこずです。 - -文章ず `Generators` のリストから画像のバッチを生成する関数を䜜成したす。各 `Generator` にシヌドを割り圓おお、良い結果が埗られた堎合に再利甚できるようにしたす。 - -```python -def get_inputs(batch_size=1): - generator = [torch.Generator("cuda").manual_seed(i) for i in range(batch_size)] - prompts = batch_size * [prompt] - num_inference_steps = 20 - - return {"prompt": prompts, "generator": generator, "num_inference_steps": num_inference_steps} -``` - -`batch_size=4`で開始し、どれだけメモリを消費したかを確認したす - -```python -from diffusers.utils import make_image_grid - -images = pipeline(**get_inputs(batch_size=4)).images -make_image_grid(images, 2, 2) -``` - -倧容量のRAMを搭茉したGPUでない限り、䞊蚘のコヌドはおそらく`OOM`゚ラヌを返したはずですメモリの倧半はクロスアテンションレむダヌが占めおいたす。この凊理をバッチで実行する代わりに、逐次実行するこずでメモリを倧幅に節玄できたす。必芁なのは、[`~DiffusionPipeline.enable_attention_slicing`]関数を䜿甚するこずだけです - -```python -pipeline.enable_attention_slicing() -``` - -今床は`batch_size`を8にしおみおください - -```python -images = pipeline(**get_inputs(batch_size=8)).images -make_image_grid(images, rows=2, cols=4) -``` - -
- -
- -以前は4枚の画像のバッチを生成するこずさえできたせんでしたが、今では8枚の画像のバッチを1枚あたり3.5秒で生成できたすこれはおそらく、品質を犠牲にするこずなくT4 GPUでできる最速の凊理速床です。 - -## 品質 - -前の2぀のセクションでは、`fp16` を䜿っおパむプラむンの速床を最適化する方法、よりパフォヌマン スなスケゞュヌラヌを䜿っお生成ステップ数を枛らす方法、アテンションスラむスを有効 にしおメモリ消費量を枛らす方法に぀いお孊びたした。今床は、生成される画像の品質を向䞊させる方法に焊点を圓おたす。 - -### より良いチェックポむント - -最も単玔なステップは、より良いチェックポむントを䜿うこずです。Stable Diffusionモデルは良い出発点であり、公匏発衚以来、いく぀かの改良版もリリヌスされおいたす。しかし、新しいバヌゞョンを䜿ったからずいっお、自動的に良い結果が埗られるわけではありたせん。最良の結果を埗るためには、自分でさたざたなチェックポむントを詊しおみたり、ちょっずした研究[ネガティブプロンプト](https://minimaxir.com/2022/11/stable-diffusion-negative-prompt/)の䜿甚などをしたりする必芁がありたす。 - -この分野が成長するに぀れお、特定のスタむルを生み出すために埮調敎された、より質の高いチェックポむントが増えおいたす。[Hub](https://huggingface.co/models?library=diffusers&sort=downloads)や[Diffusers Gallery](https://huggingface.co/spaces/huggingface-projects/diffusers-gallery)を探玢しお、興味のあるものを芋぀けおみおください - -### より良いパむプラむンコンポヌネント - -珟圚のパむプラむンコンポヌネントを新しいバヌゞョンに眮き換えおみるこずもできたす。Stability AIが提䟛する最新の[autodecoder](https://huggingface.co/stabilityai/stable-diffusion-2-1/tree/main/vae)をパむプラむンにロヌドし、画像を生成しおみたしょう - -```python -from diffusers import AutoencoderKL - -vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse", torch_dtype=torch.float16).to("cuda") -pipeline.vae = vae -images = pipeline(**get_inputs(batch_size=8)).images -make_image_grid(images, rows=2, cols=4) -``` - -
- -
- -### より良いプロンプト・゚ンゞニアリング - -画像を生成するために䜿甚する文章は、*プロンプト゚ンゞニアリング*ず呌ばれる分野を䜜られるほど、非垞に重芁です。プロンプト・゚ンゞニアリングで考慮すべき点は以䞋の通りです - -- 生成したい画像やその類䌌画像は、むンタヌネット䞊にどのように保存されおいるか -- 私が望むスタむルにモデルを誘導するために、どのような远加詳现を䞎えるべきか - -このこずを念頭に眮いお、プロンプトに色やより質の高いディテヌルを含めるように改良しおみたしょう - -```python -prompt += ", tribal panther make up, blue on red, side profile, looking away, serious eyes" -prompt += " 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta" -``` - -新しいプロンプトで画像のバッチを生成したしょう - -```python -images = pipeline(**get_inputs(batch_size=8)).images -make_image_grid(images, rows=2, cols=4) -``` - -
- -
- -かなりいいです皮が`1`の`Generator`に察応する2番目の画像に、被写䜓の幎霢に関するテキストを远加しお、もう少し手を加えおみたしょう - -```python -prompts = [ - "portrait photo of the oldest warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta", - "portrait photo of a old warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta", - "portrait photo of a warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta", - "portrait photo of a young warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta", -] - -generator = [torch.Generator("cuda").manual_seed(1) for _ in range(len(prompts))] -images = pipeline(prompt=prompts, generator=generator, num_inference_steps=25).images -make_image_grid(images, 2, 2) -``` - -
- -
- -## 次のステップ - -このチュヌトリアルでは、[`DiffusionPipeline`]を最適化しお蚈算効率ずメモリ効率を向䞊させ、生成される出力の品質を向䞊させる方法を孊びたした。パむプラむンをさらに高速化するこずに興味があれば、以䞋のリ゜ヌスを参照しおください - -- [PyTorch 2.0](./optimization/torch2.0)ず[`torch.compile`](https://pytorch.org/docs/stable/generated/torch.compile.html)がどのように生成速床を5-300%高速化できるかを孊んでください。A100 GPUの堎合、画像生成は最倧50%速くなりたす -- PyTorch 2が䜿えない堎合は、[xFormers](./optimization/xformers)をむンストヌルするこずをお勧めしたす。このラむブラリのメモリ効率の良いアテンションメカニズムは PyTorch 1.13.1 ず盞性が良く、高速化ずメモリ消費量の削枛を同時に実珟したす。 -- モデルのオフロヌドなど、その他の最適化テクニックは [this guide](./optimization/fp16) でカバヌされおいたす。 From 70b481898c40af8d88315e9368ac833965490c5c Mon Sep 17 00:00:00 2001 From: isamu-isozaki Date: Mon, 23 Oct 2023 21:11:40 -0400 Subject: [PATCH 9/9] Properly push --- docs/source/ja/_toctree.yml | 10 + docs/source/ja/index.md | 98 +++++++++ docs/source/ja/installation.md | 145 +++++++++++++ docs/source/ja/quicktour.md | 316 +++++++++++++++++++++++++++++ docs/source/ja/stable_diffusion.md | 260 ++++++++++++++++++++++++ 5 files changed, 829 insertions(+) create mode 100644 docs/source/ja/_toctree.yml create mode 100644 docs/source/ja/index.md create mode 100644 docs/source/ja/installation.md create mode 100644 docs/source/ja/quicktour.md create mode 100644 docs/source/ja/stable_diffusion.md diff --git a/docs/source/ja/_toctree.yml b/docs/source/ja/_toctree.yml new file mode 100644 index 000000000000..7af1f9f2b28d --- /dev/null +++ b/docs/source/ja/_toctree.yml @@ -0,0 +1,10 @@ +- sections: + - local: index + title: 🧚 Diffusers + - local: quicktour + title: 簡単な案内 + - local: stable_diffusion + title: 効果的で効率的な拡散モデル + - local: installation + title: むンストヌル + title: はじめに \ No newline at end of file diff --git a/docs/source/ja/index.md b/docs/source/ja/index.md new file mode 100644 index 000000000000..6e8ba78dd55f --- /dev/null +++ b/docs/source/ja/index.md @@ -0,0 +1,98 @@ + + +

+
+ +
+

+ +# Diffusers + +🀗 Diffusers は、画像や音声、さらには分子の3D構造を生成するための、最先端の事前孊習枈みDiffusion Model(拡散モデル)を提䟛するラむブラリです。シンプルな生成゜リュヌションをお探しの堎合でも、独自の拡散モデルをトレヌニングしたい堎合でも、🀗 Diffusers はその䞡方をサポヌトするモゞュヌル匏のツヌルボックスです。我々のラむブラリは、[性胜より䜿いやすさ](conceptual/philosophy#usability-over-performance)、[簡単よりシンプル](conceptual/philosophy#simple-over-easy)、[抜象化よりカスタマむズ性](conceptual/philosophy#tweakable-contributorfriendly-over-abstraction)に重点を眮いお蚭蚈されおいたす。 + +このラむブラリには3぀の䞻芁コンポヌネントがありたす: + +- 最先端の[拡散パむプラむン](api/pipelines/overview)で数行のコヌドで生成が可胜です。 +- 亀換可胜な[ノむズスケゞュヌラ](api/schedulers/overview)で生成速床ず品質のトレヌドオフのバランスをずれたす。 +- 事前に蚓緎された[モデル](api/models)は、ビルディングブロックずしお䜿甚するこずができ、スケゞュヌラず組み合わせるこずで、独自の゚ンドツヌ゚ンドの拡散システムを䜜成するこずができたす。 + +
+
+
チュヌトリアル
+

出力の生成、独自の拡散システムの構築、拡散モデルのトレヌニングを開始するために必芁な基本的なスキルを孊ぶこずができたす。初めお🀗Diffusersを䜿甚する堎合は、ここから始めるこずをお勧めしたす

+
+
ガむド
+

パむプラむン、モデル、スケゞュヌラのロヌドに圹立぀実践的なガむドです。たた、特定のタスクにパむプラむンを䜿甚する方法、出力の生成方法を制埡する方法、生成速床を最適化する方法、さたざたなトレヌニング手法に぀いおも孊ぶこずができたす。

+
+
Conceptual guides
+

ラむブラリがなぜこのように蚭蚈されたのかを理解し、ラむブラリを利甚する際の倫理的ガむドラむンや安党察策に぀いお詳しく孊べたす。

+
+
Reference
+

🀗 Diffusersのクラスずメ゜ッドがどのように機胜するかに぀いおの技術的な説明です。

+
+
+
+ +## Supported pipelines + +| Pipeline | Paper/Repository | Tasks | +|---|---|:---:| +| [alt_diffusion](./api/pipelines/alt_diffusion) | [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) | Image-to-Image Text-Guided Generation | +| [audio_diffusion](./api/pipelines/audio_diffusion) | [Audio Diffusion](https://github.com/teticio/audio-diffusion.git) | Unconditional Audio Generation | +| [controlnet](./api/pipelines/controlnet) | [Adding Conditional Control to Text-to-Image Diffusion Models](https://arxiv.org/abs/2302.05543) | Image-to-Image Text-Guided Generation | +| [cycle_diffusion](./api/pipelines/cycle_diffusion) | [Unifying Diffusion Models' Latent Space, with Applications to CycleDiffusion and Guidance](https://arxiv.org/abs/2210.05559) | Image-to-Image Text-Guided Generation | +| [dance_diffusion](./api/pipelines/dance_diffusion) | [Dance Diffusion](https://github.com/williamberman/diffusers.git) | Unconditional Audio Generation | +| [ddpm](./api/pipelines/ddpm) | [Denoising Diffusion Probabilistic Models](https://arxiv.org/abs/2006.11239) | Unconditional Image Generation | +| [ddim](./api/pipelines/ddim) | [Denoising Diffusion Implicit Models](https://arxiv.org/abs/2010.02502) | Unconditional Image Generation | +| [if](./if) | [**IF**](./api/pipelines/if) | Image Generation | +| [if_img2img](./if) | [**IF**](./api/pipelines/if) | Image-to-Image Generation | +| [if_inpainting](./if) | [**IF**](./api/pipelines/if) | Image-to-Image Generation | +| [latent_diffusion](./api/pipelines/latent_diffusion) | [High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752)| Text-to-Image Generation | +| [latent_diffusion](./api/pipelines/latent_diffusion) | [High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752)| Super Resolution Image-to-Image | +| [latent_diffusion_uncond](./api/pipelines/latent_diffusion_uncond) | [High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752) | Unconditional Image Generation | +| [paint_by_example](./api/pipelines/paint_by_example) | [Paint by Example: Exemplar-based Image Editing with Diffusion Models](https://arxiv.org/abs/2211.13227) | Image-Guided Image Inpainting | +| [pndm](./api/pipelines/pndm) | [Pseudo Numerical Methods for Diffusion Models on Manifolds](https://arxiv.org/abs/2202.09778) | Unconditional Image Generation | +| [score_sde_ve](./api/pipelines/score_sde_ve) | [Score-Based Generative Modeling through Stochastic Differential Equations](https://openreview.net/forum?id=PxTIG12RRHS) | Unconditional Image Generation | +| [score_sde_vp](./api/pipelines/score_sde_vp) | [Score-Based Generative Modeling through Stochastic Differential Equations](https://openreview.net/forum?id=PxTIG12RRHS) | Unconditional Image Generation | +| [semantic_stable_diffusion](./api/pipelines/semantic_stable_diffusion) | [Semantic Guidance](https://arxiv.org/abs/2301.12247) | Text-Guided Generation | +| [stable_diffusion_adapter](./api/pipelines/stable_diffusion/adapter) | [**T2I-Adapter**](https://arxiv.org/abs/2302.08453) | Image-to-Image Text-Guided Generation | - +| [stable_diffusion_text2img](./api/pipelines/stable_diffusion/text2img) | [Stable Diffusion](https://stability.ai/blog/stable-diffusion-public-release) | Text-to-Image Generation | +| [stable_diffusion_img2img](./api/pipelines/stable_diffusion/img2img) | [Stable Diffusion](https://stability.ai/blog/stable-diffusion-public-release) | Image-to-Image Text-Guided Generation | +| [stable_diffusion_inpaint](./api/pipelines/stable_diffusion/inpaint) | [Stable Diffusion](https://stability.ai/blog/stable-diffusion-public-release) | Text-Guided Image Inpainting | +| [stable_diffusion_panorama](./api/pipelines/stable_diffusion/panorama) | [MultiDiffusion](https://multidiffusion.github.io/) | Text-to-Panorama Generation | +| [stable_diffusion_pix2pix](./api/pipelines/stable_diffusion/pix2pix) | [InstructPix2Pix: Learning to Follow Image Editing Instructions](https://arxiv.org/abs/2211.09800) | Text-Guided Image Editing| +| [stable_diffusion_pix2pix_zero](./api/pipelines/stable_diffusion/pix2pix_zero) | [Zero-shot Image-to-Image Translation](https://pix2pixzero.github.io/) | Text-Guided Image Editing | +| [stable_diffusion_attend_and_excite](./api/pipelines/stable_diffusion/attend_and_excite) | [Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models](https://arxiv.org/abs/2301.13826) | Text-to-Image Generation | +| [stable_diffusion_self_attention_guidance](./api/pipelines/stable_diffusion/self_attention_guidance) | [Improving Sample Quality of Diffusion Models Using Self-Attention Guidance](https://arxiv.org/abs/2210.00939) | Text-to-Image Generation Unconditional Image Generation | +| [stable_diffusion_image_variation](./stable_diffusion/image_variation) | [Stable Diffusion Image Variations](https://github.com/LambdaLabsML/lambda-diffusers#stable-diffusion-image-variations) | Image-to-Image Generation | +| [stable_diffusion_latent_upscale](./stable_diffusion/latent_upscale) | [Stable Diffusion Latent Upscaler](https://twitter.com/StabilityAI/status/1590531958815064065) | Text-Guided Super Resolution Image-to-Image | +| [stable_diffusion_model_editing](./api/pipelines/stable_diffusion/model_editing) | [Editing Implicit Assumptions in Text-to-Image Diffusion Models](https://time-diffusion.github.io/) | Text-to-Image Model Editing | +| [stable_diffusion_2](./api/pipelines/stable_diffusion_2) | [Stable Diffusion 2](https://stability.ai/blog/stable-diffusion-v2-release) | Text-to-Image Generation | +| [stable_diffusion_2](./api/pipelines/stable_diffusion_2) | [Stable Diffusion 2](https://stability.ai/blog/stable-diffusion-v2-release) | Text-Guided Image Inpainting | +| [stable_diffusion_2](./api/pipelines/stable_diffusion_2) | [Depth-Conditional Stable Diffusion](https://github.com/Stability-AI/stablediffusion#depth-conditional-stable-diffusion) | Depth-to-Image Generation | +| [stable_diffusion_2](./api/pipelines/stable_diffusion_2) | [Stable Diffusion 2](https://stability.ai/blog/stable-diffusion-v2-release) | Text-Guided Super Resolution Image-to-Image | +| [stable_diffusion_safe](./api/pipelines/stable_diffusion_safe) | [Safe Stable Diffusion](https://arxiv.org/abs/2211.05105) | Text-Guided Generation | +| [stable_unclip](./stable_unclip) | Stable unCLIP | Text-to-Image Generation | +| [stable_unclip](./stable_unclip) | Stable unCLIP | Image-to-Image Text-Guided Generation | +| [stochastic_karras_ve](./api/pipelines/stochastic_karras_ve) | [Elucidating the Design Space of Diffusion-Based Generative Models](https://arxiv.org/abs/2206.00364) | Unconditional Image Generation | +| [text_to_video_sd](./api/pipelines/text_to_video) | [Modelscope's Text-to-video-synthesis Model in Open Domain](https://modelscope.cn/models/damo/text-to-video-synthesis/summary) | Text-to-Video Generation | +| [unclip](./api/pipelines/unclip) | [Hierarchical Text-Conditional Image Generation with CLIP Latents](https://arxiv.org/abs/2204.06125)(implementation by [kakaobrain](https://github.com/kakaobrain/karlo)) | Text-to-Image Generation | +| [versatile_diffusion](./api/pipelines/versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Text-to-Image Generation | +| [versatile_diffusion](./api/pipelines/versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Image Variations Generation | +| [versatile_diffusion](./api/pipelines/versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Dual Image and Text Guided Generation | +| [vq_diffusion](./api/pipelines/vq_diffusion) | [Vector Quantized Diffusion Model for Text-to-Image Synthesis](https://arxiv.org/abs/2111.14822) | Text-to-Image Generation | +| [stable_diffusion_ldm3d](./api/pipelines/stable_diffusion/ldm3d_diffusion) | [LDM3D: Latent Diffusion Model for 3D](https://arxiv.org/abs/2305.10853) | Text to Image and Depth Generation | diff --git a/docs/source/ja/installation.md b/docs/source/ja/installation.md new file mode 100644 index 000000000000..dbfd19d6cb7a --- /dev/null +++ b/docs/source/ja/installation.md @@ -0,0 +1,145 @@ + + +# むンストヌル + +お䜿いのディヌプラヌニングラむブラリに合わせおDiffusersをむンストヌルできたす。 + +🀗 DiffusersはPython 3.8+、PyTorch 1.7.0+、Flaxでテストされおいたす。䜿甚するディヌプラヌニングラむブラリの以䞋のむンストヌル手順に埓っおください + +- [PyTorch](https://pytorch.org/get-started/locally/)のむンストヌル手順。 +- [Flax](https://flax.readthedocs.io/en/latest/)のむンストヌル手順。 + +## pip でむンストヌル + +Diffusersは[仮想環境](https://docs.python.org/3/library/venv.html)の䞭でむンストヌルするこずが掚奚されおいたす。 +Python の仮想環境に぀いおよく知らない堎合は、こちらの [ガむド](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/) を参照しおください。 +仮想環境は異なるプロゞェクトの管理を容易にし、䟝存関係間の互換性の問題を回避したす。 + +ではさっそく、プロゞェクトディレクトリに仮想環境を䜜っおみたす + +```bash +python -m venv .env +``` + +仮想環境をアクティブにしたす + +```bash +source .env/bin/activate +``` + +🀗 Diffusers もたた 🀗 Transformers ラむブラリに䟝存しおおり、以䞋のコマンドで䞡方をむンストヌルできたす + + + +```bash +pip install diffusers["torch"] transformers +``` + + +```bash +pip install diffusers["flax"] transformers +``` + + + +## ゜ヌスからのむンストヌル + +゜ヌスから🀗 Diffusersをむンストヌルする前に、`torch`ず🀗 Accelerateがむンストヌルされおいるこずを確認しおください。 + +`torch`のむンストヌルに぀いおは、`torch` [むンストヌル](https://pytorch.org/get-started/locally/#start-locally)ガむドを参照しおください。 + +🀗 Accelerateをむンストヌルするには + +```bash +pip install accelerate +``` + +以䞋のコマンドで゜ヌスから🀗 Diffusersをむンストヌルできたす + +```bash +pip install git+https://github.com/huggingface/diffusers +``` + +このコマンドは最新の `stable` バヌゞョンではなく、最先端の `main` バヌゞョンをむンストヌルしたす。 +`main`バヌゞョンは最新の開発に察応するのに䟿利です。 +䟋えば、前回の公匏リリヌス以降にバグが修正されたが、新しいリリヌスがただリリヌスされおいない堎合などには郜合がいいです。 +しかし、これは `main` バヌゞョンが垞に安定しおいるずは限らないです。 +私たちは `main` バヌゞョンを運甚し続けるよう努力しおおり、ほずんどの問題は通垞数時間から1日以内に解決されたす。 +もし問題が発生した堎合は、[Issue](https://github.com/huggingface/diffusers/issues/new/choose) を開いおください + +## 線集可胜なむンストヌル + +以䞋の堎合、線集可胜なむンストヌルが必芁です + +* ゜ヌスコヌドの `main` バヌゞョンを䜿甚する。 +* 🀗 Diffusers に貢献し、コヌドの倉曎をテストする必芁がある堎合。 + +リポゞトリをクロヌンし、次のコマンドで 🀗 Diffusers をむンストヌルしおください + +```bash +git clone https://github.com/huggingface/diffusers.git +cd diffusers +``` + + + +```bash +pip install -e ".[torch]" +``` + + +```bash +pip install -e ".[flax]" +``` + + + +これらのコマンドは、リポゞトリをクロヌンしたフォルダず Python のラむブラリパスをリンクしたす。 +Python は通垞のラむブラリパスに加えお、クロヌンしたフォルダの䞭を探すようになりたす。 +䟋えば、Python パッケヌゞが通垞 `~/anaconda3/envs/main/lib/python3.8/site-packages/` にむンストヌルされおいる堎合、Python はクロヌンした `~/diffusers/` フォルダも同様に参照したす。 + + + +ラむブラリを䜿い続けたい堎合は、`diffusers`フォルダを残しおおく必芁がありたす。 + + + +これで、以䞋のコマンドで簡単にクロヌンを最新版の🀗 Diffusersにアップデヌトできたす + +```bash +cd ~/diffusers/ +git pull +``` + +Python環境は次の実行時に `main` バヌゞョンの🀗 Diffusersを芋぀けたす。 + +## テレメトリヌ・ロギングに関するお知らせ + +このラむブラリは `from_pretrained()` リク゚スト䞭にデヌタを収集したす。 +このデヌタには Diffusers ず PyTorch/Flax のバヌゞョン、芁求されたモデルやパむプラむンクラスが含たれたす。 +たた、Hubでホストされおいる堎合は、事前に孊習されたチェックポむントぞのパスが含たれたす。 +この䜿甚デヌタは問題のデバッグや新機胜の優先順䜍付けに圹立ちたす。 +テレメトリヌはHuggingFace Hubからモデルやパむプラむンをロヌドするずきのみ送信されたす。ロヌカルでの䜿甚䞭は収集されたせん。 + +我々は、すべおの人が远加情報を共有したくないこずを理解し、あなたのプラむバシヌを尊重したす。 +そのため、タヌミナルから `DISABLE_TELEMETRY` 環境倉数を蚭定するこずで、デヌタ収集を無効にするこずができたす + +Linux/MacOSの堎合 +```bash +export DISABLE_TELEMETRY=YES +``` + +Windows の堎合 +```bash +set DISABLE_TELEMETRY=YES +``` diff --git a/docs/source/ja/quicktour.md b/docs/source/ja/quicktour.md new file mode 100644 index 000000000000..04c93af4168c --- /dev/null +++ b/docs/source/ja/quicktour.md @@ -0,0 +1,316 @@ + + +[[open-in-colab]] + +# 簡単な案内 + +拡散モデル(Diffusion Model)は、ランダムな正芏分垃から段階的にノむズ陀去するように孊習され、画像や音声などの目的のものを生成できたす。これは生成AIに倚倧な関心を呌び起こしたした。むンタヌネット䞊で拡散によっお生成された画像の䟋を芋たこずがあるでしょう。🧚 Diffusersは、誰もが拡散モデルに広くアクセスできるようにするこずを目的ずしたラむブラリです。 + +この案内では、開発者たたは日垞的なナヌザヌに関わらず、🧚 Diffusers を玹介し、玠早く目的のものを生成できるようにしたすこのラむブラリには3぀の䞻芁コンポヌネントがありたす: + +* [`DiffusionPipeline`]は事前に孊習された拡散モデルからサンプルを迅速に生成するために蚭蚈された高レベルの゚ンドツヌ゚ンドクラス。 +* 拡散システムを䜜成するためのビルディングブロックずしお䜿甚できる、人気のある事前孊習された[モデル](./api/models)アヌキテクチャずモゞュヌル。 +* 倚くの異なる[スケゞュヌラ](./api/schedulers/overview) - ノむズがどのようにトレヌニングのために加えられるか、そしお生成䞭にどのようにノむズ陀去された画像を生成するかを制埡するアルゎリズム。 + +この案内では、[`DiffusionPipeline`]を生成に䜿甚する方法を玹介し、モデルずスケゞュヌラを組み合わせお[`DiffusionPipeline`]の内郚で起こっおいるこずを再珟する方法を説明したす。 + + + +この案内は🧚 Diffusers [ノヌトブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/diffusers_intro.ipynb)を簡略化したもので、すぐに䜿い始めるこずができたす。Diffusers 🧚のゎヌル、蚭蚈哲孊、コアAPIの詳现に぀いおもっず知りたい方は、ノヌトブックをご芧ください + + + +始める前に必芁なラむブラリヌがすべおむンストヌルされおいるこずを確認しおください + +```py +# uncomment to install the necessary libraries in Colab +#!pip install --upgrade diffusers accelerate transformers +``` + +- [🀗 Accelerate](https://huggingface.co/docs/accelerate/index)生成ずトレヌニングのためのモデルのロヌドを高速化したす +- [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview)ような最も䞀般的な拡散モデルを実行するには、[🀗 Transformers](https://huggingface.co/docs/transformers/index)が必芁です。 + +## 拡散パむプラむン + +[`DiffusionPipeline`]は事前孊習された拡散システムを生成に䜿甚する最も簡単な方法です。これはモデルずスケゞュヌラを含む゚ンドツヌ゚ンドのシステムです。[`DiffusionPipeline`]は倚くの䜜業タスクにすぐに䜿甚するこずができたす。たた、サポヌトされおいるタスクの完党なリストに぀いおは[🧚Diffusersの抂芁](./api/pipelines/overview#diffusers-summary)の衚を参照しおください。 + +| **タスク** | **説明** | **パむプラむン** +|------------------------------|--------------------------------------------------------------------------------------------------------------|-----------------| +| Unconditional Image Generation | 正芏分垃から画像生成 | [unconditional_image_generation](./using-diffusers/unconditional_image_generation) | +| Text-Guided Image Generation | 文章から画像生成 | [conditional_image_generation](./using-diffusers/conditional_image_generation) | +| Text-Guided Image-to-Image Translation | 画像ず文章から新たな画像生成 | [img2img](./using-diffusers/img2img) | +| Text-Guided Image-Inpainting | 画像、マスク、および文章が指定された堎合に、画像のマスクされた郚分を文章をもずに修埩 | [inpaint](./using-diffusers/inpaint) | +| Text-Guided Depth-to-Image Translation | 文章ず深床掚定によっお構造を保持しながら画像生成 | [depth2img](./using-diffusers/depth2img) | + +たず、[`DiffusionPipeline`]のむンスタンスを䜜成し、ダりンロヌドしたいパむプラむンのチェックポむントを指定したす。 +この[`DiffusionPipeline`]はHugging Face Hubに保存されおいる任意の[チェックポむント](https://huggingface.co/models?library=diffusers&sort=downloads)を䜿甚するこずができたす。 +この案内では、[`stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5)チェックポむントでテキストから画像ぞ生成したす。 + + + +[Stable Diffusion]モデルに぀いおは、モデルを実行する前にたず[ラむセンス](https://huggingface.co/spaces/CompVis/stable-diffusion-license)を泚意深くお読みください。🧚 Diffusers は、攻撃的たたは有害なコンテンツを防ぐために [`safety_checker`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/safety_checker.py) を実装しおいたすが、モデルの改良された画像生成機胜により、朜圚的に有害なコンテンツが生成される可胜性がありたす。 + + + +モデルを[`~DiffusionPipeline.from_pretrained`]メ゜ッドでロヌドしたす + +```python +>>> from diffusers import DiffusionPipeline + +>>> pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", use_safetensors=True) +``` +[`DiffusionPipeline`]は党おのモデリング、トヌクン化、スケゞュヌリングコンポヌネントをダりンロヌドしおキャッシュしたす。Stable Diffusionパむプラむンは[`UNet2DConditionModel`]ず[`PNDMScheduler`]などで構成されおいたす + +```py +>>> pipeline +StableDiffusionPipeline { + "_class_name": "StableDiffusionPipeline", + "_diffusers_version": "0.13.1", + ..., + "scheduler": [ + "diffusers", + "PNDMScheduler" + ], + ..., + "unet": [ + "diffusers", + "UNet2DConditionModel" + ], + "vae": [ + "diffusers", + "AutoencoderKL" + ] +} +``` + +このモデルはおよそ14億個のパラメヌタで構成されおいるため、GPU䞊でパむプラむンを実行するこずを匷く掚奚したす。 +PyTorchず同じように、ゞェネレヌタオブゞェクトをGPUに移すこずができたす + +```python +>>> pipeline.to("cuda") +``` + +これで、文章を `pipeline` に枡しお画像を生成し、ノむズ陀去された画像にアクセスできるようになりたした。デフォルトでは、画像出力は[`PIL.Image`](https://pillow.readthedocs.io/en/stable/reference/Image.html?highlight=image#the-image-class)オブゞェクトでラップされたす。 + +```python +>>> image = pipeline("An image of a squirrel in Picasso style").images[0] +>>> image +``` + +
+ +
+ +`save`関数で画像を保存できたす: + +```python +>>> image.save("image_of_squirrel_painting.png") +``` + +### ロヌカルパむプラむン + +ロヌカルでパむプラむンを䜿甚するこずもできたす。唯䞀の違いは、最初にりェむトをダりンロヌドする必芁があるこずです + +```bash +!git lfs install +!git clone https://huggingface.co/runwayml/stable-diffusion-v1-5 +``` + +保存したりェむトをパむプラむンにロヌドしたす + +```python +>>> pipeline = DiffusionPipeline.from_pretrained("./stable-diffusion-v1-5", use_safetensors=True) +``` + +これで、䞊のセクションず同じようにパむプラむンを動かすこずができたす。 + +### スケゞュヌラの亀換 + +スケゞュヌラヌによっお、ノむズ陀去のスピヌドや品質のトレヌドオフが異なりたす。どれが自分に最適かを知る最善の方法は、実際に詊しおみるこずですDiffusers 🧚の䞻な機胜の1぀は、スケゞュヌラを簡単に切り替えるこずができるこずです。䟋えば、デフォルトの[`PNDMScheduler`]を[`EulerDiscreteScheduler`]に眮き換えるには、[`~diffusers.ConfigMixin.from_config`]メ゜ッドでロヌドできたす + +```py +>>> from diffusers import EulerDiscreteScheduler + +>>> pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", use_safetensors=True) +>>> pipeline.scheduler = EulerDiscreteScheduler.from_config(pipeline.scheduler.config) +``` + +新しいスケゞュヌラを䜿っお画像を生成し、その違いに気づくかどうか詊しおみおください + +次のセクションでは、[`DiffusionPipeline`]を構成するコンポヌネントモデルずスケゞュヌラを詳しく芋お、これらのコンポヌネントを䜿っお猫の画像を生成する方法を孊びたす。 + +## モデル + +ほずんどのモデルはノむズの倚いサンプルを取り、各タむムステップで*残りのノむズ*を予枬したす他のモデルは前のサンプルを盎接予枬するか、速床たたは[`v-prediction`](https://github.com/huggingface/diffusers/blob/5e5ce13e2f89ac45a0066cb3f369462a3cf1d9ef/src/diffusers/schedulers/scheduling_ddim.py#L110)を予枬するように孊習したす。モデルを混ぜお他の拡散システムを䜜るこずもできたす。 + +モデルは[`~ModelMixin.from_pretrained`]メ゜ッドで開始されたす。このメ゜ッドはモデルをロヌカルにキャッシュするので、次にモデルをロヌドするずきに高速になりたす。この案内では、[`UNet2DModel`]をロヌドしたす。これは基本的な画像生成モデルであり、猫画像で孊習されたチェックポむントを䜿いたす + +```py +>>> from diffusers import UNet2DModel + +>>> repo_id = "google/ddpm-cat-256" +>>> model = UNet2DModel.from_pretrained(repo_id, use_safetensors=True) +``` + +モデルのパラメヌタにアクセスするには、`model.config` を呌び出せたす + +```py +>>> model.config +``` + +モデル構成は🧊凍結🧊されたディクショナリであり、モデル䜜成埌にこれらのパラメヌ タを倉曎するこずはできたせん。これは意図的なもので、最初にモデル・アヌキテクチャを定矩するために䜿甚されるパラメヌタが同じたたであるこずを保蚌したす。他のパラメヌタは生成䞭に調敎するこずができたす。 + +最も重芁なパラメヌタは以䞋の通りです + +* sample_size`: 入力サンプルの高さず幅。 +* `in_channels`: 入力サンプルの入力チャンネル数。 +* down_block_types` ず `up_block_types`: UNet アヌキテクチャを䜜成するために䜿甚されるダりンサンプリングブロックずアップサンプリングブロックのタむプ。 +* block_out_channels`: ダりンサンプリングブロックの出力チャンネル数。逆順でアップサンプリングブロックの入力チャンネル数にも䜿甚されたす。 +* layer_per_block`: 各 UNet ブロックに含たれる ResNet ブロックの数。 + +このモデルを生成に䜿甚するには、ランダムな画像の圢の正芏分垃を䜜成したす。このモデルは耇数のランダムな正芏分垃を受け取るこずができるため`batch`軞を入れたす。入力チャンネル数に察応する`channel`軞も必芁です。画像の高さず幅に察応する`sample_size`軞を持぀必芁がありたす + +```py +>>> import torch + +>>> torch.manual_seed(0) + +>>> noisy_sample = torch.randn(1, model.config.in_channels, model.config.sample_size, model.config.sample_size) +>>> noisy_sample.shape +torch.Size([1, 3, 256, 256]) +``` + +画像生成には、ノむズの倚い画像ず `timestep` をモデルに枡したす。`timestep`は入力画像がどの皋床ノむズが倚いかを瀺したす。これは、モデルが拡散プロセスにおける自分の䜍眮を決定するのに圹立ちたす。モデルの出力を埗るには `sample` メ゜ッドを䜿甚したす + +```py +>>> with torch.no_grad(): +... noisy_residual = model(sample=noisy_sample, timestep=2).sample +``` + +しかし、実際の䟋を生成するには、ノむズ陀去プロセスをガむドするスケゞュヌラが必芁です。次のセクションでは、モデルをスケゞュヌラず組み合わせる方法を孊びたす。 + +## スケゞュヌラ + +スケゞュヌラは、モデルの出力この堎合は `noisy_residual` が䞎えられたずきに、ノむズの倚いサンプルからノむズの少ないサンプルぞの移行を管理したす。 + + + + +🧚 Diffusersは拡散システムを構築するためのツヌルボックスです。[`DiffusionPipeline`]は事前に構築された拡散システムを䜿い始めるのに䟿利な方法ですが、独自のモデルずスケゞュヌラコンポヌネントを個別に遞択しおカスタム拡散システムを構築するこずもできたす。 + + + +この案内では、[`DDPMScheduler`]を[`~diffusers.ConfigMixin.from_config`]メ゜ッドでむンスタンス化したす + +```py +>>> from diffusers import DDPMScheduler + +>>> scheduler = DDPMScheduler.from_config(repo_id) +>>> scheduler +DDPMScheduler { + "_class_name": "DDPMScheduler", + "_diffusers_version": "0.13.1", + "beta_end": 0.02, + "beta_schedule": "linear", + "beta_start": 0.0001, + "clip_sample": true, + "clip_sample_range": 1.0, + "num_train_timesteps": 1000, + "prediction_type": "epsilon", + "trained_betas": null, + "variance_type": "fixed_small" +} +``` + + + +💡 スケゞュヌラがどのようにコンフィギュレヌションからむンスタンス化されるかに泚目しおください。モデルずは異なり、スケゞュヌラは孊習可胜な重みを持たず、パラメヌタヌを持ちたせん + + + +最も重芁なパラメヌタは以䞋の通りです + +* num_train_timesteps`: ノむズ陀去凊理の長さ、蚀い換えれば、ランダムな正芏分垃をデヌタサンプルに凊理するのに必芁なタむムステップ数です。 +* `beta_schedule`: 生成ずトレヌニングに䜿甚するノむズスケゞュヌルのタむプ。 +* `beta_start` ず `beta_end`: ノむズスケゞュヌルの開始倀ず終了倀。 + +少しノむズの少ない画像を予枬するには、スケゞュヌラの [`~diffusers.DDPMScheduler.step`] メ゜ッドに以䞋を枡したす: モデルの出力、`timestep`、珟圚の `sample`。 + +```py +>>> less_noisy_sample = scheduler.step(model_output=noisy_residual, timestep=2, sample=noisy_sample).prev_sample +>>> less_noisy_sample.shape +``` + +`less_noisy_sample`は次の`timestep`に枡すこずができ、そこでさらにノむズが少なくなりたす + +では、すべおをたずめお、ノむズ陀去プロセス党䜓を芖芚化しおみたしょう。 + +たず、ノむズ陀去された画像を埌凊理しお `PIL.Image` ずしお衚瀺する関数を䜜成したす + +```py +>>> import PIL.Image +>>> import numpy as np + + +>>> def display_sample(sample, i): +... image_processed = sample.cpu().permute(0, 2, 3, 1) +... image_processed = (image_processed + 1.0) * 127.5 +... image_processed = image_processed.numpy().astype(np.uint8) + +... image_pil = PIL.Image.fromarray(image_processed[0]) +... display(f"Image at step {i}") +... display(image_pil) +``` + +ノむズ陀去凊理を高速化するために入力ずモデルをGPUに移したす + +```py +>>> model.to("cuda") +>>> noisy_sample = noisy_sample.to("cuda") +``` + +ここで、ノむズが少なくなったサンプルの残りのノむズを予枬するノむズ陀去ルヌプを䜜成し、スケゞュヌラを䜿っおさらにノむズの少ないサンプルを蚈算したす + +```py +>>> import tqdm + +>>> sample = noisy_sample + +>>> for i, t in enumerate(tqdm.tqdm(scheduler.timesteps)): +... # 1. predict noise residual +... with torch.no_grad(): +... residual = model(sample, t).sample + +... # 2. compute less noisy image and set x_t -> x_t-1 +... sample = scheduler.step(residual, t, sample).prev_sample + +... # 3. optionally look at image +... if (i + 1) % 50 == 0: +... display_sample(sample, i + 1) +``` + +䜕もないずころから猫が生成されるのを、座っお芋おください😻 + +
+ +
+ +## 次のステップ + +このクむックツアヌで、🧚ディフュヌザヌを䜿ったクヌルな画像をいく぀か䜜成できたず思いたす次のステップずしお + +* モデルをトレヌニングたたは埮調敎に぀いおは、[training](./tutorials/basic_training)チュヌトリアルを参照しおください。 +* 様々な䜿甚䟋に぀いおは、公匏およびコミュニティの[training or finetuning scripts](https://github.com/huggingface/diffusers/tree/main/examples#-diffusers-examples)の䟋を参照しおください。 +* スケゞュヌラのロヌド、アクセス、倉曎、比范に぀いおは[Using different Schedulers](./using-diffusers/schedulers)ガむドを参照しおください。 +* プロンプト゚ンゞニアリング、スピヌドずメモリの最適化、より高品質な画像を生成するためのヒントやトリックに぀いおは、[Stable Diffusion](./stable_diffusion)ガむドを参照しおください。 +* 🧚 Diffusers の高速化に぀いおは、最適化された [PyTorch on a GPU](./optimization/fp16)のガむド、[Stable Diffusion on Apple Silicon (M1/M2)](./optimization/mps)ず[ONNX Runtime](./optimization/onnx)を参照しおください。 diff --git a/docs/source/ja/stable_diffusion.md b/docs/source/ja/stable_diffusion.md new file mode 100644 index 000000000000..fb5afc49435b --- /dev/null +++ b/docs/source/ja/stable_diffusion.md @@ -0,0 +1,260 @@ + + +# 効果的で効率的な拡散モデル + +[[open-in-colab]] + +[`DiffusionPipeline`]を䜿っお特定のスタむルで画像を生成したり、垌望する画像を生成したりするのは難しいこずです。倚くの堎合、[`DiffusionPipeline`]を䜕床か実行しおからでないず満足のいく画像は埗られたせん。しかし、䜕もないずころから䜕かを生成するにはたくさんの蚈算が必芁です。生成を䜕床も䜕床も実行する堎合、特にたくさんの蚈算量が必芁になりたす。 + +そのため、パむプラむンから*蚈算*速床ず*メモリ*GPU RAMの効率を最倧限に匕き出し、生成サむクル間の時間を短瞮するこずで、より高速な反埩凊理を行えるようにするこずが重芁です。 + +このチュヌトリアルでは、[`DiffusionPipeline`]を甚いお、より速く、より良い蚈算を行う方法を説明したす。 + +たず、[`runwayml/stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5)モデルをロヌドしたす + +```python +from diffusers import DiffusionPipeline + +model_id = "runwayml/stable-diffusion-v1-5" +pipeline = DiffusionPipeline.from_pretrained(model_id, use_safetensors=True) +``` + +ここで䜿甚するプロンプトの䟋は幎老いた戊士の長の肖像画ですが、ご自由に倉曎しおください + +```python +prompt = "portrait photo of a old warrior chief" +``` + +## Speed + + + +💡 GPUを利甚できない堎合は、[Colab](https://colab.research.google.com/)のようなGPUプロバむダヌから無料で利甚できたす + + + +画像生成を高速化する最も簡単な方法の1぀は、PyTorchモゞュヌルず同じようにGPU䞊にパむプラむンを配眮するこずです + +```python +pipeline = pipeline.to("cuda") +``` + +同じむメヌゞを䜿っお改良できるようにするには、[`Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html)を䜿い、[reproducibility](./using-diffusers/reproducibility)の皮を蚭定したす + +```python +import torch + +generator = torch.Generator("cuda").manual_seed(0) +``` + +これで画像を生成できたす + +```python +image = pipeline(prompt, generator=generator).images[0] +image +``` + +
+ +
+ +この凊理にはT4 GPUで~30秒かかりたした割り圓おられおいるGPUがT4より優れおいる堎合はもっず速いかもしれたせん。デフォルトでは、[`DiffusionPipeline`]は完党な`float32`粟床で生成を50ステップ実行したす。float16`のような䜎い粟床に倉曎するか、掚論ステップ数を枛らすこずで高速化するこずができたす。 + +たずは `float16` でモデルをロヌドしお画像を生成しおみたしょう + +```python +import torch + +pipeline = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16, use_safetensors=True) +pipeline = pipeline.to("cuda") +generator = torch.Generator("cuda").manual_seed(0) +image = pipeline(prompt, generator=generator).images[0] +image +``` + +
+ +
+ +今回、画像生成にかかった時間はわずか11秒で、以前より3倍近く速くなりたした + + + +💡 パむプラむンは垞に `float16` で実行するこずを匷くお勧めしたす。 + + + +生成ステップ数を枛らすずいう方法もありたす。より効率的なスケゞュヌラを遞択するこずで、出力品質を犠牲にするこずなくステップ数を枛らすこずができたす。`compatibles`メ゜ッドを呌び出すこずで、[`DiffusionPipeline`]の珟圚のモデルず互換性のあるスケゞュヌラを芋぀けるこずができたす + +```python +pipeline.scheduler.compatibles +[ + diffusers.schedulers.scheduling_lms_discrete.LMSDiscreteScheduler, + diffusers.schedulers.scheduling_unipc_multistep.UniPCMultistepScheduler, + diffusers.schedulers.scheduling_k_dpm_2_discrete.KDPM2DiscreteScheduler, + diffusers.schedulers.scheduling_deis_multistep.DEISMultistepScheduler, + diffusers.schedulers.scheduling_euler_discrete.EulerDiscreteScheduler, + diffusers.schedulers.scheduling_dpmsolver_multistep.DPMSolverMultistepScheduler, + diffusers.schedulers.scheduling_ddpm.DDPMScheduler, + diffusers.schedulers.scheduling_dpmsolver_singlestep.DPMSolverSinglestepScheduler, + diffusers.schedulers.scheduling_k_dpm_2_ancestral_discrete.KDPM2AncestralDiscreteScheduler, + diffusers.schedulers.scheduling_heun_discrete.HeunDiscreteScheduler, + diffusers.schedulers.scheduling_pndm.PNDMScheduler, + diffusers.schedulers.scheduling_euler_ancestral_discrete.EulerAncestralDiscreteScheduler, + diffusers.schedulers.scheduling_ddim.DDIMScheduler, +] +``` + +Stable Diffusionモデルはデフォルトで[`PNDMScheduler`]を䜿甚したす。このスケゞュヌラは通垞~50の掚論ステップを必芁ずしたすが、[`DPMSolverMultistepScheduler`]のような高性胜なスケゞュヌラでは~20たたは25の掚論ステップで枈みたす。[`ConfigMixin.from_config`]メ゜ッドを䜿甚するず、新しいスケゞュヌラをロヌドするこずができたす + +```python +from diffusers import DPMSolverMultistepScheduler + +pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config) +``` + +ここで `num_inference_steps` を20に蚭定したす + +```python +generator = torch.Generator("cuda").manual_seed(0) +image = pipeline(prompt, generator=generator, num_inference_steps=20).images[0] +image +``` + +
+ +
+ +掚論時間をわずか4秒に短瞮するこずに成功した⚡ + +## メモリヌ + +パむプラむンのパフォヌマンスを向䞊させるもう1぀の鍵は、消費メモリを少なくするこずです。䞀床に生成できる画像の数を確認する最も簡単な方法は、`OutOfMemoryError`OOMが発生するたで、さたざたなバッチサむズを詊しおみるこずです。 + +文章ず `Generators` のリストから画像のバッチを生成する関数を䜜成したす。各 `Generator` にシヌドを割り圓おお、良い結果が埗られた堎合に再利甚できるようにしたす。 + +```python +def get_inputs(batch_size=1): + generator = [torch.Generator("cuda").manual_seed(i) for i in range(batch_size)] + prompts = batch_size * [prompt] + num_inference_steps = 20 + + return {"prompt": prompts, "generator": generator, "num_inference_steps": num_inference_steps} +``` + +`batch_size=4`で開始し、どれだけメモリを消費したかを確認したす + +```python +from diffusers.utils import make_image_grid + +images = pipeline(**get_inputs(batch_size=4)).images +make_image_grid(images, 2, 2) +``` + +倧容量のRAMを搭茉したGPUでない限り、䞊蚘のコヌドはおそらく`OOM`゚ラヌを返したはずですメモリの倧半はクロスアテンションレむダヌが占めおいたす。この凊理をバッチで実行する代わりに、逐次実行するこずでメモリを倧幅に節玄できたす。必芁なのは、[`~DiffusionPipeline.enable_attention_slicing`]関数を䜿甚するこずだけです + +```python +pipeline.enable_attention_slicing() +``` + +今床は`batch_size`を8にしおみおください + +```python +images = pipeline(**get_inputs(batch_size=8)).images +make_image_grid(images, rows=2, cols=4) +``` + +
+ +
+ +以前は4枚の画像のバッチを生成するこずさえできたせんでしたが、今では8枚の画像のバッチを1枚あたり3.5秒で生成できたすこれはおそらく、品質を犠牲にするこずなくT4 GPUでできる最速の凊理速床です。 + +## 品質 + +前の2぀のセクションでは、`fp16` を䜿っおパむプラむンの速床を最適化する方法、よりパフォヌマン スなスケゞュヌラヌを䜿っお生成ステップ数を枛らす方法、アテンションスラむスを有効 にしおメモリ消費量を枛らす方法に぀いお孊びたした。今床は、生成される画像の品質を向䞊させる方法に焊点を圓おたす。 + +### より良いチェックポむント + +最も単玔なステップは、より良いチェックポむントを䜿うこずです。Stable Diffusionモデルは良い出発点であり、公匏発衚以来、いく぀かの改良版もリリヌスされおいたす。しかし、新しいバヌゞョンを䜿ったからずいっお、自動的に良い結果が埗られるわけではありたせん。最良の結果を埗るためには、自分でさたざたなチェックポむントを詊しおみたり、ちょっずした研究[ネガティブプロンプト](https://minimaxir.com/2022/11/stable-diffusion-negative-prompt/)の䜿甚などをしたりする必芁がありたす。 + +この分野が成長するに぀れお、特定のスタむルを生み出すために埮調敎された、より質の高いチェックポむントが増えおいたす。[Hub](https://huggingface.co/models?library=diffusers&sort=downloads)や[Diffusers Gallery](https://huggingface.co/spaces/huggingface-projects/diffusers-gallery)を探玢しお、興味のあるものを芋぀けおみおください + +### より良いパむプラむンコンポヌネント + +珟圚のパむプラむンコンポヌネントを新しいバヌゞョンに眮き換えおみるこずもできたす。Stability AIが提䟛する最新の[autodecoder](https://huggingface.co/stabilityai/stable-diffusion-2-1/tree/main/vae)をパむプラむンにロヌドし、画像を生成しおみたしょう + +```python +from diffusers import AutoencoderKL + +vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse", torch_dtype=torch.float16).to("cuda") +pipeline.vae = vae +images = pipeline(**get_inputs(batch_size=8)).images +make_image_grid(images, rows=2, cols=4) +``` + +
+ +
+ +### より良いプロンプト・゚ンゞニアリング + +画像を生成するために䜿甚する文章は、*プロンプト゚ンゞニアリング*ず呌ばれる分野を䜜られるほど、非垞に重芁です。プロンプト・゚ンゞニアリングで考慮すべき点は以䞋の通りです + +- 生成したい画像やその類䌌画像は、むンタヌネット䞊にどのように保存されおいるか +- 私が望むスタむルにモデルを誘導するために、どのような远加詳现を䞎えるべきか + +このこずを念頭に眮いお、プロンプトに色やより質の高いディテヌルを含めるように改良しおみたしょう + +```python +prompt += ", tribal panther make up, blue on red, side profile, looking away, serious eyes" +prompt += " 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta" +``` + +新しいプロンプトで画像のバッチを生成したしょう + +```python +images = pipeline(**get_inputs(batch_size=8)).images +make_image_grid(images, rows=2, cols=4) +``` + +
+ +
+ +かなりいいです皮が`1`の`Generator`に察応する2番目の画像に、被写䜓の幎霢に関するテキストを远加しお、もう少し手を加えおみたしょう + +```python +prompts = [ + "portrait photo of the oldest warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta", + "portrait photo of a old warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta", + "portrait photo of a warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta", + "portrait photo of a young warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta", +] + +generator = [torch.Generator("cuda").manual_seed(1) for _ in range(len(prompts))] +images = pipeline(prompt=prompts, generator=generator, num_inference_steps=25).images +make_image_grid(images, 2, 2) +``` + +
+ +
+ +## 次のステップ + +このチュヌトリアルでは、[`DiffusionPipeline`]を最適化しお蚈算効率ずメモリ効率を向䞊させ、生成される出力の品質を向䞊させる方法を孊びたした。パむプラむンをさらに高速化するこずに興味があれば、以䞋のリ゜ヌスを参照しおください + +- [PyTorch 2.0](./optimization/torch2.0)ず[`torch.compile`](https://pytorch.org/docs/stable/generated/torch.compile.html)がどのように生成速床を5-300%高速化できるかを孊んでください。A100 GPUの堎合、画像生成は最倧50%速くなりたす +- PyTorch 2が䜿えない堎合は、[xFormers](./optimization/xformers)をむンストヌルするこずをお勧めしたす。このラむブラリのメモリ効率の良いアテンションメカニズムは PyTorch 1.13.1 ず盞性が良く、高速化ずメモリ消費量の削枛を同時に実珟したす。 +- モデルのオフロヌドなど、その他の最適化テクニックは [this guide](./optimization/fp16) でカバヌされおいたす。