From 38e1b9933ca4064f7c21c0edc44e9781e0a65e6e Mon Sep 17 00:00:00 2001 From: Aryan Date: Thu, 11 Jul 2024 19:06:17 +0200 Subject: [PATCH 1/3] add pipeline docs for latte --- docs/source/en/_toctree.yml | 2 + docs/source/en/api/pipelines/latte.md | 68 +++++++++++++++++++++++++++ 2 files changed, 70 insertions(+) create mode 100644 docs/source/en/api/pipelines/latte.md diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml index 1a1a23e2938a..09bce24aca3e 100644 --- a/docs/source/en/_toctree.yml +++ b/docs/source/en/_toctree.yml @@ -328,6 +328,8 @@ title: Latent Consistency Models - local: api/pipelines/latent_diffusion title: Latent Diffusion + - local: api/pipelines/latte + title: Latte - local: api/pipelines/ledits_pp title: LEDITS++ - local: api/pipelines/lumina diff --git a/docs/source/en/api/pipelines/latte.md b/docs/source/en/api/pipelines/latte.md new file mode 100644 index 000000000000..66754dcfddc6 --- /dev/null +++ b/docs/source/en/api/pipelines/latte.md @@ -0,0 +1,68 @@ + + +# Latte + +![latte text-to-video](https://github.com/Vchitect/Latte/blob/52bc0029899babbd6e9250384c83d8ed2670ff7a/visuals/latte.gif?raw=true) + +[Latte: Latent Diffusion Transformer for Video Generation](https://arxiv.org/abs/2401.03048) from Monash University, Shanghai AI Lab, Nanjing University, and Nanyang Technological University. + +The abstract from the paper is: + +*We propose a novel Latent Diffusion Transformer, namely Latte, for video generation. Latte first extracts spatio-temporal tokens from input videos and then adopts a series of Transformer blocks to model video distribution in the latent space. In order to model a substantial number of tokens extracted from videos, four efficient variants are introduced from the perspective of decomposing the spatial and temporal dimensions of input videos. To improve the quality of generated videos, we determine the best practices of Latte through rigorous experimental analysis, including video clip patch embedding, model variants, timestep-class information injection, temporal positional embedding, and learning strategies. Our comprehensive evaluation demonstrates that Latte achieves state-of-the-art performance across four standard video generation datasets, i.e., FaceForensics, SkyTimelapse, UCF101, and Taichi-HD. In addition, we extend Latte to text-to-video generation (T2V) task, where Latte achieves comparable results compared to recent T2V models. We strongly believe that Latte provides valuable insights for future research on incorporating Transformers into diffusion models for video generation.* + +**Highlights**: Latte is a latent diffusion transformer proposed as a backbone for modeling different modalities (trained for text-to-video generation here). It achieves state-of-the-art performance across four standard video benchmarks. + + + +Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.md) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading.md#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. + + + +### Inference + +Use [`torch.compile`](https://huggingface.co/docs/diffusers/main/en/tutorials/fast_diffusion#torchcompile) to reduce the inference latency. + +First, load the pipeline: + +```python +import torch +from diffusers import LattePipeline + +pipeline = LattePipeline.from_pretrained( + "maxin-cn/Latte-1", torch_dtype=torch.float16 +).to("cuda") +``` + +Then change the memory layout of the pipelines `transformer` and `vae` components to `torch.channels-last`: + +```python +pipeline.transformer.to(memory_format=torch.channels_last) +pipeline.vae.to(memory_format=torch.channels_last) +``` + +Finally, compile the components and run inference: + +```python +pipeline.transformer = torch.compile(pipeline.transformer, mode="max-autotune", fullgraph=True) +pipeline.vae.decode = torch.compile(pipeline.vae.decode, mode="max-autotune", fullgraph=True) + +video = pipeline(prompt="A dog wearing sunglasses floating in space, surreal, nebulae in background").frames[0] +``` + +## LattePipeline + +[[autodoc]] LattePipeline + - all + - __call__ From 71acc453b0809b0be93354e7eb712f50f645d13c Mon Sep 17 00:00:00 2001 From: Aryan Date: Sat, 13 Jul 2024 01:42:22 +0200 Subject: [PATCH 2/3] add inference time to latte docs --- docs/source/en/api/pipelines/latte.md | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/docs/source/en/api/pipelines/latte.md b/docs/source/en/api/pipelines/latte.md index 66754dcfddc6..40edfa15a083 100644 --- a/docs/source/en/api/pipelines/latte.md +++ b/docs/source/en/api/pipelines/latte.md @@ -55,12 +55,19 @@ pipeline.vae.to(memory_format=torch.channels_last) Finally, compile the components and run inference: ```python -pipeline.transformer = torch.compile(pipeline.transformer, mode="max-autotune", fullgraph=True) -pipeline.vae.decode = torch.compile(pipeline.vae.decode, mode="max-autotune", fullgraph=True) +pipeline.transformer = torch.compile(pipeline.transformer) +pipeline.vae.decode = torch.compile(pipeline.vae.decode) video = pipeline(prompt="A dog wearing sunglasses floating in space, surreal, nebulae in background").frames[0] ``` +The [benchmark](https://gist.github.com/a-r-r-o-w/4e1694ca46374793c0361d740a99ff19) results on an 80GB A100 machine are: + +``` +Without torch.compile(): Average inference time: 16.246 seconds. +With torch.compile(): Average inference time: 14.573 seconds. +``` + ## LattePipeline [[autodoc]] LattePipeline From c8261615b754616f68df665b13edb4b01d706632 Mon Sep 17 00:00:00 2001 From: a-r-r-o-w Date: Tue, 16 Jul 2024 03:33:03 +0530 Subject: [PATCH 3/3] apply review suggestions --- docs/source/en/api/pipelines/latte.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/en/api/pipelines/latte.md b/docs/source/en/api/pipelines/latte.md index 40edfa15a083..2572e11e152d 100644 --- a/docs/source/en/api/pipelines/latte.md +++ b/docs/source/en/api/pipelines/latte.md @@ -1,4 +1,4 @@ -