[docs] pipeline docs for latte #8844

a-r-r-o-w · 2024-07-11T17:09:14Z

What does this PR do?

It turns out we missed adding the pipeline docs in the Latte PR, and only added it for the Latte Transformer 3D model. @maxin-cn Let me know if you want anything else to be added.

Maybe a line about #8842 (ff-chunked inference) could be added once that's merged.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@yiyixuxu

HuggingFaceDocBuilderDev · 2024-07-11T17:15:18Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

maxin-cn · 2024-07-13T03:36:41Z

What does this PR do?

It turns out we missed adding the pipeline docs in the Latte PR, and only added it for the Latte Transformer 3D model. @maxin-cn Let me know if you want anything else to be added.

Maybe a line about #8842 (ff-chunked inference) could be added once that's merged.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

@yiyixuxu

Thank you for continuing to refine the inference process of Latte. I don't have anything extra to add at the moment.

sayakpaul · 2024-07-15T10:00:31Z

docs/source/en/api/pipelines/latte.md

+
+*We propose a novel Latent Diffusion Transformer, namely Latte, for video generation. Latte first extracts spatio-temporal tokens from input videos and then adopts a series of Transformer blocks to model video distribution in the latent space. In order to model a substantial number of tokens extracted from videos, four efficient variants are introduced from the perspective of decomposing the spatial and temporal dimensions of input videos. To improve the quality of generated videos, we determine the best practices of Latte through rigorous experimental analysis, including video clip patch embedding, model variants, timestep-class information injection, temporal positional embedding, and learning strategies. Our comprehensive evaluation demonstrates that Latte achieves state-of-the-art performance across four standard video generation datasets, i.e., FaceForensics, SkyTimelapse, UCF101, and Taichi-HD. In addition, we extend Latte to text-to-video generation (T2V) task, where Latte achieves comparable results compared to recent T2V models. We strongly believe that Latte provides valuable insights for future research on incorporating Transformers into diffusion models for video generation.*
+
+**Highlights**: Latte is a latent diffusion transformer proposed as a backbone for modeling different modalities (trained for text-to-video generation here). It achieves state-of-the-art performance across four standard video benchmarks.


Can we hyperlink the benchmarks section?

Here I provide the hyperlink for the four benchmarks (https://github.com/Vchitect/Latte/blob/main/docs/datasets_evaluation.md#download-datasets).

Do you mean the hyperlink to arxiv paper benchmarks section here? Or individual links to each of the four benchmarks?

docs/source/en/api/pipelines/latte.md

sayakpaul

Some minor comments.

a-r-r-o-w · 2024-07-17T05:43:29Z

@sayakpaul Requesting another review before merge for latest changes

* add pipeline docs for latte * add inference time to latte docs * apply review suggestions

add pipeline docs for latte

38e1b99

a-r-r-o-w added 2 commits July 12, 2024 23:19

Merge branch 'main' into latte/pipeline-documentation

459c8a6

add inference time to latte docs

71acc45

a-r-r-o-w requested a review from sayakpaul July 15, 2024 09:42

sayakpaul reviewed Jul 15, 2024

View reviewed changes

docs/source/en/api/pipelines/latte.md Outdated Show resolved Hide resolved

sayakpaul approved these changes Jul 15, 2024

View reviewed changes

apply review suggestions

c826161

sayakpaul approved these changes Jul 18, 2024

View reviewed changes

sayakpaul merged commit 12625c1 into main Jul 18, 2024

sayakpaul deleted the latte/pipeline-documentation branch July 18, 2024 03:57

Disty0 pushed a commit to Disty0/diffusers that referenced this pull request Jul 18, 2024

[docs] pipeline docs for latte (huggingface#8844)

1cb0e05

* add pipeline docs for latte * add inference time to latte docs * apply review suggestions

sayakpaul pushed a commit that referenced this pull request Dec 23, 2024

[docs] pipeline docs for latte (#8844)

0f2c512

* add pipeline docs for latte * add inference time to latte docs * apply review suggestions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[docs] pipeline docs for latte #8844

[docs] pipeline docs for latte #8844

Uh oh!

a-r-r-o-w commented Jul 11, 2024

Uh oh!

HuggingFaceDocBuilderDev commented Jul 11, 2024

Uh oh!

maxin-cn commented Jul 13, 2024

What does this PR do?

Who can review?

Uh oh!

sayakpaul Jul 15, 2024

Uh oh!

maxin-cn Jul 15, 2024

Uh oh!

a-r-r-o-w Jul 15, 2024 •

edited

Loading

Uh oh!

Uh oh!

sayakpaul left a comment

Uh oh!

a-r-r-o-w commented Jul 17, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants


		We propose a novel Latent Diffusion Transformer, namely Latte, for video generation. Latte first extracts spatio-temporal tokens from input videos and then adopts a series of Transformer blocks to model video distribution in the latent space. In order to model a substantial number of tokens extracted from videos, four efficient variants are introduced from the perspective of decomposing the spatial and temporal dimensions of input videos. To improve the quality of generated videos, we determine the best practices of Latte through rigorous experimental analysis, including video clip patch embedding, model variants, timestep-class information injection, temporal positional embedding, and learning strategies. Our comprehensive evaluation demonstrates that Latte achieves state-of-the-art performance across four standard video generation datasets, i.e., FaceForensics, SkyTimelapse, UCF101, and Taichi-HD. In addition, we extend Latte to text-to-video generation (T2V) task, where Latte achieves comparable results compared to recent T2V models. We strongly believe that Latte provides valuable insights for future research on incorporating Transformers into diffusion models for video generation.

		Highlights: Latte is a latent diffusion transformer proposed as a backbone for modeling different modalities (trained for text-to-video generation here). It achieves state-of-the-art performance across four standard video benchmarks.

[docs] pipeline docs for latte #8844

[docs] pipeline docs for latte #8844

Uh oh!

Conversation

a-r-r-o-w commented Jul 11, 2024

What does this PR do?

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Jul 11, 2024

Uh oh!

maxin-cn commented Jul 13, 2024

What does this PR do?

Who can review?

Uh oh!

sayakpaul Jul 15, 2024

Choose a reason for hiding this comment

Uh oh!

maxin-cn Jul 15, 2024

Choose a reason for hiding this comment

Uh oh!

a-r-r-o-w Jul 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

a-r-r-o-w commented Jul 17, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

a-r-r-o-w Jul 15, 2024 •

edited

Loading