[core] support chunked feed forward in latte #8842

a-r-r-o-w · 2024-07-11T16:21:38Z

What does this PR do?

Adds chunked feed forward support to Latte video pipeline.

Code

import argparse
import time

import torch
from diffusers import LattePipeline
from diffusers.utils import export_to_gif


def bytes_to_giga_bytes(bytes):
    return bytes / 1024 / 1024 / 1024

def run_experiment(num_repeats: int = 5, enable_chunking: bool = False, chunk_size: int = 1, dim: int = 0) -> None:
    pipe = LattePipeline.from_pretrained("maxin-cn/Latte-1", torch_dtype=torch.float16).to("cuda:1")

    prompt = "A dog wearing sunglasses, floating in space, surreal, artistic impressions"
    negative_prompt = "low quality, worse quality, jpeg artifacts"
    num_inference_steps = 25
    video_length = 16

    if enable_chunking:
        pipe.transformer.enable_forward_chunking(chunk_size=chunk_size, dim=dim)

    # warmup
    video = pipe(
        prompt=prompt,
        negative_prompt=negative_prompt,
        num_inference_steps=num_inference_steps,
        video_length=video_length,
        generator=torch.Generator().manual_seed(42),
    )
    
    start = time.time()
    for _ in range(num_repeats):
        video = pipe(
            prompt=prompt,
            negative_prompt=negative_prompt,
            num_inference_steps=num_inference_steps,
            video_length=video_length,
            generator=torch.Generator().manual_seed(42),
        )
    end = time.time()
    avg_inference_time = (end - start) / num_repeats
    print(f"Average inference time: {avg_inference_time:.3f} seconds.")

    filename = "latte.gif" if not enable_chunking else "latte_chunk.gif"
    export_to_gif(video.frames[0], filename)


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--num_repeats", type=int, default=5, help="Number of times to repeat forward inference")
    parser.add_argument("--enable_chunking", action="store_true", default=False, help="Whether or not to enable forward chunking")
    parser.add_argument("--chunk_size", type=int, default=1, help="Chunk size for forward chunking")
    parser.add_argument("--dim", type=int, default=1, help="Dim for forward chunking")
    args = parser.parse_args()

    run_experiment(args.num_repeats, args.enable_chunking, args.chunk_size, args.dim)

Who can review?

@maxin-cn @sayakpaul @yiyixuxu

HuggingFaceDocBuilderDev · 2024-07-11T16:28:09Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

a-r-r-o-w · 2024-07-11T17:05:30Z

Failing tests seem unrelated

yiyixuxu · 2024-07-11T20:44:12Z

thanks! did you see a significant difference with this for latte?

a-r-r-o-w · 2024-07-11T20:51:36Z

thanks! did you see a significant difference with this for latte?

Unfortunately, the impact is minimal. I expected it to be more but only realised after testing, which I got to a while after opening the PR since GPUs were occupied at the time.

No feedforward chunking: 16.28 seconds, 15.21 GB
chunk_size=1, dim=1: 91.36 seconds, 14.62 GB (similar for dim=0)
chunk_size=2, dim=1, 37.49 seconds, 14.68 GB (similar for dim=0)

I think it doesn't make sense to support this after doing my tests since the increase in inference time is not justifiable.

sayakpaul

Thanks! Left a single comment.

sayakpaul · 2024-07-12T07:56:37Z

src/diffusers/models/transformers/latte_transformer_3d.py

        self.gradient_checkpointing = value

+    # Copied from diffusers.models.unets.unet_3d_condition.UNet3DConditionModel.enable_forward_chunking
+    def enable_forward_chunking(self, chunk_size: Optional[int] = None, dim: int = 0) -> None:


I assume Latte doesn't use a novel transformer block like SD3?

diffusers/src/diffusers/models/transformers/transformer_sd3.py

Line 97 in 973a62d

JointTransformerBlock(

If so, we need to do it like this:

diffusers/src/diffusers/models/attention.py

Line 182 in 973a62d

if self._chunk_size is not None:

Latte uses the BasicTransformerBlock which already has the chunked ff implementation here:

diffusers/src/diffusers/models/attention.py

Line 515 in bbd2f9d

if self._chunk_size is not None:

Have I missed anything else that might be causing very poor memory improvements?

No then we are good.

As in, we close this PR without merging it, right? Since memory improvements are negligible for way too much increase in time required

We might benefit from adding vae slicing and tiling support it here btw. Decode memory goes up to 19 GB

Might be worth checking that, yes.

This forward chunking phenomenon (less memory savings) is becoming evident for models that are purely transformer based as opposed to hybrid architectures (such as the SDXL UNet). For I2VXLGen or SVD, we see savings however, for pure transformer ones like Latte, SD3, we don’t.

For the SD3 800M variant, I did see nice improvements in memory but not so much for the 2B one. So, I think there is some specific FLOP/param count pattern where chunking tends to shine better.

We should investigate this phenomenon further because ff chunking could be crucial for these models. WDYT?

Oh interesting, I haven't really explored much of the transfomer-based image/video models and can't comment much on why we see this behaviour, but we can definitely try and investigate. I will try and put together a script over the weekend unless we have something ready already.

Quick question: IIUC, we do chunking only on the FeedForward modules and never on the individual linear layers (such as qkv projection layers of attention). Isn't the benefit of doing chunking for ffn lost if we are never going to also apply it on ALL linear layers that follow? Apologies if I'm being thick-headed but it's been a while since I've looked at some of the internal modeling code and I can't wrap my head around this. We have Attn followed by FFN. In Attn, we don't do chunking, and we materialize the QK product, which should be a main contributor to overall memory required. In FFN, we do chunking and save some memory, but as I understand, this saving is largely "invisible" in the overall memory measurement, where we care about the total/max usage, due to attention projection layers (since it does not use chunking) and QK, right?

We cannot do chunking on all linear layers just like that. Read more here:
https://huggingface.co/blog/reformer

Regarding QK materialization ping pong, you are likely disregarding the fact that attention computation is handled by SDPA which optimizes the memory bits already. So, I don’t think that is as much of an issue.

yiyixuxu · 2024-07-18T00:00:12Z

let's not add this then if it does not help?

add feed forward chunking to latte

953f4e6

a-r-r-o-w requested a review from yiyixuxu July 11, 2024 16:21

Merge branch 'main' into latte/chunked-ffn

239e510

a-r-r-o-w mentioned this pull request Jul 11, 2024

[docs] pipeline docs for latte #8844

Merged

yiyixuxu approved these changes Jul 11, 2024

View reviewed changes

Merge branch 'main' into latte/chunked-ffn

e35f377

sayakpaul reviewed Jul 12, 2024

View reviewed changes

a-r-r-o-w closed this Jul 18, 2024

[core] support chunked feed forward in latte #8842

[core] support chunked feed forward in latte #8842

Uh oh!

Conversation

a-r-r-o-w commented Jul 11, 2024

What does this PR do?

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Jul 11, 2024

Uh oh!

a-r-r-o-w commented Jul 11, 2024

Uh oh!

yiyixuxu commented Jul 11, 2024

Uh oh!

a-r-r-o-w commented Jul 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

sayakpaul Jul 12, 2024

Choose a reason for hiding this comment

Uh oh!

a-r-r-o-w Jul 12, 2024

Choose a reason for hiding this comment

Uh oh!

sayakpaul Jul 12, 2024

Choose a reason for hiding this comment

Uh oh!

a-r-r-o-w Jul 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

a-r-r-o-w Jul 12, 2024

Choose a reason for hiding this comment

Uh oh!

sayakpaul Jul 12, 2024

Choose a reason for hiding this comment

Uh oh!

a-r-r-o-w Jul 12, 2024

Choose a reason for hiding this comment

Uh oh!

sayakpaul Jul 12, 2024

Choose a reason for hiding this comment

Uh oh!

yiyixuxu commented Jul 18, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

a-r-r-o-w commented Jul 11, 2024 •

edited

Loading

a-r-r-o-w Jul 12, 2024 •

edited

Loading