[group offloading] avoid unnecessary moving out to speed up inference #12910

gameofdimension · 2026-01-04T10:07:20Z

Explicitly moving weights back to the CPU after computation is unnecessary—we can avoid it just like in the use_stream=True case. Since device-to-host copying is expensive, this change significantly improves inference speed when use_stream=False.

improvement

device: A100 40G
model: Qwen/Qwen-Image

	step latency
baseline	54s
this pr	4.8s

test code

import time
from diffusers import QwenImagePipeline, QwenImageTransformer2DModel
import torch


def main():
    device = "cuda"
    model_name = "Qwen/Qwen-Image"
    torch_dtype = torch.bfloat16
    pipe: QwenImagePipeline = QwenImagePipeline.from_pretrained(model_name, torch_dtype=torch_dtype)
    # pipe.enable_model_cpu_offload(device=device)

    offload_type = "block_level"
    num_blocks_per_group = 1
    use_stream = False
    assert isinstance(pipe.transformer, QwenImageTransformer2DModel)
    pipe.transformer.enable_group_offload(
        onload_device=device,
        offload_device="cpu",
        offload_type=offload_type,
        num_blocks_per_group=num_blocks_per_group,
        use_stream=use_stream,
    )
    pipe.to(device=device)

    positive_magic = {
        "en": ", Ultra HD, 4K, cinematic composition.",  # for english prompt
        "zh": ", 超清，4K，电影级构图.",  # for chinese prompt
    }

    # Generate image
    prompt = """A coffee shop entrance features a chalkboard sign reading "Qwen Coffee 😊 $2 per cup," with a neon light beside it displaying "通义千问". Next to it hangs a poster showing a beautiful Chinese woman, and beneath the poster is written "π≈3.1415926-53589793-23846264-33832795-02384197" perfect Ultra HD"""

    negative_prompt = (
        "very bad quality"  # using an empty string if you do not have specific concept to remove
    )

    # Generate with different aspect ratios
    aspect_ratios = {
        "1:1": (1328, 1328),
        "16:9": (1664, 928),
        "9:16": (928, 1664),
        "4:3": (1472, 1140),
        "3:4": (1140, 1472),
        "3:2": (1584, 1056),
        "2:3": (1056, 1584),
    }

    width, height = aspect_ratios["16:9"]
    generator = torch.Generator(device="cpu").manual_seed(42)

    image = pipe(
        prompt=prompt + positive_magic["en"],
        negative_prompt=negative_prompt,
        width=width,
        height=height,
        num_inference_steps=50,
        true_cfg_scale=4.0,
        generator=generator,
    ).images[0]

    image.save(f"example-{int(time.time())}.png")


if __name__ == "__main__":
    main()

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Refactor offloading logic to simplify memory management.

gameofdimension · 2026-01-05T15:02:54Z

@DN6 @yiyixuxu Could you please take a look at this change?

github-actions · 2026-02-03T15:14:39Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

iwr-redmond · 2026-02-03T16:36:05Z

Pinging to remove stale.

DN6 · 2026-02-04T09:29:36Z

Hi @gameofdimension thanks for putting this together. I believe this change could lead to a big spike in CPU RAM usage right? Would you mind benchmarking the change to get an idea of throughput (iterations/second), GPU VRAM and CPU RAM usage?

gameofdimension · 2026-02-04T12:16:40Z

Hi @gameofdimension thanks for putting this together. I believe this change could lead to a big spike in CPU RAM usage right? Would you mind benchmarking the change to get an idea of throughput (iterations/second), GPU VRAM and CPU RAM usage?

GPU memory utilization remains unchanged
CPU RAM usage will increase a lot
Given the substantial latency improvements demonstrated above, perhaps we can make this an optional feature with runtime configuration

DN6 · 2026-02-05T04:22:23Z

@gameofdimension Just curious, why not just enable similar behaviour through CUDA streams? The expectation is already set there that there will be a trade off between CPU memory and speed. Is there some specific case you have that you don't want to use streams?

gameofdimension · 2026-02-05T08:14:54Z

@DN6 IMHO use_stream=True/False alone should not make such huge difference. If I understood correctly use_stream=True enables overlapping computation with weight transfers, while use_stream=False processes them sequentially. The maximum latency difference would be (computation_time + IO_time) vs max(computation_time, IO_time) - at most a 2x difference.

Based on the observed latency differences, I proposed this adjustment for consideration.

iwr-redmond · 2026-02-05T08:21:50Z

why not just enable similar behaviour through CUDA streams?

Wouldn't that approach also disadvantage XPU users?

Simplify offloading to memory logic

b40f1da

Refactor offloading logic to simplify memory management.

github-actions bot added the stale Issues that haven't received updates label Feb 3, 2026

DN6 removed the stale Issues that haven't received updates label Feb 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[group offloading] avoid unnecessary moving out to speed up inference #12910

[group offloading] avoid unnecessary moving out to speed up inference #12910

gameofdimension commented Jan 4, 2026 •

edited

Loading

Uh oh!

gameofdimension commented Jan 5, 2026

Uh oh!

github-actions bot commented Feb 3, 2026

Uh oh!

iwr-redmond commented Feb 3, 2026

Uh oh!

DN6 commented Feb 4, 2026 •

edited

Loading

Uh oh!

gameofdimension commented Feb 4, 2026

Uh oh!

DN6 commented Feb 5, 2026

Uh oh!

gameofdimension commented Feb 5, 2026

Uh oh!

iwr-redmond commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[group offloading] avoid unnecessary moving out to speed up inference #12910

Are you sure you want to change the base?

[group offloading] avoid unnecessary moving out to speed up inference #12910

Conversation

gameofdimension commented Jan 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

improvement

test code

What does this PR do?

Before submitting

Who can review?

Uh oh!

gameofdimension commented Jan 5, 2026

Uh oh!

github-actions bot commented Feb 3, 2026

Uh oh!

iwr-redmond commented Feb 3, 2026

Uh oh!

DN6 commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gameofdimension commented Feb 4, 2026

Uh oh!

DN6 commented Feb 5, 2026

Uh oh!

gameofdimension commented Feb 5, 2026

Uh oh!

iwr-redmond commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gameofdimension commented Jan 4, 2026 •

edited

Loading

DN6 commented Feb 4, 2026 •

edited

Loading