-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[group offloading] avoid unnecessary moving out to speed up inference #12910
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[group offloading] avoid unnecessary moving out to speed up inference #12910
Conversation
Refactor offloading logic to simplify memory management.
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
|
Pinging to remove stale. |
|
Hi @gameofdimension thanks for putting this together. I believe this change could lead to a big spike in CPU RAM usage right? Would you mind benchmarking the change to get an idea of throughput (iterations/second), GPU VRAM and CPU RAM usage? |
|
|
@gameofdimension Just curious, why not just enable similar behaviour through CUDA streams? The expectation is already set there that there will be a trade off between CPU memory and speed. Is there some specific case you have that you don't want to use streams? |
|
@DN6 IMHO Based on the observed latency differences, I proposed this adjustment for consideration. |
Wouldn't that approach also disadvantage XPU users? |
Explicitly moving weights back to the CPU after computation is unnecessary—we can avoid it just like in the
use_stream=Truecase. Since device-to-host copying is expensive, this change significantly improves inference speed whenuse_stream=False.improvement
device: A100 40G
model:
Qwen/Qwen-Imagetest code
What does this PR do?
Fixes # (issue)
Before submitting
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.