-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
global, eager model weight GPU unloading #8605
Comments
We cannot control the level of hierarchy project maintainers want to have in their projects. For the things that are within our scope of control, we try to explicitly document things (such as We cannot do everything on behalf of the users as requirements vary from project to project, but we can provide simple and easy-to-use APIs for the users so that they cover the most common use cases. On a related note, @Wauplin has started to include a couple of modeling utilities within |
I observe: x = AutoModel.from_pretrained(...)
y = AutoModel.from_pretrained(...)
assert memory_usage(x) + memory_usage(y) > gpu_memory_available()
load(x)
x()
unload(x)
load(y)
y()
unload(y) reinvented over and over again by downstream users of
|
Example implementation: x = AutoModel.from_pretrained(...)
y = AutoModel.from_pretrained(...)
with sequential_offload(offload_device=torch.device('cpu'), load_device=torch.device('cuda:0')):
x(...)
y(...) and with sequential_offload(offload_device=torch.device('cpu'), load_device=torch.device('cuda:0'), models=[x, y]):
x(...)
y(...) could be implemented now with little issue. More broadly: # for example, prefer weights and inferencing in bfloat16, then gptq 8 bit if supported, then bitsandbytes 8 bit, etc.
llm_strategy = BinPacking(
devices=["cuda:0", "cuda:1", ...],
levels=[
BitsAndBytesConfig(...),
GPTQConfig(...),
torch.bfloat16
]
)
# some models do not perform well at 8 bit for example so shouldn't be used there at all
unet_strategy = BinPacking(
levels=[torch.float16, torch.bfloat16]
)
# maybe there is a separate inference and weights strategy
t5_strategy = BinPacking(load_in=[torch.float8, torch.float16], compute_in=[torch.float16])
with model_management(strategy=llm_strategy):
x(...)
with model_management(strategy=unet_strategy):
y(...) Ultimately downstream projects keep writing this, and it fuels a misconception that "Hugging Face" doesn't "run on my machine." For example: "Automatically loading/unloading models from memory [is something that Ollama supports that Hugging Face does not]" You guys are fighting pretty pervasive misconceptions too: "Huggingface isn't local"
Perhaps the contextlib approach is best. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
What API design would you like to have changed or added to the library? Why?
Most people expect
diffusers
andtransformers
"models" to be "unloaded" so that they can "just" "run" a "big" "pipeline" using their "VRAM" so that it "fits."In other words, author a mixin that keeps track of all weights in Hugging Face hierarchy objects loaded onto the GPU; and when
forward
is called on any Hugging Face hierarchy object, moves weights in other objects being tracked to ordinary RAM. Essentially, this is sequential CPU offload for scopes larger than a Hugging Face hierarchy object.What use case would this enable or better enable? Can you give us a code example?
The number of issues about GPU RAM usage scales linearly with adoption. You guys can't deal with the brain damage of having your Issues polluted by this.
Separately, it would eliminate the main source of toil for people who integrate
diffusers
into other products like ComfyUI.The text was updated successfully, but these errors were encountered: