Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

global, eager model weight GPU unloading #8605

Open
doctorpangloss opened this issue Jun 17, 2024 · 4 comments
Open

global, eager model weight GPU unloading #8605

doctorpangloss opened this issue Jun 17, 2024 · 4 comments
Labels
stale Issues that haven't received updates

Comments

@doctorpangloss
Copy link

doctorpangloss commented Jun 17, 2024

What API design would you like to have changed or added to the library? Why?

Most people expect diffusers and transformers "models" to be "unloaded" so that they can "just" "run" a "big" "pipeline" using their "VRAM" so that it "fits."

In other words, author a mixin that keeps track of all weights in Hugging Face hierarchy objects loaded onto the GPU; and when forward is called on any Hugging Face hierarchy object, moves weights in other objects being tracked to ordinary RAM. Essentially, this is sequential CPU offload for scopes larger than a Hugging Face hierarchy object.

What use case would this enable or better enable? Can you give us a code example?

The number of issues about GPU RAM usage scales linearly with adoption. You guys can't deal with the brain damage of having your Issues polluted by this.

Separately, it would eliminate the main source of toil for people who integrate diffusers into other products like ComfyUI.

@sayakpaul
Copy link
Member

We cannot control the level of hierarchy project maintainers want to have in their projects.

For the things that are within our scope of control, we try to explicitly document things (such as enable_model_cpu_offload or enable_sequential_cpu_offload. Together when they are combined with other things such as prompt pre-computing, 8bit inference of the text encoders, etc. they do reduce quite a lot of VRAM consumption.

We cannot do everything on behalf of the users as requirements vary from project to project, but we can provide simple and easy-to-use APIs for the users so that they cover the most common use cases.

On a related note, @Wauplin has started to include a couple of modeling utilities within huggingface_hub (such as the inclusion of a model sharding utility). So, I wonder if this is something to consider for him. Or maybe it lies more within the scope of accelerate (maybe something already exists that I am unable to recollect). So, ccing @muellerzr @SunMarc.

@doctorpangloss
Copy link
Author

I observe:

x = AutoModel.from_pretrained(...)
y = AutoModel.from_pretrained(...)

assert memory_usage(x) + memory_usage(y) > gpu_memory_available()
load(x)
x()
unload(x)
load(y)
y()
unload(y)

reinvented over and over again by downstream users of diffusers and PyTorch.

load and unload are kind of hacky ideas. They only make sense in the kind of 1 GPU, personal computer sort of context, that runs 1 task.

@doctorpangloss
Copy link
Author

doctorpangloss commented Jun 18, 2024

Example implementation:

x = AutoModel.from_pretrained(...)
y = AutoModel.from_pretrained(...)

with sequential_offload(offload_device=torch.device('cpu'), load_device=torch.device('cuda:0')):
  x(...)
  y(...)

and forward in the mixin would check a contextvar for loaded and unloaded models. alternatively:

with sequential_offload(offload_device=torch.device('cpu'), load_device=torch.device('cuda:0'), models=[x, y]):
  x(...)
  y(...)

could be implemented now with little issue.

More broadly:

# for example, prefer weights and inferencing in bfloat16, then gptq 8 bit if supported, then bitsandbytes 8 bit, etc.
llm_strategy = BinPacking(
  devices=["cuda:0", "cuda:1", ...],
  levels=[
    BitsAndBytesConfig(...),
    GPTQConfig(...),
    torch.bfloat16
  ]
)

# some models do not perform well at 8 bit for example so shouldn't be used there at all
unet_strategy = BinPacking(
  levels=[torch.float16, torch.bfloat16]
)

# maybe there is a separate inference and weights strategy
t5_strategy = BinPacking(load_in=[torch.float8, torch.float16], compute_in=[torch.float16])

with model_management(strategy=llm_strategy):
  x(...)
  with model_management(strategy=unet_strategy):
    y(...)

Ultimately downstream projects keep writing this, and it fuels a misconception that "Hugging Face" doesn't "run on my machine."

For example: "Automatically loading/unloading models from memory [is something that Ollama supports that Hugging Face does not]"

You guys are fighting pretty pervasive misconceptions too: "Huggingface isn't local"

We cannot control the level of hierarchy project maintainers want to have in their projects.

Perhaps the contextlib approach is best.

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Sep 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Issues that haven't received updates
Projects
None yet
Development

No branches or pull requests

2 participants