-
Notifications
You must be signed in to change notification settings - Fork 26.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Accelerate in from_pretrained
for big model inference
#17341
Conversation
setattr(submodule, param_name, new_val) | ||
for param_name, param in state_dict.items(): | ||
# First part of the test is always true as load_state_dict_keys always contains state_dict keys. | ||
if param_name not in loaded_state_dict_keys or param_name not in expected_keys: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First part of the test is left to be the same as before, but as said in the comment, it shouldn't be necessary as:
loaded_state_dict_keys = state_dict.keys()
when the checkpoint is one fileloaded_state_dict_keys
containsstate_dict.keys()
when the checkpoint is sharded
raise ValueError(f"{param_name} doesn't have any device set.") | ||
param_device = device_map[module_name] | ||
|
||
set_module_tensor_to_device(model, param_name, param_device, value=param) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This single line does the same thing as before using Accelerate. What's above is just:
- using the right dtype
- finding the right device
What's below deals with disk offload or temp offload of the CPU state dict.
@@ -870,6 +947,7 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin, PushToHubMix | |||
base_model_prefix = "" | |||
main_input_name = "input_ids" | |||
_auto_class = None | |||
_no_split_modules = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New attributes to fill on all models (for now GPT-J and T5 given as examples), that specifies the block that should not be split across devices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry just to understand - why should certain blocks not be split across devices?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you split a GPTBlock across devices, the residual connection (initial input of the block) added at the end will create a device mismatch.
elif low_cpu_mem_usage: | ||
init_contexts.append(init_empty_weights()) | ||
|
||
with ContextManagers(init_contexts): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as before + the no_init_weights context manage, but cleaner (IMO)
src/transformers/modeling_utils.py
Outdated
if device_map == "auto": | ||
no_split_modules = [] if model._no_split_modules is None else model._no_split_modules | ||
device_map = infer_auto_device_map(model, no_split_module_classes=no_split_modules, dtype=torch_dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This where the auto device map is built.
offload_index = {} if device_map is not None and "disk" in device_map.values() else None | ||
if offload_state_dict: | ||
state_dict_folder = tempfile.mkdtemp() | ||
state_dict_index = {} | ||
else: | ||
state_dict_folder = None | ||
state_dict_index = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For offloaded weights (either on disk or temp offload of the CPU weights) this index contains the map param_name -> metadata (shape and dtype)
else: | ||
error_msgs += _load_state_dict_into_model(model_to_load, state_dict, start_prefix) | ||
|
||
# force memory release | ||
del state_dict | ||
gc.collect() | ||
|
||
save_offload_index(offload_index, offload_folder) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Save the index for disk offload if necessary.
if offload_state_dict: | ||
# Load back temporarily offloaded state dict | ||
load_offloaded_weights(model, state_dict_index, state_dict_folder) | ||
shutil.rmtree(state_dict_folder) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reload the temp offloaded CPU state dict now that RAM is free.
The documentation is not available anymore as the PR was closed or merged. |
@@ -2013,18 +2124,22 @@ def from_pretrained(cls, pretrained_model_name_or_path: Optional[Union[str, os.P | |||
config.name_or_path = pretrained_model_name_or_path | |||
|
|||
# Instantiate model. | |||
init_contexts = [no_init_weights(_enable=_fast_init)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if low_cpu_mem_usage = True
then no_init_weights
is not needed no? As far as I understand when low_cpu_mem_usage=True
, then all weights will be either meta
or pretrained weights and no init will happen anyways no? But guess it also doesn't hurt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't really hurt, but it shouldn't be needed, yes.
Not necessarily linked to this PR, but in general the following code fails: from transformers import AutoTokenizer, AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", low_cpu_mem_usage=True)
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
inputs = tokenizer("Task: copy but say the opposite. PSG won its match against Barca.", return_tensors="pt")
#inputs = inputs.to(0)
output = model(inputs["input_ids"]) with:
Should we maybe throw a nice warning in |
Warning, no, but assert yes - it's abnormal if a model returned with weights that are on meta. The whole meta device things is a behind the scenes hack and it shouldn't bleed out to user-land, IMHO. |
if device_map is None: | ||
param_device = "cpu" | ||
else: | ||
while len(module_name) > 0 and module_name not in device_map: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment would be super nice here to understand a bit what is happening.
Maybe something like:
# find next higher level module that is defined in device_map: bert.lm_head.weight -> bert.lm_head -> bert -> ''
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool! Also, tried it also out on OPT-30b and it works well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this Sylvain - curious to see where it'd lead.
I run most of the deepspeed tests - nothing is broken.
Added a few nits.
Also please checkout out a related interesting new development at NVIDIA with GPUDIRECT https://docs.nvidia.com/gpudirect-storage/configuration-guide/index.html which would allow allocating tensors on disc.
Tunji is working on this feature in Deepspeed, this would allow tensor.to(nvme)
and then use it as a normal tensor.
Additionally Tunji and I are working on a universal checkpoint for huge models which doesn't contain any topology data and can shrink/expand on the fly. This is based on my earlier proposal for a checkpoing format where each tensor is a separate file.
The problem with all other current approaches is that they require TBs of CPU memory for models like 176B if you have to manipulate optim_states, etc.
And the next step will be where we load a checkpoint and it'd use 0 CPU memory and will go directly from disc to the target GPU.
else: | ||
while len(module_name) > 0 and module_name not in device_map: | ||
module_name = ".".join(module_name.split(".")[:-1]) | ||
if module_name == "" and "" not in device_map: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what would device_map[""]
signify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the whole model goes on the same device, the device_map
is {"": device}
when it's auto-inferred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, Sylvain. - perhaps adding that in a comment would make it easier to follow the code
|
||
To have Accelerate compute the most optimized `device_map` automatically, set `device_map="auto"`. | ||
offload_folder (`str` or `os.PathLike`, *optional*): | ||
If the `device_map` contains any value `"disk"`, the folder where we will offload weights. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't able to parse this last sentence.
src/transformers/modeling_utils.py
Outdated
If the `device_map` contains any value `"disk"`, the folder where we will offload weights. | ||
offload_state_dict (`bool`, *optional*, defaults to `False`): | ||
If `True`, will temporarily offload the CPU state dict on the hard drive to avoig getting out of CPU | ||
RAM if the weight of the CPU state dict + the biggest shard does not fit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
biggest shard of what?
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Thanks a lot for your reviews @patrickvonplaten and @stas00 !
The model should be fully initialized outside of the meta device. I haven't checked yet models with randomly initialized heads (as the primary goal is inference) but will make sure this is fixed before merging.
Once it's landed I'd be very interested in using it when
Note that in this instance passing a |
@tjruwase, just a heads up - as you work on these new features - could you please consider making the offload/prefetch API public so that the HF Trainers and the core could make a direct use of those? Thank you! Though I understand that it's deeply tied into the tracing mechanism, which is currently inseparable from the pre-fetch mechanism - the tracing mechanism figures out which params to prefetch and when. But perhaps we can discuss with Sylvain how he envisions using it. |
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very clean! Looking forward to the tests :)
Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr>
…ace#17341) * Initial work * More or less finished with first draft * Update src/transformers/modeling_utils.py Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * Update src/transformers/modeling_utils.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Fix randomly initialized weights * Update src/transformers/modeling_utils.py Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr> * Address review comments * Rename DeepSpeed folder to temporarily fix the test issue? * Revert to try if Accelerate fix works * Use latest Accelerate release * Quality and fixes * Style * Quality * Add doc * Test + fix * More blocks Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr>
What does this PR do?
This PR is a first draft for using the newly released big model inference APIs from Accelerate inside
from_pretrained
. For now it does this with the optionlow_cpu_mem_usage=True
and:device_map
is passeddevice_map="auto"
will auto-infer a proper device map with the available GPU(s) RAM and CPU RAM.This PR is just a first step, there is a bit more cleanup to do, namely:
Example of use:
Still missing: