New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
raise error for duplicate accelerate config values when using deepspeed_config_file
#941
raise error for duplicate accelerate config values when using deepspeed_config_file
#941
Conversation
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but let's also wait for @stas00 to check this.
Thanks for working on it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for working on this, Sourab!
I think perhaps the Issue I posted didn't succeed to communicate what the main problem is.
The mismatch is one issue, and your PR fantastically taking care of it - but really what we want is one definitive source of information.
The problem is that unless the user spent enough time with this code base they will try to maintain 2 sets of configs, which makes work very painful. this is not hypothetical - this is through my own experience and those of others on my team that don't understand that they needn't maintain 2 definitions. I have just discovered that myself.
What I propose is that if ds_config is used any config duplicates are outright refused. As in:
The ds_config is already defining the value for gradient_accumulation, please remove the ambiguous gradient accumulation setting from the accelerate config file.
Or alternatively require a single source in either way as we don't want to restrict users freedom of how they do thing.
In other words every time a duplicity is detected Accelerate will assert and say:
Detected gradient_accumulation defined in both ds_config and accelerate config file, as this creates ambiguity please remove one of the settings.
Does my thinking make sense?
I tried this branch, getting:
configs:
|
@pacman100 default needs to be 1 instead of |
Also clarifying defaults for args in this PR. Now,
|
deepspeed_config_file
deepspeed_config_file
deepspeed_config_file
Now, the error will be this |
Hello @stas00, let us know if this addresses the issue. When |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! Left a couple of nits.
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for further improvements, Sourab.
Will this still work with auto
entries in ds_config.json
, which were designed to be filled by either the system when they need to be calculated at run time or via cmd line args? I'm yet to try it out, but I have a feeling that the latter will no longer work, would it.
But this shows otherwise:
why can't it be I wonder if we are having a miscommunication here. I brought up the issue of duplicity of the 2 styles of settings the ds config, since accelerate used its own config files from the beginning, but I have never suggested that setting values via cmd line args support should be dropped. |
Regarding |
OK, so your logic is different from HF Trainer then. The HF Trainer was directing cmd args into the config file's I'm not saying the 2 logics have to match. If I am not mistaken the accelerate logic is less flexible, but it's ok if you prefer it that way. In HF Trainer the
|
As accelerate is meant to work with all models apart from Transformers and user being in control of the training loop, they are in charge of all the arguments and the naming convention of arguments will be different across different users. On the other hand, in Trainer, users are restricted to a given args set and as such those can be used to fill the DeepSpeed config due to clear mapping between args and DS config params. The idea is that artifacts sent to In accelerate, the |
Thank you for explaining this to me, Sourab, but I'm having a hard time following how Accelerate is any different from HF Trainer wrt sending cmd line arg values to the unfilled out config values in ds_config. e.g. the Accelerate launcher provides an explicit list of cmd line args for the deepspeed use. There is a 1:1 mapping here as well. Could you please explain how is this different from the HF Trainer? But as I said above it's totally fine if you prefer to do it this way, Sourab. This doesn't prevent users from doing what they need.
Understood. more work, but doable. Thank you for the explanations. |
So far, `accelerate launch` cmd args were used for filling deepspeed plugin fields and not for setting `auto` values. This PR enables that too. It also raises assertions when ambiguous values are passed in accelerate config file when using `deepspeed_config_file`
…cman100/accelerate into smangrul/ds-config-assertions
Users can have Please note that Latest changes: Code from accelerate import Accelerator
from accelerate.state import AcceleratorState
def main():
accelerator = Accelerator()
accelerator.print(f"{AcceleratorState()}")
if __name__ == "__main__":
main() Scenario 1: manually tampered accelerate config file having
command_file: null
commands: null
compute_environment: LOCAL_MACHINE
deepspeed_config:
gradient_accumulation_steps: 1
gradient_clipping: 1.0
offload_optimizer_device: 'cpu'
offload_param_device: 'cpu'
zero3_init_flag: true
zero3_save_16bit_model: true
zero_stage: 3
deepspeed_config_file: 'ds_config.json'
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
gpu_ids: null
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
megatron_lm_config: {}
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_name: null
tpu_zone: null
use_cpu: false
{
"bf16": {
"enabled": true
},
"zero_optimization": {
"stage": 3,
"stage3_gather_16bit_weights_on_model_save": false,
"offload_optimizer": {
"device": "none"
},
"offload_param": {
"device": "none"
}
},
"gradient_clipping": 1.0,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": 10,
"steps_per_print": 2000000
}
Scenario 2: Use the solution of the error to create new accelerate config and check that no ambiguity error is now thrown.
compute_environment: LOCAL_MACHINE
deepspeed_config:
deepspeed_config_file: ds_config.json
zero3_init_flag: true
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
use_cpu: false
Scenario 3: Setting the
{
"bf16": {
"enabled": "auto"
},
"zero_optimization": {
"stage": "auto",
"stage3_gather_16bit_weights_on_model_save": "auto",
"offload_optimizer": {
"device": "auto"
},
"offload_param": {
"device": "auto"
}
},
"gradient_clipping": "auto",
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"steps_per_print": 2000000
}
Distributed environment: DEEPSPEED Backend: nccl
Num processes: 4
Process index: 0
Local process index: 0
Device: cuda:0
Mixed precision type: fp16
ds_config: {'bf16': {'enabled': False}, 'zero_optimization': {'stage': 3, 'stage3_gather_16bit_weights_on_model_save': True, 'offload_optimizer': {'device': 'nvme'}, 'offload_param': {'device': 'cpu'}}, 'gradient_clipping': 1.0, 'train_batch_size': 'auto', 'train_micro_batch_size_per_gpu': 'auto', 'gradient_accumulation_steps': 5, 'steps_per_print': inf, 'fp16': {'enabled': True, 'auto_cast': True}} Note: Remaining |
Looks fantastic, Sourab! Thank you for the improvements and taking the time to layout out the different scenarios - if I'm not mistaken those would make for perfect additions to the documentation if it resonates. (at the very least the last one to demo how BTW, the config generates things like:
why not just skip parts that the user hasn't asked for? It just makes the config scarier than it is, no? I'm asking since when I first looked at it I wasn't a all sure which of the empty placeholders were safe to remove and which aren't. My personal preference is for active config - that is to only ever list config entries that I work with and any defaults should be just that defaults and not be listed at all. Which I suppose isn't the case with typical configs where everything is listed out whether it's being used or not. And I can of course remove all those, so definitely it's not an issue, I'm just asking if my thinking resonates with you. |
Sourab, I found one more ambiguous combo in one of our tests:
This combo is quietly getting accepted. I'm concerned that a developer may see Do you think accelerate should assert when |
Hello @stas00, with current setup below warning is given which I think is fine: UserWarning: DeepSpeed Zero3 Init flag is only applicable for ZeRO Stage 3. Setting it to False. |
oh boy. I didn't see it. :( If a tree falls in a forest and no one is around to hear it, does it make a sound? I guess I need to start using this pragma to turn warnings into errors, but then some warnings can't be acted upon :(
This is for example an even larger issue for tests, where distributed setup hides most warnings or again there are too many of warnings to see anything. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some nits on the doc. Why did we switch the default for zero3_init_flag
to True
?
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
What dos this PR do?
Example:
deepspeed_config_file
and other ds config entries that are available in the json config file: