Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why not save frozen params unless: self.zero_optimization_stage() >= ZeroStageEnum.gradients? #5439

Open
freckletonj opened this issue Apr 19, 2024 · 3 comments
Assignees

Comments

@freckletonj
Copy link

I've spent 2 days drilling into why my frozen params aren't getting saved, and it comes down to this line:

https://github.com/microsoft/DeepSpeed/blob/c632ea09f8d107d10f76aa2b776e4df3c1ccf98a/deepspeed/runtime/engine.py#L3297C1-L3297C107

        save_frozen_param = self.zero_optimization_partition_gradients() and not exclude_frozen_parameters

exclude_frozen_parameters is therefore misleading, since that is not the only determinant of whether frozen params get saved.

To make matters more confusing, I am using deepspeed 2, but if I make a breakpoint in that zero_optimization_partiotion_gradients function, I see:

(Pdb) self.zero_optimization_stage()
1
(Pdb) ZeroStageEnum.gradients
<ZeroStageEnum.gradients: 2>

Why is this, and is there a straightforward non-hacky solution to get frozen params to save?

@tjruwase
Copy link
Contributor

@freckletonj, thanks for reporting this issue. I agree it is quite confusing, sorry about that. Unfortunately, I can't remember the rationale for including self.zero_optimization_partition_gradients() in the conditional logic.

Can you please clarify what you mean by "deepspeed 2"? Do you mean you are using zero stage 2? Can you please share your ds_config? Your breakpoint printout suggests that you are running zero stage 1.

@freckletonj
Copy link
Author

@tjruwase thanks for the fast response!

yes I'm using zero stage 2 via pytorch lightning, with a config.yaml:

trainer:
  accelerator: gpu
  devices: auto
  num_nodes: 1
  strategy: deepspeed_stage_2
...

I was surprised to see the breakpoint print that i'm in stage 1, but i think that's a separate issue from the confusing conditional logic.

And there's a chance I'm just going about this all wrong, I'm new to both lightning and deepspeed, so, forgive me I'm probably overlooking something important :)

To clarify, my only concern is how to save frozen params along with the model.


Some more background: I'm working on the RWKV project, a fork, where they save the weights with a copy of zero_to_fp32.py.

I've added a hack into this file to keep the params, which do live at the state dict's 'module' key, but not under FROZEN_PARAM_SHAPES, where they'd get picked up automatically: RWKV/RWKV-infctx-trainer@51f9173#diff-d1b1e811618e950083898fd2b934639a17307a0d339ee61aa96f3d7539463e26R142

I've also tried lightning's version of this function, but it also drops the frozen params: https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.utilities.deepspeed.html#lightning.pytorch.utilities.deepspeed.convert_zero_checkpoint_to_fp32_state_dict

@tjruwase tjruwase assigned jomayeri and samadejacobs and unassigned jomayeri May 13, 2024
@tjruwase
Copy link
Contributor

Some more background: I'm working on the RWKV project, a fork, where they save the weights with a copy of zero_to_fp32.py.

@freckletonj, apologies for the delayed response here. Is this the RWKV project? https://github.com/BlinkDL/RWKV-LM.

Can you please share your current status? Can you provide repro steps for us? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants