Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reshape ZeroStage=0 FP16 Checkpoint #2031

Open
Muennighoff opened this issue Jun 20, 2022 · 5 comments
Open

Reshape ZeroStage=0 FP16 Checkpoint #2031

Muennighoff opened this issue Jun 20, 2022 · 5 comments
Labels
bug Something isn't working

Comments

@Muennighoff
Copy link

Muennighoff commented Jun 20, 2022

What is the best way for reshaping a checkpoint trained with zero stage = 0 & fp16?

I see two options:
a) Continue training with zero stage 1 for 1 step & adapt this PR to work with fp16
b) Adapt the script here to work without the need of zero ckpts; The difficult part will just be reshaping the optimizer states in the mp_rank files

Maybe @tjruwase could give me a quick hint if a) or b) makes more sense before I waste my time? Thanks!

@Muennighoff Muennighoff added the bug Something isn't working label Jun 20, 2022
@tjruwase
Copy link
Contributor

@Muennighoff, thanks for your question. Can you please clarify a bit more because zero_stage=0 actually disables ZeRO and is pure DDP. The only reshaping needs that I can imagine in such cases will be due to tensor parallelism or pipeline parallelism.

@Muennighoff
Copy link
Author

Muennighoff commented Jun 20, 2022

@Muennighoff, thanks for your question. Can you please clarify a bit more because zero_stage=0 actually disables ZeRO and is pure DDP. The only reshaping needs that I can imagine in such cases will be due to tensor parallelism or pipeline parallelism.

Yes there's no ZeRO used only TP & PP. The TP is based on the Megatron-DS implementation. Specifically, I am looking at a TP=4, PP=4 model. Based on my understanding I need to change the layer files due to TP & the mp files due to TP & MP.

How would you go about it?

@tjruwase
Copy link
Contributor

tjruwase commented Jun 20, 2022

Great. Thanks for the clarification. Also, do you need reshaping of just the model weights or also of the optimizer state? The reshaping logic you reference is split across bigscience/megatron-deepspeed and deepspeed, very new, and only tested with bf16 + pipeline parallelism + zero stage1.

In terms of your proposed options, I feel (b) is more straightforward and thus easier. Option (a) will require (1) creating zero ckpts only for the sake of reshaping and (2) porting the reshaping changes in the bf16_optimizer into fp16 zero_stage_1 optimizer. Although, option (b) requires changes to the reshaping script, I think those changes will be useful anyways for the non-zero training scenarios such as yours. Does that make sense?

Perhaps @stas00, who is the co-author of the reshaping feature, might have some thoughts as well.

@Muennighoff
Copy link
Author

Yes, I need continue training in the new shape, so I think I will also need to reshape the optimizer states. I will continue training with zero stage 1, however.

Thanks for your thoughts! I will work on (b) then. I think I only need to figure out how to merge the optimizer states in the mp_rank files correctly.

@stas00
Copy link
Contributor

stas00 commented Jun 27, 2022

Yes, once bf16/z0 PR is merged we can look at fp16/z0 next.

The other approach is to:

  1. start with a random optim states
  2. run for some steps with LR=0 to let the optimizer catch up
  3. resume training with normal LR

The details and math of how many steps to run are in the 104B chronicles, I can dig up the link if you want to explore this option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants