-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reshape ZeroStage=0 FP16 Checkpoint #2031
Comments
@Muennighoff, thanks for your question. Can you please clarify a bit more because |
Yes there's no ZeRO used only TP & PP. The TP is based on the Megatron-DS implementation. Specifically, I am looking at a TP=4, PP=4 model. Based on my understanding I need to change the layer files due to TP & the mp files due to TP & MP. How would you go about it? |
Great. Thanks for the clarification. Also, do you need reshaping of just the model weights or also of the optimizer state? The reshaping logic you reference is split across bigscience/megatron-deepspeed and deepspeed, very new, and only tested with bf16 + pipeline parallelism + zero stage1. In terms of your proposed options, I feel (b) is more straightforward and thus easier. Option (a) will require (1) creating zero ckpts only for the sake of reshaping and (2) porting the reshaping changes in the bf16_optimizer into fp16 zero_stage_1 optimizer. Although, option (b) requires changes to the reshaping script, I think those changes will be useful anyways for the non-zero training scenarios such as yours. Does that make sense? Perhaps @stas00, who is the co-author of the reshaping feature, might have some thoughts as well. |
Yes, I need continue training in the new shape, so I think I will also need to reshape the optimizer states. I will continue training with zero stage 1, however. Thanks for your thoughts! I will work on (b) then. I think I only need to figure out how to merge the optimizer states in the mp_rank files correctly. |
Yes, once bf16/z0 PR is merged we can look at fp16/z0 next. The other approach is to:
The details and math of how many steps to run are in the 104B chronicles, I can dig up the link if you want to explore this option. |
What is the best way for reshaping a checkpoint trained with zero stage = 0 & fp16?
I see two options:
a) Continue training with zero stage 1 for 1 step & adapt this PR to work with fp16
b) Adapt the script here to work without the need of zero ckpts; The difficult part will just be reshaping the optimizer states in the
mp_rank
filesMaybe @tjruwase could give me a quick hint if a) or b) makes more sense before I waste my time? Thanks!
The text was updated successfully, but these errors were encountered: