Reshape ZeroStage=0 FP16 Checkpoint #2031

Muennighoff · 2022-06-20T15:51:27Z

What is the best way for reshaping a checkpoint trained with zero stage = 0 & fp16?

I see two options:
a) Continue training with zero stage 1 for 1 step & adapt this PR to work with fp16
b) Adapt the script here to work without the need of zero ckpts; The difficult part will just be reshaping the optimizer states in the mp_rank files

Maybe @tjruwase could give me a quick hint if a) or b) makes more sense before I waste my time? Thanks!

The text was updated successfully, but these errors were encountered:

tjruwase · 2022-06-20T17:58:32Z

@Muennighoff, thanks for your question. Can you please clarify a bit more because zero_stage=0 actually disables ZeRO and is pure DDP. The only reshaping needs that I can imagine in such cases will be due to tensor parallelism or pipeline parallelism.

Muennighoff · 2022-06-20T18:03:22Z

@Muennighoff, thanks for your question. Can you please clarify a bit more because zero_stage=0 actually disables ZeRO and is pure DDP. The only reshaping needs that I can imagine in such cases will be due to tensor parallelism or pipeline parallelism.

Yes there's no ZeRO used only TP & PP. The TP is based on the Megatron-DS implementation. Specifically, I am looking at a TP=4, PP=4 model. Based on my understanding I need to change the layer files due to TP & the mp files due to TP & MP.

How would you go about it?

tjruwase · 2022-06-20T18:35:49Z

Great. Thanks for the clarification. Also, do you need reshaping of just the model weights or also of the optimizer state? The reshaping logic you reference is split across bigscience/megatron-deepspeed and deepspeed, very new, and only tested with bf16 + pipeline parallelism + zero stage1.

In terms of your proposed options, I feel (b) is more straightforward and thus easier. Option (a) will require (1) creating zero ckpts only for the sake of reshaping and (2) porting the reshaping changes in the bf16_optimizer into fp16 zero_stage_1 optimizer. Although, option (b) requires changes to the reshaping script, I think those changes will be useful anyways for the non-zero training scenarios such as yours. Does that make sense?

Perhaps @stas00, who is the co-author of the reshaping feature, might have some thoughts as well.

Muennighoff · 2022-06-20T18:44:14Z

Yes, I need continue training in the new shape, so I think I will also need to reshape the optimizer states. I will continue training with zero stage 1, however.

Thanks for your thoughts! I will work on (b) then. I think I only need to figure out how to merge the optimizer states in the mp_rank files correctly.

stas00 · 2022-06-27T22:41:36Z

Yes, once bf16/z0 PR is merged we can look at fp16/z0 next.

The other approach is to:

start with a random optim states
run for some steps with LR=0 to let the optimizer catch up
resume training with normal LR

The details and math of how many steps to run are in the 104B chronicles, I can dig up the link if you want to explore this option.

Muennighoff added the bug Something isn't working label Jun 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reshape ZeroStage=0 FP16 Checkpoint #2031

Reshape ZeroStage=0 FP16 Checkpoint #2031

Muennighoff commented Jun 20, 2022 •

edited

tjruwase commented Jun 20, 2022

Muennighoff commented Jun 20, 2022 •

edited

tjruwase commented Jun 20, 2022 •

edited

Muennighoff commented Jun 20, 2022

stas00 commented Jun 27, 2022

Reshape ZeroStage=0 FP16 Checkpoint #2031

Reshape ZeroStage=0 FP16 Checkpoint #2031

Comments

Muennighoff commented Jun 20, 2022 • edited

tjruwase commented Jun 20, 2022

Muennighoff commented Jun 20, 2022 • edited

tjruwase commented Jun 20, 2022 • edited

Muennighoff commented Jun 20, 2022

stas00 commented Jun 27, 2022

Muennighoff commented Jun 20, 2022 •

edited

Muennighoff commented Jun 20, 2022 •

edited

tjruwase commented Jun 20, 2022 •

edited