[REQUEST] universal checkpoint for ZeRO - 1,2,3 #2921

stas00 · 2023-03-01T17:04:10Z

Is your feature request related to a problem? Please describe.

I think we now have all the components ready to do universal checkpoint in ZeRO - 1,2,3, like we had for BF16Optimizer.

The need is to be able to add/remove gpus when resuming from a checkpoint with a different number of gpus once the training started.

Thank you.

Progress update:

@tjruwase

stas00 · 2023-06-12T04:54:21Z

Any plans to work on that, Tunji? We could have really used that feature in the current 80b m4 training, as we would like to add new parameters that were previously frozen and thus aren't in the optimizer.

which also of adds a new interesting feature request.

Train a new model with some pretrained frozen params and then towards the end of the training expand the optimizer to include the frozen params and unfreeze those to finetune the whole ensemble.

Granted one could train from the beginning with lr=0 for frozen params, but it will require a lot more memory from the beginning. So this approach could save days to weeks of training for a large model training.

iMountTai · 2023-07-10T03:06:16Z

+1

tjruwase · 2023-07-10T20:40:53Z

@stas00, I believe zero stage 1 should be supported: #2284. I think stage 2 might be working or pretty to close. I do plan to work on this, so perhaps we can chat a bit about your timing.

iMountTai · 2023-07-11T00:55:55Z

Excuse me, can stage1 directly resume the number of different cards, or do you need to set some parameters?

GradientGuru · 2023-07-24T03:31:20Z

like we had for BF16Optimizer.

does it mean deepspeed already supports automatically changing world size for Zero-1,2,3 if I use
"bf16": {
"enabled": true
}
?

GradientGuru · 2023-07-24T04:00:50Z

@stas00, I believe zero stage 1 should be supported: #2284. I think stage 2 might be working or pretty to close. I do plan to work on this, so perhaps we can chat a bit about your timing.

I'm attempting to enhance DeepSpeed by enabling it to support a dynamic world size. This is particularly for the setup involving AdamW, stage 3, and bf16. However, I'm uncertain about the level of complexity in comparison to expanding DeepSpeed's support for a universal dynamic world size across all optimizers and precisions. Could you provide some insights on this matter?

stas00 · 2023-07-24T04:13:43Z

does it mean deepspeed already supports automatically changing world size for Zero-1,2,3 if I use bf16

No, it's not. It's currently a confusing situation as BF16Optimizer was written specifically for Megatron-Deepspeed when we trained BLOOM-176B, so it works only in that framework.

As the heavy lifting to support universal checkpoint has been done, porting it to ZeRO should take significantly less effort than it did to create the initial work as all the components are already in place. So it's really about @tjruwase and his team finding time and prioritizing this effort to make this happen. Clearly it's very desirable by many users at this point.

GradientGuru · 2023-07-24T04:19:25Z

No, it's not. It's currently a confusing situation as BF16Optimizer was written specifically for Megatron-Deepspeed when we trained BLOOM-176B, so it works only in that framework.

Is it possible and easier if we convert Deepspeed's checkpoint into Megatron-Deepspeed format, change the world size, then convert back to Deepspeed's format 😀

stas00 · 2023-07-24T04:23:21Z

Hmm, there you convert from/to TP/DP/PP topology. In ZeRO-3 you only have DP so perhaps it might be possible, but it won't find info on TP/PP and probably fail. e.g. it'd expect a different set of shard files for TP and PP, which don't exist in ZeRO-3.

But if it worked once you convert it to the universal checkpoint, the tricky part would be to move to the new topology, as again the code is written for Meg-DS 3D topology.

But as I have just explained the ZeRO case is much simpler than TP/DP/PP so it should be relatively easy to make it work with just ZeRO files.

GradientGuru · 2023-07-26T12:51:31Z

But as I have just explained the ZeRO case is much simpler than TP/DP/PP so it should be relatively easy to make it work with just ZeRO files.

I think it can be achieved by a single tool similar to zero_2_fp32.py

GradientGuru · 2023-07-31T06:55:28Z

But as I have just explained the ZeRO case is much simpler than TP/DP/PP so it should be relatively easy to make it work with just ZeRO files.

I have implemented the conventional tool, and I now find myself faced with a minor question. In the 'bf16_zero_*_optim_states.pt' file, the loss scaler is stored as <deepspeed.runtime.fp16.loss_scaler.LossScaler object at 0x7f0733de5610>. However, the address 0x7f0733de5610 doesn't serve any purpose, correct? Additionally, is there a need to scale the stored optimizer state (gradient and square of gradient for all trainable params) according to the old and new world size?

stas00 · 2023-07-31T17:06:34Z

I'm just a contributor - so I am tagging @tjruwase who hopefully will have the resources to address your questions, @GradientGuru

tjruwase · 2023-07-31T17:11:12Z

@GradientGuru, saving loss_scaler as an object instead of state_dict is a bug. Please feel free to submit a PR. Thanks!

orm011 · 2023-10-05T18:49:26Z

I'm not sure this is the right place to ask, we're researchers in a situation where we sometimes get access to a bunch of GPUs, and sometimes we don't, and we're counting on this donated GPU time to train a large model that requires deepspeed Zero 2 even just to fit into the GPUs.

We're trying figure out how to best handle the changing world sizes that come in the above setting, as it looks like currently we would not be able to restore optimizer state from a checkpoint. Im wondering what advice do you have on how we could proceed?

stas00 · 2023-10-05T18:59:37Z

It's the perfect place to ask - @tjruwase, is it possible to raise the priority for this time? This is a very critical requirement for users to choose Deepspeed over other frameworks. Thanks a ton!

tjruwase · 2023-10-09T16:45:16Z

@stas00, thanks for this notification. We will raise the priority.

@orm011, are you using megatron by any chance? There is some partial support there that you could start playing with.

orm011 · 2023-10-10T15:19:41Z

Didn't know about megatron. Do you mean this library: https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/megatron.html? or is it something else?

orm011 · 2023-10-10T15:20:52Z

Or probably this one: https://github.com/microsoft/Megatron-DeepSpeed

tjruwase · 2023-10-10T15:23:04Z

Or probably this one: https://github.com/microsoft/Megatron-DeepSpeed

Or this one? https://github.com/bigscience-workshop/Megatron-DeepSpeed.

Both of the above are forks of the original from NVIDIA: https://github.com/NVIDIA/Megatron-LM.

stas00 · 2023-10-25T21:48:16Z

@tjruwase, thank you for implementing the universal checkpoint for stage 1 in #4516

this issue was opened for 1,2,3 - so perhaps this Issues shouldn't have been closed yet?

tjruwase · 2023-10-25T22:28:43Z

@stas00, correct! I didn't know how to partially close an issue :)

stas00 · 2023-10-25T22:36:50Z

I updated the OP that stage 1 is done.

zaptrem · 2024-02-20T10:09:31Z

Also interested in stage 2 support.

tjruwase · 2024-02-20T13:30:51Z

@zaptrem, stage 1/2 and bf16_optimizer are supported. Only stage 3 support is pending.
https://github.com/microsoft/Megatron-DeepSpeed/tree/main/examples_deepspeed/universal_checkpointing#zero-stage-2-training

@samadejacobs, @lekurile FYI

zaptrem · 2024-02-20T18:39:58Z

@zaptrem, stage 1/2 and bf16_optimizer are supported. Only stage 3 support is pending.

https://github.com/microsoft/Megatron-DeepSpeed/tree/main/examples_deepspeed/universal_checkpointing#zero-stage-2-training

@samadejacobs, @lekurile FYI

We tried to use this last night and the universal checkpoint conversion script failed because our DS checkpoint was missing universal _checkpoint_info. We commented out all references to that and converted it anyway then got this error when we tried to restore from the newly converted universal checkpoint:

deepspeed.runtime.zero.utils.ZeRORuntimeException: The checkpoint being loaded used a DP world size of 8 but the current world size is 128. Automatic adjustment of ZeRO's optimizer state partitioning with a new world size is not currently supported.

lekurile · 2024-02-21T00:23:06Z

We tried to use this last night and the universal checkpoint conversion script failed because our DS checkpoint was missing universal _checkpoint_info. We commented out all references to that and converted it anyway then got this error when we tried to restore from the newly converted universal checkpoint:

deepspeed.runtime.zero.utils.ZeRORuntimeException: The checkpoint being loaded used a DP world size of 8 but the current world size is 128. Automatic adjustment of ZeRO's optimizer state partitioning with a new world size is not currently supported.

Hello @zaptrem,

Just as a clarification, are you using Megatron-DeepSpeed for creation of the checkpoint?

In Megatron-DeepSpeed, when the checkpoint gets saved, there's a call to _universal_checkpoint_info(model) that updates state_dict[UNIVERSAL_CHECKPOINT_INFO] here.

state_dict[UNIVERSAL_CHECKPOINT_INFO] = _universal_checkpoint_info(model)

If you're not using Megatron-DeepSpeed, you can try ensuring that the same universal checkpoint metadata that gets stored in the _universal_checkpoint_info() call is present in your checkpoint as well.

Please share any questions or concerns.

Thanks,
Lev

zaptrem · 2024-02-21T00:35:10Z

We're not training a language model. Is that just a fork of DeepSpeed or something specifically for transformer LMs?

rgtjf · 2024-03-14T07:53:13Z

Instead of the universal checkpoint, I use the code from @tjruwase to convert a DeepSpeed checkpoint without TP and PP (128 ranks) to another DeepSpeed checkpoint (32 ranks).
https://gist.github.com/rgtjf/aa90fc37efe38ad773046623780a1026

Discussions and comments are welcome.

tjruwase · 2024-03-14T15:03:49Z

We're not training a language model. Is that just a fork of DeepSpeed or something specifically for transformer LMs?

@zaptrem, just wanted to check if this something we can still help with? Thanks!

tjruwase · 2024-03-14T15:10:39Z

Instead of the universal checkpoint, I use the code from @tjruwase to convert a DeepSpeed checkpoint without TP and PP (128 ranks) to another DeepSpeed checkpoint (32 ranks). https://gist.github.com/rgtjf/e621f3ac27192cb34a10bea700d9a0c0

Discussions and comments are welcome.

@rgtjf, thanks for sharing your usage of the conversion script. However, our plan is to focus on universal checkpointing, which is general, so that it replaces the conversion script. Are you able to work with us to make universal checkpointing work correctly for your scenario? I am looking at your report here. Thanks!

rgtjf · 2024-03-15T02:49:49Z

@tjruwase A big thank you for your quick reply. I'd love to work with you to make universal checkpointing better.

In my testing, I've found that merging in order isn't particularly correct, looking forward to more insight.

def check_mp_equal_to_fp32(args):
    output_folder = "./output"

    mp_sd = torch.load(
        os.path.join(output_folder, "output", "mp_rank_00_model_states.pt"),
        map_location=torch.device("cpu"),
    )
    zero_output_folder = os.path.join(output_folder, "zero")
    tensor_name_paths = sorted(glob.glob(f"{zero_output_folder}/*"))
    for tensor_name_path in tensor_name_paths:
        if "model" not in tensor_name_path:
            continue
        tensor_name = os.path.basename(tensor_name_path)
        fp32 = torch.load(os.path.join(tensor_name_path, "fp32.pt"))["param"].to(mp_sd["module"][tensor_name])
        torch.testing.assert_allclose(fp32, mp_sd["module"][tensor_name], msg=f"{tensor_name}, fp32: \n{fp32}, mp_sd: \n{mp_sd['module'][tensor_name]}")

In this example, I found that it wasn't the alphabetical order or number order.

xylian86 · 2024-06-28T14:09:18Z

@stas00 The Zero Stage 3 should now be supported: #5475. As Stage 1/2/3 and the BF16 optimizer have already been supported, we can close this issue if everything looks good to you. Thank you all for your hard work on this important feature!

stas00 · 2024-06-28T16:58:13Z

Thank you for the heads up, @xylian86 - I will let @tjruwase decide when this sub-project is complete.

stas00 added the enhancement New feature or request label Mar 1, 2023

stas00 mentioned this issue Jun 28, 2023

Automatic adjustment of ZeRO's optimizer state partitioning with a new world size is not currently supported. #3810

Open

stas00 mentioned this issue Jul 20, 2023

improve from_pretrained for zero3 multi gpus mode huggingface/transformers#24964

Merged

mrwyattii self-assigned this Aug 18, 2023

tjruwase self-assigned this Oct 9, 2023

tjruwase mentioned this issue Oct 16, 2023

Enable universal checkpoint for zero stage 1 #4516

Merged

1 task

tjruwase closed this as completed in #4516 Oct 25, 2023

tjruwase reopened this Oct 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REQUEST] universal checkpoint for ZeRO - 1,2,3 #2921

[REQUEST] universal checkpoint for ZeRO - 1,2,3 #2921

stas00 commented Mar 1, 2023 •

edited by tjruwase

Loading

stas00 commented Jun 12, 2023 •

edited

Loading

iMountTai commented Jul 10, 2023

tjruwase commented Jul 10, 2023

iMountTai commented Jul 11, 2023

GradientGuru commented Jul 24, 2023 •

edited

Loading

GradientGuru commented Jul 24, 2023

stas00 commented Jul 24, 2023 •

edited

Loading

GradientGuru commented Jul 24, 2023

stas00 commented Jul 24, 2023 •

edited

Loading

GradientGuru commented Jul 26, 2023

GradientGuru commented Jul 31, 2023 •

edited

Loading

stas00 commented Jul 31, 2023

tjruwase commented Jul 31, 2023

orm011 commented Oct 5, 2023 •

edited

Loading

stas00 commented Oct 5, 2023

tjruwase commented Oct 9, 2023

orm011 commented Oct 10, 2023

orm011 commented Oct 10, 2023

tjruwase commented Oct 10, 2023

stas00 commented Oct 25, 2023

tjruwase commented Oct 25, 2023

stas00 commented Oct 25, 2023

zaptrem commented Feb 20, 2024

tjruwase commented Feb 20, 2024

zaptrem commented Feb 20, 2024

lekurile commented Feb 21, 2024

zaptrem commented Feb 21, 2024

rgtjf commented Mar 14, 2024 •

edited

Loading

tjruwase commented Mar 14, 2024

tjruwase commented Mar 14, 2024

rgtjf commented Mar 15, 2024 •

edited

Loading

xylian86 commented Jun 28, 2024

stas00 commented Jun 28, 2024

[REQUEST] universal checkpoint for ZeRO - 1,2,3 #2921

[REQUEST] universal checkpoint for ZeRO - 1,2,3 #2921

Comments

stas00 commented Mar 1, 2023 • edited by tjruwase Loading

stas00 commented Jun 12, 2023 • edited Loading

iMountTai commented Jul 10, 2023

tjruwase commented Jul 10, 2023

iMountTai commented Jul 11, 2023

GradientGuru commented Jul 24, 2023 • edited Loading

GradientGuru commented Jul 24, 2023

stas00 commented Jul 24, 2023 • edited Loading

GradientGuru commented Jul 24, 2023

stas00 commented Jul 24, 2023 • edited Loading

GradientGuru commented Jul 26, 2023

GradientGuru commented Jul 31, 2023 • edited Loading

stas00 commented Jul 31, 2023

tjruwase commented Jul 31, 2023

orm011 commented Oct 5, 2023 • edited Loading

stas00 commented Oct 5, 2023

tjruwase commented Oct 9, 2023

orm011 commented Oct 10, 2023

orm011 commented Oct 10, 2023

tjruwase commented Oct 10, 2023

stas00 commented Oct 25, 2023

tjruwase commented Oct 25, 2023

stas00 commented Oct 25, 2023

zaptrem commented Feb 20, 2024

tjruwase commented Feb 20, 2024

zaptrem commented Feb 20, 2024

lekurile commented Feb 21, 2024

zaptrem commented Feb 21, 2024

rgtjf commented Mar 14, 2024 • edited Loading

tjruwase commented Mar 14, 2024

tjruwase commented Mar 14, 2024

rgtjf commented Mar 15, 2024 • edited Loading

xylian86 commented Jun 28, 2024

stas00 commented Jun 28, 2024

stas00 commented Mar 1, 2023 •

edited by tjruwase

Loading

stas00 commented Jun 12, 2023 •

edited

Loading

GradientGuru commented Jul 24, 2023 •

edited

Loading

stas00 commented Jul 24, 2023 •

edited

Loading

stas00 commented Jul 24, 2023 •

edited

Loading

GradientGuru commented Jul 31, 2023 •

edited

Loading

orm011 commented Oct 5, 2023 •

edited

Loading

rgtjf commented Mar 14, 2024 •

edited

Loading

rgtjf commented Mar 15, 2024 •

edited

Loading