You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[BUG] When initializing model_engine, if an mpu is specified, it can lead to an excessively large checkpoint size, and the checkpoint may not be convertible through the zero_to_fp32.py script.
#5514
Open
Kwen-Chen opened this issue
May 9, 2024
· 0 comments
Describe the bug
When initializing model_engine, if an mpu (model parallelism unit) is specified, it can lead to an excessively large checkpoint size, and the checkpoint may not be convertible through the zero_to_fp32.py script.
When initializing the 'mpu' and specifying the 'mpu' during the initialization of 'model_engine', without performing any other operations, directly saving the checkpoint, the issue can be reproduced with the following code.
Using the aforementioned code, the model's checkpoints will be saved as very large files, as shown in the figure below, which is the result of saving for llama2-7B:
One can observe that the saved size is exactly four times the normal size (sequence_parallel_size)
When I want to run the script python zero_to_fp32.py . model.bin, the error occurs as follows:
[2024-05-08 14:16:02,836] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Processing zero checkpoint './checkpoint-0'
Detected checkpoint of type zero stage 3, world_size: 4
Parsing checkpoint created by deepspeed==0.14.0
Traceback (most recent call last):
File "/u01/chenkun/work/ring_dp_train/checkpoints/zero_to_fp32.py", line 601, in <module>
convert_zero_checkpoint_to_fp32_state_dict(args.checkpoint_dir,
File "/u01/chenkun/work/ring_dp_train/checkpoints/zero_to_fp32.py", line 536, in convert_zero_checkpoint_to_fp32_state_dict
state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir, tag, exclude_frozen_parameters)
File "/u01/chenkun/work/ring_dp_train/checkpoints/zero_to_fp32.py", line 521, in get_fp32_state_dict_from_zero_checkpoint
return _get_fp32_state_dict_from_zero_checkpoint(ds_checkpoint_dir, exclude_frozen_parameters)
File "/u01/chenkun/work/ring_dp_train/checkpoints/zero_to_fp32.py", line 217, in _get_fp32_state_dict_from_zero_checkpoint
return _get_fp32_state_dict_from_zero3_checkpoint(world_size, fp32_flat_groups, zero_model_states,
File "/u01/chenkun/work/ring_dp_train/checkpoints/zero_to_fp32.py", line 464, in _get_fp32_state_dict_from_zero3_checkpoint
_zero3_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states)
File "/u01/chenkun/work/ring_dp_train/checkpoints/zero_to_fp32.py", line 446, in _zero3_merge_trainable_params
raise ValueError(f"consumed {offset} numels out of {avail_numel} - something is wrong")
ValueError: consumed 6738415616 numels out of 26953662464 - something is wrong
Although this checkpoint can't convert to the model.bin, it can load by model_engine.load_checkpoints()
Expected behavior
The model is saved at the normal size and is convertible.
ds_report output
System info (please complete the following information):
Describe the bug
When initializing
model_engine
, if anmpu
(model parallelism unit) is specified, it can lead to an excessively large checkpoint size, and the checkpoint may not be convertible through thezero_to_fp32.py
script.To Reproduce
Steps to reproduce the behavior:
deepspeed --include localhost:0,1,2,3 \ --master_port=25640 \ train_pt.py \ --model ~/work/Llama-2-7b-hf \ --deepspeed --deepspeed_config config/deepspeed.json
One can observe that the saved size is exactly four times the normal size (sequence_parallel_size)
python zero_to_fp32.py . model.bin
, the error occurs as follows:model_engine.load_checkpoints()
Expected behavior
The model is saved at the normal size and is convertible.
ds_report output
System info (please complete the following information):
Launcher context
deepspeed --include localhost:0,1,2,3 \ --master_port=25640 \ train_pt.py \ --model ~/work/Llama-2-7b-hf \ --deepspeed --deepspeed_config config/deepspeed.json
Additional context
The DeepSpeed config is as follows:
The text was updated successfully, but these errors were encountered: