-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: Can't Load ZeRO3 model with Engine.load_checkpoint() #1394
Comments
@skpig, sorry for the late response. Is this is still an issue? Thanks, |
Yes, the issue still exists with the most recent commit (30965ea) |
@skpig, thanks for confirming. As you probably notice, a number of related PRs are in process to merge by early next week. We can then evaluate whether the issue still remains. |
@skpig, we have just merged a bunch for checkpoint PRs. Can you please check again? Thanks. |
@skpig, please try to follow the instruction here to setup up HF transformers to do the right thing during training if you're not using HF Trainer: |
@tjruwase The issue still exists with current commit (d8e9ef6). And thanks for your reminder @stas00 . my test code
Traceback
|
means that ZeRO3 hasn't gathered the param from the gpus, before using it. Thank you for trying the the Does the problem go away if you don't use param groups i.e. just pass the args to the optimizer normally? |
No, it doesn't. The traceback is the same as before with |
OK, I was able to reproduce the issue with the help of your script after adding some missing bits and I will have a look now. Will keep you posted. The full code to reproduce the issue is (the config is from #1394 (comment)
and run with just:
I have only 2 gpus. and I'm using a tiny model to speed up the debug. |
OK, I figured it out. It appears that Deepspeed engine wasn't designed to do save/load on the same engine. It was designed to save, save, save in process A and then do load in a new process when it restarts. So it tries to load into a model that's already partitioned/used and it fails because the saved model is correct with normal shapes, but during load it tries to load it into fake ds_params which are of Here is a working workaround:
The culprit, is that In theory this should have worked as a workaround:
but it doesn't. You can see that the model's weights are gathered correctly, but I suspect that it is not the partitioning that the issue but something else, first, because my 2nd workaround didn't work and second, because under zero.Init I will let @tjruwase comment on why this is so as I haven't designed this engine. (sidenote: since HF integrated Deeepspeed, the Deepspeed team has been repeatedly surprised their awesome tool has been attempted to be used in dozens of ways they haven't originally envisioned. So this attempt at expanding adoption is a blessing and a curse at the same time.) |
Sorry for the late response. I guess it is not an important feature. But maybe the document needs to highlight the issue and indicates the right way to use |
I totally agree, @skpig, it indeed should be documented If you'd like you could make a PR adding a note explaining this limitation somewhere in the docstring here: |
I used
Engine.save_checkpoint
to save my ZeRO3 model_engine . But when I load it withEngine.load_checkpoint()
, I encountered runtime error as below:I'm using deepspeed ZeRO3 to train my bart (implemented by Huggingface's transformers) with 4 GPUs. (
deepspeed --num_gpus=4 train.py --deepspeed --deepspeed_config config/ds_config.json
)Here is my code.(to simplify the question, I skip all the training code and only test the load & save function)
And here is my
ds_config.json
I'm new to deepspeed and not familiar with every details about ZeRO3, please help me solve my problem. Thanks a lot!!!
The text was updated successfully, but these errors were encountered: