-
Notifications
You must be signed in to change notification settings - Fork 965
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues with saving model/optimizer and loading them back #285
Comments
Hi there, thanks for reaching out! Saving should only be done in the main process (no need to save several times!) but the loading should be done in all processes so each process has the right state. We'll work on adding more documentation on #255 next week (Zach is on vacation right now :-) ) The As for a release, it's coming today actually! |
Thanks for the response! What do you mean by "want to reload your checkpoints outside of your script"? If my training abruptly stops in between, and I need to resume it from the last saved checkpoint - would that be possible with load_state (and not calling save_pretrained between checkpoints)? @sgugger If my use-case doesn't involve pushing to hub (which save_pretrained does under the hood), then could you elaborate specific differences between save_pretrained and save_state (with regards to model saving, since save_state might be saving other entities too)? The docs for save_pretrained are abstract as they just mention "saves a model", so I don't know exactly what save_state is trying to save for the model vs save_pretrained and if there are any differences (which looks like to me there'd be some - since you also recommend towards the end to use
|
The difference between But your model is already instantiated in your script so you can reload the weights inside (with |
Thanks that makes sense to me now! For "loading should be done in all processes so each process has the right state" -> I was curious conceptually, shouldn't load_state() on all processes vs load_state() on main process only + calling accelerator.prepare() on the loaded model have the same result ideally? If not, why? If yes, does the former have any explicit advantages (other than the naive part of writing reduced boilerplate code)? @sgugger Small side question: So load_state will load the model across all processes and also allocate/move model to each GPU (in a multi gpu case) automatically without explicitly calling accelerator.prepare(), is that the correct understanding? |
Note that you can safely call |
That makes sense. Let me clarify a bit, about the side question above. So my use-case is something like this where I do some cross validation using train-val sets, do save_state at end of each epoch and get the best performing model by doing load_state right towards the end. Now this best performing model is eventually passed to the test set in the same script , for getting final eval metrics. Pretty standard usecase basically. The place where I was coming from in the side question above, is that when I get the best performing model using load_state, all my data loaders (including the test set one), existing model (note that: which might not be same as best model, since best model could be at an earlier iteration), optimizers etc are already prepared using the accelerator at the top of the script. So when I actually do load_state for getting the best model, would I have to do accelerator.prepare specifically on this best_model again (which != existing model), or would load_state do the heavy lifting internally of transferring these models to all the processes and also move them to the gpus (if my understanding is right, former would happen most likely because load_state is being called on each process rather than not only on main_process as you suggested, but I wasn't sure if they'd be moved to the relevant gpus too without calling accelerator.prepare?) ? Apologies I didn't clarify this entirely earlier. But this was the intent behind the side question in my previous message. If you could help answer? @sgugger |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Hi @sgugger , I think there is a bug. I can try to make a reproducible example if you think that's necessary. But I want to write it down first.
But when load_state saves model, it calls accelerate/src/accelerate/accelerator.py Line 925 in 6ebddcd
which unwraps model first, before returning the state_dict accelerate/src/accelerate/accelerator.py Lines 1020 to 1021 in 6ebddcd
So when I call
I added an line to print out the loaded dict, it looks like this
|
cc @muellerzr |
@cccntu are we saving the state before calling it in |
@muellerzr I followed the suggestion above to save after Here is my example: name this file 1.py and run the script below twice
|
@cccntu I saw no issue with this when I saved and loaded from a 2gpu state or save and loaded from a 1gpu state. Are you modifying the accelerate config in between runs? Or did you save the state in a particular distributed mode and then tried loading it in w/o distributed? My steps: accelerate config
--- Script answers below ---
(0, 2, 1, no, no, 2, no)
accelerate launch test_script.py
# printed "did not load"
accelerate launch test_script.py
# printed "loading" and ran successfully # delete our checkpoint originally made
rm -r checkpoint accelerate config
--- Script answers below ---
(0, no, no, no)
accelerate launch test_script.py
# printed "did not load"
accelerate launch test_script.py
# printed "loading" and ran successfully |
@muellerzr Could you try it again, but with deepspeed.
Or use my bash script in my last reply. I just noticed the error I got has this:
and the code that accelerate/src/accelerate/utils/other.py Lines 35 to 51 in 6ebddcd
The error only happens in deepspeed because normally, the model is appended to accelerate/src/accelerate/accelerator.py Lines 381 to 383 in 6ebddcd
But with deepspeed, it calls accelerate/src/accelerate/accelerator.py Lines 484 to 488 in 6ebddcd
And the model is appended to self._models after the wrapper class.accelerate/src/accelerate/accelerator.py Lines 691 to 692 in 6ebddcd
|
Hello @cccntu , A few quick queries:
|
Hi @pacman100 ,
Does that mean |
Update: I tried deepspeed It can also load normal optimizer states, but not |
I met the same problem. With |
@cyk1337, minimal reproducible example? DeepSpeed tests daily run on GPU hosted runner wherein saving and loading checkpoint functionality is working fine. https://github.com/huggingface/accelerate/blob/main/tests/deepspeed/test_deepspeed.py#L699 |
Hi @pacman100 , I am using FSDP with full sharding. I use the following to save sate so that I can resume with the last state:
And to resume:
However, while saving, it hangs at
|
I have the same problem as @cyk1337: With Deepspeed stage2,
But I found that when I use the following code to save the optimizer parameters, it saves only a certain part of the optimizer parameters, not the complete optimizer parameters:
Is there anything that can be done to fix this? |
I met the same problem. Do you have any solutions to it? Thank you~ |
If you are doing the following with zero 3 deep spedd
Then you have to run it from all processes |
I got the error: Flux object does not have the attribution of save_pretrained function. |
Hello @sgugger ,
Came across multiple related issues regarding this - #242, #154 . They were all closed with this PR - #255, but unfortunately the PR doesn't seem to have much documentation.
I was looking for specifically: saving a model, it's optimizer state, LR scheduler state, it's random seeds/states, epoch/step count, and other related similar states for reproducible training runs and resuming them correctly.
I know there's this very brief doc here: here and here , but it looks like there are still few grey areas not documented currently regarding it's usage.
a) My question is specifically that, like in the official example here: link that saves using
save_pretrained
only in the main process, should I be using these only in the main process (both save/load) too, and in case of load_state I will have to call prepare() after load_state is done to prepare them for multi-gpu training/inference after that is done (or does load_state do all of that internally itself?)?b) Does the save_state method call
save_pretrained
methods for the model internally or do I have to do both? FWIW, I'm using HF's BERT and other pretrained models from the transformers lib, so if there are any other specialized methods specifically for those then please advise on the same. If there's any simple toy example that already uses these new checkpointing methods, and if you can help share that'd be pretty helpful!The last release seems to be way back in Sept 2021 - https://github.com/huggingface/accelerate/releases/tag/v0.5.1 - and the PR is just about a month old. Any plans for a soonish version-bump release of accelerate?
Request: If some more detailed examples can be added to the docs that'd be really awesome and help clarify about some of these specifics to users more easily!
Thanks so much in advance! :)
The text was updated successfully, but these errors were encountered: