Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with saving model/optimizer and loading them back #285

Closed
ashutoshsaboo opened this issue Mar 18, 2022 · 23 comments · Fixed by #532
Closed

Issues with saving model/optimizer and loading them back #285

ashutoshsaboo opened this issue Mar 18, 2022 · 23 comments · Fixed by #532
Assignees

Comments

@ashutoshsaboo
Copy link

ashutoshsaboo commented Mar 18, 2022

Hello @sgugger ,

Came across multiple related issues regarding this - #242, #154 . They were all closed with this PR - #255, but unfortunately the PR doesn't seem to have much documentation.

I was looking for specifically: saving a model, it's optimizer state, LR scheduler state, it's random seeds/states, epoch/step count, and other related similar states for reproducible training runs and resuming them correctly.

I know there's this very brief doc here: here and here , but it looks like there are still few grey areas not documented currently regarding it's usage.
a) My question is specifically that, like in the official example here: link that saves using save_pretrained only in the main process, should I be using these only in the main process (both save/load) too, and in case of load_state I will have to call prepare() after load_state is done to prepare them for multi-gpu training/inference after that is done (or does load_state do all of that internally itself?)?
b) Does the save_state method call save_pretrained methods for the model internally or do I have to do both? FWIW, I'm using HF's BERT and other pretrained models from the transformers lib, so if there are any other specialized methods specifically for those then please advise on the same. If there's any simple toy example that already uses these new checkpointing methods, and if you can help share that'd be pretty helpful!

The last release seems to be way back in Sept 2021 - https://github.com/huggingface/accelerate/releases/tag/v0.5.1 - and the PR is just about a month old. Any plans for a soonish version-bump release of accelerate?

Request: If some more detailed examples can be added to the docs that'd be really awesome and help clarify about some of these specifics to users more easily!

Thanks so much in advance! :)

@sgugger
Copy link
Collaborator

sgugger commented Mar 18, 2022

Hi there, thanks for reaching out! Saving should only be done in the main process (no need to save several times!) but the loading should be done in all processes so each process has the right state. We'll work on adding more documentation on #255 next week (Zach is on vacation right now :-) )

The save_state method won't call save_pretrained, it just saves the state dictionary of the model, so you will need to manually call it with your checkpoint saves if you want to reload your checkpoints outside of your script. However if you only use them in the script, using load_state will work perfectly fine, it's only for the final model save that you will need to call save_pretrained.

As for a release, it's coming today actually!

@ashutoshsaboo
Copy link
Author

ashutoshsaboo commented Mar 18, 2022

Thanks for the response! What do you mean by "want to reload your checkpoints outside of your script"? If my training abruptly stops in between, and I need to resume it from the last saved checkpoint - would that be possible with load_state (and not calling save_pretrained between checkpoints)? @sgugger

If my use-case doesn't involve pushing to hub (which save_pretrained does under the hood), then could you elaborate specific differences between save_pretrained and save_state (with regards to model saving, since save_state might be saving other entities too)?

The docs for save_pretrained are abstract as they just mention "saves a model", so I don't know exactly what save_state is trying to save for the model vs save_pretrained and if there are any differences (which looks like to me there'd be some - since you also recommend towards the end to use save_pretrained for the final model and not save_state).

Save a model and its configuration file to a directory, so that it can be re-loaded using the [from_pretrained()]

@sgugger
Copy link
Collaborator

sgugger commented Mar 18, 2022

The difference between save_pretrained and save_state wrt the model is that save_state only saves the model weights, whereas save_pretrained saves the model config as well.

But your model is already instantiated in your script so you can reload the weights inside (with load_state), save_pretrained is not necessary for that. However if you want to use your model outside of your training script, especially with the from_pretrained method, you will need the model config, hence my suggestion to call save_pretrained at the end, to save the final model.

@ashutoshsaboo
Copy link
Author

ashutoshsaboo commented Mar 18, 2022

Thanks that makes sense to me now!

For "loading should be done in all processes so each process has the right state" -> I was curious conceptually, shouldn't load_state() on all processes vs load_state() on main process only + calling accelerator.prepare() on the loaded model have the same result ideally? If not, why? If yes, does the former have any explicit advantages (other than the naive part of writing reduced boilerplate code)? @sgugger

Small side question: So load_state will load the model across all processes and also allocate/move model to each GPU (in a multi gpu case) automatically without explicitly calling accelerator.prepare(), is that the correct understanding?

@sgugger
Copy link
Collaborator

sgugger commented Mar 18, 2022

load_state should be called after accelerator.prepare to ensure reproducilibity. It shouldn't change anything if you call it before/after for the model or optimizers, but since accelerator.prepare initializes the generators of the dataloaders, you will have a slight difference there.
It saves the RNGs on each process, so needs to be called on each process to be able to restore that.

Note that you can safely call save_state on each process too, the fact it will only save on the main process is done by Accelerate behind the scenes.

@ashutoshsaboo
Copy link
Author

ashutoshsaboo commented Mar 18, 2022

That makes sense. Let me clarify a bit, about the side question above. So my use-case is something like this where I do some cross validation using train-val sets, do save_state at end of each epoch and get the best performing model by doing load_state right towards the end. Now this best performing model is eventually passed to the test set in the same script , for getting final eval metrics. Pretty standard usecase basically.

The place where I was coming from in the side question above, is that when I get the best performing model using load_state, all my data loaders (including the test set one), existing model (note that: which might not be same as best model, since best model could be at an earlier iteration), optimizers etc are already prepared using the accelerator at the top of the script.

So when I actually do load_state for getting the best model, would I have to do accelerator.prepare specifically on this best_model again (which != existing model), or would load_state do the heavy lifting internally of transferring these models to all the processes and also move them to the gpus (if my understanding is right, former would happen most likely because load_state is being called on each process rather than not only on main_process as you suggested, but I wasn't sure if they'd be moved to the relevant gpus too without calling accelerator.prepare?) ? Apologies I didn't clarify this entirely earlier. But this was the intent behind the side question in my previous message. If you could help answer? @sgugger

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this as completed Jun 2, 2022
@cccntu
Copy link
Contributor

cccntu commented Jul 3, 2022

Hi @sgugger , I think there is a bug. I can try to make a reproducible example if you think that's necessary. But I want to write it down first.

load_state should be called after accelerator.prepare to ensure reproducilibity.

But when load_state saves model, it calls get_state_dict

weights = [self.get_state_dict(m) for m in self._models]

which unwraps model first, before returning the state_dict
model = self.unwrap_model(model)
state_dict = model.state_dict()

So when I call accelerator.prepare, then load_state, I get an error:

  File "python3.8/site-packages/accelerate/accelerator.py", line 940, in load_state
    load_accelerator_state(
  File "python3.8/site-packages/accelerate/checkpointing.py", line 136, in load_accelerator_state
    models[i].load_state_dict(torch.load(input_model_file, map_location="cpu"))
  File "python3.8/site-packages/torch/nn/modules/module.py", line 1497, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for DeepSpeedEngine:
        Missing key(s) in state_dict: "module.transformer.wte.weight", "module.transformer.wpe.weight", "module.transformer.h.0.ln_1.weight", "module.transformer.h.0.ln_1.bi

I added an line to print out the loaded dict, it looks like this

['transformer.wte.weight', 'transformer.wpe.weight', 'transformer.h.0.ln_1.weight', 

@sgugger
Copy link
Collaborator

sgugger commented Jul 4, 2022

cc @muellerzr

@muellerzr muellerzr reopened this Jul 4, 2022
@muellerzr muellerzr self-assigned this Jul 4, 2022
@muellerzr
Copy link
Collaborator

@cccntu are we saving the state before calling it in accelerator.prepare and then trying to load it in? A reproducer is needed, yes

@cccntu
Copy link
Contributor

cccntu commented Jul 4, 2022

@muellerzr I followed the suggestion above to save after prepare

Here is my example:

name this file 1.py and run the script below twice

import os

from accelerate import Accelerator
from torch.optim import AdamW
from torch.utils.data import DataLoader
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer, get_scheduler, set_seed


def main():
    dataset = [1, 2, 3]
    dataloader = DataLoader(dataset, batch_size=1)
    model = AutoModelForCausalLM.from_pretrained("gpt2")
    optimizer = AdamW(model.parameters(), lr=0.001)
    accelerator = Accelerator()
    model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
    # if path exist , load model
    path = "checkpoint"
    if os.path.exists(path):
        print("loading")
        accelerator.load_state(path)
    else:
        print("did not load")
    accelerator.save_state(path)


if __name__ == "__main__":
    main()

echo "compute_environment: LOCAL_MACHINE
deepspeed_config:
  gradient_accumulation_steps: 1
  offload_optimizer_device: cpu
  zero_stage: 1
distributed_type: DEEPSPEED
fp16: true
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
num_machines: 1
num_processes: 1
mixed_precision: bf16
" > accelerate_config.yaml

accelerate launch --config_file accelerate_config.yaml 1.py
pip freeze | grep -e accelerate -e deepspeed
accelerate==0.10.0
deepspeed==0.6.5

@muellerzr
Copy link
Collaborator

muellerzr commented Jul 4, 2022

@cccntu I saw no issue with this when I saved and loaded from a 2gpu state or save and loaded from a 1gpu state. Are you modifying the accelerate config in between runs?

Or did you save the state in a particular distributed mode and then tried loading it in w/o distributed?

My steps:

accelerate config
--- Script answers below ---
(0, 2, 1, no, no, 2, no)
accelerate launch test_script.py
# printed "did not load"
accelerate launch test_script.py
# printed "loading" and ran successfully
# delete our checkpoint originally made
rm -r checkpoint
accelerate config
--- Script answers below ---
(0, no, no, no)
accelerate launch test_script.py
# printed "did not load"
accelerate launch test_script.py
# printed "loading" and ran successfully

@cccntu
Copy link
Contributor

cccntu commented Jul 5, 2022

@muellerzr Could you try it again, but with deepspeed.

accelerate config
--- Script answers below ---
(0, 0, no, yes, *defaults)
accelerate launch test_script.py
# printed "did not load"
accelerate launch test_script.py
# printed "loading" and get errors

Or use my bash script in my last reply.

I just noticed the error I got has this:

RuntimeError: Error(s) in loading state_dict for DeepSpeedEngine:

and the code that unwrap_model calls has this:

def extract_model_from_parallel(model):
"""
Extract a model from its distributed containers.
Args:
model (`torch.nn.Module`): The model to extract.
Returns:
`torch.nn.Module`: The extracted model.
"""
options = (torch.nn.parallel.DistributedDataParallel, torch.nn.DataParallel)
if is_deepspeed_available():
options += (DeepSpeedEngine,)
while isinstance(model, options):
model = model.module
return model

The error only happens in deepspeed because normally, the model is appended to self._models, before wrapping it in another class.

elif isinstance(obj, torch.nn.Module) and first_pass:
self._models.append(obj)
return self.prepare_model(obj)

But with deepspeed, it calls prepare_deepspeed instead

if self.distributed_type == DistributedType.DEEPSPEED:
result = self._prepare_deepspeed(*args)
else:
result = tuple(self._prepare_one(obj, first_pass=True) for obj in args)
result = tuple(self._prepare_one(obj) for obj in result)

And the model is appended to self._models after the wrapper class.
self.deepspeed_engine_wrapped = DeepSpeedEngineWrapper(engine)
self._models.append(engine)

@pacman100
Copy link
Contributor

pacman100 commented Jul 5, 2022

Hello @cccntu ,

A few quick queries:

  1. Are you trying to reload the checkpoint for further training/resuming training or inference? If you want to run inference, then only ZeRO stage 3 is applicable because there are no optimizer states and gradients involved during inference. If you want to resume training, please follow the answer to this issue #418

  2. Does the following DeepSpeed Integration documentation section on Saving and loading model answer enable you to achieve your end goal: deepspeed ?

@cccntu
Copy link
Contributor

cccntu commented Jul 5, 2022

Hi @pacman100 ,

  1. I am trying to resume training, that's why I use save_state to save also the optimizer, instead of unwrap_model + save. The answer you linked uses deepspeed load_checkpoint. I think that might work, I need to try that. Thanks for the pointer. I am not sure if it loads optimizer and scheduler from the usage, see load_checkpoint nuances microsoft/DeepSpeed#647

  2. The doc says

Saving and loading of models is unchanged for ZeRO Stage-1 and Stage-2.

Does that mean save_state and load_state should work? I am not using stage-3.

@cccntu
Copy link
Contributor

cccntu commented Jul 6, 2022

Update: I tried deepspeed model.save_checkpoint and model.load_checkpoint. It worked, it can restore optimizer and scheduler states, when they are created via DummyOptim and DummyScheduler.

It can also load normal optimizer states, but not LambdaLR schedulers.

@cyk1337
Copy link

cyk1337 commented Sep 7, 2022

Update: I tried deepspeed model.save_checkpoint and model.load_checkpoint. It worked, it can restore optimizer and scheduler states, when they are created via DummyOptim and DummyScheduler.

It can also load normal optimizer states, but not LambdaLR schedulers.

I met the same problem. With deepspeed, both model.save_checkpoint and accelerator.save_state will hang. How to save the optimizer / lr_scheduler states for resuming training (w/ deepspeed)?

@pacman100
Copy link
Contributor

@cyk1337, minimal reproducible example? DeepSpeed tests daily run on GPU hosted runner wherein saving and loading checkpoint functionality is working fine.

https://github.com/huggingface/accelerate/blob/main/tests/deepspeed/test_deepspeed.py#L699

@amarazad
Copy link

amarazad commented Sep 20, 2023

Hi @pacman100 , I am using FSDP with full sharding. I use the following to save sate so that I can resume with the last state:

accelerator.wait_for_everyone()
if accelerator.is_main_process:
    if config["SAVE_STATE"] :
            accelerator.save_state(save_state_dir)

And to resume:

model = accelerator.prepare(model)
optimizer =  accelerator.prepare(optimizer) 
...
if config["RESUME_STATE"]:
              accelerator.wait_for_everyone()
              accelerator.load_state(save_state_dir)

However, while saving, it hangs at accelerator.save_state(save_state_dir) and after a long time throws the following error:

INFO:accelerate.accelerator:Saving FSDP model
[E ProcessGroupNCCL.cpp:828] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18364, OpType=ALLGATHER, Timeout(ms)=4800000) ran for 4804648 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18364, OpType=ALLGATHER, Timeout(ms)=4800000) ran for 4804768 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18364, OpType=ALLGATHER, Timeout(ms)=4800000) ran for 4804777 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18338, OpType=_ALLGATHER_BASE, Timeout(ms)=4800000) ran for 4807653 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2820777 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2820778 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2820780 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 2820776) of binary: /home/apa/anaconda3/envs/py39_llm/bin/python
Traceback (most recent call last): ..
....

@jiaxilv
Copy link

jiaxilv commented Mar 18, 2024

I have the same problem as @cyk1337: With Deepspeed stage2, model.save_checkpoint and accelerator.save_state block the program from continuing. I am now saving my unet with the following code:

    unwrapped_model = accelerator.unwrap_model(unet)
    accelerator.save(unwrapped_model.state_dict(), os.path.join(save_path, "model.pth"))

But I found that when I use the following code to save the optimizer parameters, it saves only a certain part of the optimizer parameters, not the complete optimizer parameters:

    unwrapped_optimizer = accelerator.unwrap_model(optimizer)
    accelerator.save(unwrapped_optimizer.state_dict(), os.path.join(save_path, "optimizer.pth"))

Is there anything that can be done to fix this?

@jhliu17
Copy link

jhliu17 commented May 25, 2024

Hi @pacman100 , I am using FSDP with full sharding. I use the following to save sate so that I can resume with the last state:

accelerator.wait_for_everyone()
if accelerator.is_main_process:
    if config["SAVE_STATE"] :
            accelerator.save_state(save_state_dir)

And to resume:

model = accelerator.prepare(model)
optimizer =  accelerator.prepare(optimizer) 
...
if config["RESUME_STATE"]:
              accelerator.wait_for_everyone()
              accelerator.load_state(save_state_dir)

However, while saving, it hangs at accelerator.save_state(save_state_dir) and after a long time throws the following error:

INFO:accelerate.accelerator:Saving FSDP model
[E ProcessGroupNCCL.cpp:828] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18364, OpType=ALLGATHER, Timeout(ms)=4800000) ran for 4804648 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18364, OpType=ALLGATHER, Timeout(ms)=4800000) ran for 4804768 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18364, OpType=ALLGATHER, Timeout(ms)=4800000) ran for 4804777 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18338, OpType=_ALLGATHER_BASE, Timeout(ms)=4800000) ran for 4807653 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2820777 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2820778 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2820780 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 2820776) of binary: /home/apa/anaconda3/envs/py39_llm/bin/python
Traceback (most recent call last): ..
....

I met the same problem. Do you have any solutions to it? Thank you~

@ParthaEth
Copy link

If you are doing the following with zero 3 deep spedd

unwrapped_model = accelerator.unwrap_model(model)
unwrapped_model.save_pretrained(
                self.output_dir,
                is_main_process=accelerator.is_main_process,
                save_function=accelerator.save,
                state_dict=accelerator.get_state_dict(model)
            )

Then you have to run it from all processes

@Reginald-L
Copy link

If you are doing the following with zero 3 deep spedd如果你正在以零 3 深速进行以下操作

unwrapped_model = accelerator.unwrap_model(model)
unwrapped_model.save_pretrained(
                self.output_dir,
                is_main_process=accelerator.is_main_process,
                save_function=accelerator.save,
                state_dict=accelerator.get_state_dict(model)
            )

Then you have to run it from all processes然后你必须从所有进程中运行它

I got the error: Flux object does not have the attribution of save_pretrained function.
Flux object is a subclass of torch.nn.Module

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.