Issues with saving model/optimizer and loading them back #285

ashutoshsaboo · 2022-03-18T10:45:53Z

Came across multiple related issues regarding this - #242, #154 . They were all closed with this PR - #255, but unfortunately the PR doesn't seem to have much documentation.

I was looking for specifically: saving a model, it's optimizer state, LR scheduler state, it's random seeds/states, epoch/step count, and other related similar states for reproducible training runs and resuming them correctly.

I know there's this very brief doc here: here and here , but it looks like there are still few grey areas not documented currently regarding it's usage.
a) My question is specifically that, like in the official example here: link that saves using save_pretrained only in the main process, should I be using these only in the main process (both save/load) too, and in case of load_state I will have to call prepare() after load_state is done to prepare them for multi-gpu training/inference after that is done (or does load_state do all of that internally itself?)?
b) Does the save_state method call save_pretrained methods for the model internally or do I have to do both? FWIW, I'm using HF's BERT and other pretrained models from the transformers lib, so if there are any other specialized methods specifically for those then please advise on the same. If there's any simple toy example that already uses these new checkpointing methods, and if you can help share that'd be pretty helpful!

The last release seems to be way back in Sept 2021 - https://github.com/huggingface/accelerate/releases/tag/v0.5.1 - and the PR is just about a month old. Any plans for a soonish version-bump release of accelerate?

Request: If some more detailed examples can be added to the docs that'd be really awesome and help clarify about some of these specifics to users more easily!

Thanks so much in advance! :)

The text was updated successfully, but these errors were encountered:

sgugger · 2022-03-18T13:04:34Z

Hi there, thanks for reaching out! Saving should only be done in the main process (no need to save several times!) but the loading should be done in all processes so each process has the right state. We'll work on adding more documentation on #255 next week (Zach is on vacation right now :-) )

The save_state method won't call save_pretrained, it just saves the state dictionary of the model, so you will need to manually call it with your checkpoint saves if you want to reload your checkpoints outside of your script. However if you only use them in the script, using load_state will work perfectly fine, it's only for the final model save that you will need to call save_pretrained.

As for a release, it's coming today actually!

ashutoshsaboo · 2022-03-18T13:20:16Z

Thanks for the response! What do you mean by "want to reload your checkpoints outside of your script"? If my training abruptly stops in between, and I need to resume it from the last saved checkpoint - would that be possible with load_state (and not calling save_pretrained between checkpoints)? @sgugger

If my use-case doesn't involve pushing to hub (which save_pretrained does under the hood), then could you elaborate specific differences between save_pretrained and save_state (with regards to model saving, since save_state might be saving other entities too)?

The docs for save_pretrained are abstract as they just mention "saves a model", so I don't know exactly what save_state is trying to save for the model vs save_pretrained and if there are any differences (which looks like to me there'd be some - since you also recommend towards the end to use save_pretrained for the final model and not save_state).

Save a model and its configuration file to a directory, so that it can be re-loaded using the [from_pretrained()]

sgugger · 2022-03-18T13:23:43Z

The difference between save_pretrained and save_state wrt the model is that save_state only saves the model weights, whereas save_pretrained saves the model config as well.

But your model is already instantiated in your script so you can reload the weights inside (with load_state), save_pretrained is not necessary for that. However if you want to use your model outside of your training script, especially with the from_pretrained method, you will need the model config, hence my suggestion to call save_pretrained at the end, to save the final model.

ashutoshsaboo · 2022-03-18T13:34:08Z

Thanks that makes sense to me now!

For "loading should be done in all processes so each process has the right state" -> I was curious conceptually, shouldn't load_state() on all processes vs load_state() on main process only + calling accelerator.prepare() on the loaded model have the same result ideally? If not, why? If yes, does the former have any explicit advantages (other than the naive part of writing reduced boilerplate code)? @sgugger

Small side question: So load_state will load the model across all processes and also allocate/move model to each GPU (in a multi gpu case) automatically without explicitly calling accelerator.prepare(), is that the correct understanding?

sgugger · 2022-03-18T13:53:59Z

load_state should be called after accelerator.prepare to ensure reproducilibity. It shouldn't change anything if you call it before/after for the model or optimizers, but since accelerator.prepare initializes the generators of the dataloaders, you will have a slight difference there.
It saves the RNGs on each process, so needs to be called on each process to be able to restore that.

Note that you can safely call save_state on each process too, the fact it will only save on the main process is done by Accelerate behind the scenes.

ashutoshsaboo · 2022-03-18T14:44:14Z

That makes sense. Let me clarify a bit, about the side question above. So my use-case is something like this where I do some cross validation using train-val sets, do save_state at end of each epoch and get the best performing model by doing load_state right towards the end. Now this best performing model is eventually passed to the test set in the same script , for getting final eval metrics. Pretty standard usecase basically.

The place where I was coming from in the side question above, is that when I get the best performing model using load_state, all my data loaders (including the test set one), existing model (note that: which might not be same as best model, since best model could be at an earlier iteration), optimizers etc are already prepared using the accelerator at the top of the script.

So when I actually do load_state for getting the best model, would I have to do accelerator.prepare specifically on this best_model again (which != existing model), or would load_state do the heavy lifting internally of transferring these models to all the processes and also move them to the gpus (if my understanding is right, former would happen most likely because load_state is being called on each process rather than not only on main_process as you suggested, but I wasn't sure if they'd be moved to the relevant gpus too without calling accelerator.prepare?) ? Apologies I didn't clarify this entirely earlier. But this was the intent behind the side question in my previous message. If you could help answer? @sgugger

github-actions · 2022-05-24T15:53:33Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

cccntu · 2022-07-03T10:46:11Z

Hi @sgugger , I think there is a bug. I can try to make a reproducible example if you think that's necessary. But I want to write it down first.

load_state should be called after accelerator.prepare to ensure reproducilibity.

But when load_state saves model, it calls get_state_dict

accelerate/src/accelerate/accelerator.py

Line 925 in 6ebddcd

weights = [self.get_state_dict(m) for m in self._models]

which unwraps model first, before returning the state_dict

accelerate/src/accelerate/accelerator.py

Lines 1020 to 1021 in 6ebddcd

    
           model = self.unwrap_model(model) 
        
           state_dict = model.state_dict()

So when I call accelerator.prepare, then load_state, I get an error:

  File "python3.8/site-packages/accelerate/accelerator.py", line 940, in load_state
    load_accelerator_state(
  File "python3.8/site-packages/accelerate/checkpointing.py", line 136, in load_accelerator_state
    models[i].load_state_dict(torch.load(input_model_file, map_location="cpu"))
  File "python3.8/site-packages/torch/nn/modules/module.py", line 1497, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for DeepSpeedEngine:
        Missing key(s) in state_dict: "module.transformer.wte.weight", "module.transformer.wpe.weight", "module.transformer.h.0.ln_1.weight", "module.transformer.h.0.ln_1.bi

I added an line to print out the loaded dict, it looks like this

['transformer.wte.weight', 'transformer.wpe.weight', 'transformer.h.0.ln_1.weight',

sgugger · 2022-07-04T12:38:43Z

cc @muellerzr

muellerzr · 2022-07-04T15:19:29Z

@cccntu are we saving the state before calling it in accelerator.prepare and then trying to load it in? A reproducer is needed, yes

cccntu · 2022-07-04T16:22:19Z

@muellerzr I followed the suggestion above to save after prepare

Here is my example:

name this file 1.py and run the script below twice

import os

from accelerate import Accelerator
from torch.optim import AdamW
from torch.utils.data import DataLoader
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer, get_scheduler, set_seed


def main():
    dataset = [1, 2, 3]
    dataloader = DataLoader(dataset, batch_size=1)
    model = AutoModelForCausalLM.from_pretrained("gpt2")
    optimizer = AdamW(model.parameters(), lr=0.001)
    accelerator = Accelerator()
    model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
    # if path exist , load model
    path = "checkpoint"
    if os.path.exists(path):
        print("loading")
        accelerator.load_state(path)
    else:
        print("did not load")
    accelerator.save_state(path)


if __name__ == "__main__":
    main()

echo "compute_environment: LOCAL_MACHINE
deepspeed_config:
  gradient_accumulation_steps: 1
  offload_optimizer_device: cpu
  zero_stage: 1
distributed_type: DEEPSPEED
fp16: true
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
num_machines: 1
num_processes: 1
mixed_precision: bf16
" > accelerate_config.yaml

accelerate launch --config_file accelerate_config.yaml 1.py

pip freeze | grep -e accelerate -e deepspeed
accelerate==0.10.0
deepspeed==0.6.5

muellerzr · 2022-07-04T16:56:17Z

@cccntu I saw no issue with this when I saved and loaded from a 2gpu state or save and loaded from a 1gpu state. Are you modifying the accelerate config in between runs?

Or did you save the state in a particular distributed mode and then tried loading it in w/o distributed?

My steps:

accelerate config
--- Script answers below ---
(0, 2, 1, no, no, 2, no)
accelerate launch test_script.py
# printed "did not load"
accelerate launch test_script.py
# printed "loading" and ran successfully

# delete our checkpoint originally made
rm -r checkpoint

accelerate config
--- Script answers below ---
(0, no, no, no)
accelerate launch test_script.py
# printed "did not load"
accelerate launch test_script.py
# printed "loading" and ran successfully

cccntu · 2022-07-05T03:34:49Z

@muellerzr Could you try it again, but with deepspeed.

accelerate config
--- Script answers below ---
(0, 0, no, yes, *defaults)
accelerate launch test_script.py
# printed "did not load"
accelerate launch test_script.py
# printed "loading" and get errors

Or use my bash script in my last reply.

I just noticed the error I got has this:

RuntimeError: Error(s) in loading state_dict for DeepSpeedEngine:

and the code that unwrap_model calls has this:

accelerate/src/accelerate/utils/other.py

Lines 35 to 51 in 6ebddcd

    
           def extract_model_from_parallel(model): 
        
               """ 
        
               Extract a model from its distributed containers. 
        
               Args: 
        
                   model (`torch.nn.Module`): The model to extract. 
        
               Returns: 
        
                   `torch.nn.Module`: The extracted model. 
        
               """ 
        
               options = (torch.nn.parallel.DistributedDataParallel, torch.nn.DataParallel) 
        
               if is_deepspeed_available(): 
        
                   options += (DeepSpeedEngine,) 
        
               while isinstance(model, options): 
        
                   model = model.module 
        
               return model

The error only happens in deepspeed because normally, the model is appended to self._models, before wrapping it in another class.

accelerate/src/accelerate/accelerator.py

Lines 381 to 383 in 6ebddcd

    
           elif isinstance(obj, torch.nn.Module) and first_pass: 
        
               self._models.append(obj) 
        
               return self.prepare_model(obj)

But with deepspeed, it calls prepare_deepspeed instead

accelerate/src/accelerate/accelerator.py

Lines 484 to 488 in 6ebddcd

    
           if self.distributed_type == DistributedType.DEEPSPEED: 
        
               result = self._prepare_deepspeed(*args) 
        
           else: 
        
               result = tuple(self._prepare_one(obj, first_pass=True) for obj in args) 
        
               result = tuple(self._prepare_one(obj) for obj in result)

And the model is appended to self._models after the wrapper class.

accelerate/src/accelerate/accelerator.py

Lines 691 to 692 in 6ebddcd

    
           self.deepspeed_engine_wrapped = DeepSpeedEngineWrapper(engine) 
        
           self._models.append(engine)

pacman100 · 2022-07-05T07:05:28Z

Hello @cccntu ,

A few quick queries:

Are you trying to reload the checkpoint for further training/resuming training or inference? If you want to run inference, then only ZeRO stage 3 is applicable because there are no optimizer states and gradients involved during inference. If you want to resume training, please follow the answer to this issue #418
Does the following DeepSpeed Integration documentation section on Saving and loading model answer enable you to achieve your end goal: deepspeed ?

cccntu · 2022-07-05T07:49:11Z

Hi @pacman100 ,

I am trying to resume training, that's why I use save_state to save also the optimizer, instead of unwrap_model + save. The answer you linked uses deepspeed load_checkpoint. I think that might work, I need to try that. Thanks for the pointer. I am not sure if it loads optimizer and scheduler from the usage, see load_checkpoint nuances microsoft/DeepSpeed#647
The doc says

Saving and loading of models is unchanged for ZeRO Stage-1 and Stage-2.

Does that mean save_state and load_state should work? I am not using stage-3.

cccntu · 2022-07-06T13:35:33Z

Update: I tried deepspeed model.save_checkpoint and model.load_checkpoint. It worked, it can restore optimizer and scheduler states, when they are created via DummyOptim and DummyScheduler.

It can also load normal optimizer states, but not LambdaLR schedulers.

cyk1337 · 2022-09-07T14:54:29Z

Update: I tried deepspeed model.save_checkpoint and model.load_checkpoint. It worked, it can restore optimizer and scheduler states, when they are created via DummyOptim and DummyScheduler.

It can also load normal optimizer states, but not LambdaLR schedulers.

I met the same problem. With deepspeed, both model.save_checkpoint and accelerator.save_state will hang. How to save the optimizer / lr_scheduler states for resuming training (w/ deepspeed)?

pacman100 · 2022-09-07T15:09:48Z

@cyk1337, minimal reproducible example? DeepSpeed tests daily run on GPU hosted runner wherein saving and loading checkpoint functionality is working fine.

https://github.com/huggingface/accelerate/blob/main/tests/deepspeed/test_deepspeed.py#L699

amarazad · 2023-09-20T12:33:36Z

Hi @pacman100 , I am using FSDP with full sharding. I use the following to save sate so that I can resume with the last state:

accelerator.wait_for_everyone()
if accelerator.is_main_process:
    if config["SAVE_STATE"] :
            accelerator.save_state(save_state_dir)

And to resume:

model = accelerator.prepare(model)
optimizer =  accelerator.prepare(optimizer) 
...
if config["RESUME_STATE"]:
              accelerator.wait_for_everyone()
              accelerator.load_state(save_state_dir)

However, while saving, it hangs at accelerator.save_state(save_state_dir) and after a long time throws the following error:

INFO:accelerate.accelerator:Saving FSDP model
[E ProcessGroupNCCL.cpp:828] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18364, OpType=ALLGATHER, Timeout(ms)=4800000) ran for 4804648 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18364, OpType=ALLGATHER, Timeout(ms)=4800000) ran for 4804768 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18364, OpType=ALLGATHER, Timeout(ms)=4800000) ran for 4804777 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18338, OpType=_ALLGATHER_BASE, Timeout(ms)=4800000) ran for 4807653 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2820777 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2820778 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2820780 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 2820776) of binary: /home/apa/anaconda3/envs/py39_llm/bin/python
Traceback (most recent call last): ..
....

jiaxilv · 2024-03-18T08:18:45Z

I have the same problem as @cyk1337: With Deepspeed stage2, model.save_checkpoint and accelerator.save_state block the program from continuing. I am now saving my unet with the following code:

    unwrapped_model = accelerator.unwrap_model(unet)
    accelerator.save(unwrapped_model.state_dict(), os.path.join(save_path, "model.pth"))

But I found that when I use the following code to save the optimizer parameters, it saves only a certain part of the optimizer parameters, not the complete optimizer parameters:

    unwrapped_optimizer = accelerator.unwrap_model(optimizer)
    accelerator.save(unwrapped_optimizer.state_dict(), os.path.join(save_path, "optimizer.pth"))

Is there anything that can be done to fix this?

jhliu17 · 2024-05-25T01:57:15Z

Hi @pacman100 , I am using FSDP with full sharding. I use the following to save sate so that I can resume with the last state:

accelerator.wait_for_everyone()
if accelerator.is_main_process:
    if config["SAVE_STATE"] :
            accelerator.save_state(save_state_dir)

And to resume:

model = accelerator.prepare(model)
optimizer =  accelerator.prepare(optimizer) 
...
if config["RESUME_STATE"]:
              accelerator.wait_for_everyone()
              accelerator.load_state(save_state_dir)

However, while saving, it hangs at accelerator.save_state(save_state_dir) and after a long time throws the following error:

INFO:accelerate.accelerator:Saving FSDP model
[E ProcessGroupNCCL.cpp:828] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18364, OpType=ALLGATHER, Timeout(ms)=4800000) ran for 4804648 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18364, OpType=ALLGATHER, Timeout(ms)=4800000) ran for 4804768 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18364, OpType=ALLGATHER, Timeout(ms)=4800000) ran for 4804777 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18338, OpType=_ALLGATHER_BASE, Timeout(ms)=4800000) ran for 4807653 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2820777 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2820778 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2820780 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 2820776) of binary: /home/apa/anaconda3/envs/py39_llm/bin/python
Traceback (most recent call last): ..
....

I met the same problem. Do you have any solutions to it? Thank you~

ParthaEth · 2024-08-30T10:36:29Z

If you are doing the following with zero 3 deep spedd

unwrapped_model = accelerator.unwrap_model(model)
unwrapped_model.save_pretrained(
                self.output_dir,
                is_main_process=accelerator.is_main_process,
                save_function=accelerator.save,
                state_dict=accelerator.get_state_dict(model)
            )

Then you have to run it from all processes

Reginald-L · 2024-10-30T06:48:42Z

If you are doing the following with zero 3 deep spedd如果你正在以零 3 深速进行以下操作
unwrapped_model = accelerator.unwrap_model(model)
unwrapped_model.save_pretrained(
                self.output_dir,
                is_main_process=accelerator.is_main_process,
                save_function=accelerator.save,
                state_dict=accelerator.get_state_dict(model)
            )
Then you have to run it from all processes然后你必须从所有进程中运行它

I got the error: Flux object does not have the attribution of save_pretrained function.
Flux object is a subclass of torch.nn.Module

github-actions bot closed this as completed Jun 2, 2022

muellerzr reopened this Jul 4, 2022

muellerzr self-assigned this Jul 4, 2022

cccntu mentioned this issue Jul 6, 2022

Don't unwrap in save_state() #489

Merged

pacman100 mentioned this issue Jul 19, 2022

enhancements and fixes for FSDP and DeepSpeed #532

Merged

3 tasks

pacman100 closed this as completed in #532 Jul 26, 2022

erap129 mentioned this issue Dec 4, 2023

Deepspeed model breaks after model.resize_token_embeddings #2211

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with saving model/optimizer and loading them back #285

Issues with saving model/optimizer and loading them back #285

ashutoshsaboo commented Mar 18, 2022 •

edited

Loading

sgugger commented Mar 18, 2022

ashutoshsaboo commented Mar 18, 2022 •

edited

Loading

sgugger commented Mar 18, 2022

ashutoshsaboo commented Mar 18, 2022 •

edited

Loading

sgugger commented Mar 18, 2022

ashutoshsaboo commented Mar 18, 2022 •

edited

Loading

github-actions bot commented May 24, 2022

cccntu commented Jul 3, 2022

sgugger commented Jul 4, 2022

muellerzr commented Jul 4, 2022

cccntu commented Jul 4, 2022

muellerzr commented Jul 4, 2022 •

edited

Loading

cccntu commented Jul 5, 2022 •

edited

Loading

pacman100 commented Jul 5, 2022 •

edited

Loading

cccntu commented Jul 5, 2022

cccntu commented Jul 6, 2022

cyk1337 commented Sep 7, 2022 •

edited

Loading

pacman100 commented Sep 7, 2022

amarazad commented Sep 20, 2023 •

edited

Loading

jiaxilv commented Mar 18, 2024

jhliu17 commented May 25, 2024

ParthaEth commented Aug 30, 2024

Reginald-L commented Oct 30, 2024

Issues with saving model/optimizer and loading them back #285

Issues with saving model/optimizer and loading them back #285

Comments

ashutoshsaboo commented Mar 18, 2022 • edited Loading

sgugger commented Mar 18, 2022

ashutoshsaboo commented Mar 18, 2022 • edited Loading

sgugger commented Mar 18, 2022

ashutoshsaboo commented Mar 18, 2022 • edited Loading

sgugger commented Mar 18, 2022

ashutoshsaboo commented Mar 18, 2022 • edited Loading

github-actions bot commented May 24, 2022

cccntu commented Jul 3, 2022

sgugger commented Jul 4, 2022

muellerzr commented Jul 4, 2022

cccntu commented Jul 4, 2022

muellerzr commented Jul 4, 2022 • edited Loading

cccntu commented Jul 5, 2022 • edited Loading

pacman100 commented Jul 5, 2022 • edited Loading

cccntu commented Jul 5, 2022

cccntu commented Jul 6, 2022

cyk1337 commented Sep 7, 2022 • edited Loading

pacman100 commented Sep 7, 2022

amarazad commented Sep 20, 2023 • edited Loading

jiaxilv commented Mar 18, 2024

jhliu17 commented May 25, 2024

ParthaEth commented Aug 30, 2024

Reginald-L commented Oct 30, 2024

ashutoshsaboo commented Mar 18, 2022 •

edited

Loading

ashutoshsaboo commented Mar 18, 2022 •

edited

Loading

ashutoshsaboo commented Mar 18, 2022 •

edited

Loading

ashutoshsaboo commented Mar 18, 2022 •

edited

Loading

muellerzr commented Jul 4, 2022 •

edited

Loading

cccntu commented Jul 5, 2022 •

edited

Loading

pacman100 commented Jul 5, 2022 •

edited

Loading

cyk1337 commented Sep 7, 2022 •

edited

Loading

amarazad commented Sep 20, 2023 •

edited

Loading