Skip to content

Issue with saving accelerator state with FSDP  #2000

@amarazad

Description

@amarazad

System Info

python  3.9
accelerate 0.21.0
pytorch                   2.0.1           py3.9_cuda11.7_cudnn8.5.0_0

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

Hi @pacman100 , I am using FSDP with full sharding. I use the following to save sate so that I can resume with the last state:

accelerator.wait_for_everyone()
if accelerator.is_main_process:
    if config["SAVE_STATE"] :
            accelerator.save_state(save_state_dir)

And to resume:

model = accelerator.prepare(model)
optimizer =  accelerator.prepare(optimizer) 
...
if config["RESUME_STATE"]:
              accelerator.wait_for_everyone()
              accelerator.load_state(save_state_dir)

However, while saving, it hangs at accelerator.save_state(save_state_dir) and after a long time throws the following error:

INFO:accelerate.accelerator:Saving FSDP model
[E ProcessGroupNCCL.cpp:828] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18364, OpType=ALLGATHER, Timeout(ms)=4800000) ran for 4804648 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18364, OpType=ALLGATHER, Timeout(ms)=4800000) ran for 4804768 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18364, OpType=ALLGATHER, Timeout(ms)=4800000) ran for 4804777 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18338, OpType=_ALLGATHER_BASE, Timeout(ms)=4800000) ran for 4807653 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2820777 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2820778 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2820780 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 2820776) of binary: /home/apa/anaconda3/envs/py39_llm/bin/python
Traceback (most recent call last): ..
....

Expected behavior

As per document, it should save state to be able to resume from there.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions