-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Closed
Description
System Info
python 3.9
accelerate 0.21.0
pytorch 2.0.1 py3.9_cuda11.7_cudnn8.5.0_0Information
- The official example scripts
- My own modified scripts
Tasks
- One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainerscript in theexamplesfolder of thetransformersrepo (such asrun_no_trainer_glue.py) - My own task or dataset (give details below)
Reproduction
Hi @pacman100 , I am using FSDP with full sharding. I use the following to save sate so that I can resume with the last state:
accelerator.wait_for_everyone()
if accelerator.is_main_process:
if config["SAVE_STATE"] :
accelerator.save_state(save_state_dir)
And to resume:
model = accelerator.prepare(model)
optimizer = accelerator.prepare(optimizer)
...
if config["RESUME_STATE"]:
accelerator.wait_for_everyone()
accelerator.load_state(save_state_dir)
However, while saving, it hangs at accelerator.save_state(save_state_dir) and after a long time throws the following error:
INFO:accelerate.accelerator:Saving FSDP model
[E ProcessGroupNCCL.cpp:828] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18364, OpType=ALLGATHER, Timeout(ms)=4800000) ran for 4804648 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18364, OpType=ALLGATHER, Timeout(ms)=4800000) ran for 4804768 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18364, OpType=ALLGATHER, Timeout(ms)=4800000) ran for 4804777 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18338, OpType=_ALLGATHER_BASE, Timeout(ms)=4800000) ran for 4807653 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2820777 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2820778 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2820780 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 2820776) of binary: /home/apa/anaconda3/envs/py39_llm/bin/python
Traceback (most recent call last): ..
....
Expected behavior
As per document, it should save state to be able to resume from there.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels