Issue with saving accelerator state with FSDP 

### System Info

```Shell
python  3.9
accelerate 0.21.0
pytorch                   2.0.1           py3.9_cuda11.7_cudnn8.5.0_0
```


### Information

- [ ] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported `no_trainer` script in the `examples` folder of the `transformers` repo (such as `run_no_trainer_glue.py`)
- [ ] My own task or dataset (give details below)

### Reproduction

Hi @pacman100 , I am using FSDP with full sharding. I use the following to save sate so that I can resume with the last state:

```
accelerator.wait_for_everyone()
if accelerator.is_main_process:
    if config["SAVE_STATE"] :
            accelerator.save_state(save_state_dir)
```
And to resume:

```
model = accelerator.prepare(model)
optimizer =  accelerator.prepare(optimizer) 
...
if config["RESUME_STATE"]:
              accelerator.wait_for_everyone()
              accelerator.load_state(save_state_dir)
```

However,  while saving, it hangs at `accelerator.save_state(save_state_dir)` and after a long time throws the following error:
```
INFO:accelerate.accelerator:Saving FSDP model
[E ProcessGroupNCCL.cpp:828] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18364, OpType=ALLGATHER, Timeout(ms)=4800000) ran for 4804648 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18364, OpType=ALLGATHER, Timeout(ms)=4800000) ran for 4804768 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18364, OpType=ALLGATHER, Timeout(ms)=4800000) ran for 4804777 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18338, OpType=_ALLGATHER_BASE, Timeout(ms)=4800000) ran for 4807653 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2820777 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2820778 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2820780 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 2820776) of binary: /home/apa/anaconda3/envs/py39_llm/bin/python
Traceback (most recent call last): ..
....
```







### Expected behavior

As per document, it should save state to be able to resume from there. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with saving accelerator state with FSDP #2000

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue with saving accelerator state with FSDP #2000

Description

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions