No effect from InitProcessGroupKwargs timeout #2236

Randl · 2023-12-10T07:44:53Z

System Info

- `Accelerate` version: 0.23.0
- Platform: Linux-6.2.0-37-generic-x86_64-with-glibc2.35
- Python version: 3.10.13
- Numpy version: 1.26.2
- PyTorch version (GPU?): 2.1.1+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 62.65 GB
- GPU type: NVIDIA RTX 6000 Ada Generation
- `Accelerate` default config:
	Not found

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

Follow instructions from https://github.com/huggingface/alignment-handbook/tree/main/scripts. Install the environment to run lora sft training
Change the timeout to 3 hours:

accelerator = Accelerator(kwargs_handlers=[InitProcessGroupKwargs(timeout=timedelta(seconds=6 * 1800))])

and run the training
3. Get crash due to timeout: https://wandb.ai/evgeniizh/huggingface/runs/pskgg48d

[E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1124292, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800584 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1124292, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800584 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1124292, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800584 milliseconds before timing out.
[2023-12-09 08:46:08,664] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 54784 closing signal SIGTERM
[2023-12-09 08:46:11,834] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 1 (pid: 54785) of binary: /home/evgenii/.conda/envs/handbook/bin/python
Traceback (most recent call last):
  File "/home/evgenii/.conda/envs/handbook/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/evgenii/.conda/envs/handbook/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/home/evgenii/.conda/envs/handbook/lib/python3.10/site-packages/accelerate/commands/launch.py", line 971, in launch_command
    deepspeed_launcher(args)
  File "/home/evgenii/.conda/envs/handbook/lib/python3.10/site-packages/accelerate/commands/launch.py", line 687, in deepspeed_launcher
    distrib_run.run(args)
  File "/home/evgenii/.conda/envs/handbook/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/home/evgenii/.conda/envs/handbook/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/evgenii/.conda/envs/handbook/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
======================================================
scripts/run_sft.py FAILED

Note that timeout is still 1800 secconds
(see also huggingface/alignment-handbook#59)

Expected behavior

Timeout is increased, and no crush.

The text was updated successfully, but these errors were encountered:

SunMarc · 2023-12-20T11:23:51Z

Hi @Randl, thanks for asking. This is normal for nccl backend. I invite you to read the description of the timeout arg in the related doc to have more information.
Excerpt :

timeout (timedelta, optional) – Timeout for operations executed against the process group. Default value equals 30 minutes. This is applicable for the gloo backend. For nccl, this is applicable only if the environment variable NCCL_BLOCKING_WAIT or NCCL_ASYNC_ERROR_HANDLING is set to 1.

muellerzr · 2023-12-20T15:37:12Z

@Randl can you rerun your code building accelerate from pip install git+https://github.com/huggingface/accelerate@check-for-nccl and verify we can catch this early? (And that is indeed what is wrong with your setup?) 😄

Randl · 2023-12-20T16:48:47Z

@muellerzr
NCCL_ASYNC_ERROR_HANDLING is set to 1 (by some of the libraries I use, I guess? I didn't set it).
In fact, the function changed in this branch is called only twice in my code, both from training_args
https://github.com/huggingface/transformers/blob/main/src/transformers/training_args.py#L1871-L1873
once with self.backend=nccl and once with self.backend=None. InitProcessGroupKwargs(timeout=timedelta(seconds=6 * 1800)) can't even influence it?

I've also tried to set --ddp_timeout=10800 (this is what passed from training_args) in my command, and it is passed to this function only in the second call; I still get the 30-minute timeout in my code.

github-actions · 2024-01-14T15:06:08Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Randl · 2024-01-14T15:15:02Z

I don't think it was addressed?

github-actions · 2024-02-08T15:06:47Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Randl · 2024-02-12T06:15:40Z

still not resolved?

github-actions · 2024-03-07T15:06:38Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Randl · 2024-03-07T15:10:20Z

...

muellerzr · 2024-03-07T15:21:27Z

Looking into this again this week, sorry for the delay

muellerzr · 2024-03-07T18:26:34Z

I'm definitely seeing an effect here. Note that timeout only applies on situations where wait_for_everyone (or gather, etc) has been called.

Minimal test:

import time
from datetime import timedelta
from accelerate import Accelerator, InitProcessGroupKwargs
from torch import tensor

kwargs = [InitProcessGroupKwargs(timeout=timedelta(seconds=4))]
accelerator = Accelerator(kwargs_handlers=kwargs)

if accelerator.is_main_process:
    t = tensor(0).to(accelerator.device)
    time.sleep(8)
else:
    t = tensor(0).to(accelerator.device)
accelerator.wait_for_everyone()

print("All called!")

This will lead to a failure, change that 4 to a 10 and it'll pass.

muellerzr · 2024-03-07T18:27:54Z

Can you give us more of your trace? It doesn't hint at where it's failing at.

Randl · 2024-03-07T18:36:48Z

I don't have the access to the machine currently. I'll update you when I can run stuff on it.
I don't think there was any additional information there. From logs, it's failing after uploading the checkpoint to the hub, ie somewhere around
https://github.com/huggingface/alignment-handbook/blob/ff618a4d13a2c77cf97479fac8af2c576619062a/scripts/run_sft.py#L203-L205

muellerzr · 2024-03-07T18:39:44Z

Thanks, that's helpful

muellerzr · 2024-03-07T18:41:22Z

I see the exact issue, it's due to SFTTrainer, and is not an accelerate issue (though it is accelerate adjacent). Can you open an issue in trl for this and ping me?

muellerzr self-assigned this Dec 20, 2023

muellerzr mentioned this issue Dec 20, 2023

Guard timeout early if we cannot actually set it #2269

Closed

5 tasks

muellerzr closed this as completed Mar 7, 2024

Randl mentioned this issue Mar 7, 2024

No effect from InitProcessGroupKwargs timeout huggingface/trl#1403

Closed

This was referenced Mar 7, 2024

Fix timeout propagation for world_process_zero things huggingface/transformers#29523

Closed

dpo train with llama 2 70b timeout #2536

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No effect from InitProcessGroupKwargs timeout #2236

No effect from InitProcessGroupKwargs timeout #2236

Randl commented Dec 10, 2023

SunMarc commented Dec 20, 2023 •

edited

muellerzr commented Dec 20, 2023

Randl commented Dec 20, 2023

github-actions bot commented Jan 14, 2024

Randl commented Jan 14, 2024

github-actions bot commented Feb 8, 2024

Randl commented Feb 12, 2024

github-actions bot commented Mar 7, 2024

Randl commented Mar 7, 2024

muellerzr commented Mar 7, 2024

muellerzr commented Mar 7, 2024

muellerzr commented Mar 7, 2024

Randl commented Mar 7, 2024

muellerzr commented Mar 7, 2024

muellerzr commented Mar 7, 2024

No effect from InitProcessGroupKwargs timeout #2236

No effect from InitProcessGroupKwargs timeout #2236

Comments

Randl commented Dec 10, 2023

System Info

Information

Tasks

Reproduction

Expected behavior

SunMarc commented Dec 20, 2023 • edited

muellerzr commented Dec 20, 2023

Randl commented Dec 20, 2023

github-actions bot commented Jan 14, 2024

Randl commented Jan 14, 2024

github-actions bot commented Feb 8, 2024

Randl commented Feb 12, 2024

github-actions bot commented Mar 7, 2024

Randl commented Mar 7, 2024

muellerzr commented Mar 7, 2024

muellerzr commented Mar 7, 2024

muellerzr commented Mar 7, 2024

Randl commented Mar 7, 2024

muellerzr commented Mar 7, 2024

muellerzr commented Mar 7, 2024

SunMarc commented Dec 20, 2023 •

edited