deepseek R1 671B 使用 deepspeed zero3 出现 Some NCCL operations have failed or timed out #7196

Tongmengfei · 2025-03-06T10:25:11Z

Tongmengfei
Mar 6, 2025

请问大家在训练 deepseek R1 671B 模型时，使用 deepspeed zero3 ，是否有遇到 Some NCCL operations have failed or timed out. 这个问题呢？

具体报错如下：
hgpu8077: [rank7]:[E305 21:36:52.669123121 ProcessGroupNCCL.cpp:616] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=512650, OpType=_ALLGATHER_BASE, NumelIn=180224, NumelOut=2883584, Timeout(ms)=1800000) ran for 1800004 milliseconds before timing out.
hgpu8092: [rank10]:[E305 21:36:52.594230057 ProcessGroupNCCL.cpp:616] [Rank 10] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=512650, OpType=_ALLGATHER_BASE, NumelIn=360448, NumelOut=5767168, Timeout(ms)=1800000) ran for 1800003 milliseconds before timing out.
hgpu8077: [rank7]:[E305 21:36:52.669455337 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 7] Exception (either an error or timeout) detected by watchdog at work: 512650, last enqueued NCCL work: 512653, last completed NCCL work: 512649.
hgpu8092: [rank10]:[E305 21:36:52.594651945 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 10] Exception (either an error or timeout) detected by watchdog at work: 512650, last enqueued NCCL work: 512653, last completed NCCL work: 512649.
hgpu8077: [rank7]:[E305 21:36:52.669469959 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 7] Timeout at NCCL work: 512650, last enqueued NCCL work: 512653, last completed NCCL work: 512649.
hgpu8092: [rank10]:[E305 21:36:52.594665866 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 10] Timeout at NCCL work: 512650, last enqueued NCCL work: 512653, last completed NCCL work: 512649.
hgpu8077: [rank7]:[E305 21:36:52.669477433 ProcessGroupNCCL.cpp:630] [Rank 7] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
hgpu8092: [rank10]:[E305 21:36:52.594672482 ProcessGroupNCCL.cpp:630] [Rank 10] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.

我用的配置是 ds_z3_config.json
{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"zero_optimization": {
"stage": 3,
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
}
}

JH-ninjatech · 2025-04-04T04:28:27Z

JH-ninjatech
Apr 4, 2025

检查slave worker看一下是否有CUDA OOM. 做好用cpu_offload

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

deepseek R1 671B 使用 deepspeed zero3 出现 Some NCCL operations have failed or timed out #7196

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

deepseek R1 671B 使用 deepspeed zero3 出现 Some NCCL operations have failed or timed out #7196

Uh oh!

Tongmengfei Mar 6, 2025

Replies: 1 comment

Uh oh!

JH-ninjatech Apr 4, 2025

Tongmengfei
Mar 6, 2025

JH-ninjatech
Apr 4, 2025