deepseek R1 671B 使用 deepspeed zero3 出现 Some NCCL operations have failed or timed out #7196
Unanswered
Tongmengfei
asked this question in
Q&A
Replies: 1 comment
-
|
检查slave worker看一下是否有CUDA OOM. 做好用cpu_offload |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
请问大家在训练 deepseek R1 671B 模型时,使用 deepspeed zero3 ,是否有遇到 Some NCCL operations have failed or timed out. 这个问题呢?
具体报错如下:
hgpu8077: [rank7]:[E305 21:36:52.669123121 ProcessGroupNCCL.cpp:616] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=512650, OpType=_ALLGATHER_BASE, NumelIn=180224, NumelOut=2883584, Timeout(ms)=1800000) ran for 1800004 milliseconds before timing out.
hgpu8092: [rank10]:[E305 21:36:52.594230057 ProcessGroupNCCL.cpp:616] [Rank 10] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=512650, OpType=_ALLGATHER_BASE, NumelIn=360448, NumelOut=5767168, Timeout(ms)=1800000) ran for 1800003 milliseconds before timing out.
hgpu8077: [rank7]:[E305 21:36:52.669455337 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 7] Exception (either an error or timeout) detected by watchdog at work: 512650, last enqueued NCCL work: 512653, last completed NCCL work: 512649.
hgpu8092: [rank10]:[E305 21:36:52.594651945 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 10] Exception (either an error or timeout) detected by watchdog at work: 512650, last enqueued NCCL work: 512653, last completed NCCL work: 512649.
hgpu8077: [rank7]:[E305 21:36:52.669469959 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 7] Timeout at NCCL work: 512650, last enqueued NCCL work: 512653, last completed NCCL work: 512649.
hgpu8092: [rank10]:[E305 21:36:52.594665866 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 10] Timeout at NCCL work: 512650, last enqueued NCCL work: 512653, last completed NCCL work: 512649.
hgpu8077: [rank7]:[E305 21:36:52.669477433 ProcessGroupNCCL.cpp:630] [Rank 7] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
hgpu8092: [rank10]:[E305 21:36:52.594672482 ProcessGroupNCCL.cpp:630] [Rank 10] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
我用的配置是 ds_z3_config.json
{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"zero_optimization": {
"stage": 3,
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
}
}
Beta Was this translation helpful? Give feedback.
All reactions