Rhlf grpo training, communication freezes in vllm server mode

```
CUDA_VISIBLE_DEVICES=0,1 \
swift rollout \
    --model /mnt/tenant-home_speed/shared/models/huggingface/Qwen3-14B--1f9e100-C8 \
    --model_type qwen3 \
    --vllm_tensor_parallel_size 2 \
    --host 0.0.0.0 \
    --port 8088
```

another node

```
CUDA_VISIBLE_DEVICES=2,3,4,5,6,7 \
NPROC_PER_NODE=6 \
swift rlhf \
    --rlhf_type grpo \
    --model /mnt/tenant-home_speed/whl/swift_learn/model/Qwen3-14B--1f9e100-C8 \
    --model_type qwen3 \
    --output_dir /mnt/tenant-home_speed/whl/swift_learn/output/grpo-training/qwen14B \
    --importance_sampling_level sequence \
    --reward_funcs accuracy \
    --use_vllm true \
    --vllm_mode server \
    --vllm_server_host xxxx \
    --vllm_server_port 8088 \
    --vllm_gpu_memory_utilization 0.6 \
    --train_type full \
    --torch_dtype bfloat16 \
    --dataset /mnt/tenant-home_speed/whl/swift_learn/dataset/numina_math_64#32 \
    --split_dataset_ratio 0 \
    --max_completion_length 2048 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --learning_rate 1e-6 \
    --gradient_accumulation_steps 1 \
    --save_total_limit 2 \
    --logging_steps 1 \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --dataset_num_proc 4 \
    --num_generations 6 \
    --temperature 1.0 \
    --top_p 0.9 \
    --top_k 50 \
    --deepspeed zero3 \
    --log_completions true \
    --num_iterations 1 \
    --report_to tensorboard wandb \
    --beta 0.0 \
    --offload_optimizer true \
    --offload_model true \
    --sleep_level 1
```

**After deployment, run the training script. The server received the request normally, but the training end was unable to connect successfully and froze. Rollout and RLHF exchange nodes also have the same result. But when I run on the same node, I can immediately connect successfully.**

server log:
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8088 (Press CTRL+C to quit)
INFO:     127.0.0.6:34157 - "GET /health/ HTTP/1.1" 200 OK
INFO:     127.0.0.6:34157 - "GET /get_world_size/ HTTP/1.1" 200 OK
INFO:     127.0.0.6:34157 - "POST /init_communicator/ HTTP/1.1" 200 OK
(VllmWorker rank=1 pid=3028) INFO 09-22 09:39:36 [__init__.py:1375] Found nccl from library libnccl.so.2
(VllmWorker rank=1 pid=3028) INFO 09-22 09:39:36 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker rank=0 pid=3027) INFO 09-22 09:39:36 [__init__.py:1375] Found nccl from library libnccl.so.2
(VllmWorker rank=0 pid=3027) INFO 09-22 09:39:36 [pynccl.py:70] vLLM is using nccl==2.26.2

another log:
[INFO:swift] Successfully registered `/usr/local/lib/python3.11/site-packages/swift/llm/dataset/data/dataset_info.json`.
[INFO:swift] Setting args.remove_unused_columns: False
INFO 09-22 09:39:34 [__init__.py:235] Automatically detected platform cuda.
INFO 09-22 09:39:34 [__init__.py:235] Automatically detected platform cuda.
INFO 09-22 09:39:34 [__init__.py:235] Automatically detected platform cuda.
INFO 09-22 09:39:34 [__init__.py:235] Automatically detected platform cuda.
INFO 09-22 09:39:34 [__init__.py:235] Automatically detected platform cuda.
INFO 09-22 09:39:34 [__init__.py:235] Automatically detected platform cuda.
[INFO:swift] Start connecting to vLLM server
INFO 09-22 09:39:36 [__init__.py:1375] Found nccl from library libnccl.so.2
INFO 09-22 09:39:36 [pynccl.py:70] vLLM is using nccl==2.26.2
[2025-09-22 09:39:38,043] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-09-22 09:39:38,070] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-09-22 09:39:38,070] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-09-22 09:39:38,093] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-09-22 09:39:38,179] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-09-22 09:39:39,186] [INFO] [comm.py:669:init_distributed] cdb=None
[2025-09-22 09:39:39,202] [INFO] [comm.py:669:init_distributed] cdb=None
[2025-09-22 09:39:39,220] [INFO] [comm.py:669:init_distributed] cdb=None
[2025-09-22 09:39:39,251] [INFO] [comm.py:669:init_distributed] cdb=None
[2025-09-22 09:39:39,321] [INFO] [comm.py:669:init_distributed] cdb=None






Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rhlf grpo training, communication freezes in vllm server mode #5910

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Rhlf grpo training, communication freezes in vllm server mode #5910

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions