Skip to content

Rhlf grpo training, communication freezes in vllm server mode #5910

@helin-wang-zte

Description

@helin-wang-zte
CUDA_VISIBLE_DEVICES=0,1 \
swift rollout \
    --model /mnt/tenant-home_speed/shared/models/huggingface/Qwen3-14B--1f9e100-C8 \
    --model_type qwen3 \
    --vllm_tensor_parallel_size 2 \
    --host 0.0.0.0 \
    --port 8088

another node

CUDA_VISIBLE_DEVICES=2,3,4,5,6,7 \
NPROC_PER_NODE=6 \
swift rlhf \
    --rlhf_type grpo \
    --model /mnt/tenant-home_speed/whl/swift_learn/model/Qwen3-14B--1f9e100-C8 \
    --model_type qwen3 \
    --output_dir /mnt/tenant-home_speed/whl/swift_learn/output/grpo-training/qwen14B \
    --importance_sampling_level sequence \
    --reward_funcs accuracy \
    --use_vllm true \
    --vllm_mode server \
    --vllm_server_host xxxx \
    --vllm_server_port 8088 \
    --vllm_gpu_memory_utilization 0.6 \
    --train_type full \
    --torch_dtype bfloat16 \
    --dataset /mnt/tenant-home_speed/whl/swift_learn/dataset/numina_math_64#32 \
    --split_dataset_ratio 0 \
    --max_completion_length 2048 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --learning_rate 1e-6 \
    --gradient_accumulation_steps 1 \
    --save_total_limit 2 \
    --logging_steps 1 \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --dataset_num_proc 4 \
    --num_generations 6 \
    --temperature 1.0 \
    --top_p 0.9 \
    --top_k 50 \
    --deepspeed zero3 \
    --log_completions true \
    --num_iterations 1 \
    --report_to tensorboard wandb \
    --beta 0.0 \
    --offload_optimizer true \
    --offload_model true \
    --sleep_level 1

After deployment, run the training script. The server received the request normally, but the training end was unable to connect successfully and froze. Rollout and RLHF exchange nodes also have the same result. But when I run on the same node, I can immediately connect successfully.

server log:
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8088 (Press CTRL+C to quit)
INFO: 127.0.0.6:34157 - "GET /health/ HTTP/1.1" 200 OK
INFO: 127.0.0.6:34157 - "GET /get_world_size/ HTTP/1.1" 200 OK
INFO: 127.0.0.6:34157 - "POST /init_communicator/ HTTP/1.1" 200 OK
(VllmWorker rank=1 pid=3028) INFO 09-22 09:39:36 [init.py:1375] Found nccl from library libnccl.so.2
(VllmWorker rank=1 pid=3028) INFO 09-22 09:39:36 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker rank=0 pid=3027) INFO 09-22 09:39:36 [init.py:1375] Found nccl from library libnccl.so.2
(VllmWorker rank=0 pid=3027) INFO 09-22 09:39:36 [pynccl.py:70] vLLM is using nccl==2.26.2

another log:
[INFO:swift] Successfully registered /usr/local/lib/python3.11/site-packages/swift/llm/dataset/data/dataset_info.json.
[INFO:swift] Setting args.remove_unused_columns: False
INFO 09-22 09:39:34 [init.py:235] Automatically detected platform cuda.
INFO 09-22 09:39:34 [init.py:235] Automatically detected platform cuda.
INFO 09-22 09:39:34 [init.py:235] Automatically detected platform cuda.
INFO 09-22 09:39:34 [init.py:235] Automatically detected platform cuda.
INFO 09-22 09:39:34 [init.py:235] Automatically detected platform cuda.
INFO 09-22 09:39:34 [init.py:235] Automatically detected platform cuda.
[INFO:swift] Start connecting to vLLM server
INFO 09-22 09:39:36 [init.py:1375] Found nccl from library libnccl.so.2
INFO 09-22 09:39:36 [pynccl.py:70] vLLM is using nccl==2.26.2
[2025-09-22 09:39:38,043] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-09-22 09:39:38,070] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-09-22 09:39:38,070] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-09-22 09:39:38,093] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-09-22 09:39:38,179] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-09-22 09:39:39,186] [INFO] [comm.py:669:init_distributed] cdb=None
[2025-09-22 09:39:39,202] [INFO] [comm.py:669:init_distributed] cdb=None
[2025-09-22 09:39:39,220] [INFO] [comm.py:669:init_distributed] cdb=None
[2025-09-22 09:39:39,251] [INFO] [comm.py:669:init_distributed] cdb=None
[2025-09-22 09:39:39,321] [INFO] [comm.py:669:init_distributed] cdb=None

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions