Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to update weights to vLLM #313

Closed
thirteenflt opened this issue Jun 4, 2024 · 3 comments
Closed

Failed to update weights to vLLM #313

thirteenflt opened this issue Jun 4, 2024 · 3 comments

Comments

@thirteenflt
Copy link

Anyone has any clue on this error?

�[36m(RayWorkerWrapper pid=4183, ip=10.3.32.122)�[0m ERROR 06-03 23:13:39 worker_base.py:148] Error executing method start_worker_execution_loop. This might cause deadlock in distributed execution.
�[36m(RayWorkerWrapper pid=4183, ip=10.3.32.122)�[0m ERROR 06-03 23:13:39 worker_base.py:148] Traceback (most recent call last):
�[36m(RayWorkerWrapper pid=4183, ip=10.3.32.122)�[0m ERROR 06-03 23:13:39 worker_base.py:148]   File "/opt/conda/envs/ptca/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 140, in execute_method
�[36m(RayWorkerWrapper pid=4183, ip=10.3.32.122)�[0m ERROR 06-03 23:13:39 worker_base.py:148]     return executor(*args, **kwargs)
�[36m(RayWorkerWrapper pid=4183, ip=10.3.32.122)�[0m ERROR 06-03 23:13:39 worker_base.py:148]   File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
�[36m(RayWorkerWrapper pid=4183, ip=10.3.32.122)�[0m ERROR 06-03 23:13:39 worker_base.py:148]     return func(*args, **kwargs)
�[36m(RayWorkerWrapper pid=4183, ip=10.3.32.122)�[0m ERROR 06-03 23:13:39 worker_base.py:148]   File "/opt/conda/envs/ptca/lib/python3.10/site-packages/vllm/worker/worker.py", line 286, in start_worker_execution_loop
�[36m(RayWorkerWrapper pid=4183, ip=10.3.32.122)�[0m ERROR 06-03 23:13:39 worker_base.py:148]     while self._execute_model_non_driver():
�[36m(RayWorkerWrapper pid=4183, ip=10.3.32.122)�[0m ERROR 06-03 23:13:39 worker_base.py:148]   File "/opt/conda/envs/ptca/lib/python3.10/site-packages/vllm/worker/worker.py", line 295, in _execute_model_non_driver
�[36m(RayWorkerWrapper pid=4183, ip=10.3.32.122)�[0m ERROR 06-03 23:13:39 worker_base.py:148]     data = broadcast_tensor_dict(src=0)
�[36m(RayWorkerWrapper pid=4183, ip=10.3.32.122)�[0m ERROR 06-03 23:13:39 worker_base.py:148]   File "/opt/conda/envs/ptca/lib/python3.10/site-packages/vllm/distributed/communication_op.py", line 284, in broadcast_tensor_dict
�[36m(RayWorkerWrapper pid=4183, ip=10.3.32.122)�[0m ERROR 06-03 23:13:39 worker_base.py:148]     torch.distributed.broadcast_object_list(recv_metadata_list,
�[36m(RayWorkerWrapper pid=4183, ip=10.3.32.122)�[0m ERROR 06-03 23:13:39 worker_base.py:148]   File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
�[36m(RayWorkerWrapper pid=4183, ip=10.3.32.122)�[0m ERROR 06-03 23:13:39 worker_base.py:148]     return func(*args, **kwargs)
�[36m(RayWorkerWrapper pid=4183, ip=10.3.32.122)�[0m ERROR 06-03 23:13:39 worker_base.py:148]   File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2649, in broadcast_object_list
�[36m(RayWorkerWrapper pid=4183, ip=10.3.32.122)�[0m ERROR 06-03 23:13:39 worker_base.py:148]     broadcast(object_sizes_tensor, src=src, group=group)
�[36m(RayWorkerWrapper pid=4183, ip=10.3.32.122)�[0m ERROR 06-03 23:13:39 worker_base.py:148]   File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
�[36m(RayWorkerWrapper pid=4183, ip=10.3.32.122)�[0m ERROR 06-03 23:13:39 worker_base.py:148]     return func(*args, **kwargs)
�[36m(RayWorkerWrapper pid=4183, ip=10.3.32.122)�[0m ERROR 06-03 23:13:39 worker_base.py:148]   File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2144, in broadcast
�[36m(RayWorkerWrapper pid=4183, ip=10.3.32.122)�[0m ERROR 06-03 23:13:39 worker_base.py:148]     work.wait()
�[36m(RayWorkerWrapper pid=4183, ip=10.3.32.122)�[0m ERROR 06-03 23:13:39 worker_base.py:148] RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete
@hijkzzz
Copy link
Collaborator

hijkzzz commented Jun 4, 2024

what is your vllm version?

@thirteenflt
Copy link
Author

vllm version is 0.4.3
nvidia-nccl-cu12==2.20.5

@hijkzzz
Copy link
Collaborator

hijkzzz commented Jun 4, 2024

vllm version is 0.4.3 nvidia-nccl-cu12==2.20.5

vllm version is 0.4.3 nvidia-nccl-cu12==2.20.5

do you use our docker container (https://github.com/OpenLLMAI/OpenRLHF/tree/main/dockerfile) and NCCL between multiple nodes (such as IB)?
I recommend you try vLLM 0.42 first, because for 0.43 we didn't test it enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants