Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train_ppo_llama_ray_70b.sh run two H800 machine error #316

Closed
yangzhipeng1108 opened this issue Jun 6, 2024 · 1 comment
Closed

train_ppo_llama_ray_70b.sh run two H800 machine error #316

yangzhipeng1108 opened this issue Jun 6, 2024 · 1 comment

Comments

@yangzhipeng1108
Copy link

yangzhipeng1108 commented Jun 6, 2024

image
image

@yangzhipeng1108 yangzhipeng1108 changed the title train_ppo_llama_ray_70b.sh run two H800 machine train_ppo_llama_ray_70b.sh run two H800 machine error Jun 6, 2024
@hijkzzz
Copy link
Collaborator

hijkzzz commented Jun 6, 2024

Is there an NCCL connection between the two machines (required by vLLM weights sync)
If not, you need to hack the code here to support sync weights using gloo .
https://github.com/OpenLLMAI/OpenRLHF/blob/main/openrlhf/trainer/ray/ppo_actor.py#L85
At last, please use vLLM v0.42 due to there is a bug for vLLM 0.43.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants