-
Notifications
You must be signed in to change notification settings - Fork 2.4k
GRPOTrainer adds support for OpenAI API-compatible servers to models that generate samples #2901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
GRPOTrainer adds support for OpenAI API-compatible servers to models that generate samples #2901
Conversation
|
Cc @edbeeching |
I would like to clarify this PR, as I am not a native English speaker and there might be some errors in my original description. The default code loads the vLLM generative model on a single GPU. When training the model, other GPUs must wait for the single GPU to complete its task, causing delays. In this PR, I have added a new optional feature that allows using an external API for completion, instead of relying solely on the local vLLM implementation. Thanks! |
|
I might not fully understand it, but I don't see how is the external openai compatible model updated during training? The original slow implementation loads the most updated weights to vLLM at each step before generating responses. |
Hmm, you're right. This code doesn't update the vLLM server model weights in real time. I'm currently looking for ways to address this issue. |
|
Actually vLLM addressed the weight update in the following commit for RLHF: This enables the user to update weights via an NCCL process group. This means there would be two process groups, one for training and one for updating the weights on the vLLM server. |

What does this PR do?
The original generative model is loaded on a single GPU, which becomes very slow when the generation length reaches 4096 or more. Therefore, I suggest loading the generation model externally to leverage multi-GPU support.
Motivation Behind This Feature:
The motivation for this feature stems from performance limitations when using the original vllm generative model, which currently loads on a single GPU. This becomes a bottleneck, especially when the generation length reaches 4096 tokens or more, significantly slowing down the process. By utilizing multiple GPUs, we can distribute the workload more efficiently and drastically improve performance, especially for long-generation tasks.
This feature is crucial for my project, as faster generation times are essential. I believe it could also benefit the broader community by enhancing the scalability and efficiency of model inference on multi-GPU setups.
Requested Feature:
The feature I am requesting involves loading the generative model outside of the current single-GPU setup in order to leverage multi-GPU capabilities. This would improve performance for tasks involving large generation lengths, making the library more efficient and scalable for demanding use cases.
Code Snippet:
GRPOTrainer
__init__GRPOTrainer
_prepare_inputsFixes # (issue)
The issue mentioned in: #2887
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR. If additional documentation improvements are needed, I would be happy to contribute.