GRPOTrainer adds support for OpenAI API-compatible servers to models that generate samples #2901

ZYM66 · 2025-02-19T02:47:52Z

What does this PR do?

The original generative model is loaded on a single GPU, which becomes very slow when the generation length reaches 4096 or more. Therefore, I suggest loading the generation model externally to leverage multi-GPU support.

Motivation Behind This Feature:

The motivation for this feature stems from performance limitations when using the original vllm generative model, which currently loads on a single GPU. This becomes a bottleneck, especially when the generation length reaches 4096 tokens or more, significantly slowing down the process. By utilizing multiple GPUs, we can distribute the workload more efficiently and drastically improve performance, especially for long-generation tasks.

This feature is crucial for my project, as faster generation times are essential. I believe it could also benefit the broader community by enhancing the scalability and efficiency of model inference on multi-GPU setups.

Requested Feature:

The feature I am requesting involves loading the generative model outside of the current single-GPU setup in order to leverage multi-GPU capabilities. This would improve performance for tasks involving large generation lengths, making the library more efficient and scalable for demanding use cases.

Code Snippet:

GRPOTrainer __init__

elif self.args.use_openai_compatible_server:
    api_endpoint = args.api_endpoint
    api_key = args.api_key

    openai_serving_client = openai.OpenAI(base_url=api_endpoint, api_key=api_key, )
    # set the openai logger to ERROR level to avoid mess log information
    logging.getLogger("openai").setLevel(logging.ERROR)
    logging.getLogger("httpx").setLevel(logging.ERROR)
    self.ref_model_name = args.ref_model_name

    self.ref_llm = partial(openai_serving_client.chat.completions.create,
            model=args.ref_model_name,
            max_tokens=self.max_completion_length,
            temperature=args.temperature,
    )

GRPOTrainer _prepare_inputs

elif self.args.use_openai_compatible_server:
    completions = []
    # don't use any chattemplate, because the server have load it.
    for prompt in prompts:
        # request server
        response = self.ref_llm(messages=prompt)
        completion_text = response.choices[0].message.content
        completion_tokens = self.processing_class.encode(completion_text, add_special_tokens=False)
        completions.append(completion_tokens)

    completion_ids = completions
    completion_ids = [torch.tensor(ids, device=device) for ids in completion_ids]
    completion_ids = pad(completion_ids, padding_value=self.processing_class.pad_token_id)
    prompt_completion_ids = torch.cat([prompt_ids, completion_ids], dim=1)

Fixes # (issue)

The issue mentioned in: #2887

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR. If additional documentation improvements are needed, I would be happy to contribute.

ZYM66 · 2025-02-19T05:06:36Z

Additionally, you can set these keywords to use this PR

qgallouedec · 2025-02-19T08:21:12Z

Cc @edbeeching

ZYM66 · 2025-02-20T03:04:38Z

Cc @edbeeching

I would like to clarify this PR, as I am not a native English speaker and there might be some errors in my original description.

The default code loads the vLLM generative model on a single GPU. When training the model, other GPUs must wait for the single GPU to complete its task, causing delays. In this PR, I have added a new optional feature that allows using an external API for completion, instead of relying solely on the local vLLM implementation.

Thanks!

XZ-X · 2025-02-20T03:09:21Z

I might not fully understand it, but I don't see how is the external openai compatible model updated during training?

The original slow implementation loads the most updated weights to vLLM at each step before generating responses.

ZYM66 · 2025-02-20T03:23:13Z

I might not fully understand it, but I don't see how is the external openai compatible model updated during training?

The original slow implementation loads the most updated weights to vLLM at each step before generating responses.

Hmm, you're right. This code doesn't update the vLLM server model weights in real time. I'm currently looking for ways to address this issue.
I've now changed this PR to a draft.

paul-grundmann · 2025-03-05T10:57:24Z

Actually vLLM addressed the weight update in the following commit for RLHF:
vllm-project/vllm#12084

This enables the user to update weights via an NCCL process group. This means there would be two process groups, one for training and one for updating the weights on the vLLM server.
For me this appears to be a good solution to outsource the vLLM server

ZYM66 added 2 commits February 18, 2025 22:58

Add reference model OpenAI compatible API server support

7431386

Add annotation

d276864

ZYM66 mentioned this pull request Feb 19, 2025

Bottleneck in GRPO training #2887

Open

Use Default Chat template

40c7bc1

ZYM66 changed the title ~~Add support for OpenAI API-compatible server reference models.~~ GRPOTrainer add support for OpenAI API-compatible server reference models. Feb 19, 2025

ZYM66 changed the title ~~GRPOTrainer add support for OpenAI API-compatible server reference models.~~ GRPOTrainer adds support for OpenAI API-compatible servers to models that generate samples Feb 20, 2025

ZYM66 marked this pull request as draft February 20, 2025 03:23

ghrua mentioned this pull request Feb 22, 2025

Supporting multi-vLLM inference for GRPO #2929

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GRPOTrainer adds support for OpenAI API-compatible servers to models that generate samples #2901

GRPOTrainer adds support for OpenAI API-compatible servers to models that generate samples #2901

Uh oh!

ZYM66 commented Feb 19, 2025 •

edited

Loading

Uh oh!

ZYM66 commented Feb 19, 2025

Uh oh!

qgallouedec commented Feb 19, 2025

Uh oh!

ZYM66 commented Feb 20, 2025

Uh oh!

XZ-X commented Feb 20, 2025

Uh oh!

ZYM66 commented Feb 20, 2025

Uh oh!

paul-grundmann commented Mar 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

GRPOTrainer adds support for OpenAI API-compatible servers to models that generate samples #2901

Are you sure you want to change the base?

GRPOTrainer adds support for OpenAI API-compatible servers to models that generate samples #2901

Uh oh!

Conversation

ZYM66 commented Feb 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Motivation Behind This Feature:

Requested Feature:

Code Snippet:

Fixes # (issue)

Before submitting

Who can review?

Uh oh!

ZYM66 commented Feb 19, 2025

Uh oh!

qgallouedec commented Feb 19, 2025

Uh oh!

ZYM66 commented Feb 20, 2025

Uh oh!

XZ-X commented Feb 20, 2025

Uh oh!

ZYM66 commented Feb 20, 2025

Uh oh!

paul-grundmann commented Mar 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ZYM66 commented Feb 19, 2025 •

edited

Loading