Skip to content

Conversation

rolshoven
Copy link
Contributor

@rolshoven rolshoven commented Sep 15, 2025

As part of a community task I have been collaborating on, I've encountered various challenges with the litellm judge backend. The challenges were (#962):

  • Judge outputs are currently not cached. If the evaluation script fails, the responses from the judge need to be regenerated, leading to higher evaluation costs.
  • Not all inference providers support the same set of parameters when generating chat completions.
  • The maximum number of generated tokens is currently hardcoded to 512 in the litellm backend
  • Currently, 100 concurrent requests are performed, potentially leading to the client running into rate limits.

These changes are solved in this PR in the following way:

The JudgeLM and JudgeLLM constructors now accept a backend_options parameter (dict). In case of the litellm backend, this is then converted to a new dataclass LitellmBackendOptions, which allows to specify whether to use caching or not, how many concurrent requests should be performed, and whether to increase the number of output tokens in case of reasoning models or not.

Additionally, the litellm backend will now by default ignore chat completion arguments that are not supported by the currently used inference provider. The max_tokens parameter is now respected by the litellm backend instead of using a hardcoded value.

I'm looking forward to discussing the current solution or potential alternatives!

Currently only used for litellm backend but can be extended to other backends as well. Allows to specify whether to use caching or not, the number of concurrent requests, and whether the token output budget should be increased for reasoning models.
Copy link
Member

@NathanHB NathanHB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey ! Thanks for the PR it looks great, only a few nts and it's will be good to ber merged

@HuggingFaceDocBuilderDev
Copy link
Collaborator

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@NathanHB NathanHB merged commit 9ba430f into huggingface:main Sep 16, 2025
4 checks passed
NathanHB added a commit that referenced this pull request Sep 19, 2025
* Added `backend_options` parameter to llm judges.

Currently only used for litellm backend but can be extended to other backends as well. Allows to specify whether to use caching or not, the number of concurrent requests, and whether the token output budget should be increased for reasoning models.

* Implemented changes from code review

* Ran pre-commit hooks

---------

Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants