Skip to content

Conversation

Jack-Khuu
Copy link
Contributor

Adds a basic LLMJudge (vllm generation from Policy) based on the methods listed in https://hazyresearch.stanford.edu/blog/2025-06-18-weaver

Note: Policy should probably be renamed, but this isn't the PR to do it and there's separate abstraction discussions in #149

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 17, 2025
"""
Returns whether there is a match to the first output
"""
first_output = outputs.outputs[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: may be handle the edge case when the outputs are empty?



@dataclass
class LLMJudge:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if LLMJudge here is evaluating responses i.e. Pass@1, Majority, First Sample, etc. - aren't these metrics rather than judges?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't metrics just for logging?

The Judges can be used as part of the generation evaluation (analogous to Rewards)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah Metric is sort of a loaded term unfortunately. Maybe it's EvaluationMetric?

class LLMJudge:
"""Simple interface for Judges utilizing LLMs."""

judge_model: ServiceInterface
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of this class holding a ServiceInterface, can we instead pass the generated responses directly to evaluate()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So have the main loop directly manage calling all N weak verifiers in the rollout?

I don't have a strong preference here, but doing so does increases the boilerplate/management load

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, so this is my mental model of how Weaver works:

  1. Generator generates K responses
  2. K responses go through N verifiers, producing KN verifier results
  3. KN results gets distilled down to a scalar 0/1 through weaver

Meanwhile, for pass@1, majority, first sample, and pass @k, they're more like "assuming we know the answer already, was the generator able to produce the correct results in K tries?"

so when it comes to this PR, it depends on what we're trying to accomplish - is it step 2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Step 2 (will update to return a list/tensor of length K) where N-judges correspond to N verifiers

Good catch though, I should generalize this to K responses

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so if we're shooting for 2 then I wouldn't necessarily include pass@1, majority, first sample, pass@K etc. which are separate from the verifiers

What we should show is like N different models from RewardBench as individual services

if we want to introduce a Judge concept, then I think we should do two things:

  • Rename Policy* to Inference*
  • Make the Judge a special instance of Policy that uses generate to turn the final result into scalars or w/e is needed

@Jack-Khuu
Copy link
Contributor Author

Will remake PR, since first N commits were going in a different direction than needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants