-
Notifications
You must be signed in to change notification settings - Fork 24
Adds basic LLMJudges #167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds basic LLMJudges #167
Conversation
""" | ||
Returns whether there is a match to the first output | ||
""" | ||
first_output = outputs.outputs[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: may be handle the edge case when the outputs are empty?
|
||
|
||
@dataclass | ||
class LLMJudge: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if LLMJudge here is evaluating responses i.e. Pass@1, Majority, First Sample, etc. - aren't these metrics rather than judges?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't metrics just for logging?
The Judges can be used as part of the generation evaluation (analogous to Rewards)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah Metric is sort of a loaded term unfortunately. Maybe it's EvaluationMetric
?
class LLMJudge: | ||
"""Simple interface for Judges utilizing LLMs.""" | ||
|
||
judge_model: ServiceInterface |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of this class holding a ServiceInterface, can we instead pass the generated responses directly to evaluate()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So have the main loop directly manage calling all N weak verifiers in the rollout?
I don't have a strong preference here, but doing so does increases the boilerplate/management load
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, so this is my mental model of how Weaver works:
- Generator generates K responses
- K responses go through N verifiers, producing KN verifier results
- KN results gets distilled down to a scalar 0/1 through weaver
Meanwhile, for pass@1, majority, first sample, and pass @k, they're more like "assuming we know the answer already, was the generator able to produce the correct results in K tries?"
so when it comes to this PR, it depends on what we're trying to accomplish - is it step 2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Step 2 (will update to return a list/tensor of length K) where N-judges correspond to N verifiers
Good catch though, I should generalize this to K responses
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, so if we're shooting for 2 then I wouldn't necessarily include pass@1, majority, first sample, pass@K etc. which are separate from the verifiers
What we should show is like N different models from RewardBench as individual services
if we want to introduce a Judge concept, then I think we should do two things:
- Rename Policy* to Inference*
- Make the Judge a special instance of Policy that uses
generate
to turn the final result into scalars or w/e is needed
Will remake PR, since first N commits were going in a different direction than needed |
Adds a basic LLMJudge (vllm generation from
Policy
) based on the methods listed in https://hazyresearch.stanford.edu/blog/2025-06-18-weaverNote: Policy should probably be renamed, but this isn't the PR to do it and there's separate abstraction discussions in #149