Adds basic LLMJudges #167

Jack-Khuu · 2025-09-17T00:40:54Z

Adds a basic LLMJudge (vllm generation from Policy) based on the methods listed in https://hazyresearch.stanford.edu/blog/2025-06-18-weaver

Note: Policy should probably be renamed, but this isn't the PR to do it and there's separate abstraction discussions in #149

Ritesh1905 · 2025-09-17T01:52:12Z

src/forge/data/judge.py

+        """
+        Returns whether there is a match to the first output
+        """
+        first_output = outputs.outputs[0]


nit: may be handle the edge case when the outputs are empty?

allenwang28 · 2025-09-17T15:08:32Z

src/forge/data/judge.py

+
+
+@dataclass
+class LLMJudge:


if LLMJudge here is evaluating responses i.e. Pass@1, Majority, First Sample, etc. - aren't these metrics rather than judges?

Aren't metrics just for logging?

The Judges can be used as part of the generation evaluation (analogous to Rewards)

yeah Metric is sort of a loaded term unfortunately. Maybe it's EvaluationMetric?

allenwang28 · 2025-09-17T15:10:19Z

src/forge/data/judge.py

+class LLMJudge:
+    """Simple interface for Judges utilizing LLMs."""
+
+    judge_model: ServiceInterface


instead of this class holding a ServiceInterface, can we instead pass the generated responses directly to evaluate()?

So have the main loop directly manage calling all N weak verifiers in the rollout?

I don't have a strong preference here, but doing so does increases the boilerplate/management load

hmm, so this is my mental model of how Weaver works:

Generator generates K responses

K responses go through N verifiers, producing KN verifier results

KN results gets distilled down to a scalar 0/1 through weaver

Meanwhile, for pass@1, majority, first sample, and pass @k, they're more like "assuming we know the answer already, was the generator able to produce the correct results in K tries?"

so when it comes to this PR, it depends on what we're trying to accomplish - is it step 2?

Step 2 (will update to return a list/tensor of length K) where N-judges correspond to N verifiers

Good catch though, I should generalize this to K responses

ok, so if we're shooting for 2 then I wouldn't necessarily include pass@1, majority, first sample, pass@K etc. which are separate from the verifiers

What we should show is like N different models from RewardBench as individual services

if we want to introduce a Judge concept, then I think we should do two things:

Rename Policy* to Inference*

Make the Judge a special instance of Policy that uses generate to turn the final result into scalars or w/e is needed

Jack-Khuu · 2025-09-19T15:21:10Z

Will remake PR, since first N commits were going in a different direction than needed

Jack-Khuu added 3 commits September 16, 2025 13:11

Push

eefb9b2

Merge remote-tracking branch 'origin/main' into basic-judge

3e57be6

Add test and impl

b256f32

Jack-Khuu requested review from allenwang28 and joecummings September 17, 2025 00:40

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 17, 2025

Fixed typehint

df9b32d

Jack-Khuu requested a review from felipemello1 September 17, 2025 00:43

Temp import check until we have vllm nightly intergrated

c3f2ec5

Ritesh1905 reviewed Sep 17, 2025

View reviewed changes

Ritesh1905 approved these changes Sep 17, 2025

View reviewed changes

allenwang28 requested changes Sep 17, 2025

View reviewed changes

Jack-Khuu closed this Sep 19, 2025

Jack-Khuu mentioned this pull request Sep 19, 2025

Creates Judge Example as a wrapper on Policy #202

Open

Jack-Khuu deleted the basic-judge branch October 2, 2025 20:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adds basic LLMJudges #167

Adds basic LLMJudges #167

Uh oh!

Jack-Khuu commented Sep 17, 2025

Uh oh!

Ritesh1905 Sep 17, 2025

Uh oh!

allenwang28 Sep 17, 2025

Uh oh!

Jack-Khuu Sep 17, 2025

Uh oh!

allenwang28 Sep 17, 2025

Uh oh!

allenwang28 Sep 17, 2025

Uh oh!

Jack-Khuu Sep 17, 2025

Uh oh!

allenwang28 Sep 17, 2025

Uh oh!

Jack-Khuu Sep 17, 2025

Uh oh!

allenwang28 Sep 17, 2025

Uh oh!

Jack-Khuu commented Sep 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Adds basic LLMJudges #167

Adds basic LLMJudges #167

Uh oh!

Conversation

Jack-Khuu commented Sep 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jack-Khuu commented Sep 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants