-
Notifications
You must be signed in to change notification settings - Fork 9
[WIP] Creates Judges as a wrapper on Policy #202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
apps/vllm/judge.py
Outdated
print(f"Responses: {responses}\n") | ||
|
||
try: | ||
async with policy.session(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't, will remove
apps/vllm/judge.py
Outdated
|
||
print("Spawning service...") | ||
policy = await Policy.options(**cfg.services.policy).as_service(**cfg.policy) | ||
evaluate = GenerativeJudge( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you help me understand why can't we actor'ify the generative judge? policy/service config can be passed to the GenerativeJudgeActor, and let it figure put the the actor creator, session semantics (if that is needed) etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Jack's intention here is that the generators can also be used as the judge (e.g. how it's done here)
but intuitively I feel like they should be kept separate. I feel like a judge specific actor and a reward model specific actor makes sense too, with all of the boilerplate kept with those implementations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you help me understand why can't we actor'ify the generative judge?
generators can also be used as the judge
The idea behind taking in a hydrated generator was to make it easy to enable uses of an existing policy (or policy version) as a discriminator.
That said we absolutely can make this an actor and push the setup inside. I wanted to avoid some of the boiler plate, but if we're fine with it, I'll send up a JudgeActor, RewardModelActor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enable uses of an existing policy (or policy version) as a discriminator.
Is this something that's common in the literature? If it's not I sort of would wanna wait and see until this is requested
I'll send up a JudgeActor, RewardModelActor
I thinkJudgeActor
andRewardModelActor
are reasonable. I also think it's reasonable if you want to do another PR now that renames Policy to VLLM / VLLMWorker etc. as a PR before this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prompt: What is the capital of Japan?
Responses: ['Aardvark', 'Durian', 'Tokyo']
Generation Results:
================================================================================
Sample 1
Evaluation: 3
--------------------------------------------------------------------------------
Sample 2
Evaluation: 3
--------------------------------------------------------------------------------
Sample 3
Evaluation: 3
--------------------------------------------------------------------------------
Sample 4
Evaluation: 3
--------------------------------------------------------------------------------
lol is this working correctly?
apps/vllm/judge.py
Outdated
Note: This is not a "good" prompt setup, it just demonstrates how to make one | ||
""" | ||
|
||
def _wrapper(prompt: str, responses: list[str]) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, this is a good start and I think we can improve as well. IIUC in LLMs as verifiers we have two tracks:
- Reward models
- LLM as a judge
what seems to differ is the prompt you input and whether or not you have to massage the outputs? Is there a way we can minimize the user code to focus on just that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way we can minimize the user code to focus on just that?
Agreed that the wrapper seems funky since i wanted to test generic models. In practice a user would just pass in a (str, list[str]) -> str
to the constructor
We can bake in a default of "LLM as a judge" Judge(model_name, policy_config)
and reduce the scope of class?
apps/vllm/judge.py
Outdated
|
||
print("Spawning service...") | ||
policy = await Policy.options(**cfg.services.policy).as_service(**cfg.policy) | ||
evaluate = GenerativeJudge( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Jack's intention here is that the generators can also be used as the judge (e.g. how it's done here)
but intuitively I feel like they should be kept separate. I feel like a judge specific actor and a reward model specific actor makes sense too, with all of the boilerplate kept with those implementations
I wrote this prompt from the deep archives of my mind and I'm also shocked that the prompting worked. |
src/forge/actors/generative_judge.py
Outdated
cls, | ||
prompt_wrapper=prompt_wrapper, | ||
output_postprocessor=output_postprocessor, | ||
generator=policy, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running into a pickling issue with passing around ServiceInterfaces
File "/home/jackkhuu/forge/src/forge/actors/generative_judge.py", line 51, in launch
llm_judge = await judge_procs.spawn(
File "/home/jackkhuu/.fbpkg_conda_envs/forge-a7401c7/lib/python3.10/site-packages/monarch/_src/actor/proc_mesh.py", line 254, in spawn
return self._spawn_nonblocking(name, Class, *args, **kwargs)
File "/home/jackkhuu/.fbpkg_conda_envs/forge-a7401c7/lib/python3.10/site-packages/monarch/_src/actor/proc_mesh.py", line 366, in _spawn_nonblocking
return self._spawn_nonblocking_on(self._proc_mesh, name, Class, *args, **kwargs)
File "/home/jackkhuu/.fbpkg_conda_envs/forge-a7401c7/lib/python3.10/site-packages/monarch/_src/actor/proc_mesh.py", line 386, in _spawn_nonblocking_on
service = ActorMesh._create(
File "/home/jackkhuu/.fbpkg_conda_envs/forge-a7401c7/lib/python3.10/site-packages/monarch/_src/actor/actor_mesh.py", line 1048, in _create
send(ep, (mesh._class, proc_mesh, controller_controller, *args), kwargs)
File "/home/jackkhuu/.fbpkg_conda_envs/forge-a7401c7/lib/python3.10/site-packages/monarch/_src/actor/actor_mesh.py", line 603, in send
endpoint._send(args, kwargs, port, selection)
File "/home/jackkhuu/.fbpkg_conda_envs/forge-a7401c7/lib/python3.10/site-packages/monarch/_src/actor/actor_mesh.py", line 465, in _send
objects, bytes = flatten((args, kwargs), _is_ref_or_mailbox)
File "/home/jackkhuu/.fbpkg_conda_envs/forge-a7401c7/lib/python3.10/site-packages/monarch/_src/actor/pickle.py", line 73, in flatten
pickler.dump(obj)
File "/home/jackkhuu/.fbpkg_conda_envs/forge-a7401c7/lib/python3.10/site-packages/cloudpickle/cloudpickle.py", line 1303, in dump
return super().dump(obj)
TypeError: cannot pickle '_asyncio.Future' object
LLMJudges and RewardModels (LLM's finetuned for evals) can both be used as "Verifiers" or "Graders".
This PR creates a
Judge
class which helps manage the pre/post processing that may be required.They will take as input (prompts + responses) generated from a model, and return the evaluated quality of these samples. Results can then be used to make decisions on which responses to utilize (e.g. to-user or as a training metric)
Testing in progress
Outdated PR: #167