Hi, thanks for maintaining PinchBench. I noticed a possible issue with the judge prompt for meeting tasks.
In my run, many meeting tasks received judge comments such as:
used prohibited tools
violated instructions by using tools
unnecessary tool usage
However, the task_meeting_*.md prompts seem to mainly ask the agent to read the transcript and write the requested output. I did not see instructions that prohibit the tested agent from using tools.
My guess is that the judge may be confusing this judge-side instruction:
as a restriction on the evaluated agent, rather than only on the judge itself.
Could you clarify the judge prompt so that tool-use restrictions apply only to the judge, unless the task prompt explicitly forbids tool use?
This seems important because normal file read/write tool use is expected for these meeting tasks, so penalizing it may affect the validity of the scores.
Thanks!
Hi, thanks for maintaining PinchBench. I noticed a possible issue with the judge prompt for meeting tasks.
In my run, many meeting tasks received judge comments such as:
used prohibited toolsviolated instructions by using toolsunnecessary tool usageHowever, the
task_meeting_*.mdprompts seem to mainly ask the agent to read the transcript and write the requested output. I did not see instructions that prohibit the tested agent from using tools.My guess is that the judge may be confusing this judge-side instruction:
as a restriction on the evaluated agent, rather than only on the judge itself.
Could you clarify the judge prompt so that tool-use restrictions apply only to the judge, unless the task prompt explicitly forbids tool use?
This seems important because normal file read/write tool use is expected for these meeting tasks, so penalizing it may affect the validity of the scores.
Thanks!