Skip to content

Judge may be incorrectly penalizing tool use in meeting tasks #374

@kxfan2002

Description

@kxfan2002

Hi, thanks for maintaining PinchBench. I noticed a possible issue with the judge prompt for meeting tasks.

In my run, many meeting tasks received judge comments such as:

  • used prohibited tools
  • violated instructions by using tools
  • unnecessary tool usage

However, the task_meeting_*.md prompts seem to mainly ask the agent to read the transcript and write the requested output. I did not see instructions that prohibit the tested agent from using tools.

My guess is that the judge may be confusing this judge-side instruction:

Do NOT use any tools

as a restriction on the evaluated agent, rather than only on the judge itself.

Could you clarify the judge prompt so that tool-use restrictions apply only to the judge, unless the task prompt explicitly forbids tool use?
This seems important because normal file read/write tool use is expected for these meeting tasks, so penalizing it may affect the validity of the scores.

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions