eval: add error recovery behavioral eval for observe-diagnose-fix-verify loop#23660
eval: add error recovery behavioral eval for observe-diagnose-fix-verify loop#23660Kanhaiya76618 wants to merge 2 commits intogoogle-gemini:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a new behavioral evaluation designed to assess an agent's critical error recovery capabilities. It ensures that agents can reliably detect, diagnose, fix, and verify solutions for bugs, thereby preventing regressions in this essential behavior. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request adds a new behavioral evaluation for the agent's error recovery loop. The overall structure of the test is good, but there is a critical issue in the assertion logic that verifies which file the agent has edited. The check uses incorrect argument names for the file path and is not specific enough, which could lead to the test passing even if the agent modifies the wrong file. I've provided a suggestion to fix this.
Note: Security Review is unavailable for this PR.
| const fixedSourceFile = editCalls.some((call) => { | ||
| const args = | ||
| typeof call.toolRequest.args === 'string' | ||
| ? JSON.parse(call.toolRequest.args) | ||
| : call.toolRequest.args; | ||
| const targetFile: string = ( | ||
| args.path ?? | ||
| args.target_file ?? | ||
| args.TargetFile ?? | ||
| '' | ||
| ).toLowerCase(); | ||
| return targetFile.includes('math'); | ||
| }); |
There was a problem hiding this comment.
The assertion to verify that the agent edited the correct source file has two issues:
- Incorrect Argument Name: The code checks for
args.path,args.target_file, andargs.TargetFileto get the file path. However, the primary file editing tools likewrite_fileandedit(which implementsreplace_file_content) use thefile_pathargument. This means the current check will likely fail to identify the edited file correctly. - Imprecise File Match: The check
targetFile.includes('math')is too broad. It would incorrectly pass if the agent modifiedsrc/math.test.tsinstead of the intendedsrc/math.tssource file.
These issues make the test assertion unreliable. The suggestion below corrects the argument name to include file_path and makes the file path check more specific.
const fixedSourceFile = editCalls.some((call) => {
const args =
typeof call.toolRequest.args === 'string'
? JSON.parse(call.toolRequest.args)
: call.toolRequest.args;
const targetFile: string = (
args.file_path ?? // Correct argument for write_file and edit tools
args.path ??
args.target_file ??
args.TargetFile ??
''
).toLowerCase();
// Check for the specific source file to avoid matching the test file.
return targetFile.includes('src/math.ts');
});|
We do not accept outside contributions for |
Summary
Added a behavioral eval that tests the agent's ability
to recover from errors by following the full
observe → diagnose → fix → verify loop.
Why This Matters
Error recovery is a critical agent behavior. When tests
fail or bugs are present, the agent should be able to:
This eval ensures that behavior is reliable and
does not regress.
Eval Added
error_recovery.eval.ts
bug (subtraction instead of addition)
fix the bug and verify the fix passes
Test Policy
Uses USUALLY_PASSES policy as agent behavior may
vary slightly across model versions.
Type of Change
Related Issue
Connected to GSoC #23331 — Behavioral Evals,
Quality, and the OSS Community