Skip to content

eval: add error recovery behavioral eval for observe-diagnose-fix-verify loop#23660

Closed
Kanhaiya76618 wants to merge 2 commits intogoogle-gemini:mainfrom
Kanhaiya76618:feat/add-clarification-seeking-eval
Closed

eval: add error recovery behavioral eval for observe-diagnose-fix-verify loop#23660
Kanhaiya76618 wants to merge 2 commits intogoogle-gemini:mainfrom
Kanhaiya76618:feat/add-clarification-seeking-eval

Conversation

@Kanhaiya76618
Copy link
Copy Markdown

Summary

Added a behavioral eval that tests the agent's ability
to recover from errors by following the full
observe → diagnose → fix → verify loop.

Why This Matters

Error recovery is a critical agent behavior. When tests
fail or bugs are present, the agent should be able to:

  1. Detect the failure
  2. Diagnose the root cause
  3. Fix the source code
  4. Re-run tests to verify the fix

This eval ensures that behavior is reliable and
does not regress.

Eval Added

error_recovery.eval.ts

  • Sets up a realistic project with an intentional
    bug (subtraction instead of addition)
  • Prompts the agent to run tests, find the failure,
    fix the bug and verify the fix passes
  • Asserts that:
    • Agent ran the test suite at least once
    • Agent ran tests more than once (initial + verify)
    • Agent edited the correct source file
    • Final output indicates success

Test Policy

Uses USUALLY_PASSES policy as agent behavior may
vary slightly across model versions.

Type of Change

  • New behavioral eval

Related Issue

Connected to GSoC #23331 — Behavioral Evals,
Quality, and the OSS Community

@Kanhaiya76618 Kanhaiya76618 requested a review from a team as a code owner March 24, 2026 11:38
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new behavioral evaluation designed to assess an agent's critical error recovery capabilities. It ensures that agents can reliably detect, diagnose, fix, and verify solutions for bugs, thereby preventing regressions in this essential behavior.

Highlights

  • New Behavioral Evaluation: Introduced error_recovery.eval.ts to test an agent's ability to recover from errors.
  • Error Recovery Loop Validation: The evaluation specifically validates the "observe → diagnose → fix → verify" loop for agents.
  • Realistic Bug Simulation: The eval sets up a project with an intentional bug (subtraction instead of addition) to simulate a real-world scenario.
  • Agent Behavior Assertions: Asserts that the agent runs tests multiple times, edits the correct source file (src/math.ts), and indicates a successful fix.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a new behavioral evaluation for the agent's error recovery loop. The overall structure of the test is good, but there is a critical issue in the assertion logic that verifies which file the agent has edited. The check uses incorrect argument names for the file path and is not specific enough, which could lead to the test passing even if the agent modifies the wrong file. I've provided a suggestion to fix this.

Note: Security Review is unavailable for this PR.

Comment on lines +92 to +104
const fixedSourceFile = editCalls.some((call) => {
const args =
typeof call.toolRequest.args === 'string'
? JSON.parse(call.toolRequest.args)
: call.toolRequest.args;
const targetFile: string = (
args.path ??
args.target_file ??
args.TargetFile ??
''
).toLowerCase();
return targetFile.includes('math');
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The assertion to verify that the agent edited the correct source file has two issues:

  1. Incorrect Argument Name: The code checks for args.path, args.target_file, and args.TargetFile to get the file path. However, the primary file editing tools like write_file and edit (which implements replace_file_content) use the file_path argument. This means the current check will likely fail to identify the edited file correctly.
  2. Imprecise File Match: The check targetFile.includes('math') is too broad. It would incorrectly pass if the agent modified src/math.test.ts instead of the intended src/math.ts source file.

These issues make the test assertion unreliable. The suggestion below corrects the argument name to include file_path and makes the file path check more specific.

      const fixedSourceFile = editCalls.some((call) => {
        const args =
          typeof call.toolRequest.args === 'string'
            ? JSON.parse(call.toolRequest.args)
            : call.toolRequest.args;
        const targetFile: string = (
          args.file_path ?? // Correct argument for write_file and edit tools
          args.path ??
          args.target_file ??
          args.TargetFile ??
          ''
        ).toLowerCase();
        // Check for the specific source file to avoid matching the test file.
        return targetFile.includes('src/math.ts');
      });

@gemini-cli gemini-cli Bot added priority/p2 Important but can be addressed in a future release. area/agent Issues related to Core Agent, Tools, Memory, Sub-Agents, Hooks, Agent Quality 🔒 maintainer only ⛔ Do not contribute. Internal roadmap item. labels Mar 24, 2026
@gemini-cli gemini-cli Bot added the area/platform Issues related to Build infra, Release mgmt, Testing, Eval infra, Capacity, Quota mgmt label Apr 6, 2026
@scidomino
Copy link
Copy Markdown
Collaborator

We do not accept outside contributions for 🔒 maintainer only issues.

@scidomino scidomino closed this Apr 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/agent Issues related to Core Agent, Tools, Memory, Sub-Agents, Hooks, Agent Quality area/platform Issues related to Build infra, Release mgmt, Testing, Eval infra, Capacity, Quota mgmt 🔒 maintainer only ⛔ Do not contribute. Internal roadmap item. priority/p2 Important but can be addressed in a future release.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants