eval: add error recovery behavioral eval for observe-diagnose-fix-verify loop by Kanhaiya76618 · Pull Request #23660 · google-gemini/gemini-cli

Kanhaiya76618 · 2026-03-24T11:38:53Z

Summary

Added a behavioral eval that tests the agent's ability
to recover from errors by following the full
observe → diagnose → fix → verify loop.

Why This Matters

Error recovery is a critical agent behavior. When tests
fail or bugs are present, the agent should be able to:

Detect the failure
Diagnose the root cause
Fix the source code
Re-run tests to verify the fix

This eval ensures that behavior is reliable and
does not regress.

Eval Added

error_recovery.eval.ts

Sets up a realistic project with an intentional
bug (subtraction instead of addition)
Prompts the agent to run tests, find the failure,
fix the bug and verify the fix passes
Asserts that:
- Agent ran the test suite at least once
- Agent ran tests more than once (initial + verify)
- Agent edited the correct source file
- Final output indicates success

Test Policy

Uses USUALLY_PASSES policy as agent behavior may
vary slightly across model versions.

Type of Change

New behavioral eval

Related Issue

Connected to GSoC #23331 — Behavioral Evals,
Quality, and the OSS Community

…ify loop

gemini-code-assist · 2026-03-24T11:39:06Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new behavioral evaluation designed to assess an agent's critical error recovery capabilities. It ensures that agents can reliably detect, diagnose, fix, and verify solutions for bugs, thereby preventing regressions in this essential behavior.

Highlights

New Behavioral Evaluation: Introduced error_recovery.eval.ts to test an agent's ability to recover from errors.
Error Recovery Loop Validation: The evaluation specifically validates the "observe → diagnose → fix → verify" loop for agents.
Realistic Bug Simulation: The eval sets up a project with an intentional bug (subtraction instead of addition) to simulate a real-world scenario.
Agent Behavior Assertions: Asserts that the agent runs tests multiple times, edits the correct source file (src/math.ts), and indicates a successful fix.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds a new behavioral evaluation for the agent's error recovery loop. The overall structure of the test is good, but there is a critical issue in the assertion logic that verifies which file the agent has edited. The check uses incorrect argument names for the file path and is not specific enough, which could lead to the test passing even if the agent modifies the wrong file. I've provided a suggestion to fix this.

_{Note: Security Review is unavailable for this PR.}

gemini-code-assist · 2026-03-24T11:43:02Z

+      const fixedSourceFile = editCalls.some((call) => {
+        const args =
+          typeof call.toolRequest.args === 'string'
+            ? JSON.parse(call.toolRequest.args)
+            : call.toolRequest.args;
+        const targetFile: string = (
+          args.path ??
+          args.target_file ??
+          args.TargetFile ??
+          ''
+        ).toLowerCase();
+        return targetFile.includes('math');
+      });


The assertion to verify that the agent edited the correct source file has two issues:

Incorrect Argument Name: The code checks for args.path, args.target_file, and args.TargetFile to get the file path. However, the primary file editing tools like write_file and edit (which implements replace_file_content) use the file_path argument. This means the current check will likely fail to identify the edited file correctly.

Imprecise File Match: The check targetFile.includes('math') is too broad. It would incorrectly pass if the agent modified src/math.test.ts instead of the intended src/math.ts source file.

These issues make the test assertion unreliable. The suggestion below corrects the argument name to include file_path and makes the file path check more specific.

const fixedSourceFile = editCalls.some((call) => { const args = typeof call.toolRequest.args === 'string' ? JSON.parse(call.toolRequest.args) : call.toolRequest.args; const targetFile: string = ( args.file_path ?? // Correct argument for write_file and edit tools args.path ?? args.target_file ?? args.TargetFile ?? '' ).toLowerCase(); // Check for the specific source file to avoid matching the test file. return targetFile.includes('src/math.ts'); });

scidomino · 2026-04-21T18:34:43Z

We do not accept outside contributions for 🔒 maintainer only issues.

eval: add error recovery behavioral eval for observe-diagnose-fix-ver…

79c50b7

…ify loop

Kanhaiya76618 requested a review from a team as a code owner March 24, 2026 11:38

gemini-code-assist Bot reviewed Mar 24, 2026

View reviewed changes

Kanhaiya76618 mentioned this pull request Mar 24, 2026

GSoC Behavioral evals, Quality, and the OSS Community #23331

Open

gemini-cli Bot added priority/p2 Important but can be addressed in a future release. area/agent Issues related to Core Agent, Tools, Memory, Sub-Agents, Hooks, Agent Quality 🔒 maintainer only ⛔ Do not contribute. Internal roadmap item. labels Mar 24, 2026

Kanhaiya76618 mentioned this pull request Mar 24, 2026

docs: fix typos and grammar in evals/README.md and release-confidence.md #23494

Closed

1 task

Merge branch 'main' into feat/add-clarification-seeking-eval

def7ad8

gemini-cli Bot added the area/platform Issues related to Build infra, Release mgmt, Testing, Eval infra, Capacity, Quota mgmt label Apr 6, 2026

scidomino closed this Apr 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eval: add error recovery behavioral eval for observe-diagnose-fix-verify loop#23660

eval: add error recovery behavioral eval for observe-diagnose-fix-verify loop#23660
Kanhaiya76618 wants to merge 2 commits intogoogle-gemini:mainfrom
Kanhaiya76618:feat/add-clarification-seeking-eval

Kanhaiya76618 commented Mar 24, 2026

Uh oh!

gemini-code-assist Bot commented Mar 24, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Mar 24, 2026

Uh oh!

scidomino commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Kanhaiya76618 commented Mar 24, 2026

Summary

Why This Matters

Eval Added

error_recovery.eval.ts

Test Policy

Type of Change

Related Issue

Uh oh!

gemini-code-assist Bot commented Mar 24, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

scidomino commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants