Skip to content

Sanity test: tighten agent-mode prompt so the model can't skip the tool call#317426

Merged
bryanchen-d merged 1 commit into
mainfrom
brchen/sanity-agent-prompt-fix
May 20, 2026
Merged

Sanity test: tighten agent-mode prompt so the model can't skip the tool call#317426
bryanchen-d merged 1 commit into
mainfrom
brchen/sanity-agent-prompt-fix

Conversation

@bryanchen-d
Copy link
Copy Markdown
Contributor

@bryanchen-d bryanchen-d commented May 20, 2026

Context

Follow-up to #317407, which added diagnostics that finally pinpointed the cause of the recurring E2E Production agent mode sanity-test flake (see e.g. build 440655).

The enriched timeout error showed:

items=19 [ChatResponseMarkdownPart ×19]
currentProgress="The error check failed due to an invalid input format.
                 This was just a test, so I won't retry as requested.
                 Let me know if you need anything else!"
| follow-up: kind=resolved

So: backend was healthy, request resolved cleanly, but the model never invoked the tool — it hallucinated a failure and skipped the call. The test asserts exactly one thing: that onWillInvokeTool fires. Any prompt where the model can reasonably decide not to call the tool is, by construction, wrong for what's being asserted.

Why the old prompt drifted into broken

The original prompt was:

You must use the get_errors tool to check the window for errors. It may fail, that's ok, just testing, don't retry.

At write time those hints were guarding against older models (a) being chatty before the first call or (b) entering a retry loop on tool failure — both of which would also fail this test by pushing the first onWillInvokeTool outside the 20 s window. Reasonable then.

But the same words conflate two different concerns — "make the call" and "what to do after the call". Newer model snapshots, which prioritize literal instruction-following, reread the post-call permission as pre-call permission: "the tool will fail anyway and I'm told not to retry — so I'll skip it and just narrate a plausible failure."

Change

The test's actual contract:

Test asserts Needs from model
onWillInvokeTool fires within 20 s The model invokes the tool.
getResultPromise resolves The model finishes the turn.
stream.currentProgress is non-empty The model emits some text.

That's it. Event.toPromise(onWillInvokeTool) captures the first invocation — whether the tool succeeded, failed, or the agent loop retried is irrelevant to the assertion. So the "don't retry" hint was a runtime/cost optimization, not a correctness requirement, and any hedging language ("it may fail", "it's fine if it errors") just hands the model an escape hatch.

New prompt — unconditional, no permission to skip:

new TestChatRequest(
    `Call the get_errors tool now to check the current window for errors. ` +
    `You must invoke the tool exactly once, then reply with a brief summary of whatever it returned.`
);
  • "exactly once" prevents both the chatty-before-call and retry-loop failure modes the original prompt was guarding against.
  • No mention of failure → no escape hatch for the "skip the call and narrate" pattern we saw in 440655.

Diagnostics from #317407 stay in place so the next regression (if any) is still self-explanatory.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Tightens the prompt used by the Copilot “E2E Production agent mode” sanity test so the model cannot reasonably skip the required get_errors tool invocation, reducing flakiness where the request resolves without ever producing a tool-call delta.

Changes:

  • Rewords the test prompt to require invoking get_errors exactly once before any textual reply.
  • Clarifies post-call behavior (no retries; summarize whatever the tool returned), while removing wording that could be interpreted as permission to skip the call.
  • Adds inline comments documenting the motivation for the prompt structure.
Show a summary per file
File Description
extensions/copilot/src/extension/test/vscode-node/sanity.sanity-test.ts Updates the agent-mode sanity-test prompt to unconditionally force a single tool invocation before any response text.

Copilot's findings

  • Files reviewed: 1/1 changed files
  • Comments generated: 0

@bryanchen-d bryanchen-d force-pushed the brchen/sanity-agent-prompt-fix branch from bc8f10d to 7579173 Compare May 20, 2026 00:32
@bryanchen-d bryanchen-d merged commit 5608b5f into main May 20, 2026
25 checks passed
@bryanchen-d bryanchen-d deleted the brchen/sanity-agent-prompt-fix branch May 20, 2026 21:59
@vs-code-engineering vs-code-engineering Bot added this to the 1.122.0 milestone May 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants