Sanity test: tighten agent-mode prompt so the model can't skip the tool call by bryanchen-d · Pull Request #317426 · microsoft/vscode

bryanchen-d · 2026-05-20T00:18:25Z

Context

Follow-up to #317407, which added diagnostics that finally pinpointed the cause of the recurring E2E Production agent mode sanity-test flake (see e.g. build 440655).

The enriched timeout error showed:

items=19 [ChatResponseMarkdownPart ×19]
currentProgress="The error check failed due to an invalid input format.
                 This was just a test, so I won't retry as requested.
                 Let me know if you need anything else!"
| follow-up: kind=resolved

So: backend was healthy, request resolved cleanly, but the model never invoked the tool — it hallucinated a failure and skipped the call. The test asserts exactly one thing: that onWillInvokeTool fires. Any prompt where the model can reasonably decide not to call the tool is, by construction, wrong for what's being asserted.

Why the old prompt drifted into broken

The original prompt was:

You must use the get_errors tool to check the window for errors. It may fail, that's ok, just testing, don't retry.

At write time those hints were guarding against older models (a) being chatty before the first call or (b) entering a retry loop on tool failure — both of which would also fail this test by pushing the first onWillInvokeTool outside the 20 s window. Reasonable then.

But the same words conflate two different concerns — "make the call" and "what to do after the call". Newer model snapshots, which prioritize literal instruction-following, reread the post-call permission as pre-call permission: "the tool will fail anyway and I'm told not to retry — so I'll skip it and just narrate a plausible failure."

Change

The test's actual contract:

Test asserts	Needs from model
`onWillInvokeTool` fires within 20 s	The model invokes the tool.
`getResultPromise` resolves	The model finishes the turn.
`stream.currentProgress` is non-empty	The model emits some text.

That's it. Event.toPromise(onWillInvokeTool) captures the first invocation — whether the tool succeeded, failed, or the agent loop retried is irrelevant to the assertion. So the "don't retry" hint was a runtime/cost optimization, not a correctness requirement, and any hedging language ("it may fail", "it's fine if it errors") just hands the model an escape hatch.

New prompt — unconditional, no permission to skip:

new TestChatRequest(
    `Call the get_errors tool now to check the current window for errors. ` +
    `You must invoke the tool exactly once, then reply with a brief summary of whatever it returned.`
);

"exactly once" prevents both the chatty-before-call and retry-loop failure modes the original prompt was guarding against.
No mention of failure → no escape hatch for the "skip the call and narrate" pattern we saw in 440655.

Diagnostics from #317407 stay in place so the next regression (if any) is still self-explanatory.

Copilot

Pull request overview

Tightens the prompt used by the Copilot “E2E Production agent mode” sanity test so the model cannot reasonably skip the required get_errors tool invocation, reducing flakiness where the request resolves without ever producing a tool-call delta.

Changes:

Rewords the test prompt to require invoking get_errors exactly once before any textual reply.
Clarifies post-call behavior (no retries; summarize whatever the tool returned), while removing wording that could be interpreted as permission to skip the call.
Adds inline comments documenting the motivation for the prompt structure.

Show a summary per file

File	Description
extensions/copilot/src/extension/test/vscode-node/sanity.sanity-test.ts	Updates the agent-mode sanity-test prompt to unconditionally force a single tool invocation before any response text.

Copilot's findings

Files reviewed: 1/1 changed files
Comments generated: 0

…ol call

bryanchen-d requested review from Copilot and roblourens May 20, 2026 00:18

Copilot started reviewing on behalf of bryanchen-d May 20, 2026 00:21 View session

Copilot AI reviewed May 20, 2026

View reviewed changes

Sanity test: tighten agent-mode prompt so the model can't skip the to…

7579173

…ol call

bryanchen-d force-pushed the brchen/sanity-agent-prompt-fix branch from bc8f10d to 7579173 Compare May 20, 2026 00:32

vs-code-engineering Bot assigned bryanchen-d May 20, 2026

roblourens approved these changes May 20, 2026

View reviewed changes

bryanchen-d merged commit 5608b5f into main May 20, 2026
25 checks passed

bryanchen-d deleted the brchen/sanity-agent-prompt-fix branch May 20, 2026 21:59

vs-code-engineering Bot added this to the 1.122.0 milestone May 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sanity test: tighten agent-mode prompt so the model can't skip the tool call#317426

Sanity test: tighten agent-mode prompt so the model can't skip the tool call#317426
bryanchen-d merged 1 commit into
mainfrom
brchen/sanity-agent-prompt-fix

bryanchen-d commented May 20, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

bryanchen-d commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Why the old prompt drifted into broken

Change

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot's findings

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bryanchen-d commented May 20, 2026 •

edited

Loading