Sanity test: tighten agent-mode prompt so the model can't skip the tool call#317426
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Tightens the prompt used by the Copilot “E2E Production agent mode” sanity test so the model cannot reasonably skip the required get_errors tool invocation, reducing flakiness where the request resolves without ever producing a tool-call delta.
Changes:
- Rewords the test prompt to require invoking
get_errorsexactly once before any textual reply. - Clarifies post-call behavior (no retries; summarize whatever the tool returned), while removing wording that could be interpreted as permission to skip the call.
- Adds inline comments documenting the motivation for the prompt structure.
Show a summary per file
| File | Description |
|---|---|
| extensions/copilot/src/extension/test/vscode-node/sanity.sanity-test.ts | Updates the agent-mode sanity-test prompt to unconditionally force a single tool invocation before any response text. |
Copilot's findings
- Files reviewed: 1/1 changed files
- Comments generated: 0
bc8f10d to
7579173
Compare
roblourens
approved these changes
May 20, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
Follow-up to #317407, which added diagnostics that finally pinpointed the cause of the recurring
E2E Production agent modesanity-test flake (see e.g. build 440655).The enriched timeout error showed:
So: backend was healthy, request resolved cleanly, but the model never invoked the tool — it hallucinated a failure and skipped the call. The test asserts exactly one thing: that
onWillInvokeToolfires. Any prompt where the model can reasonably decide not to call the tool is, by construction, wrong for what's being asserted.Why the old prompt drifted into broken
The original prompt was:
At write time those hints were guarding against older models (a) being chatty before the first call or (b) entering a retry loop on tool failure — both of which would also fail this test by pushing the first
onWillInvokeTooloutside the 20 s window. Reasonable then.But the same words conflate two different concerns — "make the call" and "what to do after the call". Newer model snapshots, which prioritize literal instruction-following, reread the post-call permission as pre-call permission: "the tool will fail anyway and I'm told not to retry — so I'll skip it and just narrate a plausible failure."
Change
The test's actual contract:
onWillInvokeToolfires within 20 sgetResultPromiseresolvesstream.currentProgressis non-emptyThat's it.
Event.toPromise(onWillInvokeTool)captures the first invocation — whether the tool succeeded, failed, or the agent loop retried is irrelevant to the assertion. So the "don't retry" hint was a runtime/cost optimization, not a correctness requirement, and any hedging language ("it may fail", "it's fine if it errors") just hands the model an escape hatch.New prompt — unconditional, no permission to skip:
Diagnostics from #317407 stay in place so the next regression (if any) is still self-explanatory.