Skip to content

trackIdleOnPrompt logging#313080

Merged
meganrogge merged 5 commits into
mainfrom
meganrogge/fix-si-flake
Apr 29, 2026
Merged

trackIdleOnPrompt logging#313080
meganrogge merged 5 commits into
mainfrom
meganrogge/fix-si-flake

Conversation

@meganrogge
Copy link
Copy Markdown
Collaborator

@meganrogge meganrogge commented Apr 28, 2026

Fixes #313074

Problem

Eval run 25073061392 (Ubuntu 24.04, bash 5.2) had 20 tasks that hung indefinitely (1-hour eval timeout). The trackIdleOnPrompt state machine — which detects when a command finishes by monitoring shell integration sequences and data idle — is currently a complete black box in the logs. We can see commands start and finish, but nothing about internal state transitions, timer fires, or why the state machine failed to resolve.

Observed behavior

After certain commands (often containing set -e, but not exclusively), shell integration breaks mid-session. Subsequent commands return exitCode=undefined, result.length=0 and resolve only via fallback timers (~11s). Eventually a command hangs forever because no fallback fires.

Log evidencewinning-avg-corewars task: the first two set -e commands complete normally (exitCode=0), the third breaks SI, and the final command hangs indefinitely:

19:42:01.271 Finished with exitCode `0`      ← set -e command #1, SI working
19:43:17.339 Finished with exitCode `0`      ← set -e command #2, SI working  
19:43:46.945 Shell integration failed to add capabilities within 10 seconds
19:44:07.896 Finished with exitCode `undefined`, result.length `1100`  ← SI broken
...
19:45:28.819 Using `rich` execute strategy    ← final command
                                               ← NO "Finished" line. Hangs forever.

What we don't know

The April 6 eval (run 24049820070) had 38 set -e tasks with zero hangs on an older build. The April 28 eval (run 25073061392) had 26 set -e tasks with 20 hangs. The set -e correlation is strong but doesn't fully explain the regression — something else changed between these builds. Without visibility into trackIdleOnPrompt's internal state, we can't determine the exact failure mode.

Changes

1. trackIdleOnPrompt diagnostic logging (executeStrategy.ts)

Add INFO-level logging to the state machine that's currently invisible in logs:

  • State transitions: Initial -> Executing (sequence C), Executing -> PromptAfterExecuting (sequence A)
  • Timer fires: Initial fallback fired, no data events received, Executing fallback fired after 30s data-idle
  • Data event counts: dataEvents=N on each timer fire

This will make the next eval run fully diagnosable — we'll see exactly where and why commands hang.

2. Bash SI set -e protection (shellIntegration-bash.sh)

Wrap all 4 SI handler functions with set +e / restore + return 0:

  • __vsc_prompt_cmd — main PROMPT_COMMAND handler
  • __vsc_prompt_cmd_original — wrapper for user's original PROMPT_COMMAND
  • __vsc_preexec_only — DEBUG trap preexec (emits 633;C)
  • __vsc_preexec_all — DEBUG trap combined handler

On all bash versions, the DEBUG trap is NOT protected from set -e — if any statement returns non-zero, the handler aborts before emitting 633;C (command start). On macOS bash 3.2, PROMPT_COMMAND is also unprotected. This fix hardens both paths. While set -e may not be the sole root cause (see above), this is good defense-in-depth.

3. Tool description update (runInTerminalTool.ts)

Instruct the model to pair set -e with set +e at the end of bash commands, reducing the chance of set -e persisting across commands.

4. Trace log placement fix (shellIntegrationAddon.ts)

Move trace logs from _handleFinalTermSequence/_handleVSCodeSequence into _doHandle* methods per review feedback.

Copilot AI review requested due to automatic review settings April 28, 2026 19:18
@meganrogge meganrogge self-assigned this Apr 28, 2026
@meganrogge meganrogge added this to the 1.119.0 milestone Apr 28, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves terminal shell integration reliability for bash and makes xterm shell-integration sequence processing more diagnosable via trace logging, targeting intermittent mid-session SI breakages (issue #313074).

Changes:

  • Disable and restore errexit (set -e) inside bash DEBUG-trap preexec handlers to prevent bash from dropping the DEBUG trap.
  • Add trace logs for received VS Code (OSC 633) and FinalTerm (OSC 133) sequence types.
Show a summary per file
File Description
src/vs/workbench/contrib/terminal/common/scripts/shellIntegration-bash.sh Wrap DEBUG trap handler logic with save/restore of errexit to prevent trap removal under set -e.
src/vs/platform/terminal/common/xterm/shellIntegrationAddon.ts Add trace logging of received shell integration sequence types for debugging SI dropouts.

Copilot's findings

Comments suppressed due to low confidence (1)

src/vs/platform/terminal/common/xterm/shellIntegrationAddon.ts:445

  • Same as above: _handleVSCodeSequence does a parse purely for logging, but _doHandleVSCodeSequence immediately repeats indexOf/substring to derive command. If trace logs are meant to be cheap, consider logging from the already-parsed command (or gating by log level) to avoid duplicated parsing.
		const argsIndex = data.indexOf(';');
		const sequenceType = argsIndex === -1 ? data : data.substring(0, argsIndex);
		this._logService.trace(`ShellIntegrationAddon#_handleVSCodeSequence: received sequence ${sequenceType}`);
		const didHandle = this._doHandleVSCodeSequence(data);
  • Files reviewed: 2/2 changed files
  • Comments generated: 2

Comment thread src/vs/workbench/contrib/terminal/common/scripts/shellIntegration-bash.sh Outdated
Comment thread src/vs/platform/terminal/common/xterm/shellIntegrationAddon.ts Outdated
@meganrogge meganrogge force-pushed the meganrogge/fix-si-flake branch from 736f57f to ce6f69a Compare April 28, 2026 19:31
@meganrogge meganrogge changed the title terminal: protect bash SI DEBUG trap from set -e and add trace logging terminal: protect bash SI from set -e in PROMPT_COMMAND and add trace logging Apr 28, 2026
@meganrogge meganrogge force-pushed the meganrogge/fix-si-flake branch from ce6f69a to 70a15a1 Compare April 28, 2026 21:49
@meganrogge meganrogge changed the title terminal: protect bash SI from set -e in PROMPT_COMMAND and add trace logging add logging Apr 28, 2026
@meganrogge meganrogge force-pushed the meganrogge/fix-si-flake branch from 70a15a1 to 3cf1ee8 Compare April 28, 2026 21:50
@meganrogge meganrogge force-pushed the meganrogge/fix-si-flake branch from 3cf1ee8 to 4aebd90 Compare April 28, 2026 21:51
@meganrogge meganrogge enabled auto-merge (squash) April 28, 2026 21:52
When shell integration breaks mid-session and the model doesn't specify
a timeout, the foreground race had no timeout at all — leading to
indefinite hangs (observed as 1-hour eval timeouts).

Root cause analysis of eval run 25073061392 (20 hanging tasks):
- For sync mode, args.timeout is set to 0 when model omits it, so
  timeoutValue > 0 is false and no timeout race candidate is added
- trackIdleOnPrompt's fallback timers should resolve, but onData events
  may not fire reliably (xterm write callbacks depend on
  requestAnimationFrame which can be throttled in headless environments)
- When onData does fire with a 633;C sequence, state transitions to
  Executing — neutering both initialFallbackScheduler and
  promptFallbackScheduler — and executingFallbackScheduler can be
  indefinitely rescheduled by periodic data events

Changes:
1. Add 5-minute safety-net timeout to foreground race when no model
   timeout is present, preventing indefinite hangs regardless of root
   cause
2. Add INFO-level logging to trackIdleOnPrompt state machine (state
   transitions, timer fires, data event counts) so the next eval run
   will reveal exactly why fallback timers fail to resolve

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@meganrogge meganrogge changed the title add logging Fix terminal SI flake: set -e protection, foreground safety timeout, trackIdleOnPrompt logging Apr 28, 2026
meganrogge and others added 2 commits April 28, 2026 19:36
Legitimate foreground commands (compilations, test suites) can easily
exceed 5 minutes. The safety-net timeout would interrupt them, returning
incomplete output and potentially causing the model to take wrong actions.

The set -e bash protection and trackIdleOnPrompt logging are the real
fixes. For eval environments, the model should specify an explicit timeout.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@meganrogge meganrogge changed the title Fix terminal SI flake: set -e protection, foreground safety timeout, trackIdleOnPrompt logging Fix terminal SI flake: set -e protection + trackIdleOnPrompt diagnostic logging Apr 28, 2026
@meganrogge meganrogge changed the title Fix terminal SI flake: set -e protection + trackIdleOnPrompt diagnostic logging trackIdleOnPrompt logging Apr 28, 2026
@meganrogge meganrogge added the ~requires-eval-assessment Evals will be run and will generate a report upon completion label Apr 29, 2026
@vs-code-engineering
Copy link
Copy Markdown
Contributor

⏳ Queued vscode build for 04eda7d278ea10d520a66dd8cf289e80edf99bf5 (step 1/2).

@meganrogge meganrogge removed the ~requires-eval-assessment Evals will be run and will generate a report upon completion label Apr 29, 2026
@meganrogge
Copy link
Copy Markdown
Collaborator Author

/requires-eval-assessment terminalbench2 gpt-5.4,claude-opus-4.6,claude-opus-4.7

@vs-code-engineering vs-code-engineering Bot added the ~requires-eval-assessment Evals will be run and will generate a report upon completion label Apr 29, 2026
@vs-code-engineering
Copy link
Copy Markdown
Contributor

🚀 Queued eval-assessment publish build for c1548822493ed9e25a5a45b2c2c892be6f191973 (step 2/2).

@vs-code-engineering
Copy link
Copy Markdown
Contributor

🔬 Queued eval-assessment benchmark for b969e48125.

Results will be posted back here when the run completes.

@vs-code-engineering
Copy link
Copy Markdown
Contributor

✅ Eval-assessment build published.

@vs-code-engineering vs-code-engineering Bot removed the ~requires-eval-assessment Evals will be run and will generate a report upon completion label Apr 29, 2026
@meganrogge meganrogge merged commit 8f9e979 into main Apr 29, 2026
26 checks passed
@meganrogge meganrogge deleted the meganrogge/fix-si-flake branch April 29, 2026 15:12
@vs-code-engineering
Copy link
Copy Markdown
Contributor

📊 Eval-assessment benchmark complete.

Analysis Results

Resolution Rate

Benchmark Total Cases Passed Failed Resolved Rate
terminalbench2 89 54 35 60.67%

Token Usage

Metric Value
Total Tokens 44,873,263
Input Tokens 44,007,532
Output Tokens 865,731
Cached Tokens 41,681,032

Step Counts

Metric Value
Total Steps 1,442
Mean Steps/Instance 16.20

@vs-code-engineering
Copy link
Copy Markdown
Contributor

📊 Eval-assessment benchmark complete.

Analysis Results

Resolution Rate

Benchmark Total Cases Passed Failed Resolved Rate
terminalbench2 89 47 42 52.81%

Token Usage

Metric Value
Total Tokens 68,706,378
Input Tokens 67,780,506
Output Tokens 925,872
Cached Tokens 55,488,768

Step Counts

Metric Value
Total Steps 1,513
Mean Steps/Instance 17.00

@vs-code-engineering
Copy link
Copy Markdown
Contributor

📊 Eval-assessment benchmark complete.

Analysis Results

Resolution Rate

Benchmark Total Cases Passed Failed Resolved Rate
terminalbench2 89 54 35 60.67%

Token Usage

Metric Value
Total Tokens 66,437,478
Input Tokens 65,261,731
Output Tokens 1,175,747
Cached Tokens 54,700,590

Step Counts

Metric Value
Total Steps 1,988
Mean Steps/Instance 22.34

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Shell integration intermittently breaks mid-session in agent terminal commands

3 participants