trackIdleOnPrompt logging#313080
Conversation
There was a problem hiding this comment.
Pull request overview
Improves terminal shell integration reliability for bash and makes xterm shell-integration sequence processing more diagnosable via trace logging, targeting intermittent mid-session SI breakages (issue #313074).
Changes:
- Disable and restore
errexit(set -e) inside bash DEBUG-trap preexec handlers to prevent bash from dropping the DEBUG trap. - Add
tracelogs for received VS Code (OSC 633) and FinalTerm (OSC 133) sequence types.
Show a summary per file
| File | Description |
|---|---|
| src/vs/workbench/contrib/terminal/common/scripts/shellIntegration-bash.sh | Wrap DEBUG trap handler logic with save/restore of errexit to prevent trap removal under set -e. |
| src/vs/platform/terminal/common/xterm/shellIntegrationAddon.ts | Add trace logging of received shell integration sequence types for debugging SI dropouts. |
Copilot's findings
Comments suppressed due to low confidence (1)
src/vs/platform/terminal/common/xterm/shellIntegrationAddon.ts:445
- Same as above:
_handleVSCodeSequencedoes a parse purely for logging, but_doHandleVSCodeSequenceimmediately repeatsindexOf/substringto derivecommand. If trace logs are meant to be cheap, consider logging from the already-parsedcommand(or gating by log level) to avoid duplicated parsing.
const argsIndex = data.indexOf(';');
const sequenceType = argsIndex === -1 ? data : data.substring(0, argsIndex);
this._logService.trace(`ShellIntegrationAddon#_handleVSCodeSequence: received sequence ${sequenceType}`);
const didHandle = this._doHandleVSCodeSequence(data);
- Files reviewed: 2/2 changed files
- Comments generated: 2
736f57f to
ce6f69a
Compare
ce6f69a to
70a15a1
Compare
70a15a1 to
3cf1ee8
Compare
3cf1ee8 to
4aebd90
Compare
When shell integration breaks mid-session and the model doesn't specify a timeout, the foreground race had no timeout at all — leading to indefinite hangs (observed as 1-hour eval timeouts). Root cause analysis of eval run 25073061392 (20 hanging tasks): - For sync mode, args.timeout is set to 0 when model omits it, so timeoutValue > 0 is false and no timeout race candidate is added - trackIdleOnPrompt's fallback timers should resolve, but onData events may not fire reliably (xterm write callbacks depend on requestAnimationFrame which can be throttled in headless environments) - When onData does fire with a 633;C sequence, state transitions to Executing — neutering both initialFallbackScheduler and promptFallbackScheduler — and executingFallbackScheduler can be indefinitely rescheduled by periodic data events Changes: 1. Add 5-minute safety-net timeout to foreground race when no model timeout is present, preventing indefinite hangs regardless of root cause 2. Add INFO-level logging to trackIdleOnPrompt state machine (state transitions, timer fires, data event counts) so the next eval run will reveal exactly why fallback timers fail to resolve Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Legitimate foreground commands (compilations, test suites) can easily exceed 5 minutes. The safety-net timeout would interrupt them, returning incomplete output and potentially causing the model to take wrong actions. The set -e bash protection and trackIdleOnPrompt logging are the real fixes. For eval environments, the model should specify an explicit timeout. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
⏳ Queued vscode build for
|
|
/requires-eval-assessment terminalbench2 gpt-5.4,claude-opus-4.6,claude-opus-4.7 |
|
🚀 Queued eval-assessment publish build for
|
|
🔬 Queued eval-assessment benchmark for
Results will be posted back here when the run completes. |
|
✅ Eval-assessment build published.
|
|
📊 Eval-assessment benchmark complete.
Analysis ResultsResolution Rate
Token Usage
Step Counts
|
|
📊 Eval-assessment benchmark complete.
Analysis ResultsResolution Rate
Token Usage
Step Counts
|
|
📊 Eval-assessment benchmark complete.
Analysis ResultsResolution Rate
Token Usage
Step Counts
|
Fixes #313074
Problem
Eval run 25073061392 (Ubuntu 24.04, bash 5.2) had 20 tasks that hung indefinitely (1-hour eval timeout). The
trackIdleOnPromptstate machine — which detects when a command finishes by monitoring shell integration sequences and data idle — is currently a complete black box in the logs. We can see commands start and finish, but nothing about internal state transitions, timer fires, or why the state machine failed to resolve.Observed behavior
After certain commands (often containing
set -e, but not exclusively), shell integration breaks mid-session. Subsequent commands returnexitCode=undefined, result.length=0and resolve only via fallback timers (~11s). Eventually a command hangs forever because no fallback fires.Log evidence —
winning-avg-corewarstask: the first twoset -ecommands complete normally (exitCode=0), the third breaks SI, and the final command hangs indefinitely:What we don't know
The April 6 eval (run 24049820070) had 38
set -etasks with zero hangs on an older build. The April 28 eval (run 25073061392) had 26set -etasks with 20 hangs. Theset -ecorrelation is strong but doesn't fully explain the regression — something else changed between these builds. Without visibility intotrackIdleOnPrompt's internal state, we can't determine the exact failure mode.Changes
1.
trackIdleOnPromptdiagnostic logging (executeStrategy.ts)Add INFO-level logging to the state machine that's currently invisible in logs:
Initial -> Executing (sequence C),Executing -> PromptAfterExecuting (sequence A)Initial fallback fired, no data events received,Executing fallback fired after 30s data-idledataEvents=Non each timer fireThis will make the next eval run fully diagnosable — we'll see exactly where and why commands hang.
2. Bash SI
set -eprotection (shellIntegration-bash.sh)Wrap all 4 SI handler functions with
set +e/ restore +return 0:__vsc_prompt_cmd— main PROMPT_COMMAND handler__vsc_prompt_cmd_original— wrapper for user's original PROMPT_COMMAND__vsc_preexec_only— DEBUG trap preexec (emits633;C)__vsc_preexec_all— DEBUG trap combined handlerOn all bash versions, the DEBUG trap is NOT protected from
set -e— if any statement returns non-zero, the handler aborts before emitting633;C(command start). On macOS bash 3.2, PROMPT_COMMAND is also unprotected. This fix hardens both paths. Whileset -emay not be the sole root cause (see above), this is good defense-in-depth.3. Tool description update (
runInTerminalTool.ts)Instruct the model to pair
set -ewithset +eat the end of bash commands, reducing the chance ofset -epersisting across commands.4. Trace log placement fix (
shellIntegrationAddon.ts)Move trace logs from
_handleFinalTermSequence/_handleVSCodeSequenceinto_doHandle*methods per review feedback.