feat(progress): surface long-running Bash/ScheduleWakeup waits in chat — silent 5–10 min holds look hung

## Context

When a Claude session executes a Bash tool call that internally polls or
sleeps for several minutes (e.g. `until gh run list … grep -q completed; do
sleep 15; done` waiting for a CloudFlare Pages deploy, or a `sleep 600`
waiting for a job, or any `npm run build` on a slow repo), the Untether
chat shows **a single static `▸ Bash` line for the entire duration**.
From the user's phone there is no visual difference between "still
polling" and "frozen". The same pattern applies to `ScheduleWakeup`-driven
`/loop`-style waits and to long-running MCP tool calls.

## Concrete reproduction (live, 2026-05-05 staging)

Session `9905b5fe-7e21-4042-8715-941c9206fa52` on `@hetz_lba1_bot`,
project `lba-web`, chat `-5193338937`. The user reported "session frozen
~40 min". Investigation found two real waits:

1. **10 m 04 s** Bash polling loop:

   ```
   until gh run list --branch main --limit 2 --json status,workflowName \
     --jq '.[] | select(.workflowName == "Deploy Production") | .status' \
     2>/dev/null | grep -q "completed"; do
     sleep 15
     echo "  $(date +%H:%M:%S) Deploy Production: …"
   done
   echo "Deploy Production done."
   ```

   The loop emitted a useful "still alive" line every 15 s on subprocess
   stdout. Untether saw zero of those — the tool result is buffered until
   the Bash call returns, so the `▸ Bash` progress line was static for
   the full 10 minutes.

2. **14 m 00 s** waiting on a `control_request` for an inline approval
   keyboard the user didn't notice (separate UX gap, see follow-up
   comment in this issue thread). Outside scope of this issue.

Plus three CI-poll waits in the same run (3 m + 1.5 m + 2.5 m).

Net effect: the run's Telegram surface looked dead for ~17 minutes of
external-system polling that was actually proceeding normally.

## Related issues

- **#470** — stall-message backoff once `last_event_type=result`. That
  issue solves the post-result-idle silence-then-warn-spam problem;
  this issue is its mid-tool sibling. Together they cover the two
  ways a Claude session can look hung in Telegram. **See section
  "Coordination with #470" below — the stall-suppression work is
  shared.**
- **#289** — full `/loop` support via ScheduleWakeup interception.
  Surfacing the wait time per loop iteration is functionally adjacent;
  if both ship together a 4 m 30 s `/loop` delay should render as a
  live countdown rather than a static action line.
- **#346 / #347** (closed in v0.35.2) — first round of background-task
  surfacing (Monitor / Bash-bg / ScheduleWakeup state). This issue is
  the next layer: not just *that* a long tool is running, but that it
  is *making progress*.

## Proposed work — three coordinated changes

### 1. Long-tool elapsed-time progress timer

When a Bash/ScheduleWakeup/long-MCP tool call exceeds a threshold
(default 60 s), the progress message should update its label to show the
elapsed time and (where available) the tool input snippet:

```
▸ Bash · 7m 12s · still running
  $ until gh run list --branch main … grep -q "completed"; do sleep 15; …
```

Implementation:

- New per-action timer in `MarkdownFormatter.format_action_line()` keyed
  on `action_id`.
- Driven by an anyio task spawned from the Bash/Schedule action handler
  in `runner_bridge.py` that re-renders the action line every
  `[progress] long_tool_render_interval` seconds (default 30 s).
- Cap at one timer per active action; cancel on `tool_result`.
- Respect `[progress] verbosity` and `/verbose` toggle — the elapsed
  string is always shown in `verbose=on`, hidden in `verbose=off` unless
  elapsed exceeds a higher threshold (`long_tool_visible_threshold`,
  default 120 s).

Critical files:

- `src/untether/markdown.py` — `format_action_line`, action timer hook
- `src/untether/runner_bridge.py` — spawn/cancel timer in
  the `ActionEvent` action_started/action_completed branches
- `src/untether/settings.py` — `[progress] long_tool_render_interval`,
  `[progress] long_tool_visible_threshold`
- `tests/test_meta_line.py` — fixture that drives a 5-minute synthetic
  Bash tool, asserts ≥ 8 elapsed-string updates and that the timer is
  cancelled on the tool_result event.

### 2. Surface Bash subprocess stdout as a "last line" on the progress action

Claude Code already streams Bash stdout in its `tool_use_id` partial
events when run in `--verbose --output-format stream-json` (which
Untether uses by default). The line above —
`$(date +%H:%M:%S) Deploy Production: …` — would have been a strong
"still alive" signal. Untether currently discards these partials.

Implementation:

- Extend `claude_schema` (`src/untether/schemas/claude.py`) and
  `translate_claude_event` (`src/untether/runners/claude.py`) to
  preserve the most recent stdout line per active Bash tool call.
- Surface it on the action line as a sub-line:

  ```
  ▸ Bash · 7m 12s · still running
    $ until gh run list … (10 lines suppressed)
    └─ 11:34:12 Deploy Production: in_progress
  ```

- Throttle: at most one update per 5 s per action (configurable via
  `[progress] bash_stdout_render_interval`, default 5).
- Bound: trim the rendered line to ≤ 120 chars.
- Strip ANSI colour codes via the existing `strip_ansi` helper used by
  `_sanitise_stderr` siblings.

Critical files:

- `src/untether/schemas/claude.py` — preserve `bash_output_chunk` partials
- `src/untether/runners/claude.py:translate_claude_event` — feed
  `state.bash_action_stdout[action_id] = line`
- `src/untether/markdown.py:format_action_line` — render the sub-line
- `tests/test_claude_schema.py` and `tests/test_meta_line.py` —
  fixture & rendering coverage

### 3. Coordination with #470 — suppress stall warnings during long-tool waits

Today the stall watchdog
(`runners/claude.py:_subprocess_liveness_stall`) fires every ~3 min
based on `idle_seconds >= stall_threshold` AND
`tree_active/cpu_active`. Once changes 1+2 land, **a long Bash poll
or ScheduleWakeup wait is no longer a "stall"** — the run is
actively progressing.

Both this issue AND #470 should suppress Telegram-side stall warnings
when the subprocess is in one of these expected-idle states:

| State | Suppression rationale |
|---|---|
| `last_event_type == "result"` (post-turn idle) | #470's original case — turn is done, stdin is just held open |
| Active Bash tool with last stdout line within `stall_threshold/2` | the Bash command is producing output, run is alive |
| Active `ScheduleWakeup` with timer fire-time still in the future | the agent explicitly asked to wait — not a stall |
| Active long-MCP-tool call (already partially handled by `mcp_tool_timeout` per #154) | already covered, mention here for completeness |

The structlog `WARNING` event should still fire on each stall threshold
crossing for observability (the `untether-issue-watcher` consumes
those). The change is purely the Telegram surfacing.

New events:

- `progress_edits.stall_long_bash_suppressed` — when active Bash tool
  has fresh stdout
- `progress_edits.stall_schedule_wakeup_suppressed` — when active
  ScheduleWakeup timer is still in the future

This overlaps directly with #470's `progress_edits.stall_post_result_suppressed`
proposal — the three suppression rules should land in the same
`progress_edits.stall_detected` decision branch in `runner_bridge.py`
so the logic stays in one place.

**Action item for #470:** I'll comment on that issue to widen its scope
to include long-tool-wait suppression (this issue) so the changes ship
together rather than as overlapping patches.

## Acceptance criteria

- [ ] A Bash tool call that runs for ≥ 60 s renders an elapsed-time
      string on its action line, refreshing at least every 30 s.
- [ ] If the Bash call produces stdout, the most recent ≤120-char line
      is surfaced as a sub-action under the action line, refreshing
      at least every 5 s while changing.
- [ ] A `ScheduleWakeup` action shows time-until-fire in the action
      line, refreshing at least every 30 s (synergy with #289 if that
      ships first).
- [ ] During an active long-tool wait (Bash with fresh stdout, or
      ScheduleWakeup with future fire time, or post-result stdin idle
      per #470), Telegram stall warnings are suppressed.
- [ ] structlog `subprocess.liveness_stall` warnings still fire on
      each threshold crossing (observability preserved).
- [ ] New `progress_edits.stall_*_suppressed` info logs for each
      suppression branch.
- [ ] No regression on existing fast tools — sub-second tool calls
      render identically (no flicker, no elapsed string).

## Test plan

**Unit** — extend `tests/test_exec_bridge.py` and `tests/test_meta_line.py`:

1. Synthetic 5-minute Bash with periodic stdout — assert elapsed
   string updates ≥ 10 times and stdout sub-line refreshes.
2. Synthetic ScheduleWakeup with `delaySeconds=180` — assert
   countdown renders.
3. Synthetic long Bash + stall threshold cross — assert structlog
   WARNING fires but Telegram surfacing is suppressed.
4. Synthetic post-result + stall threshold cross — assert #470's
   suppression still works (regression).
5. Sub-1-s Bash — assert no elapsed string rendered.

**Integration** — against `@untether_dev_bot`, claude-test chat:

- Send `run a bash command: for i in 1 2 3 4 5 6 7 8 9 10; do echo
  "tick $i $(date +%H:%M:%S)"; sleep 12; done` → observe action line
  count up over 2 min with rolling stdout sub-line; no stall warning.
- Send `/loop … (whatever the syntax is by the time this ships)` and
  let it self-pace → observe countdown per iteration.

## Out of scope

- The 14-min approval-keyboard-wait surfaced in the same lba-web
  session — separate UX issue: chat-side reminder when a
  `control_request` is pending past the stall threshold. Will file
  separately if there's appetite.
- The CloudFlare Pages production-deploy time itself — that's CDN-side,
  Untether can only render the wait better, not shorten it.

## Source

Filed after the v0.35.3 staging dogfood session
`9905b5fe-7e21-4042-8715-941c9206fa52` on 2026-05-05, where a 45-min
run felt frozen but was actually 17 min of healthy CI/CDN polling +
9 min of agent work + a separate 14-min approval-keyboard wait.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(progress): surface long-running Bash/ScheduleWakeup waits in chat — silent 5–10 min holds look hung #481

Context

Concrete reproduction (live, 2026-05-05 staging)

Related issues

Proposed work — three coordinated changes

1. Long-tool elapsed-time progress timer

2. Surface Bash subprocess stdout as a "last line" on the progress action

3. Coordination with #470 — suppress stall warnings during long-tool waits

Acceptance criteria

Test plan

Out of scope

Source

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

State	Suppression rationale
`last_event_type == "result"` (post-turn idle)	#470's original case — turn is done, stdin is just held open
Active Bash tool with last stdout line within `stall_threshold/2`	the Bash command is producing output, run is alive
Active `ScheduleWakeup` with timer fire-time still in the future	the agent explicitly asked to wait — not a stall
Active long-MCP-tool call (already partially handled by `mcp_tool_timeout` per #154)	already covered, mention here for completeness

feat(progress): surface long-running Bash/ScheduleWakeup waits in chat — silent 5–10 min holds look hung #481

Description

Context

Concrete reproduction (live, 2026-05-05 staging)

Related issues

Proposed work — three coordinated changes

1. Long-tool elapsed-time progress timer

2. Surface Bash subprocess stdout as a "last line" on the progress action

3. Coordination with #470 — suppress stall warnings during long-tool waits

Acceptance criteria

Test plan

Out of scope

Source

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions