Skip to content

fix(subagents): pause inactivity watchdog while awaiting human approval#1203

Merged
Aaronontheweb merged 6 commits into
netclaw-dev:devfrom
Aaronontheweb:claude-wt-subagent-tool-approval-time
May 28, 2026
Merged

fix(subagents): pause inactivity watchdog while awaiting human approval#1203
Aaronontheweb merged 6 commits into
netclaw-dev:devfrom
Aaronontheweb:claude-wt-subagent-tool-approval-time

Conversation

@Aaronontheweb
Copy link
Copy Markdown
Collaborator

Summary

  • Sub-agents were aborting after their inactivity budget (default 60s) elapsed while a tool-call approval prompt was outstanding. Approval waits don't produce progress events, so the watchdog killed the sub-agent before the human could click Approve.
  • The fix decouples the approval wait along two axes: a counter of in-flight approval waits so SubAgentTimeout re-arms instead of cancelling, and a dedicated external-cancel CancellationTokenSource so the bridge await is cancellable only by explicit external triggers (parent cancellation, SubAgentCancelled).
  • External cancellation (parent passivation, daemon restart, user cancel) still aborts the wait promptly. The watchdog still governs post-approval inactivity.

Changes

SubAgentActor

  • New _pendingApprovalWaits counter (mailbox-thread only) and _externalCts second CTS.
  • SubAgentTimeout re-arms (no-op cancel) while counter > 0.
  • ApprovalWaitStarted / ApprovalWaitCompleted self-messages bracket the bridge call in ExecuteToolsAsync via try/finally. First Started cancels the timer outright; last Completed re-arms it.
  • PostStop cancels and disposes both CTSes; Complete is idempotent via a _completed guard.
  • The per-tool catch propagates OperationCanceledException when externalCt fires, so cancellation routes via ToolExecutionFailed instead of being masked as a fake tool error.
  • ApprovalWaitCompleted logs (rather than silently clamps) any counter underflow, per the constitution's no-silent-fallbacks rule.

Tests

  • 4 new cases in SubAgentActorTests: indefinite-block-on-approval, prompt-cancel-on-external-CT, post-approval retry runs, parallel approvals all paused.
  • New DelayingParentApprovalBridge helper with an EnteredApprovalWait deterministic signal — replaces Task.Delay race windows in test orchestration. Uses Task.WaitAsync(ct) instead of a hand-rolled WhenAny + Register plumbing.
  • Cancellation assertion loosened from `Contains("cancelled")` to `Contains("cancel")` so the test doesn't couple to UK/US spelling of the underlying error message.

Out of scope (deferred)

  • Persisting an `OriginatedFromSubAgent` flag on `ToolApprovalRequested` and short-circuiting cold-recovery clicks for sub-agent approvals. The existing "resolved-but-no-tool-result" abandonment path (`LlmSessionActor.cs:392-405`) handles this — ungracefully — and the parent session's `Processing` state already disables idle timeout (`LlmSessionActor.cs:434`), so a live parent doesn't passivate during a sub-agent run.

Test plan

  • `dotnet test src/Netclaw.Actors.Tests/Netclaw.Actors.Tests.csproj` — 2032 pass, 0 fail
  • `dotnet slopwatch analyze` — 0 issues
  • `./scripts/Add-FileHeaders.ps1 -Verify` — all files have headers
  • Manual smoke: trigger a sub-agent that requires a Bash approval, wait ~5 minutes before clicking Approve, confirm completion

Sub-agents aborted after their inactivity budget (default 60s) elapsed
while a tool-call approval prompt was outstanding, because no progress
events flow into the actor during the human wait. The internal watchdog
cancelled the execution CTS, which propagated into the approval bridge
and threw OperationCanceledException before the user could click.

Decouple the approval wait along two axes: a counter of in-flight
waits that makes SubAgentTimeout re-arm instead of cancel, and a
dedicated external-cancel CTS so the bridge await is cancellable only
by explicit external triggers (parent cancellation, SubAgentCancelled).
The watchdog still governs post-approval inactivity.
Recall-mode review of the prior fix surfaced eight actionable findings:

- Catch-all in ExecuteToolsAsync was swallowing OCE from externalCt as a
  fake "Error: ..." tool result. Re-raise OCE when externalCt fired so
  the cancellation reaches ToolExecutionFailed instead of producing a
  benign-looking result the LLM would continue from.
- Math.Max(0, ...) on the pending-approval counter was a silent fallback
  (forbidden by the constitution). Replaced with an explicit log + early
  return so a future imbalance fails loudly.
- Complete() was not idempotent; a stale handler arriving after Complete
  could send a duplicate SubAgentResult. Added a _completed guard.
- Inactivity timer kept ticking every budget interval during a long
  approval wait. Cancel the timer outright on the first ApprovalWaitStarted
  and re-arm only when the last wait clears.
- Misleading field comment on _externalCts implied the watchdog never
  touched it; in fact Complete() does. Comment rewritten to match.
- Replaced the hand-rolled AwaitWithCancellation in the test bridge with
  Task.WaitAsync(ct) — same semantics, one allocation, no leaked TCS.
- Added an EnteredApprovalWait signal on the test bridge and switched
  the test orchestration off Task.Delay race windows onto the signal.
- Loosened the cancellation-test assertion from "cancelled" (UK) to
  "cancel" so the test no longer couples to a spelling race between
  Complete's message and OperationCanceledException's Message.
…l-approval-time

# Conflicts:
#	src/Netclaw.Actors.Tests/SubAgents/SubAgentActorTests.cs
#	src/Netclaw.Actors/SubAgents/SubAgentActor.cs
@Aaronontheweb Aaronontheweb marked this pull request as ready for review May 27, 2026 23:53
@Aaronontheweb Aaronontheweb enabled auto-merge (squash) May 27, 2026 23:53
@Aaronontheweb Aaronontheweb added security Security-related changes subagents spawn_agent, SubAgentActor, definition loader, discovery context layer, and related features sessions LLM session actor, turn lifecycle, pipelines labels May 27, 2026
@Aaronontheweb Aaronontheweb disabled auto-merge May 27, 2026 23:54
@Aaronontheweb Aaronontheweb enabled auto-merge (squash) May 27, 2026 23:54
@Aaronontheweb Aaronontheweb merged commit 132c2e0 into netclaw-dev:dev May 28, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

security Security-related changes sessions LLM session actor, turn lifecycle, pipelines subagents spawn_agent, SubAgentActor, definition loader, discovery context layer, and related features

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant