fix(subagents): pause inactivity watchdog while awaiting human approval#1203
Merged
Aaronontheweb merged 6 commits intoMay 28, 2026
Merged
Conversation
Sub-agents aborted after their inactivity budget (default 60s) elapsed while a tool-call approval prompt was outstanding, because no progress events flow into the actor during the human wait. The internal watchdog cancelled the execution CTS, which propagated into the approval bridge and threw OperationCanceledException before the user could click. Decouple the approval wait along two axes: a counter of in-flight waits that makes SubAgentTimeout re-arm instead of cancel, and a dedicated external-cancel CTS so the bridge await is cancellable only by explicit external triggers (parent cancellation, SubAgentCancelled). The watchdog still governs post-approval inactivity.
Recall-mode review of the prior fix surfaced eight actionable findings: - Catch-all in ExecuteToolsAsync was swallowing OCE from externalCt as a fake "Error: ..." tool result. Re-raise OCE when externalCt fired so the cancellation reaches ToolExecutionFailed instead of producing a benign-looking result the LLM would continue from. - Math.Max(0, ...) on the pending-approval counter was a silent fallback (forbidden by the constitution). Replaced with an explicit log + early return so a future imbalance fails loudly. - Complete() was not idempotent; a stale handler arriving after Complete could send a duplicate SubAgentResult. Added a _completed guard. - Inactivity timer kept ticking every budget interval during a long approval wait. Cancel the timer outright on the first ApprovalWaitStarted and re-arm only when the last wait clears. - Misleading field comment on _externalCts implied the watchdog never touched it; in fact Complete() does. Comment rewritten to match. - Replaced the hand-rolled AwaitWithCancellation in the test bridge with Task.WaitAsync(ct) — same semantics, one allocation, no leaked TCS. - Added an EnteredApprovalWait signal on the test bridge and switched the test orchestration off Task.Delay race windows onto the signal. - Loosened the cancellation-test assertion from "cancelled" (UK) to "cancel" so the test no longer couples to a spelling race between Complete's message and OperationCanceledException's Message.
…l-approval-time # Conflicts: # src/Netclaw.Actors.Tests/SubAgents/SubAgentActorTests.cs # src/Netclaw.Actors/SubAgents/SubAgentActor.cs
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
SubAgentTimeoutre-arms instead of cancelling, and a dedicated external-cancelCancellationTokenSourceso the bridgeawaitis cancellable only by explicit external triggers (parent cancellation,SubAgentCancelled).Changes
SubAgentActor_pendingApprovalWaitscounter (mailbox-thread only) and_externalCtssecond CTS.SubAgentTimeoutre-arms (no-op cancel) while counter > 0.ApprovalWaitStarted/ApprovalWaitCompletedself-messages bracket the bridge call inExecuteToolsAsyncviatry/finally. First Started cancels the timer outright; last Completed re-arms it.PostStopcancels and disposes both CTSes;Completeis idempotent via a_completedguard.OperationCanceledExceptionwhenexternalCtfires, so cancellation routes viaToolExecutionFailedinstead of being masked as a fake tool error.ApprovalWaitCompletedlogs (rather than silently clamps) any counter underflow, per the constitution's no-silent-fallbacks rule.Tests
SubAgentActorTests: indefinite-block-on-approval, prompt-cancel-on-external-CT, post-approval retry runs, parallel approvals all paused.DelayingParentApprovalBridgehelper with anEnteredApprovalWaitdeterministic signal — replacesTask.Delayrace windows in test orchestration. UsesTask.WaitAsync(ct)instead of a hand-rolledWhenAny + Registerplumbing.Out of scope (deferred)
Test plan