fix(acp): close notify channel on EOF to prevent stream hang#470
Merged
fix(acp): close notify channel on EOF to prevent stream hang#470
Conversation
When the ACP child process dies, the reader task drops the MutexGuard but not the Option<Sender> inside, so rx.recv() in the streaming loop never returns None — it hangs forever, holding the per-connection mutex and leaking the pool slot. Fix: set *sub = None to drop the UnboundedSender, closing the channel. Also add a 10-minute timeout to rx.recv() as defense-in-depth so the streaming loop cannot hang indefinitely even if other bugs exist. Closes #295
|
All PRs must reference a prior Discord discussion to ensure community alignment before implementation. Please edit the PR description to include a link like: This PR will be automatically closed in 3 days if the link is not added. |
obrutjack
approved these changes
Apr 19, 2026
Collaborator
obrutjack
left a comment
There was a problem hiding this comment.
Merge checklist verified:
- ✅ CI all green (cargo check + 7 Docker smoke tests)
- ✅ MERGEABLE, upstream branch
- ✅ Fixes real bug: EOF didn't close notify channel → stream_prompt hangs forever
- ✅ 10min timeout as safety net for unresponsive agents
- ✅ Minimal change (+13/-4), correct logic
- ✅ No version regression
Pending @thepagent code owner review.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #295
When the ACP child process dies (EOF on stdout), the reader task in
connection.rsfails to close the notify channel. This causesrx.recv()in the streaming loop to hang forever, holding the per-connection mutex and permanently leaking the pool slot.Changes
src/acp/connection.rs*sub = Noneinstead ofdrop(sub)— drops theUnboundedSender, closing the channelsrc/adapter.rsrx.recv()as defense-in-depthRoot Cause
In the reader task EOF handler:
drop(sub)only drops theMutexGuard<Option<UnboundedSender>>. TheOptionremainsSome(sender), so theUnboundedSenderis never dropped andrx.recv()never returnsNone.Defense-in-Depth
Added a 10-minute timeout around
rx.recv()in the streaming loop (adapter.rs). If no notification arrives within 10 minutes, the loop breaks with an "Agent stopped responding" error. This prevents indefinite hangs even if other channel-closing bugs exist.Scope Note
Issue #295 also describes a global write lock problem in
with_connection. That was already fixed in a prior refactor —with_connectionnow uses a read lock + per-connection mutex. See validation comment for details. This PR only addresses the remaining confirmed bug (notify channel EOF).Validation
*sub = Nonedrops theUnboundedSender, which is the standard tokiompscchannel close mechanismtokio::time::timeoutwrappingrx.recv()is a well-established pattern for stall detection