Skip to content

fix(acp): close notify channel on EOF to prevent stream hang#470

Merged
thepagent merged 1 commit intomainfrom
fix/notify-channel-eof-close
Apr 19, 2026
Merged

fix(acp): close notify channel on EOF to prevent stream hang#470
thepagent merged 1 commit intomainfrom
fix/notify-channel-eof-close

Conversation

@masami-agent
Copy link
Copy Markdown
Contributor

Summary

Closes #295

When the ACP child process dies (EOF on stdout), the reader task in connection.rs fails to close the notify channel. This causes rx.recv() in the streaming loop to hang forever, holding the per-connection mutex and permanently leaking the pool slot.

Changes

File Change
src/acp/connection.rs Fix EOF handler: *sub = None instead of drop(sub) — drops the UnboundedSender, closing the channel
src/adapter.rs Add 10-minute timeout to rx.recv() as defense-in-depth

Root Cause

In the reader task EOF handler:

// Before (bug)
let sub = notify_tx.lock().await;
drop(sub);  // drops MutexGuard, NOT the Option<Sender> inside

drop(sub) only drops the MutexGuard<Option<UnboundedSender>>. The Option remains Some(sender), so the UnboundedSender is never dropped and rx.recv() never returns None.

// After (fix)
let mut sub = notify_tx.lock().await;
*sub = None;  // drops the Sender → channel closes → rx.recv() returns None

Defense-in-Depth

Added a 10-minute timeout around rx.recv() in the streaming loop (adapter.rs). If no notification arrives within 10 minutes, the loop breaks with an "Agent stopped responding" error. This prevents indefinite hangs even if other channel-closing bugs exist.

Scope Note

Issue #295 also describes a global write lock problem in with_connection. That was already fixed in a prior refactor — with_connection now uses a read lock + per-connection mutex. See validation comment for details. This PR only addresses the remaining confirmed bug (notify channel EOF).

Validation

  • Code-level verification: *sub = None drops the UnboundedSender, which is the standard tokio mpsc channel close mechanism
  • tokio::time::timeout wrapping rx.recv() is a well-established pattern for stall detection

When the ACP child process dies, the reader task drops the MutexGuard
but not the Option<Sender> inside, so rx.recv() in the streaming loop
never returns None — it hangs forever, holding the per-connection mutex
and leaking the pool slot.

Fix: set *sub = None to drop the UnboundedSender, closing the channel.

Also add a 10-minute timeout to rx.recv() as defense-in-depth so the
streaming loop cannot hang indefinitely even if other bugs exist.

Closes #295
@masami-agent masami-agent requested a review from thepagent as a code owner April 19, 2026 13:55
@github-actions github-actions bot added the closing-soon PR missing Discord Discussion URL — will auto-close in 3 days label Apr 19, 2026
@github-actions
Copy link
Copy Markdown

⚠️ This PR is missing a Discord Discussion URL in the body.

All PRs must reference a prior Discord discussion to ensure community alignment before implementation.

Please edit the PR description to include a link like:

Discord Discussion URL: https://discord.com/channels/...

This PR will be automatically closed in 3 days if the link is not added.

Copy link
Copy Markdown
Collaborator

@obrutjack obrutjack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge checklist verified:

  • ✅ CI all green (cargo check + 7 Docker smoke tests)
  • ✅ MERGEABLE, upstream branch
  • ✅ Fixes real bug: EOF didn't close notify channel → stream_prompt hangs forever
  • ✅ 10min timeout as safety net for unresponsive agents
  • ✅ Minimal change (+13/-4), correct logic
  • ✅ No version regression

Pending @thepagent code owner review.

@obrutjack obrutjack removed the closing-soon PR missing Discord Discussion URL — will auto-close in 3 days label Apr 19, 2026
@thepagent thepagent merged commit 91a8094 into main Apr 19, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: with_connection global write lock + unclosed notify channel freezes bot when ACP process goes stale

3 participants