Skip to content

refactor(pool): per-connection Arc<Mutex> to unblock concurrent sessions#183

Open
ruan330 wants to merge 1 commit intoopenabdev:mainfrom
ruan330:refactor/per-connection-lock
Open

refactor(pool): per-connection Arc<Mutex> to unblock concurrent sessions#183
ruan330 wants to merge 1 commit intoopenabdev:mainfrom
ruan330:refactor/per-connection-lock

Conversation

@ruan330
Copy link
Copy Markdown

@ruan330 ruan330 commented Apr 10, 2026

Problem

SessionPool::with_connection currently holds the pool's write lock for the entire callback duration. Because stream_prompt in discord.rs runs inside that callback and can take many seconds — or minutes — to drain an ACP turn, every other Discord thread is blocked from touching the pool while one session is streaming, even for get_or_create on a completely unrelated thread_id (which only needs the read lock).

In production on our fork (which runs a few dozen concurrent Discord threads against one broker), this manifests as:

Fix

Wrap each AcpConnection in Arc<Mutex<_>>:

  • with_connection takes the pool's read lock only long enough to clone the Arc for the target thread_id, then releases it.
  • It then locks that specific connection's mutex for the duration of the callback.
  • Other sessions continue to stream concurrently.
  • get_or_create on unrelated thread_ids is no longer blocked.
  • Rebuilds still take the write lock briefly (correct — that's a structural change to the HashMap).

cleanup_idle is updated to follow the same rule on the cleanup path: snapshot the Arcs under the read lock, release it, then try_lock each connection individually. A connection that's currently in use is by definition not idle, so try_lock lets us skip it without ever awaiting on a per-connection mutex while holding the pool lock. The write lock is only re-acquired if there are stale entries to remove.

Note: The cleanup_idle shape addresses the P1 review comment that Codex bot left on the original #77 — it pointed out that cleanup_idle grabbing the pool write lock and then awaiting conn_arc.lock() would re-introduce the exact starvation this refactor is meant to eliminate. The snapshot + try_lock pattern fixes that.

The with_connection signature is unchanged, so no call sites need to be updated — the fix is entirely internal to pool.rs. Diff is +51 / -21 in a single file.

Why this approach

A few alternatives we considered:

  1. Leave it as one big RwLock, convert callers to read lock. Doesn't work — callers need &mut AcpConnection to drive session_prompt, and a read lock can't hand out mutable refs.
  2. DashMap<String, AcpConnection>. DashMap's shard locks are synchronous, so we'd still need per-entry async coordination to let a callback hold exclusive access across .await points. Doesn't avoid the underlying need for a per-connection async mutex.
  3. Per-connection Arc<Mutex<_>> (this PR). Minimal change, preserves the existing with_connection API, fixes the root cause (pool lock held during streaming). This matches the architecture discussed in RFC: Session Management — Design Proposal #78 §2b.

Scope

This PR is only the locking change. Previously bundled in #77 alongside notification-loop resilience, an alive-check safety net, and a startup cleanup routine — on reflection that was too much for a single review. Closing #77 and splitting into three focused PRs; this is the first.

Next PRs (to follow in order, from separate branches off main):

  1. ✅ This PR — per-connection lock.
  2. Notification loop resilience — end_turn can arrive before the final agent_message_chunk via tokio::select! racing; fix is a small drain window + empty-response fallback. Fixes fix: notification loop assumes ordered events, bounded prompts, and managed session lifecycle — none hold in production #76.
  3. Alive check + hard timeout — defensive safety net around notification_loop (30s alive check / 30min hard ceiling).

Supersedes #59 and #77. Closes #58.

Testing

Happy to add a focused test if you'd like — the behavioral change is subtle (a session that's streaming no longer blocks unrelated get_or_create calls), and a concurrent-access test with two tasks would make it visible.

@ruan330 ruan330 requested a review from thepagent as a code owner April 10, 2026 12:40
`SessionPool::with_connection` currently holds the pool's write lock
for the entire callback duration. Because `stream_prompt` in discord.rs
runs inside that callback and can take many seconds (or minutes) to
drain an ACP turn, every other Discord thread is blocked from touching
the pool while one session streams — even for `get_or_create` on a
completely unrelated thread_id, which only needs the read lock.

The fix: wrap each `AcpConnection` in `Arc<Mutex<_>>`. `with_connection`
now takes only the pool's read lock long enough to clone the Arc, then
locks that specific connection's mutex for the callback. The pool lock
is released immediately, so:

  - Other sessions can still stream concurrently.
  - `get_or_create` on unrelated thread_ids proceeds without waiting.
  - Rebuilds still take the write lock briefly (correct — structural
    change to the HashMap).

`cleanup_idle` uses a snapshot-then-probe pattern so the same rule
holds on the cleanup path: clone the Arcs under the read lock, release
it, then `try_lock` each connection individually. A busy connection
is, by definition, not idle — `try_lock` lets us skip it without ever
awaiting on a per-connection mutex while holding the pool lock. The
write lock is only re-acquired if there are stale entries to remove.
This addresses the P1 review comment left by the Codex bot on the
original openabdev#77, which noted that awaiting `conn_arc.lock()` from inside
a held pool write lock would re-introduce the very starvation this
refactor is meant to eliminate.

This matches the architecture discussed in openabdev#78 §2b and closes openabdev#58
(pool write lock deadlock during long-running notification loops).

Supersedes openabdev#59 and openabdev#77. Scoped to just the locking change so it can
be reviewed in isolation — notification-loop resilience and alive
check will follow as separate PRs.

No call-site changes: the `with_connection` signature is unchanged.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ruan330 ruan330 force-pushed the refactor/per-connection-lock branch from 93f8f6b to f823131 Compare April 10, 2026 12:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant