Problem
The notification loop in stream_prompt (discord.rs) is built on three assumptions that don't hold in production with long-running ACP backends like Claude Code:
Assumption 1: ACP events arrive in order
The loop breaks when it sees a response with id:
while let Some(notification) = rx.recv().await {
if notification.id.is_some() {
break; // assumes all text chunks arrived before this
}
}
Reality: end_turn (the response with id) sometimes arrives before the final agent_message_chunk notifications. When this happens, the loop breaks early and text_buf is empty — the user sees "(no response)" even though the agent did respond.
Evidence from production logs:
06:31:39 end_turn id=40, totalTokens=0 ← response arrives
06:31:56 agent_message_chunk: "先" ← text arrives AFTER response
06:31:57 agent_message_chunk: "更新實作..." ← more text, already discarded
Assumption 2: Prompts always complete
The loop has no exit condition other than receiving the response id or the channel closing:
while let Some(notification) = rx.recv().await {
// no timeout, no liveness check
}
Reality: Agents run long tool calls (build commands, test suites) that produce no ACP notifications for minutes. If the tool call never completes (e.g., flutter run is a long-lived app server), the loop blocks forever. Combined with the global pool write lock (#58), this freezes the entire broker.
Evidence: a single flutter run command blocked all sessions for 7+ hours until manual restart.
Assumption 3: Session lifecycle is self-managing
There is no cleanup of Discord threads when the broker restarts. After a restart, the in-memory session pool is empty, but Discord threads from previous sessions remain active. Users typing in these stale threads create new sessions with no conversation context, silently consuming session pool slots.
With max_sessions = 5, three stale threads + two new threads = pool exhausted.
These are one problem, not three
All three stem from the same architectural gap: the notification loop has no resilience against real-world conditions — event ordering violations, unbounded blocking, and lifecycle mismatches.
Fixing them individually produces three localized patches that don't address the underlying fragility. Fixing them together produces a robust notification loop that handles production realities.
Reference implementation
We encountered all three issues in our production deployment (multi-session, Claude Code backend, long-running Flutter builds) and submitted a PR with our fixes:
PR: #77 — includes per-connection locking (#58), alive check, drain window, fallback, and startup cleanup in a single cohesive change.
The key changes:
1. Drain window for event ordering (discord.rs)
After receiving the end_turn response, drain the notification channel for 200ms to capture late-arriving text chunks:
if notification.id.is_some() {
let drain_until = Instant::now() + Duration::from_millis(200);
while let Ok(remaining) = timeout_at(drain_until, rx.recv()).await {
if let Some(n) = remaining {
if let Some(AcpEvent::Text(t)) = classify_notification(&n) {
text_buf.push_str(&t);
}
} else { break; }
}
break;
}
2. Alive check + hard timeout for unbounded blocking (discord.rs)
Replace the bare while let rx.recv() with a tokio::select! that periodically checks process liveness and enforces a hard ceiling:
let prompt_start = Instant::now();
let hard_timeout = Duration::from_secs(30 * 60);
loop {
tokio::select! {
msg = rx.recv() => { /* process notification */ },
_ = sleep(Duration::from_secs(30)) => {
if !conn.alive() { break; } // process dead → break
if prompt_start.elapsed() > hard_timeout { break; } // safety net
continue; // alive → keep waiting
}
}
}
3. Startup thread cleanup (discord.rs, ready handler)
On startup, fetch active threads in allowed channels and archive any created by this bot:
async fn ready(&self, ctx: Context, ready: Ready) {
for thread in active_threads {
if thread.owner_id == bot_id && thread.parent_id in allowed_channels {
thread.edit(archived: true);
}
}
}
4. Fallback for empty responses (discord.rs)
If text_buf is empty after draining but tool activity was recorded, compose a fallback from tool lines instead of showing "(no response)":
let final_content = if text_buf.trim().is_empty() && !tool_lines.is_empty() {
format!("{}\n\n_Task completed but no text response was captured._",
tool_lines.join("\n"))
} else if text_buf.trim().is_empty() {
"_(no response)_".to_string()
} else {
compose_display(&tool_lines, &text_buf)
};
Tradeoffs
| Decision |
Cost |
Why we chose it |
| 200ms drain window |
Adds 200ms latency to every prompt completion |
Small cost; avoids losing entire responses |
| 30-min hard timeout |
Legitimate 30+ min tasks get interrupted |
Safety net; 30 min is generous for most use cases |
| Auto-archive on startup |
Can't resume old threads after restart |
Old threads have no session context anyway; clean start is safer |
| Fallback message |
Shows tool summary instead of actual response |
Better than "(no response)"; user knows work happened |
These are our solutions based on our specific use case (multi-session, Claude Code backend, Flutter development). The maintainer may have better approaches — for example, the drain window could be replaced by sequence numbers in the ACP protocol, or the hard timeout could be configurable via config.toml. We're sharing what worked for us as a starting point, not prescribing the solution.
Related
Problem
The notification loop in
stream_prompt(discord.rs) is built on three assumptions that don't hold in production with long-running ACP backends like Claude Code:Assumption 1: ACP events arrive in order
The loop breaks when it sees a response with
id:Reality:
end_turn(the response withid) sometimes arrives before the finalagent_message_chunknotifications. When this happens, the loop breaks early andtext_bufis empty — the user sees "(no response)" even though the agent did respond.Evidence from production logs:
Assumption 2: Prompts always complete
The loop has no exit condition other than receiving the response
idor the channel closing:Reality: Agents run long tool calls (build commands, test suites) that produce no ACP notifications for minutes. If the tool call never completes (e.g.,
flutter runis a long-lived app server), the loop blocks forever. Combined with the global pool write lock (#58), this freezes the entire broker.Evidence: a single
flutter runcommand blocked all sessions for 7+ hours until manual restart.Assumption 3: Session lifecycle is self-managing
There is no cleanup of Discord threads when the broker restarts. After a restart, the in-memory session pool is empty, but Discord threads from previous sessions remain active. Users typing in these stale threads create new sessions with no conversation context, silently consuming session pool slots.
With
max_sessions = 5, three stale threads + two new threads = pool exhausted.These are one problem, not three
All three stem from the same architectural gap: the notification loop has no resilience against real-world conditions — event ordering violations, unbounded blocking, and lifecycle mismatches.
Fixing them individually produces three localized patches that don't address the underlying fragility. Fixing them together produces a robust notification loop that handles production realities.
Reference implementation
We encountered all three issues in our production deployment (multi-session, Claude Code backend, long-running Flutter builds) and submitted a PR with our fixes:
PR: #77 — includes per-connection locking (#58), alive check, drain window, fallback, and startup cleanup in a single cohesive change.
The key changes:
1. Drain window for event ordering (discord.rs)
After receiving the
end_turnresponse, drain the notification channel for 200ms to capture late-arriving text chunks:2. Alive check + hard timeout for unbounded blocking (discord.rs)
Replace the bare
while let rx.recv()with atokio::select!that periodically checks process liveness and enforces a hard ceiling:3. Startup thread cleanup (discord.rs, ready handler)
On startup, fetch active threads in allowed channels and archive any created by this bot:
4. Fallback for empty responses (discord.rs)
If
text_bufis empty after draining but tool activity was recorded, compose a fallback from tool lines instead of showing "(no response)":Tradeoffs
These are our solutions based on our specific use case (multi-session, Claude Code backend, Flutter development). The maintainer may have better approaches — for example, the drain window could be replaced by sequence numbers in the ACP protocol, or the hard timeout could be configurable via
config.toml. We're sharing what worked for us as a starting point, not prescribing the solution.Related