fix: notification loop assumes ordered events, bounded prompts, and managed session lifecycle — none hold in production

## Problem

The notification loop in `stream_prompt` (`discord.rs`) is built on three assumptions that don't hold in production with long-running ACP backends like Claude Code:

### Assumption 1: ACP events arrive in order

The loop breaks when it sees a response with `id`:

```rust
while let Some(notification) = rx.recv().await {
    if notification.id.is_some() {
        break;  // assumes all text chunks arrived before this
    }
}
```

**Reality**: `end_turn` (the response with `id`) sometimes arrives *before* the final `agent_message_chunk` notifications. When this happens, the loop breaks early and `text_buf` is empty — the user sees "(no response)" even though the agent did respond.

Evidence from production logs:
```
06:31:39  end_turn id=40, totalTokens=0     ← response arrives
06:31:56  agent_message_chunk: "先"          ← text arrives AFTER response
06:31:57  agent_message_chunk: "更新實作..."  ← more text, already discarded
```

### Assumption 2: Prompts always complete

The loop has no exit condition other than receiving the response `id` or the channel closing:

```rust
while let Some(notification) = rx.recv().await {
    // no timeout, no liveness check
}
```

**Reality**: Agents run long tool calls (build commands, test suites) that produce no ACP notifications for minutes. If the tool call never completes (e.g., `flutter run` is a long-lived app server), the loop blocks forever. Combined with the global pool write lock (#58), this freezes the entire broker.

Evidence: a single `flutter run` command blocked all sessions for 7+ hours until manual restart.

### Assumption 3: Session lifecycle is self-managing

There is no cleanup of Discord threads when the broker restarts. After a restart, the in-memory session pool is empty, but Discord threads from previous sessions remain active. Users typing in these stale threads create new sessions with no conversation context, silently consuming session pool slots.

With `max_sessions = 5`, three stale threads + two new threads = pool exhausted.

## These are one problem, not three

All three stem from the same architectural gap: **the notification loop has no resilience against real-world conditions** — event ordering violations, unbounded blocking, and lifecycle mismatches.

Fixing them individually produces three localized patches that don't address the underlying fragility. Fixing them together produces a robust notification loop that handles production realities.

## Reference implementation

We encountered all three issues in our production deployment (multi-session, Claude Code backend, long-running Flutter builds) and submitted a PR with our fixes:

**PR: #77** — includes per-connection locking (#58), alive check, drain window, fallback, and startup cleanup in a single cohesive change.

The key changes:

### 1. Drain window for event ordering (discord.rs)

After receiving the `end_turn` response, drain the notification channel for 200ms to capture late-arriving text chunks:

```rust
if notification.id.is_some() {
    let drain_until = Instant::now() + Duration::from_millis(200);
    while let Ok(remaining) = timeout_at(drain_until, rx.recv()).await {
        if let Some(n) = remaining {
            if let Some(AcpEvent::Text(t)) = classify_notification(&n) {
                text_buf.push_str(&t);
            }
        } else { break; }
    }
    break;
}
```

### 2. Alive check + hard timeout for unbounded blocking (discord.rs)

Replace the bare `while let rx.recv()` with a `tokio::select!` that periodically checks process liveness and enforces a hard ceiling:

```rust
let prompt_start = Instant::now();
let hard_timeout = Duration::from_secs(30 * 60);
loop {
    tokio::select! {
        msg = rx.recv() => { /* process notification */ },
        _ = sleep(Duration::from_secs(30)) => {
            if !conn.alive() { break; }           // process dead → break
            if prompt_start.elapsed() > hard_timeout { break; }  // safety net
            continue;                               // alive → keep waiting
        }
    }
}
```

### 3. Startup thread cleanup (discord.rs, ready handler)

On startup, fetch active threads in allowed channels and archive any created by this bot:

```rust
async fn ready(&self, ctx: Context, ready: Ready) {
    for thread in active_threads {
        if thread.owner_id == bot_id && thread.parent_id in allowed_channels {
            thread.edit(archived: true);
        }
    }
}
```

### 4. Fallback for empty responses (discord.rs)

If `text_buf` is empty after draining but tool activity was recorded, compose a fallback from tool lines instead of showing "(no response)":

```rust
let final_content = if text_buf.trim().is_empty() && !tool_lines.is_empty() {
    format!("{}\n\n_Task completed but no text response was captured._",
            tool_lines.join("\n"))
} else if text_buf.trim().is_empty() {
    "_(no response)_".to_string()
} else {
    compose_display(&tool_lines, &text_buf)
};
```

## Tradeoffs

| Decision | Cost | Why we chose it |
|----------|------|-----------------|
| 200ms drain window | Adds 200ms latency to every prompt completion | Small cost; avoids losing entire responses |
| 30-min hard timeout | Legitimate 30+ min tasks get interrupted | Safety net; 30 min is generous for most use cases |
| Auto-archive on startup | Can't resume old threads after restart | Old threads have no session context anyway; clean start is safer |
| Fallback message | Shows tool summary instead of actual response | Better than "(no response)"; user knows work happened |

These are our solutions based on our specific use case (multi-session, Claude Code backend, Flutter development). The maintainer may have better approaches — for example, the drain window could be replaced by sequence numbers in the ACP protocol, or the hard timeout could be configurable via `config.toml`. We're sharing what worked for us as a starting point, not prescribing the solution.

## Related

- #77 — PR with the complete fix (supersedes #59)
- #58 — pool write lock deadlock (addressed in #77)
- #53 — tool status display fix (separate PR, toolCallId matching)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: notification loop assumes ordered events, bounded prompts, and managed session lifecycle — none hold in production #76

Problem

Assumption 1: ACP events arrive in order

Assumption 2: Prompts always complete

Assumption 3: Session lifecycle is self-managing

These are one problem, not three

Reference implementation

1. Drain window for event ordering (discord.rs)

2. Alive check + hard timeout for unbounded blocking (discord.rs)

3. Startup thread cleanup (discord.rs, ready handler)

4. Fallback for empty responses (discord.rs)

Tradeoffs

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Decision	Cost	Why we chose it
200ms drain window	Adds 200ms latency to every prompt completion	Small cost; avoids losing entire responses
30-min hard timeout	Legitimate 30+ min tasks get interrupted	Safety net; 30 min is generous for most use cases
Auto-archive on startup	Can't resume old threads after restart	Old threads have no session context anyway; clean start is safer
Fallback message	Shows tool summary instead of actual response	Better than "(no response)"; user knows work happened

fix: notification loop assumes ordered events, bounded prompts, and managed session lifecycle — none hold in production #76

Description

Problem

Assumption 1: ACP events arrive in order

Assumption 2: Prompts always complete

Assumption 3: Session lifecycle is self-managing

These are one problem, not three

Reference implementation

1. Drain window for event ordering (discord.rs)

2. Alive check + hard timeout for unbounded blocking (discord.rs)

3. Startup thread cleanup (discord.rs, ready handler)

4. Fallback for empty responses (discord.rs)

Tradeoffs

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions