RFC: Session Management

# RFC: Session Management

**Tracking issue:** #75
**Status:** Draft
**Author:** @thepagent

---

## Summary

Comprehensive session management for agent-broker covering lifecycle control, isolation, observability, security, and multi-agent support.

## Current State

- `SessionPool` uses `HashMap<String, AcpConnection>` keyed by Discord thread_id
- Basic `cleanup_idle` (TTL-based) and `shutdown` exist
- No user-facing session control, no API, no metrics
- Write lock held during entire prompt streaming (#58)

---

## 1. Session Lifecycle

### 1a. `/close` command (#40)
- Discord handler intercepts `/close` message
- Calls `pool.remove(thread_id)` → drops `AcpConnection` → `kill_on_drop` kills child process
- Replies "Session closed." and archives thread

### 1b. Session timeout / auto-expiry
- Split current `session_ttl_hours` (hard max) from new `idle_timeout_minutes` (e.g. 30 min)
- On idle timeout → post "⏰ Session expired due to inactivity" in thread → remove session

### 1c. Per-user session limits
- Track `HashMap<UserId, Vec<ThreadId>>` for active sessions per user
- Exceeding limit → reply "You have too many active sessions. Use `/close` to free one."

### 1d. Graceful shutdown
- On shutdown, post "🔄 Broker restarting..." in each active thread before clearing pool
- Phase 2: persist session metadata to disk/S3 for restart recovery

---

## 2. Session Isolation & Stability

### 2a. Per-thread working directories (#38)
- Change `working_dir` to `{base_working_dir}/{thread_id}/`
- Each session gets its own filesystem namespace
- Cleanup deletes working dir along with session

### 2b. Cross-session deadlock fix (#58)
- Change from `RwLock<HashMap<K, AcpConnection>>` to `RwLock<HashMap<K, Arc<Mutex<AcpConnection>>>>`
- Outer `RwLock` only protects map insert/remove (released immediately)
- Per-connection `Mutex` protects streaming — session A no longer blocks session B

```rust
pub struct SessionPool {
    connections: RwLock<HashMap<String, Arc<Mutex<AcpConnection>>>>,
}

pub async fn with_connection(&self, thread_id: &str, f: F) -> Result<R> {
    let conn = {
        let conns = self.connections.read().await;
        conns.get(thread_id).cloned()
            .ok_or_else(|| anyhow!("no connection"))?
    };
    let mut guard = conn.lock().await;  // per-session lock only
    f(&mut guard).await
}
```

---

## 3. Session State

### 3a. Session metadata

```rust
struct SessionMetadata {
    thread_id: String,
    user_id: String,
    agent_name: String,
    created_at: Instant,
    last_active: Instant,
    message_count: u64,
    status: SessionStatus,  // Active, Idle, Expired
}
```

- Phase 1: in-memory, used for observability and lifecycle decisions
- Phase 2: serialize to disk/S3 for restart recovery

### 3b. Context window management
- Track `message_count` per session, warn user when approaching limits
- Conversation summarization deferred to Phase 2 (requires extra LLM call)

---

## 4. Session Observability (#39)

### 4a. Management API
Lightweight HTTP server on separate port (e.g. `9090`) using `axum`:

```
GET    /sessions              — list all active sessions
GET    /sessions/:thread_id   — session detail
DELETE /sessions/:thread_id   — force terminate
GET    /health                — broker health + pool stats
GET    /metrics               — prometheus-compatible metrics
```

### 4b. Metrics
- `active_sessions` (gauge)
- `total_sessions_created` (counter)
- `session_duration_seconds` (histogram)
- `messages_per_session` (histogram)
- `pool_exhaustion_events` (counter)

### 4c. Audit trail
- Structured logging via existing `tracing` — add fields: `thread_id`, `user_id`, `event`
- Events: `session_created`, `session_prompt`, `session_closed`, `session_expired`

---

## 5. Session Security & Access Control

### 5a. Session ownership
- `SessionMetadata` records `owner_user_id`
- `/close` restricted to session owner or admin
- Other users can still interact in thread (Discord threads are public)

### 5b. Rate limiting per session
- Per-session sliding window: configurable `max_messages_per_minute` (default: 10)
- Exceeding limit → reply "⏳ Rate limited, please wait."

---

## 6. Multi-agent

### 6a. Session routing
- Extend config from single `[agent]` to `[agents]` table with multiple agent configs
- Routing by Discord channel or `/agent <name>` command
- Pool becomes `HashMap<String, (AgentConfig, AcpConnection)>`

### 6b. Session handoff (Phase 2)
- `/handoff <agent>` → close current connection, respawn with new agent
- Optional: carry conversation summary to new agent system prompt

---

## Implementation Phases

| Phase | Scope | Complexity |
|-------|-------|------------|
| **Phase 1** | #58 deadlock fix, #40 `/close`, idle timeout notification, session metadata | Low-Med |
| **Phase 2** | #39 management API, metrics, per-user limits, rate limiting | Medium |
| **Phase 3** | #38 per-thread working dirs, session ownership, audit trail | Medium |
| **Phase 4** | Multi-agent routing, persistence/recovery, handoff | High |

---

## Open Questions

1. Should the management API require auth (API key / mTLS)?
2. Should we support session resume after agent crash (requires agent-side support)?
3. Multi-agent routing: per-channel config vs. user command vs. both?
4. Rate limiting: per-session or per-user?

---

_Comments and feedback welcome._


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Session Management — Design Proposal #78

Summary

Current State

1. Session Lifecycle

1a. `/close` command (#40)

1b. Session timeout / auto-expiry

1c. Per-user session limits

1d. Graceful shutdown

2. Session Isolation & Stability

2a. Per-thread working directories (#38)

2b. Cross-session deadlock fix (#58)

3. Session State

3a. Session metadata

3b. Context window management

4. Session Observability (#39)

4a. Management API

4b. Metrics

4c. Audit trail

5. Session Security & Access Control

5a. Session ownership

5b. Rate limiting per session

6. Multi-agent

6a. Session routing

6b. Session handoff (Phase 2)

Implementation Phases

Open Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Phase	Scope	Complexity
Phase 1	#58 deadlock fix, #40 `/close`, idle timeout notification, session metadata	Low-Med
Phase 2	#39 management API, metrics, per-user limits, rate limiting	Medium
Phase 3	#38 per-thread working dirs, session ownership, audit trail	Medium
Phase 4	Multi-agent routing, persistence/recovery, handoff	High

RFC: Session Management — Design Proposal #78

Description

RFC: Session Management

Summary

Current State

1. Session Lifecycle

1a. /close command (#40)

1b. Session timeout / auto-expiry

1c. Per-user session limits

1d. Graceful shutdown

2. Session Isolation & Stability

2a. Per-thread working directories (#38)

2b. Cross-session deadlock fix (#58)

3. Session State

3a. Session metadata

3b. Context window management

4. Session Observability (#39)

4a. Management API

4b. Metrics

4c. Audit trail

5. Session Security & Access Control

5a. Session ownership

5b. Rate limiting per session

6. Multi-agent

6a. Session routing

6b. Session handoff (Phase 2)

Implementation Phases

Open Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1a. `/close` command (#40)