Skip to content

RFC: Session Management — Design Proposal #78

@thepagent

Description

@thepagent

RFC: Session Management

Tracking issue: #75
Status: Draft
Author: @thepagent


Summary

Comprehensive session management for agent-broker covering lifecycle control, isolation, observability, security, and multi-agent support.

Current State


1. Session Lifecycle

1a. /close command (#40)

  • Discord handler intercepts /close message
  • Calls pool.remove(thread_id) → drops AcpConnectionkill_on_drop kills child process
  • Replies "Session closed." and archives thread

1b. Session timeout / auto-expiry

  • Split current session_ttl_hours (hard max) from new idle_timeout_minutes (e.g. 30 min)
  • On idle timeout → post "⏰ Session expired due to inactivity" in thread → remove session

1c. Per-user session limits

  • Track HashMap<UserId, Vec<ThreadId>> for active sessions per user
  • Exceeding limit → reply "You have too many active sessions. Use /close to free one."

1d. Graceful shutdown

  • On shutdown, post "🔄 Broker restarting..." in each active thread before clearing pool
  • Phase 2: persist session metadata to disk/S3 for restart recovery

2. Session Isolation & Stability

2a. Per-thread working directories (#38)

  • Change working_dir to {base_working_dir}/{thread_id}/
  • Each session gets its own filesystem namespace
  • Cleanup deletes working dir along with session

2b. Cross-session deadlock fix (#58)

  • Change from RwLock<HashMap<K, AcpConnection>> to RwLock<HashMap<K, Arc<Mutex<AcpConnection>>>>
  • Outer RwLock only protects map insert/remove (released immediately)
  • Per-connection Mutex protects streaming — session A no longer blocks session B
pub struct SessionPool {
    connections: RwLock<HashMap<String, Arc<Mutex<AcpConnection>>>>,
}

pub async fn with_connection(&self, thread_id: &str, f: F) -> Result<R> {
    let conn = {
        let conns = self.connections.read().await;
        conns.get(thread_id).cloned()
            .ok_or_else(|| anyhow!("no connection"))?
    };
    let mut guard = conn.lock().await;  // per-session lock only
    f(&mut guard).await
}

3. Session State

3a. Session metadata

struct SessionMetadata {
    thread_id: String,
    user_id: String,
    agent_name: String,
    created_at: Instant,
    last_active: Instant,
    message_count: u64,
    status: SessionStatus,  // Active, Idle, Expired
}
  • Phase 1: in-memory, used for observability and lifecycle decisions
  • Phase 2: serialize to disk/S3 for restart recovery

3b. Context window management

  • Track message_count per session, warn user when approaching limits
  • Conversation summarization deferred to Phase 2 (requires extra LLM call)

4. Session Observability (#39)

4a. Management API

Lightweight HTTP server on separate port (e.g. 9090) using axum:

GET    /sessions              — list all active sessions
GET    /sessions/:thread_id   — session detail
DELETE /sessions/:thread_id   — force terminate
GET    /health                — broker health + pool stats
GET    /metrics               — prometheus-compatible metrics

4b. Metrics

  • active_sessions (gauge)
  • total_sessions_created (counter)
  • session_duration_seconds (histogram)
  • messages_per_session (histogram)
  • pool_exhaustion_events (counter)

4c. Audit trail

  • Structured logging via existing tracing — add fields: thread_id, user_id, event
  • Events: session_created, session_prompt, session_closed, session_expired

5. Session Security & Access Control

5a. Session ownership

  • SessionMetadata records owner_user_id
  • /close restricted to session owner or admin
  • Other users can still interact in thread (Discord threads are public)

5b. Rate limiting per session

  • Per-session sliding window: configurable max_messages_per_minute (default: 10)
  • Exceeding limit → reply "⏳ Rate limited, please wait."

6. Multi-agent

6a. Session routing

  • Extend config from single [agent] to [agents] table with multiple agent configs
  • Routing by Discord channel or /agent <name> command
  • Pool becomes HashMap<String, (AgentConfig, AcpConnection)>

6b. Session handoff (Phase 2)

  • /handoff <agent> → close current connection, respawn with new agent
  • Optional: carry conversation summary to new agent system prompt

Implementation Phases

Phase Scope Complexity
Phase 1 #58 deadlock fix, #40 /close, idle timeout notification, session metadata Low-Med
Phase 2 #39 management API, metrics, per-user limits, rate limiting Medium
Phase 3 #38 per-thread working dirs, session ownership, audit trail Medium
Phase 4 Multi-agent routing, persistence/recovery, handoff High

Open Questions

  1. Should the management API require auth (API key / mTLS)?
  2. Should we support session resume after agent crash (requires agent-side support)?
  3. Multi-agent routing: per-channel config vs. user command vs. both?
  4. Rate limiting: per-session or per-user?

Comments and feedback welcome.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions