Skip to content

feat: lightweight management API for session observability and lifecycle control#57

Open
thepagent wants to merge 1 commit intomainfrom
feat/management-api
Open

feat: lightweight management API for session observability and lifecycle control#57
thepagent wants to merge 1 commit intomainfrom
feat/management-api

Conversation

@thepagent
Copy link
Copy Markdown
Collaborator

Summary

Add an optional lightweight HTTP management server (zero new dependencies) for session observability and lifecycle control.

Endpoints

Method Path Description
GET /healthz Uptime + discord connection status (K8s probe target)
GET /sessions List active sessions with idle time and alive status
DELETE /sessions/{thread_id} Graceful single-session teardown
DELETE /sessions Terminate all sessions

Config

Opt-in via [management] section in config.toml:

[management]
enabled = true
bind = "0.0.0.0:8090"

Defaults to enabled = false — no behavior change for existing deployments.

Changes

  • src/config.rsManagementConfig struct (enabled, bind)
  • src/acp/pool.rslist_sessions(), remove_session(), remove_all_sessions(), max_sessions()
  • src/discord.rsdiscord_connected AtomicBool set on ready()
  • src/management.rs — Raw TCP HTTP server (tokio::net::TcpListener, no framework)
  • src/main.rs — Wire management server, shared state
  • config.toml.example[management] section
  • k8s/deployment.yamllivenessProbe + readinessProbe on port 8090
  • Helm chartmanagement.enabled/management.bind in values, configmap, deployment template

Design decisions

  • No framework deps — raw TcpListener keeps the dependency tree unchanged
  • Config opt-in — zero attack surface for deployments that don't need it
  • AtomicBool for discord status — lightweight, lock-free signaling from serenity handler

Closes #39

…cle control

Add optional HTTP management server (no framework deps) with endpoints:
- GET /healthz — uptime + discord connection status for K8s probes
- GET /sessions — list active sessions with idle time
- DELETE /sessions/{id} — graceful single-session teardown
- DELETE /sessions — terminate all sessions

Config opt-in via [management] section (enabled = false by default).
K8s deployment and Helm chart updated with livenessProbe/readinessProbe.

Closes #39
@ruan330
Copy link
Copy Markdown

ruan330 commented Apr 10, 2026

Thanks for pushing this forward — we've been running a very similar always-on management API in production (bare metal + Docker) for the past few days, so wanted to share what we've hit. Happy to rebase onto this design once it lands.

1. /healthz as both liveness and readiness will flap.

The serenity Discord gateway reconnects its WebSocket periodically (roughly every ~30 minutes, sooner under packet loss). During the reconnect, discord_connected briefly flips to false. If K8s uses /healthz as liveness and expects discord_connected: true, the kubelet will restart the pod every reconnect cycle — which then drops every active session.

Our Docker healthcheck runs into the same class of issue for a different reason (/status returns 200 even when Discord is disconnected, so we can't distinguish "process alive but idle" from "process alive and serving").

Suggested split:

  • /livez — process alive + HTTP server responsive. Never checks Discord. Liveness probe target.
  • /readyzdiscord_connected == true AND pool not exhausted. Readiness probe target.

The current PR's livenessProbe + readinessProbe both pointing at /healthz would lead to restart storms under flaky networks. We'd recommend different paths in k8s/deployment.yaml.

2. DELETE /sessions (bulk) — is the use case worth the risk?

Genuine question: what's the production scenario? We've never needed it. Our /kill/{thread_id} is used in two cases:

  • flutter run dev servers that pass alive() but block streaming → per-thread kill, user unaffected
  • Occasional stuck ACP state during debugging → specific thread

Bulk delete would wipe every active user conversation indiscriminately. If someone hits DELETE /sessions by accident (curl, browser extension, typo), there's no recovery — Discord threads lose their agent state and users get silent sessions.

If there's a real use case (e.g. before broker upgrade), a ?confirm=true query param or an admin_token gate would dramatically reduce blast radius. Or drop it entirely and rely on individual deletes — we've never missed it.

3. GET /sessions — worth reserving a cwd field.

Our /sessions returns an extra cwd field per session (working directory the agent was spawned with). In production debugging, the most common question is "which project is this stuck thread in?" — thread_id + alive + idle_seconds alone requires cross-referencing Discord channel metadata.

This is currently fork-specific (our [cwd:] directive), but once §2a lands {base_working_dir}/{thread_id}/, every session will have a meaningful cwd to surface. Reserving the field now would save a schema version bump later.

4. Default bind + auth.

bind = "0.0.0.0:8090" is an opinionated choice — perfect for bare metal in a trusted network (which is our setup), but surprising for a K8s deployment without NetworkPolicy.

Options we'd consider, ordered by disruption:

  • Default to 127.0.0.1:8090 (backward incompatible for existing always-on users, so probably not)
  • Keep 0.0.0.0:8090 but document the assumption in config.toml.example
  • Optional [management] auth_token = "${MGMT_TOKEN}" — skip validation when empty, require Authorization: Bearer when set. Two lines of diff, keeps the simple-deployment story.

We don't feel strongly; the token option is what we'd add if this were being shipped to users we don't control.


None of these are blockers — the core approach (raw TCP, opt-in, zero deps) matches what we independently arrived at, and we'd love to drop our inline implementation in favor of this once it lands. Let me know if any of these would be useful as follow-up PRs vs folded in here.

@chaodu-agent
Copy link
Copy Markdown
Collaborator

Community Triage Review — 超渡法師 + 普渡法師

Verdict: Changes required before merge

The feature is needed — main has no management API, no K8s health probes, and no session observability. The core design (raw TCP, opt-in, zero deps) is solid and independently validated by production experience from other deployments. However, there are compile-breaking and production-breaking issues that need to be addressed first.

🔴 Suggested Changes

[1] pool.rs field name conflict — needs rebase (compile error)

The diff references self.connections.read().await, but main uses self.state: RwLock<PoolState>. This PR was written against an older pool.rs structure. Direct merge will fail to compile. Needs rebase and update to match current self.state API.

[2] k8s/deployment.yaml adds probes unconditionally (restart loop when management disabled)

The Helm chart correctly wraps probes in {{- if .Values.management.enabled }}, but the raw k8s/deployment.yaml adds liveness/readiness probes unconditionally. When management.enabled = false (the default), port 8090 has no listener → probes fail forever → pod restart loop.

Fix: wrap in the same conditional, or remove probes from the raw manifest and only include them in the Helm chart.

[3] /healthz as both liveness and readiness — restart storm on Discord reconnect

As @ruan330 noted from production experience: serenity's Discord gateway reconnects its WebSocket roughly every ~30 minutes. During reconnect, discord_connected briefly flips to false. If kubelet uses /healthz as liveness probe, it will restart the pod — killing all active sessions. This is independently validated by both OpenClaw and Hermes Agent architectures (see PR #302 research comment).

Suggested split:

  • /livez — process alive + HTTP responsive, never checks Discord → liveness target
  • /readyzdiscord_connected == true → readiness target (only affects traffic routing, never triggers pod restart)

🟡 NIT

[4] DELETE /sessions bulk delete — no blast-radius protection

A single mistyped curl -X DELETE wipes every active session with no recovery. As @ruan330 noted, the production use case is almost always per-thread kill. Suggested: add ?confirm=true query param, or drop bulk delete entirely and keep only per-thread DELETE /sessions/{thread_id}.

[5] 0.0.0.0:8090 bind + no auth — undocumented security assumption

With 0.0.0.0, any pod in the same namespace can call DELETE /sessions. In a cluster without NetworkPolicy this is a footgun. Suggested: document the security assumption in config.toml.example, or add optional auth_token config (OpenClaw pattern: skip validation on loopback, require Authorization: Bearer when token is set — minimal diff).

[6] remove_session / remove_all_sessions bypass graceful teardown

Both methods do a raw conns.clear() / conns.remove(), relying on Arc Drop to kill the process group. If the Arc has other holders (e.g. a task currently handling a request), the process won't die immediately. Acceptable for now, but worth a comment marking the assumption.

[7] read_line has no size limit

The HTTP request parser uses buf_reader.read_line(&mut request_line) without a size cap. A malformed or malicious request could exhaust memory. Add a reasonable cap (e.g. 8 KB) before the splitn parse.

✅ What's good

  • Zero new dependencies (raw TcpListener) — matches project philosophy
  • enabled = false default — zero attack surface for existing deployments
  • AtomicBool for Discord status is the right call (lock-free, only needs eventual visibility)
  • Helm chart conditional is correct; the Helm path is production-ready
  • Error handling in serve() loop is solid: bind failure returns, accept error continues
  • @ruan330's production feedback validates the core approach independently

Summary

# Issue Severity Status
1 pool.rs field name conflict 🔴 compile error needs rebase
2 k8s/deployment.yaml unconditional probes 🔴 restart loop needs fix
3 /healthz as liveness + readiness 🔴 restart storm split to /livez + /readyz
4 Bulk DELETE no protection 🟡 blast radius add ?confirm=true or drop
5 0.0.0.0 no auth docs 🟡 security add comment or optional token
6 Graceful teardown bypassed 🟡 assumption add code comment
7 read_line no size limit 🟡 robustness cap at 8 KB

@thepagent
Copy link
Copy Markdown
Collaborator Author

@ruan330 feel free to continue. Let me know when you are ready for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: lightweight management API for session observability and lifecycle control

3 participants