feat: lightweight management API for session observability and lifecycle control by thepagent · Pull Request #57 · openabdev/openab

thepagent · 2026-04-05T13:59:47Z

Summary

Add an optional lightweight HTTP management server (zero new dependencies) for session observability and lifecycle control.

Endpoints

Method	Path	Description
`GET`	`/healthz`	Uptime + discord connection status (K8s probe target)
`GET`	`/sessions`	List active sessions with idle time and alive status
`DELETE`	`/sessions/{thread_id}`	Graceful single-session teardown
`DELETE`	`/sessions`	Terminate all sessions

Config

Opt-in via [management] section in config.toml:

[management]
enabled = true
bind = "0.0.0.0:8090"

Defaults to enabled = false — no behavior change for existing deployments.

Changes

src/config.rs — ManagementConfig struct (enabled, bind)
src/acp/pool.rs — list_sessions(), remove_session(), remove_all_sessions(), max_sessions()
src/discord.rs — discord_connected AtomicBool set on ready()
src/management.rs — Raw TCP HTTP server (tokio::net::TcpListener, no framework)
src/main.rs — Wire management server, shared state
config.toml.example — [management] section
k8s/deployment.yaml — livenessProbe + readinessProbe on port 8090
Helm chart — management.enabled/management.bind in values, configmap, deployment template

Design decisions

No framework deps — raw TcpListener keeps the dependency tree unchanged
Config opt-in — zero attack surface for deployments that don't need it
AtomicBool for discord status — lightweight, lock-free signaling from serenity handler

Closes #39

…cle control Add optional HTTP management server (no framework deps) with endpoints: - GET /healthz — uptime + discord connection status for K8s probes - GET /sessions — list active sessions with idle time - DELETE /sessions/{id} — graceful single-session teardown - DELETE /sessions — terminate all sessions Config opt-in via [management] section (enabled = false by default). K8s deployment and Helm chart updated with livenessProbe/readinessProbe. Closes #39

ruan330 · 2026-04-10T12:18:32Z

Thanks for pushing this forward — we've been running a very similar always-on management API in production (bare metal + Docker) for the past few days, so wanted to share what we've hit. Happy to rebase onto this design once it lands.

1. /healthz as both liveness and readiness will flap.

The serenity Discord gateway reconnects its WebSocket periodically (roughly every ~30 minutes, sooner under packet loss). During the reconnect, discord_connected briefly flips to false. If K8s uses /healthz as liveness and expects discord_connected: true, the kubelet will restart the pod every reconnect cycle — which then drops every active session.

Our Docker healthcheck runs into the same class of issue for a different reason (/status returns 200 even when Discord is disconnected, so we can't distinguish "process alive but idle" from "process alive and serving").

Suggested split:

/livez — process alive + HTTP server responsive. Never checks Discord. Liveness probe target.
/readyz — discord_connected == true AND pool not exhausted. Readiness probe target.

The current PR's livenessProbe + readinessProbe both pointing at /healthz would lead to restart storms under flaky networks. We'd recommend different paths in k8s/deployment.yaml.

2. DELETE /sessions (bulk) — is the use case worth the risk?

Genuine question: what's the production scenario? We've never needed it. Our /kill/{thread_id} is used in two cases:

flutter run dev servers that pass alive() but block streaming → per-thread kill, user unaffected
Occasional stuck ACP state during debugging → specific thread

Bulk delete would wipe every active user conversation indiscriminately. If someone hits DELETE /sessions by accident (curl, browser extension, typo), there's no recovery — Discord threads lose their agent state and users get silent sessions.

If there's a real use case (e.g. before broker upgrade), a ?confirm=true query param or an admin_token gate would dramatically reduce blast radius. Or drop it entirely and rely on individual deletes — we've never missed it.

3. GET /sessions — worth reserving a cwd field.

Our /sessions returns an extra cwd field per session (working directory the agent was spawned with). In production debugging, the most common question is "which project is this stuck thread in?" — thread_id + alive + idle_seconds alone requires cross-referencing Discord channel metadata.

This is currently fork-specific (our [cwd:] directive), but once §2a lands {base_working_dir}/{thread_id}/, every session will have a meaningful cwd to surface. Reserving the field now would save a schema version bump later.

4. Default bind + auth.

bind = "0.0.0.0:8090" is an opinionated choice — perfect for bare metal in a trusted network (which is our setup), but surprising for a K8s deployment without NetworkPolicy.

Options we'd consider, ordered by disruption:

Default to 127.0.0.1:8090 (backward incompatible for existing always-on users, so probably not)
Keep 0.0.0.0:8090 but document the assumption in config.toml.example
Optional [management] auth_token = "${MGMT_TOKEN}" — skip validation when empty, require Authorization: Bearer when set. Two lines of diff, keeps the simple-deployment story.

We don't feel strongly; the token option is what we'd add if this were being shipped to users we don't control.

None of these are blockers — the core approach (raw TCP, opt-in, zero deps) matches what we independently arrived at, and we'd love to drop our inline implementation in favor of this once it lands. Let me know if any of these would be useful as follow-up PRs vs folded in here.

chaodu-agent · 2026-04-21T09:12:41Z

Community Triage Review — 超渡法師 + 普渡法師

Verdict: Changes required before merge

The feature is needed — main has no management API, no K8s health probes, and no session observability. The core design (raw TCP, opt-in, zero deps) is solid and independently validated by production experience from other deployments. However, there are compile-breaking and production-breaking issues that need to be addressed first.

🔴 Suggested Changes

[1] pool.rs field name conflict — needs rebase (compile error)

The diff references self.connections.read().await, but main uses self.state: RwLock<PoolState>. This PR was written against an older pool.rs structure. Direct merge will fail to compile. Needs rebase and update to match current self.state API.

[2] k8s/deployment.yaml adds probes unconditionally (restart loop when management disabled)

The Helm chart correctly wraps probes in {{- if .Values.management.enabled }}, but the raw k8s/deployment.yaml adds liveness/readiness probes unconditionally. When management.enabled = false (the default), port 8090 has no listener → probes fail forever → pod restart loop.

Fix: wrap in the same conditional, or remove probes from the raw manifest and only include them in the Helm chart.

[3] /healthz as both liveness and readiness — restart storm on Discord reconnect

As @ruan330 noted from production experience: serenity's Discord gateway reconnects its WebSocket roughly every ~30 minutes. During reconnect, discord_connected briefly flips to false. If kubelet uses /healthz as liveness probe, it will restart the pod — killing all active sessions. This is independently validated by both OpenClaw and Hermes Agent architectures (see PR #302 research comment).

Suggested split:

/livez — process alive + HTTP responsive, never checks Discord → liveness target
/readyz — discord_connected == true → readiness target (only affects traffic routing, never triggers pod restart)

🟡 NIT

[4] DELETE /sessions bulk delete — no blast-radius protection

A single mistyped curl -X DELETE wipes every active session with no recovery. As @ruan330 noted, the production use case is almost always per-thread kill. Suggested: add ?confirm=true query param, or drop bulk delete entirely and keep only per-thread DELETE /sessions/{thread_id}.

[5] 0.0.0.0:8090 bind + no auth — undocumented security assumption

With 0.0.0.0, any pod in the same namespace can call DELETE /sessions. In a cluster without NetworkPolicy this is a footgun. Suggested: document the security assumption in config.toml.example, or add optional auth_token config (OpenClaw pattern: skip validation on loopback, require Authorization: Bearer when token is set — minimal diff).

[6] remove_session / remove_all_sessions bypass graceful teardown

Both methods do a raw conns.clear() / conns.remove(), relying on Arc Drop to kill the process group. If the Arc has other holders (e.g. a task currently handling a request), the process won't die immediately. Acceptable for now, but worth a comment marking the assumption.

[7] read_line has no size limit

The HTTP request parser uses buf_reader.read_line(&mut request_line) without a size cap. A malformed or malicious request could exhaust memory. Add a reasonable cap (e.g. 8 KB) before the splitn parse.

✅ What's good

Zero new dependencies (raw TcpListener) — matches project philosophy
enabled = false default — zero attack surface for existing deployments
AtomicBool for Discord status is the right call (lock-free, only needs eventual visibility)
Helm chart conditional is correct; the Helm path is production-ready
Error handling in serve() loop is solid: bind failure returns, accept error continues
@ruan330's production feedback validates the core approach independently

Summary

#	Issue	Severity	Status
1	pool.rs field name conflict	🔴 compile error	needs rebase
2	k8s/deployment.yaml unconditional probes	🔴 restart loop	needs fix
3	`/healthz` as liveness + readiness	🔴 restart storm	split to `/livez` + `/readyz`
4	Bulk DELETE no protection	🟡 blast radius	add `?confirm=true` or drop
5	`0.0.0.0` no auth docs	🟡 security	add comment or optional token
6	Graceful teardown bypassed	🟡 assumption	add code comment
7	`read_line` no size limit	🟡 robustness	cap at 8 KB

thepagent · 2026-04-21T09:27:53Z

@ruan330 feel free to continue. Let me know when you are ready for review.

chaodu-agent added the pending-maintainer label Apr 8, 2026

This was referenced Apr 8, 2026

RFC: Session Management — Design Proposal #78

Open

tracking: Session Management — observability, lifecycle control, and isolation #75

Open

CHC-Agent mentioned this pull request Apr 12, 2026

feat: lightweight management API for session observability and lifecycle control #39

Open

chaodu-agent added needs-rebase pending-contributor labels Apr 18, 2026

github-actions Bot added needs-rebase pending-contributor and removed needs-rebase pending-contributor labels Apr 18, 2026

github-actions Bot added needs-rebase pending-contributor and removed needs-rebase pending-contributor labels Apr 20, 2026

chaodu-agent mentioned this pull request Apr 21, 2026

docs: ADR for PR Contribution Guidelines #302

Open

thepagent added response-requested p1 High — address this sprint feature pending-contributor and removed pending-maintainer labels Apr 21, 2026

github-actions Bot added needs-rebase pending-contributor and removed needs-rebase pending-contributor labels Apr 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: lightweight management API for session observability and lifecycle control#57

feat: lightweight management API for session observability and lifecycle control#57
thepagent wants to merge 1 commit intomainfrom
feat/management-api

thepagent commented Apr 5, 2026

Uh oh!

ruan330 commented Apr 10, 2026

Uh oh!

chaodu-agent commented Apr 21, 2026

Uh oh!

thepagent commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

thepagent commented Apr 5, 2026

Summary

Endpoints

Config

Changes

Design decisions

Uh oh!

ruan330 commented Apr 10, 2026

Uh oh!

chaodu-agent commented Apr 21, 2026

Community Triage Review — 超渡法師 + 普渡法師

🔴 Suggested Changes

🟡 NIT

✅ What's good

Summary

Uh oh!

thepagent commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants