fix: hermes deployment fixes (env block, oauth gzip, mcp mount, branding) by nnemirovsky · Pull Request #37 · nnemirovsky/sluice

nnemirovsky · 2026-05-07T06:28:15Z

Summary

Six fixes turned up while migrating a production deployment from OpenClaw to Hermes. Each has a tractable root cause, a reproducible symptom, and a unit test.

OAuth response gzip + panic: internal/proxy/addon.go and the corresponding regression tests. The OAuth response handler parsed raw gzip bytes (\x1f\x8b) as JSON, then panicked on the parse failure and zeroed out the response body. Reproduced live against auth.openai.com. Now decompresses via the existing safeReplaceToDecodedBody helper, strips Content-Encoding after rewrite, and wraps the whole handler in a deferred recover so future malformed responses surface as errors instead of panics.
Env file marker block: internal/container/types.go. Startup injection used to : > "$ENV_FILE" (truncate), wiping any keys an agent or migration tool had written. Sluice now writes phantom tokens into a fenced BEGIN sluice-managed / END sluice-managed block and replaces only that block on each call. Foreign keys are preserved across both incremental updates and full reconciliation runs. fullReplace is retained for API compatibility but no longer affects behavior. Keys are sorted for diff-friendly output.
MCP gateway always mounts: cmd/sluice/main.go. The /mcp route used to mount only when ≥1 upstream was registered. An agent that registered sluice as an MCP server (the documented setup) hit a 404 and gave up before the operator could add the first upstream. Now the gateway always starts; with zero upstreams it exposes an empty tool list, which is a valid MCP response.
Hermes WireMCPCmd uses venv: internal/container/agent_profile.go. The bare python3 -c <script> failed with ModuleNotFoundError: yaml on the official nousresearch/hermes-agent image because PyYAML lives in /opt/hermes/.venv, not on the system Python path. The wire script is now invoked through a sh wrapper that activates the venv when present. Native installs without the venv keep working via the system python3.
Telegram agent label is configurable: internal/telegram/bot.go. Approval messages used to read "OpenClaw wants to connect to..." regardless of the active profile. New SetAgentDisplayName is wired from the --agent flag at startup. Hermes deployments now read "Hermes wants to connect to...".
Env path validation + sed escaping: internal/container/agent_profile.go. Already-existing validateEnvFileRelPath is now invoked from BuildEnvInjectionScriptForProfile, and the marker delete pattern escapes sed regex metacharacters so future marker edits cannot break the script.

Context

Production deploy of the Hermes profile (shipped in v0.13.0) hit each of these in turn while migrating from OpenClaw. Symptoms documented in commit message; happy to add a deploy runbook to docs if helpful.

Test plan

go build ./...
go test ./... — 2441 passed across 13 packages
gofumpt -l clean
golangci-lint run ./... clean
New unit tests cover gzip OAuth response, panic recovery on malformed body, marker block reconciliation across fullReplace=false|true, sorted keys, empty envMap, configurable Telegram display name, and the venv-aware Hermes wire command.
CI: test, e2e-linux, e2e-macos, gofumpt, golangci-lint, openapi

…ing) Six fixes turned up while migrating a production deployment from openclaw to hermes. Each had a tractable root cause and a real symptom that broke the deploy. - proxy/addon: OAuth response handler now decompresses gzip/br/deflate before parsing the token JSON. The old code parsed raw \x1f\x8b bytes as JSON, then panicked on the parse failure and took the flow's response down with it. Reproduced live against auth.openai.com which returns gzip by default. Wrapped the handler in a deferred recover so future malformed responses surface as errors instead of panics, and stripped Content-Encoding after rewrite so the client does not try to gunzip plaintext. - container/types: env injection no longer truncates the agent's env file. Sluice writes its phantom tokens into a fenced "BEGIN/END sluice-managed" block and replaces only that block on each call. Foreign keys (set by `hermes claw migrate`, the agent's own auth flow, or an operator) survive every injection. The `fullReplace` flag is retained for API compatibility but no longer affects file behavior. Keys are sorted for diff-friendly output. - cmd/sluice: MCP /mcp endpoint mounts unconditionally, not only when at least one upstream is registered. With zero upstreams the gateway exposes an empty tool list, which is a valid MCP response. This avoids the chicken-and-egg where an agent's MCP client gives up on a 404 before the operator can register the first upstream. - container/agent_profile: HermesProfile.WireMCPCmd now invokes python3 through a sh wrapper that activates /opt/hermes/.venv when present. PyYAML lives in the venv on the official Hermes Docker image but is not on the system Python path; the previous bare `python3 -c` failed with ModuleNotFoundError on every wire-up. Native installs without the venv use the system python3. - telegram/bot: agent name in approval messages is configurable via SetAgentDisplayName; the OpenClaw label is no longer hardcoded. Sluice's main wires this from --agent profile name at startup, so Hermes deployments now read "Hermes wants to connect to..." in Telegram. - container/types: BuildEnvInjectionScript validates EnvFileRelPath and escapes sed regex metacharacters on the marker comment so future markers can be edited without breaking the script. All six are covered by new unit tests.

Copilot

Pull request overview

This PR addresses several Hermes production deployment issues discovered during migration from OpenClaw, spanning OAuth token response handling, env-file injection semantics, MCP gateway availability, Hermes MCP wiring, and Telegram approval message branding.

Changes:

Fix OAuth token response rewriting to handle compressed bodies and prevent proxy crashes on malformed responses.
Change env var injection to reconcile a fenced “sluice-managed” block (preserving foreign keys) and sort keys for stable output.
Always mount the MCP gateway, make Hermes MCP wiring venv-aware, and allow Telegram approval branding to reflect the active agent profile.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
internal/telegram/bot.go	Adds configurable agent display name in approval message templates.
internal/telegram/bot_test.go	Adds test coverage for configurable branding.
internal/proxy/addon.go	Decodes compressed OAuth responses before swapping tokens; adds panic recovery and header adjustments.
internal/proxy/addon_test.go	Adds regression tests for gzip token responses and malformed body panic protection.
internal/container/types.go	Reworks env injection to replace only a fenced managed block; sorts keys; adds sed escaping helper.
internal/container/docker_test.go	Updates env injection behavior expectations and Hermes wire command expectations.
internal/container/agent_profile.go	Wraps Hermes MCP wiring with a sh-based venv activation fallback.
internal/container/agent_profile_test.go	Updates tests for Hermes sh wrapper behavior and env injection semantics.
cmd/sluice/main.go	Always starts/mounts MCP gateway; wires Telegram display name from active agent profile.
CLAUDE.md	Documents Hermes venv wiring behavior and the new sluice-managed env block semantics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

The OpenClaw production compose lived in compose.yml at the repo root since the original setup. Now that Hermes is the supported agent, the canonical deploy files in this repo target Hermes too. - compose.yml: new sluice (--agent hermes) + tun2proxy + hermes gateway + hermes-dashboard sidecar + caddy front. Sluice image pulled from ghcr.io/nnemirovsky/sluice:latest. Hermes image pinned to nousresearch/hermes-agent:v2026.4.30. HERMES_HOME is overridden to /opt/data/.hermes so paths align with sluice's HermesProfile. - compose.dev.yml: same shape but builds sluice from local source for the dev loop. Drops caddy + dashboard sidecar. - Caddyfile: cert paths moved off provider-specific /etc/cloudflare to standard FHS /etc/ssl/certs/agent.pem and /etc/ssl/private/agent.key, and reverse_proxy now targets the hermes-tun2proxy-1 dashboard port. - bootstrap.sh: one-shot helper that imports an existing OpenClaw home volume into the Hermes home volume via `hermes claw migrate`, then pre-writes mcp_servers.sluice.url into ~/.hermes/config.yaml. Skips cleanly when no legacy volume is found, so it is safe to invoke on fresh installs too. The old OpenClaw compose is gone. Operators still on OpenClaw can recover it from any v0.12.x tag if needed; the migration path forward is bootstrap.sh + the new compose.yml.

- proxy/addon: snapshot body + Content-Encoding/Length/Transfer-Encoding before decompressing the OAuth response. Restore the snapshot on any error or panic after the decompress so a partial mid-rewrite cannot send the client plaintext bytes with stripped encoding headers (which, for token responses, would mean real tokens delivered with no phantom swap). The flow now ends in one of two states: fully swapped, or untouched. - telegram/bot: HTML-escape agentDisplayName at render time. SetAgent- DisplayName already filters empty values; the escape adds a second line of defense so a future profile name containing '<', '&', etc cannot break the message structure or inject markup. - container/types: replace `sed '/BEGIN/,/END/d'` with an awk-based conditional delete that runs only when both BEGIN and END markers exist in the file. The old sed range would have happily deleted from BEGIN to end-of-file when END was missing (truncated write, manual edit), nuking foreign keys the marker design promises to preserve. The awk script writes to a sibling temp file and renames in place so a crash mid-rewrite leaves the original env file intact. `bsdSed` is kept in the signature for API compatibility but is no longer consulted (awk is portable across BSD and GNU). - container/types: refresh the ContainerManager.InjectEnvVars doc comment to describe the marker-block reconciliation semantics instead of the old merge-vs-truncate split. - container: rename TestHermesProfile_WireMCPUsesPython to TestHermesProfile_WireMCPUsesVenvWrapper now that the assertion is about the sh wrapper that activates the bundled venv.

Copilot

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 5 comments.

- proxy/addon: processAddonOAuthResponse now sets `modified` based on a byte-equality check between the pre-swap and post-swap bodies. A token endpoint that echoes already-phantom tokens (e.g. on a retry after the upstream rotated and we already swapped on the previous attempt) was logging "swapped to phantoms" with metrics bumps even though zero bytes changed. The function now rolls back its decode in the no-op case so the encoding/length headers continue to advertise the original wire form. - Caddyfile: reverse_proxy now targets the docker compose service name `tun2proxy:9119` instead of the project-prefixed container name. The dashboard sidecar shares tun2proxy's namespace, so the service name reaches it on any internal network. Survives renames of the compose project and replica index changes. - bootstrap.sh: typo fix ("read read-only" -> "mounted read-only") and BOOTSTRAP_OVERWRITE env var so a re-run can opt out of the --overwrite flag and let migrate refuse on conflicts. Header comment now describes the actual overwrite semantics instead of claiming the script is unconditionally idempotent.

Copilot

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 2 comments.

- container/types: BuildEnvInjectionScript docstring no longer references bsdSed/sed semantics. The implementation switched to awk in round 1 of this PR; the doc comment lagged. Now describes the awk-based block reconciliation accurately. - container/types: env var values are now validated for newlines and NUL bytes via the new validateEnvVarValue helper. The previous doc claimed values were "single-quoted to prevent shell injection", which is true at the shell-command layer but did not stop a value with an embedded newline from splitting one logical entry into two lines of the env file (the second line would either drop or be parsed as a separate KEY=value when sourced). The escape comment now describes the actual quoting model.

Copilot

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.

Two findings, one squashed commit because they are interleaved by the copilot review thread on this PR. - proxy/ssh: sshHandleChannel previously called srcChan.CloseWrite from the upstream→agent data-copy goroutine the moment it saw EOF on the upstream channel. That CloseWrite (SSH_MSG_CHANNEL_EOF) raced the request-forwarder goroutine writing exit-status on the same channel. Depending on goroutine schedule, the agent could observe EOF and channel-close BEFORE the exit-status request reached the wire, which surfaces as `session.Wait` returning EOF instead of a clean exit. TestCredential_SSHInjection reproduced this 10-20% of the time locally and one in N runs in CI. Fix: hold the agent-side stdout EOF until all three upstream→agent goroutines (request, data, stderr) have drained, then issue CloseWrite followed by Close. The agent now sees the documented order: data, exit-status, EOF, close. Inputs from the agent (stdin → upstream) still get EOF'd as soon as the agent half-closes, so upstream `cat`-style commands keep terminating correctly. - e2e/credential_test: tightened the test SSH server to send exit-status before half-closing the data side. This matches what well-behaved SSH servers do and removes one cause of the same ordering race on the server side. - container/types: BuildEnvInjectionScriptForProfile now counts non-empty values up front and skips the marker-block emission entirely when nothing in envMap has content. The previous code emitted a stray `BEGIN sluice-managed ... END sluice-managed` pair with no entries when every value was empty (the per-call delete semantic), contradicting the "skip when empty" comment.

Copilot

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 5 comments.

- container/types: env file values are now written as `KEY='value'` with single-quoted values via a quoted-tag heredoc instead of `KEY=value` (unquoted). The previous format relied on shell echo quoting which prevented script-level injection but did NOT survive re-sourcing the file. Production compose loads the env file with `set -a; . file; set +a`, so an unquoted value containing `$`, `$()`, backticks, spaces, or globs would have been re-expanded by the shell at source time. Single quotes around values neutralize every shell metacharacter; embedded single quotes are escaped via the standard `'\''` idiom (close, escaped quote, reopen). New round-trip test runs the generated script in a real shell, sources the file, and confirms values like `pa$$word`, `it's`, `back\`tick\`` reach the environment unchanged. - bootstrap.sh: HERMES_UID and HERMES_GID are now honored from env vars (with the upstream-image defaults of 10000/10000) and applied consistently to every chown across all docker run invocations. An operator who pinned non-default UID/GID via compose.yml previously ended up with migrated files owned by 10000:10000 and unreadable to the runtime user. - bootstrap.sh: the venv activation step is now guarded by a `[ -f /opt/hermes/.venv/bin/activate ] && . ...` check that mirrors the runtime wrapper in HermesProfile.WireMCPCmd. A custom Hermes image without the venv now falls through to system python3 and emits a clear error if PyYAML is missing, instead of failing on a non-existent activate script. - Caddyfile: comment now explicitly says the dashboard is reached *via* tun2proxy because the dashboard sidecar uses network_mode: "service:tun2proxy" and has no internal-network identity of its own. Less likely to mislead a reader trying to swap the proxy target.

Copilot

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 5 comments.

- container/types: refresh validateEnvVarValue and the BuildEnvInjectionScript docstring to describe the quoted-tag heredoc + KEY='value' format. The previous comments referenced the old echo-based mechanism, and the example escape idiom was rendered with a smart quote that did not match the actual code. - container/types: harden the awk pre-pass that removes the managed block. The old version triggered range-delete behavior whenever both BEGIN and END markers appeared anywhere in the file; if an operator manually shuffled the block (END before BEGIN, or duplicated markers) it could nuke foreign keys after an unmatched BEGIN. The new pass walks the file once, records every well-formed BEGIN..END pair as a contiguous range to delete, then prints every line not in the delete set. Orphan markers stay verbatim so a corrupted file shows what is wrong rather than silently dropping content. - CLAUDE.md: env-block example now uses single-quoted values (KEY1='phantom-value-1') matching the production format, and documents the awk well-formed-pair semantic so operators hand-editing the file know which marker layouts are honored. - CLAUDE.md: the Hermes profile row in the Agent Profiles table now describes the sh wrapper that activates Hermes' bundled venv before exec'ing python3, instead of the old bare `python3 -c <script>` form that no longer matches the code.

Copilot

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.

- container/types: awkStringEscape no longer doubles backslashes. In shell single-quoted strings, only the apostrophe needs the break-out-and-rejoin escape; every other byte (backslash included) is taken literally. Doubling backslashes would have silently changed the marker text awk sees vs the bytes in the file if EnvBlockBegin or EnvBlockEnd were ever edited to include a backslash, breaking the pre-pass match. - cmd/sluice: phase 3 (MCP wire-up) comments and log messages no longer hard-code openclaw.json / mcp.servers.sluice.url. The comment now describes the profile-specific dispatch (openclaw uses the gateway RPC, hermes patches mcp_servers in ~/.hermes/config.yaml, future profiles plug their own WireMCPCmd), and the success/error logs include the active profile name so an operator reading the tail of `docker compose logs sluice` sees which agent the wiring targeted.

Copilot