Skip to content

fix: hermes deployment fixes (env block, oauth gzip, mcp mount, branding)#37

Merged
nnemirovsky merged 9 commits intomainfrom
v0.13.1-fixes
May 7, 2026
Merged

fix: hermes deployment fixes (env block, oauth gzip, mcp mount, branding)#37
nnemirovsky merged 9 commits intomainfrom
v0.13.1-fixes

Conversation

@nnemirovsky
Copy link
Copy Markdown
Owner

Summary

Six fixes turned up while migrating a production deployment from OpenClaw to Hermes. Each has a tractable root cause, a reproducible symptom, and a unit test.

  • OAuth response gzip + panic: internal/proxy/addon.go and the corresponding regression tests. The OAuth response handler parsed raw gzip bytes (\x1f\x8b) as JSON, then panicked on the parse failure and zeroed out the response body. Reproduced live against auth.openai.com. Now decompresses via the existing safeReplaceToDecodedBody helper, strips Content-Encoding after rewrite, and wraps the whole handler in a deferred recover so future malformed responses surface as errors instead of panics.
  • Env file marker block: internal/container/types.go. Startup injection used to : > "$ENV_FILE" (truncate), wiping any keys an agent or migration tool had written. Sluice now writes phantom tokens into a fenced BEGIN sluice-managed / END sluice-managed block and replaces only that block on each call. Foreign keys are preserved across both incremental updates and full reconciliation runs. fullReplace is retained for API compatibility but no longer affects behavior. Keys are sorted for diff-friendly output.
  • MCP gateway always mounts: cmd/sluice/main.go. The /mcp route used to mount only when ≥1 upstream was registered. An agent that registered sluice as an MCP server (the documented setup) hit a 404 and gave up before the operator could add the first upstream. Now the gateway always starts; with zero upstreams it exposes an empty tool list, which is a valid MCP response.
  • Hermes WireMCPCmd uses venv: internal/container/agent_profile.go. The bare python3 -c <script> failed with ModuleNotFoundError: yaml on the official nousresearch/hermes-agent image because PyYAML lives in /opt/hermes/.venv, not on the system Python path. The wire script is now invoked through a sh wrapper that activates the venv when present. Native installs without the venv keep working via the system python3.
  • Telegram agent label is configurable: internal/telegram/bot.go. Approval messages used to read "OpenClaw wants to connect to..." regardless of the active profile. New SetAgentDisplayName is wired from the --agent flag at startup. Hermes deployments now read "Hermes wants to connect to...".
  • Env path validation + sed escaping: internal/container/agent_profile.go. Already-existing validateEnvFileRelPath is now invoked from BuildEnvInjectionScriptForProfile, and the marker delete pattern escapes sed regex metacharacters so future marker edits cannot break the script.

Context

Production deploy of the Hermes profile (shipped in v0.13.0) hit each of these in turn while migrating from OpenClaw. Symptoms documented in commit message; happy to add a deploy runbook to docs if helpful.

Test plan

  • go build ./...
  • go test ./... — 2441 passed across 13 packages
  • gofumpt -l clean
  • golangci-lint run ./... clean
  • New unit tests cover gzip OAuth response, panic recovery on malformed body, marker block reconciliation across fullReplace=false|true, sorted keys, empty envMap, configurable Telegram display name, and the venv-aware Hermes wire command.
  • CI: test, e2e-linux, e2e-macos, gofumpt, golangci-lint, openapi

…ing)

Six fixes turned up while migrating a production deployment from
openclaw to hermes. Each had a tractable root cause and a real symptom
that broke the deploy.

- proxy/addon: OAuth response handler now decompresses gzip/br/deflate
  before parsing the token JSON. The old code parsed raw \x1f\x8b
  bytes as JSON, then panicked on the parse failure and took the
  flow's response down with it. Reproduced live against
  auth.openai.com which returns gzip by default. Wrapped the handler
  in a deferred recover so future malformed responses surface as
  errors instead of panics, and stripped Content-Encoding after
  rewrite so the client does not try to gunzip plaintext.
- container/types: env injection no longer truncates the agent's env
  file. Sluice writes its phantom tokens into a fenced
  "BEGIN/END sluice-managed" block and replaces only that block on
  each call. Foreign keys (set by `hermes claw migrate`, the agent's
  own auth flow, or an operator) survive every injection. The
  `fullReplace` flag is retained for API compatibility but no longer
  affects file behavior. Keys are sorted for diff-friendly output.
- cmd/sluice: MCP /mcp endpoint mounts unconditionally, not only when
  at least one upstream is registered. With zero upstreams the
  gateway exposes an empty tool list, which is a valid MCP response.
  This avoids the chicken-and-egg where an agent's MCP client gives
  up on a 404 before the operator can register the first upstream.
- container/agent_profile: HermesProfile.WireMCPCmd now invokes
  python3 through a sh wrapper that activates /opt/hermes/.venv when
  present. PyYAML lives in the venv on the official Hermes Docker
  image but is not on the system Python path; the previous bare
  `python3 -c` failed with ModuleNotFoundError on every wire-up.
  Native installs without the venv use the system python3.
- telegram/bot: agent name in approval messages is configurable via
  SetAgentDisplayName; the OpenClaw label is no longer hardcoded.
  Sluice's main wires this from --agent profile name at startup, so
  Hermes deployments now read "Hermes wants to connect to..." in
  Telegram.
- container/types: BuildEnvInjectionScript validates EnvFileRelPath
  and escapes sed regex metacharacters on the marker comment so
  future markers can be edited without breaking the script.

All six are covered by new unit tests.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses several Hermes production deployment issues discovered during migration from OpenClaw, spanning OAuth token response handling, env-file injection semantics, MCP gateway availability, Hermes MCP wiring, and Telegram approval message branding.

Changes:

  • Fix OAuth token response rewriting to handle compressed bodies and prevent proxy crashes on malformed responses.
  • Change env var injection to reconcile a fenced “sluice-managed” block (preserving foreign keys) and sort keys for stable output.
  • Always mount the MCP gateway, make Hermes MCP wiring venv-aware, and allow Telegram approval branding to reflect the active agent profile.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
internal/telegram/bot.go Adds configurable agent display name in approval message templates.
internal/telegram/bot_test.go Adds test coverage for configurable branding.
internal/proxy/addon.go Decodes compressed OAuth responses before swapping tokens; adds panic recovery and header adjustments.
internal/proxy/addon_test.go Adds regression tests for gzip token responses and malformed body panic protection.
internal/container/types.go Reworks env injection to replace only a fenced managed block; sorts keys; adds sed escaping helper.
internal/container/docker_test.go Updates env injection behavior expectations and Hermes wire command expectations.
internal/container/agent_profile.go Wraps Hermes MCP wiring with a sh-based venv activation fallback.
internal/container/agent_profile_test.go Updates tests for Hermes sh wrapper behavior and env injection semantics.
cmd/sluice/main.go Always starts/mounts MCP gateway; wires Telegram display name from active agent profile.
CLAUDE.md Documents Hermes venv wiring behavior and the new sluice-managed env block semantics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/proxy/addon.go
Comment thread internal/telegram/bot.go
Comment thread internal/container/types.go Outdated
Comment thread internal/container/types.go Outdated
Comment thread internal/container/agent_profile_test.go Outdated
The OpenClaw production compose lived in compose.yml at the repo root
since the original setup. Now that Hermes is the supported agent, the
canonical deploy files in this repo target Hermes too.

- compose.yml: new sluice (--agent hermes) + tun2proxy + hermes gateway
  + hermes-dashboard sidecar + caddy front. Sluice image pulled from
  ghcr.io/nnemirovsky/sluice:latest. Hermes image pinned to
  nousresearch/hermes-agent:v2026.4.30. HERMES_HOME is overridden to
  /opt/data/.hermes so paths align with sluice's HermesProfile.
- compose.dev.yml: same shape but builds sluice from local source for
  the dev loop. Drops caddy + dashboard sidecar.
- Caddyfile: cert paths moved off provider-specific /etc/cloudflare to
  standard FHS /etc/ssl/certs/agent.pem and /etc/ssl/private/agent.key,
  and reverse_proxy now targets the hermes-tun2proxy-1 dashboard port.
- bootstrap.sh: one-shot helper that imports an existing OpenClaw home
  volume into the Hermes home volume via `hermes claw migrate`, then
  pre-writes mcp_servers.sluice.url into ~/.hermes/config.yaml. Skips
  cleanly when no legacy volume is found, so it is safe to invoke on
  fresh installs too.

The old OpenClaw compose is gone. Operators still on OpenClaw can
recover it from any v0.12.x tag if needed; the migration path forward
is bootstrap.sh + the new compose.yml.
- proxy/addon: snapshot body + Content-Encoding/Length/Transfer-Encoding
  before decompressing the OAuth response. Restore the snapshot on any
  error or panic after the decompress so a partial mid-rewrite cannot
  send the client plaintext bytes with stripped encoding headers
  (which, for token responses, would mean real tokens delivered with
  no phantom swap). The flow now ends in one of two states: fully
  swapped, or untouched.
- telegram/bot: HTML-escape agentDisplayName at render time. SetAgent-
  DisplayName already filters empty values; the escape adds a second
  line of defense so a future profile name containing '<', '&', etc
  cannot break the message structure or inject markup.
- container/types: replace `sed '/BEGIN/,/END/d'` with an awk-based
  conditional delete that runs only when both BEGIN and END markers
  exist in the file. The old sed range would have happily deleted
  from BEGIN to end-of-file when END was missing (truncated write,
  manual edit), nuking foreign keys the marker design promises to
  preserve. The awk script writes to a sibling temp file and renames
  in place so a crash mid-rewrite leaves the original env file
  intact. `bsdSed` is kept in the signature for API compatibility
  but is no longer consulted (awk is portable across BSD and GNU).
- container/types: refresh the ContainerManager.InjectEnvVars doc
  comment to describe the marker-block reconciliation semantics
  instead of the old merge-vs-truncate split.
- container: rename TestHermesProfile_WireMCPUsesPython to
  TestHermesProfile_WireMCPUsesVenvWrapper now that the assertion is
  about the sh wrapper that activates the bundled venv.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 5 comments.

Comment thread internal/container/types.go Outdated
Comment thread internal/proxy/addon.go
Comment thread Caddyfile Outdated
Comment thread bootstrap.sh Outdated
Comment thread bootstrap.sh Outdated
- proxy/addon: processAddonOAuthResponse now sets `modified` based on
  a byte-equality check between the pre-swap and post-swap bodies. A
  token endpoint that echoes already-phantom tokens (e.g. on a retry
  after the upstream rotated and we already swapped on the previous
  attempt) was logging "swapped to phantoms" with metrics bumps even
  though zero bytes changed. The function now rolls back its decode
  in the no-op case so the encoding/length headers continue to
  advertise the original wire form.
- Caddyfile: reverse_proxy now targets the docker compose service
  name `tun2proxy:9119` instead of the project-prefixed container
  name. The dashboard sidecar shares tun2proxy's namespace, so the
  service name reaches it on any internal network. Survives renames
  of the compose project and replica index changes.
- bootstrap.sh: typo fix ("read read-only" -> "mounted read-only")
  and BOOTSTRAP_OVERWRITE env var so a re-run can opt out of the
  --overwrite flag and let migrate refuse on conflicts. Header
  comment now describes the actual overwrite semantics instead of
  claiming the script is unconditionally idempotent.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 2 comments.

Comment thread internal/container/types.go Outdated
Comment thread internal/container/types.go Outdated
- container/types: BuildEnvInjectionScript docstring no longer
  references bsdSed/sed semantics. The implementation switched to awk
  in round 1 of this PR; the doc comment lagged. Now describes the
  awk-based block reconciliation accurately.
- container/types: env var values are now validated for newlines and
  NUL bytes via the new validateEnvVarValue helper. The previous doc
  claimed values were "single-quoted to prevent shell injection",
  which is true at the shell-command layer but did not stop a value
  with an embedded newline from splitting one logical entry into two
  lines of the env file (the second line would either drop or be
  parsed as a separate KEY=value when sourced). The escape comment
  now describes the actual quoting model.
@nnemirovsky nnemirovsky requested a review from Copilot May 7, 2026 06:55
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.

Comment thread internal/container/types.go Outdated
Two findings, one squashed commit because they are interleaved by the
copilot review thread on this PR.

- proxy/ssh: sshHandleChannel previously called srcChan.CloseWrite from
  the upstream→agent data-copy goroutine the moment it saw EOF on the
  upstream channel. That CloseWrite (SSH_MSG_CHANNEL_EOF) raced the
  request-forwarder goroutine writing exit-status on the same channel.
  Depending on goroutine schedule, the agent could observe EOF and
  channel-close BEFORE the exit-status request reached the wire, which
  surfaces as `session.Wait` returning EOF instead of a clean exit.
  TestCredential_SSHInjection reproduced this 10-20% of the time
  locally and one in N runs in CI.
  Fix: hold the agent-side stdout EOF until all three upstream→agent
  goroutines (request, data, stderr) have drained, then issue
  CloseWrite followed by Close. The agent now sees the documented
  order: data, exit-status, EOF, close. Inputs from the agent (stdin
  → upstream) still get EOF'd as soon as the agent half-closes, so
  upstream `cat`-style commands keep terminating correctly.
- e2e/credential_test: tightened the test SSH server to send
  exit-status before half-closing the data side. This matches what
  well-behaved SSH servers do and removes one cause of the same
  ordering race on the server side.
- container/types: BuildEnvInjectionScriptForProfile now counts
  non-empty values up front and skips the marker-block emission
  entirely when nothing in envMap has content. The previous code
  emitted a stray `BEGIN sluice-managed ... END sluice-managed` pair
  with no entries when every value was empty (the per-call delete
  semantic), contradicting the "skip when empty" comment.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 5 comments.

Comment thread internal/container/types.go
Comment thread internal/container/types.go Outdated
Comment thread bootstrap.sh
Comment thread bootstrap.sh Outdated
Comment thread Caddyfile Outdated
- container/types: env file values are now written as `KEY='value'`
  with single-quoted values via a quoted-tag heredoc instead of
  `KEY=value` (unquoted). The previous format relied on shell echo
  quoting which prevented script-level injection but did NOT survive
  re-sourcing the file. Production compose loads the env file with
  `set -a; . file; set +a`, so an unquoted value containing `$`,
  `$()`, backticks, spaces, or globs would have been re-expanded by
  the shell at source time. Single quotes around values neutralize
  every shell metacharacter; embedded single quotes are escaped via
  the standard `'\''` idiom (close, escaped quote, reopen). New
  round-trip test runs the generated script in a real shell, sources
  the file, and confirms values like `pa$$word`, `it's`, `back\`tick\``
  reach the environment unchanged.
- bootstrap.sh: HERMES_UID and HERMES_GID are now honored from env
  vars (with the upstream-image defaults of 10000/10000) and applied
  consistently to every chown across all docker run invocations. An
  operator who pinned non-default UID/GID via compose.yml previously
  ended up with migrated files owned by 10000:10000 and unreadable
  to the runtime user.
- bootstrap.sh: the venv activation step is now guarded by a
  `[ -f /opt/hermes/.venv/bin/activate ] && . ...` check that mirrors
  the runtime wrapper in HermesProfile.WireMCPCmd. A custom Hermes
  image without the venv now falls through to system python3 and
  emits a clear error if PyYAML is missing, instead of failing on a
  non-existent activate script.
- Caddyfile: comment now explicitly says the dashboard is reached
  *via* tun2proxy because the dashboard sidecar uses
  network_mode: "service:tun2proxy" and has no internal-network
  identity of its own. Less likely to mislead a reader trying to
  swap the proxy target.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 5 comments.

Comment thread internal/container/types.go Outdated
Comment thread internal/container/types.go Outdated
Comment thread internal/container/types.go Outdated
Comment thread CLAUDE.md Outdated
Comment thread CLAUDE.md
- container/types: refresh validateEnvVarValue and the
  BuildEnvInjectionScript docstring to describe the quoted-tag
  heredoc + KEY='value' format. The previous comments referenced
  the old echo-based mechanism, and the example escape idiom was
  rendered with a smart quote that did not match the actual code.
- container/types: harden the awk pre-pass that removes the
  managed block. The old version triggered range-delete behavior
  whenever both BEGIN and END markers appeared anywhere in the
  file; if an operator manually shuffled the block (END before
  BEGIN, or duplicated markers) it could nuke foreign keys after
  an unmatched BEGIN. The new pass walks the file once, records
  every well-formed BEGIN..END pair as a contiguous range to
  delete, then prints every line not in the delete set.
  Orphan markers stay verbatim so a corrupted file shows what is
  wrong rather than silently dropping content.
- CLAUDE.md: env-block example now uses single-quoted values
  (KEY1='phantom-value-1') matching the production format, and
  documents the awk well-formed-pair semantic so operators
  hand-editing the file know which marker layouts are honored.
- CLAUDE.md: the Hermes profile row in the Agent Profiles table
  now describes the sh wrapper that activates Hermes' bundled
  venv before exec'ing python3, instead of the old bare
  `python3 -c <script>` form that no longer matches the code.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.

Comment thread internal/container/types.go Outdated
Comment thread cmd/sluice/main.go
- container/types: awkStringEscape no longer doubles backslashes.
  In shell single-quoted strings, only the apostrophe needs the
  break-out-and-rejoin escape; every other byte (backslash
  included) is taken literally. Doubling backslashes would have
  silently changed the marker text awk sees vs the bytes in the
  file if EnvBlockBegin or EnvBlockEnd were ever edited to include
  a backslash, breaking the pre-pass match.
- cmd/sluice: phase 3 (MCP wire-up) comments and log messages no
  longer hard-code openclaw.json / mcp.servers.sluice.url. The
  comment now describes the profile-specific dispatch (openclaw
  uses the gateway RPC, hermes patches mcp_servers in
  ~/.hermes/config.yaml, future profiles plug their own
  WireMCPCmd), and the success/error logs include the active
  profile name so an operator reading the tail of `docker compose
  logs sluice` sees which agent the wiring targeted.
@nnemirovsky nnemirovsky requested a review from Copilot May 7, 2026 07:52
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated no new comments.

@nnemirovsky nnemirovsky merged commit d27b05e into main May 7, 2026
10 checks passed
@nnemirovsky nnemirovsky deleted the v0.13.1-fixes branch May 7, 2026 07:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants