Skip to content

fix(k8s): Service for BoJ to ClusterIP (HCG tier-2 E1 prereq)#131

Merged
hyperpolymath merged 1 commit into
mainfrom
fix/boj-k8s-service-clusterip
May 20, 2026
Merged

fix(k8s): Service for BoJ to ClusterIP (HCG tier-2 E1 prereq)#131
hyperpolymath merged 1 commit into
mainfrom
fix/boj-k8s-service-clusterip

Conversation

@hyperpolymath
Copy link
Copy Markdown
Owner

Summary

Per ADR-0004 §1 (http-capability-gateway tier-2 placement) and the Phase E rollout-runbook §1.4 prereq #8, BoJ MUST NOT be externally addressable when fronted by HCG. Previously k8s/service.yaml declared type: LoadBalancer, exposing all four BoJ ports (REST 7700, gRPC 7701, GraphQL 7702, SSE 7703) externally — defeating the runbook §3.4 decommission invariant before Phase E even starts.

This is action item #8 from the Phase E consumer-side audit — companion to action #6 (boj-server#130 Cowboy bind tightening). The two layers together give defence in depth: BoJ binds loopback (#130) AND the k8s Service does not expose it externally (this PR).

Refs hyperpolymath/standards#100 (NOT Closes — joint-close is owner-only).
Refs hyperpolymath/standards#91.

What's in

File Change
k8s/service.yaml type: LoadBalancertype: ClusterIP. Header comment explaining the Phase E posture. New annotations: hyperpolymath.dev/exposure: "internal-only", hyperpolymath.dev/external-via: "http-capability-gateway (tier-2)". Ports 7700–7703 retained forward-compatibly.
CHANGELOG.md New ### Changed entry under [Unreleased].

Estate cross-check: hypatia/*, rsr-certifier, and opsm-service all use ClusterIP for backend Services. The estate's gateway-fronted-backend pattern is Service: ClusterIP. The only LoadBalancer in the estate's k8s manifests today (besides the BoJ Service this PR fixes) is svalinn-gateway — i.e. the gateway tier, not a backend. This PR brings BoJ's Service in line with the estate pattern.

Why DRAFT

Breaking for anyone running this manifest as-is with a LoadBalancer in production. I don't know whether a live LoadBalancer-fronted BoJ deployment exists. Marking draft so the owner gates merge on confirming no such deployment will break. If one exists, the migration path is HCG-in-front (per the rollout-runbook §2).

For ad-hoc / dev / non-HCG-fronted deployments that need BoJ exposed externally, the header comment in service.yaml shows the kustomize/helm overlay recipe rather than editing the canonical manifest:

- op: replace
  path: /spec/type
  value: LoadBalancer  # only valid for non-HCG-fronted deployments

Test plan

  • YAML lints clean (manual review — single-document Service spec, valid v1 API).
  • CI green — governance / a2ml / k9 / hypatia all pass.
  • Owner: confirm no live LoadBalancer-fronted BoJ deployment will break, then flip from DRAFT to ready.
  • Post-merge sanity: kubectl apply -f k8s/service.yaml --dry-run=client accepts the spec; kubectl describe svc/boj-server shows the new annotations.

Risk

Medium for the manifest, low for the codebase. The codebase is untouched — only k8s manifest. Breakage scope is limited to k8s operators relying on the LoadBalancer external IP for BoJ; they get a fixable Service-type override. No mix.exs / Elixir / Zig / Idris2 / cartridge changes; CI should not show any logic regressions.

Follow-ups not in this PR (audit residue)

  • Action Add node operator tray app and browser extension #7stapeln.toml APP_HOST = "[::]" → loopback (separate concern: packager). Can pick up next.
  • Optional: add a NetworkPolicy restricting BoJ pod ingress to only the HCG pod. Defence-in-depth beyond ClusterIP. Not strictly required for Phase E acceptance — Service-type is the controlling lever — but worth filing as a follow-up issue.

🤖 Generated with Claude Code

Per ADR-0004 §1 (http-capability-gateway tier-2 placement) and the
Phase E rollout-runbook §1.4 prereq #8, BoJ MUST NOT be externally
addressable when fronted by HCG. Previously `k8s/service.yaml`
declared `type: LoadBalancer`, exposing all four BoJ ports
(REST 7700, gRPC 7701, GraphQL 7702, SSE 7703) externally — defeating
the §3.4 decommission invariant before Phase E even starts.

This is action item #8 from the Phase E consumer-side audit posted on
hyperpolymath/standards#100 — companion to action #6 (boj-server#130
Cowboy bind tightening). The two layers together give defense in
depth: the BoJ process itself binds loopback (#130) AND the k8s
Service does not expose it externally (this PR).

Changes:

- `k8s/service.yaml` type: LoadBalancer -> ClusterIP.
- Add `hyperpolymath.dev/exposure: "internal-only"` and
  `hyperpolymath.dev/external-via: "http-capability-gateway (tier-2)"`
  annotations so the posture is discoverable from `kubectl describe`
  without grepping for ADR references.
- Add a header comment explaining the rationale, pointing to ADR-0004
  and the rollout-runbook, and showing the kustomize/helm override
  recipe for legacy/standalone deployments that need BoJ exposed
  externally (no HCG in front).
- Keep ports 7700-7703 declared forward-compatibly (current BoJ binds
  7700 only; 7701/7702/7703 reserved for gRPC/GraphQL/SSE).

DRAFT — breaking for anyone running this manifest as-is with a
LoadBalancer in production. Owner gates merge on confirming no live
LoadBalancer-fronted BoJ deployment exists that this change would
break. If one exists, the migration path is HCG-in-front
(per the rollout-runbook).

Refs hyperpolymath/standards#100
Refs hyperpolymath/standards#91

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔍 Hypatia Security Scan

Findings: 30 issues detected

Severity Count
🔴 Critical 18
🟠 High 5
🟡 Medium 7

⚠️ Action Required: Critical security issues found!

View findings
[
  {
    "reason": "Stale AI session file -- delete",
    "type": "stale",
    "file": "GEMINI.md",
    "action": "delete",
    "rule_module": "root_hygiene",
    "severity": "medium"
  },
  {
    "reason": "Issue in quality.yml",
    "type": "missing_workflow",
    "file": "quality.yml",
    "action": "create",
    "rule_module": "workflow_audit",
    "severity": "high"
  },
  {
    "reason": "Issue in security-policy.yml",
    "type": "missing_workflow",
    "file": "security-policy.yml",
    "action": "create",
    "rule_module": "workflow_audit",
    "severity": "medium"
  },
  {
    "reason": "Action hyperpolymath/standards/.github/workflows/governance-reusable.yml@main needs attention",
    "type": "unpinned_action",
    "file": "governance.yml",
    "action": "pin_sha",
    "rule_module": "workflow_audit",
    "severity": "high"
  },
  {
    "reason": "TypeScript file detected -- banned language",
    "type": "banned_language_file",
    "file": "/home/runner/work/boj-server/boj-server/cartridges/sanctify-mcp/adapter/mod.ts",
    "action": "flag",
    "rule_module": "cicd_rules",
    "severity": "critical"
  },
  {
    "reason": "TypeScript file detected -- banned language",
    "type": "banned_language_file",
    "file": "/home/runner/work/boj-server/boj-server/cartridges/academic-workflow-mcp/adapter/mod.ts",
    "action": "flag",
    "rule_module": "cicd_rules",
    "severity": "critical"
  },
  {
    "reason": "TypeScript file detected -- banned language",
    "type": "banned_language_file",
    "file": "/home/runner/work/boj-server/boj-server/cartridges/fireflag-mcp/adapter/mod.ts",
    "action": "flag",
    "rule_module": "cicd_rules",
    "severity": "critical"
  },
  {
    "reason": "TypeScript file detected -- banned language",
    "type": "banned_language_file",
    "file": "/home/runner/work/boj-server/boj-server/cartridges/ephapax-mcp/adapter/mod.ts",
    "action": "flag",
    "rule_module": "cicd_rules",
    "severity": "critical"
  },
  {
    "reason": "TypeScript file detected -- banned language",
    "type": "banned_language_file",
    "file": "/home/runner/work/boj-server/boj-server/cartridges/bofig-mcp/adapter/mod.ts",
    "action": "flag",
    "rule_module": "cicd_rules",
    "severity": "critical"
  },
  {
    "reason": "TypeScript file detected -- banned language",
    "type": "banned_language_file",
    "file": "/home/runner/work/boj-server/boj-server/cartridges/hesiod-mcp/adapter/mod.ts",
    "action": "flag",
    "rule_module": "cicd_rules",
    "severity": "critical"
  }
]

Powered by Hypatia Neurosymbolic CI/CD Intelligence

@hyperpolymath hyperpolymath marked this pull request as ready for review May 20, 2026 10:00
@hyperpolymath hyperpolymath merged commit a3ae35f into main May 20, 2026
19 checks passed
@hyperpolymath hyperpolymath deleted the fix/boj-k8s-service-clusterip branch May 20, 2026 10:00
hyperpolymath added a commit that referenced this pull request May 20, 2026
…#132)

## Summary

Tightens three sites that feed the Zig adapter binary's `--host` flag in
production deployments, materialising the ADR-0004 §1 invariant that
BoJ's back-side bind is not externally routable when fronted by
`http-capability-gateway` (HCG tier-2).

This is **action item #7** from the [Phase E consumer-side
audit](hyperpolymath/standards#100 (comment)).
The scope expanded during implementation from one site (the audit named
`stapeln.toml`) to three sites — `entrypoint.sh` and `compose.prod.yaml`
had the same `[::]` default that the audit missed. All three sites feed
into the same Zig-adapter `--host` flag, so they need to flip together
for the change to actually take effect at runtime.

Companion PR to **#130** (Cowboy bind tightening in the Elixir path) and
**#131** (k8s Service ClusterIP). Together the three give defence in
depth: Elixir Cowboy binds loopback **AND** Zig adapter binds loopback
**AND** k8s Service is internal-only.

`Refs hyperpolymath/standards#100` (NOT Closes — joint-close is
owner-only).
`Refs hyperpolymath/standards#91`.

## What's in

| File | Change |
|---|---|
| `stapeln.toml` `[targets.production]` | `APP_HOST = "[::]"` →
`APP_HOST = "127.0.0.1"` + comment block. |
| `container/entrypoint.sh` line 40 (log) + line 140 (`exec` invocation)
| `${APP_HOST:-[::]}` → `${APP_HOST:-127.0.0.1}` + comment at exec line.
|
| `container/compose.prod.yaml` `services.boj-rest.environment` |
`APP_HOST: "[::]"` → `APP_HOST: "127.0.0.1"` + comment block. |
| `CHANGELOG.md` | New `### Changed` entry under `[Unreleased]`. |

## Override path for legacy/standalone use

Deployments without HCG in front: set `APP_HOST=0.0.0.0` (IPv4
all-interfaces) or `APP_HOST=::` (IPv6 all-interfaces) in your
deployment config. The in-repo defaults remain loopback.

## Audit-residue follow-ups deliberately NOT in this PR

- `container/Containerfile` line 125: `ENV PHX_HOST=0.0.0.0` is
**vestigial**. Nothing in the codebase reads `PHX_HOST` (verified by
`grep -rn "PHX_HOST\|phx_host" --include="*.ex" ...` returning empty).
Leftover from a former Phoenix incarnation. Safe to leave alone; can be
removed in a hygiene PR if desired.
- Unifying `APP_HOST` (Zig adapter) and `BOJ_BIND_IP` (Elixir Cowboy
from #130) into one envelope is broader scope. The divergence exists
because they feed different binaries built by different toolchains. If
it proves annoying in operation, file a separate issue.

## Why DRAFT

Same reason as #131 — this is a behaviour change for anyone running the
stapeln-built production container or `compose.prod.yaml` as-is and
relying on the default `[::]` for external access. Owner gates merge on
confirming no such reliance, or on coordinating with anyone who needs
the migration path (HCG-in-front, or explicit `APP_HOST=0.0.0.0`
override).

## Test plan

- [x] `stapeln.toml` parses as valid TOML (syntax preserved).
- [x] `container/entrypoint.sh` runs through `sh -n` without syntax
error (no syntax change, just literal substitution).
- [x] `container/compose.prod.yaml` parses as valid YAML.
- [ ] CI green — governance / hypatia / a2ml / k9 / dogfooding all pass.
- [ ] Owner: confirm no live deployment depends on `[::]` default; flip
from DRAFT to ready.
- [ ] Post-merge manual: a stapeln-built production container, with no
env override, refuses connections from non-loopback peers.

## Risk

**Low for the codebase, medium for ops.** No Elixir / Zig / Idris2 /
cartridge logic touched; CI should not show any regressions. Ops risk:
anyone whose runbook assumes the container exposes BoJ on all interfaces
by default will need to set `APP_HOST=0.0.0.0` explicitly. Reasonable
default for the Phase E posture; documented override path.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
hyperpolymath added a commit that referenced this pull request May 20, 2026
…ds#100/#91) (#138)

## Summary

Adds a second 2026-05-20 entry to `.machine_readable/6a2/STATE.a2ml`
`[session-history]` documenting the afternoon HCG Phase E first-session
output. The morning Tier C entry (already in main) stays in place; this
new entry sits **above** it per the newest-first convention.

`Refs hyperpolymath/standards#100` (Phase E), `Refs
hyperpolymath/standards#91` (HCG tier-2 channel parent). **NOT Closes**.

## What's in

A single new TOML entry in `[session-history] entries = [ ... ]`
summarising the afternoon's deliverables:

- PR `#128` (MERGED) — `docs/integration/hcg-tier2-rollout-runbook.md`
(E5 rollout-and-rollback runbook, 308 lines, `!OWNER:` markers in §1.3 +
§4)
- PR `#130` (MERGED) — Cowboy bind `127.0.0.1` default + `BOJ_BIND_IP`
env override (audit #6)
- PR `#131` (MERGED) — k8s Service `LoadBalancer → ClusterIP` (audit #8)
- PR `#132` (MERGED) — container `APP_HOST` defaults across
`stapeln.toml` + `entrypoint.sh` + `compose.prod.yaml` (audit #7)
- Issue `#135` (filed) — k8s NetworkPolicy follow-up (Low priority,
Phase E acceptance non-critical)
- Defence in depth: 3 independent loopback layers (Elixir Cowboy + Zig
adapter + k8s Service)
- Phase C §3 invariant 3 correction: confirmed via `git log` that the
deny clause landed in `boj-server#106 (40e46f6)`; the channel-status
comment claiming it was owner-gated was stale.

The entry also records the **Phase E gating posture**: E1/E2/E3/E4
wiring + Trustfile `PENDING → DEPLOYED` flip are all explicitly gated on
Phase D-3 (regression alert armed) + D-4 (real baseline numbers
populated), per the runbook §1.1. The afternoon session shipped only the
Phase-D-independent artefacts.

## Why a separate PR (not amended into another)

All four code PRs (#128/#130/#131/#132) are already merged. The
STATE.a2ml entry parallels the morning Tier C entry (already in main
from the morning session), and the convention is per-session per-entry.
Keeping this as its own doc PR is the cleanest record.

## Verification

- TOML syntax: valid (single new `{ date = "...", description = "..." }`
entry prepended).
- Linting: `validate-a2ml` action will run on PR.

## Risk

**Negligible.** Doc-only; no code or workflow changes.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant