Skip to content

OTel Phase 6 — Build-time manifest + egress wiring #107

@initializ-mk

Description

@initializ-mk

Part of the Observability — OpenTelemetry Tracing v1 initiative (master tracking: #108). Effort: S (2–3 engineer-days). Risk: medium (deployment-time semantics, two enforcement layers, dynamic egress). Depends on: Phase 2 (#103).

Goal

A built/deployed agent can reach its collector without baking an environment-specific URL into the image. The endpoint is supplied at deploy time via env; the allowlist resolves to whatever was injected, with zero drift.

Files

File Change
forge-cli/build egress generation When tracing.enabled, add a dynamic entry $OTEL_EXPORTER_OTLP_ENDPOINT to egress_allowlist.json, source otel — reusing the existing $VAR dynamic-egress mechanism (same as a skill's $K8S_API_DOMAIN)
forge-core/security egress resolver When expanding the otel dynamic entry, host-extract from the value (the env holds a full URL, the matcher needs a host)
forge-cli/build manifest generation When tracing.enabled, inject env references (ConfigMap/Secret, optional: true) into the Deployment; emit a ConfigMap stub

Two enforcement layers — keep them distinct

1. Forge in-process EgressEnforcer

The layer that would otherwise silently drop the exporter. Solved by the dynamic $VAR entry: the build emits a placeholder, the runtime resolver expands it from the same env var the exporter uses. Destination and allowlist derive from one variable, so they cannot drift.

2. K8s NetworkPolicy (static, network-level)

It cannot expand env vars at deploy. So for an external collector with a deploy-injected host, the NetworkPolicy egress rule is the deployer's / Platform's responsibility — the same actor that injects the env. This is the same limitation $K8S_API_DOMAIN already carries; not a new gap.

Recommended sidestep: run the collector as a sidecar (http://localhost:4318 — no NetworkPolicy egress rule at all) or an in-cluster service (a same-cluster egress rule, not internet). Then the agent's external allowlist / NetworkPolicy is untouched and the collector owns forwarding to the real backend.

Egress entry

{ "domain": "$OTEL_EXPORTER_OTLP_ENDPOINT", "source": "otel" }

Resolver expands at startup → host-extract:

https://otel.initializ.ai:4318/v1/traces → otel.initializ.ai

Skip the entry when the expanded host is localhost (sidecar) or the var is empty. Do not introduce a second _HOST var — host-parse the one endpoint var.

Deployment env injection

When tracing.enabled, emit into the Deployment container (references, not literals — so deploy-time override works and missing config degrades to no-op):

env:
  - name: OTEL_EXPORTER_OTLP_ENDPOINT
    valueFrom:
      configMapKeyRef: { name: forge-otel, key: endpoint, optional: true }
  - name: OTEL_EXPORTER_OTLP_HEADERS
    valueFrom:
      secretKeyRef: { name: forge-otel-auth, key: headers, optional: true }

optional: true is load-bearing: absent ConfigMap → env unset → gate fails (per Phase 2) → no-op, pod still healthy. OTEL_SERVICE_NAME defaults to agent_id. Also emit forge-otel.configmap.example.yaml (a stub) so the operator sees exactly what to populate.

The Initializ Platform populates this ConfigMap automatically at deploy; self-managed operators fill it via kustomize / helm / GitOps.

Build mode interactions

  • --slim: skips manifests + allowlist entirely → no otel env, no otel egress entry. Ops wires everything.
  • --prod: rejects the dev-open egress profile. Because the otel entry is dynamic, the in-process enforcer still resolves it correctly at runtime — but for an external collector the NetworkPolicy must permit it (deploy-owned) or the exporter is silently blocked in prod. State this loudly; recommend the in-cluster collector for prod.

Verify

# Build an agent with tracing.enabled (no endpoint committed in forge.yaml):
forge build
jq '.[] | select(.source=="otel")' .forge-output/egress_allowlist.json
# dynamic $OTEL_EXPORTER_OTLP_ENDPOINT entry present

grep -n 'configMapKeyRef\|forge-otel' .forge-output/k8s/*deployment*.yaml
# env reference + optional:true present

ls .forge-output/k8s/ | grep -i 'forge-otel.*example'
# ConfigMap stub emitted

# Runtime host-extraction:
OTEL_EXPORTER_OTLP_ENDPOINT=https://otel.example.com:4318/v1/traces \
  forge run --tracing --port 8095 &
# confirm enforcer allowlists host "otel.example.com" (not the full URL)
# check egress_allowed audit/log line

# Build with tracing disabled: NO otel egress entry, NO otel env in Deployment.

Anti-patterns to avoid

  • Resolving the endpoint host at build time (it's deploy-time — emit the $VAR placeholder, don't bake a host).
  • Adding a --otel-endpoint flag (redundant — the dynamic entry supersedes it).
  • A second _HOST env var (host-parse the one endpoint var).
  • Literal env value: instead of valueFrom reference (kills deploy-time override).
  • Allowlisting localhost.
  • Expecting the static NetworkPolicy to resolve $VAR for external collectors.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions