Part of the Observability — OpenTelemetry Tracing v1 initiative (master tracking: #108). Effort: S (2–3 engineer-days). Risk: medium (deployment-time semantics, two enforcement layers, dynamic egress). Depends on: Phase 2 (#103).
Goal
A built/deployed agent can reach its collector without baking an environment-specific URL into the image. The endpoint is supplied at deploy time via env; the allowlist resolves to whatever was injected, with zero drift.
Files
| File |
Change |
forge-cli/build egress generation |
When tracing.enabled, add a dynamic entry $OTEL_EXPORTER_OTLP_ENDPOINT to egress_allowlist.json, source otel — reusing the existing $VAR dynamic-egress mechanism (same as a skill's $K8S_API_DOMAIN) |
forge-core/security egress resolver |
When expanding the otel dynamic entry, host-extract from the value (the env holds a full URL, the matcher needs a host) |
forge-cli/build manifest generation |
When tracing.enabled, inject env references (ConfigMap/Secret, optional: true) into the Deployment; emit a ConfigMap stub |
Two enforcement layers — keep them distinct
1. Forge in-process EgressEnforcer
The layer that would otherwise silently drop the exporter. Solved by the dynamic $VAR entry: the build emits a placeholder, the runtime resolver expands it from the same env var the exporter uses. Destination and allowlist derive from one variable, so they cannot drift.
2. K8s NetworkPolicy (static, network-level)
It cannot expand env vars at deploy. So for an external collector with a deploy-injected host, the NetworkPolicy egress rule is the deployer's / Platform's responsibility — the same actor that injects the env. This is the same limitation $K8S_API_DOMAIN already carries; not a new gap.
Recommended sidestep: run the collector as a sidecar (http://localhost:4318 — no NetworkPolicy egress rule at all) or an in-cluster service (a same-cluster egress rule, not internet). Then the agent's external allowlist / NetworkPolicy is untouched and the collector owns forwarding to the real backend.
Egress entry
{ "domain": "$OTEL_EXPORTER_OTLP_ENDPOINT", "source": "otel" }
Resolver expands at startup → host-extract:
https://otel.initializ.ai:4318/v1/traces → otel.initializ.ai
Skip the entry when the expanded host is localhost (sidecar) or the var is empty. Do not introduce a second _HOST var — host-parse the one endpoint var.
Deployment env injection
When tracing.enabled, emit into the Deployment container (references, not literals — so deploy-time override works and missing config degrades to no-op):
env:
- name: OTEL_EXPORTER_OTLP_ENDPOINT
valueFrom:
configMapKeyRef: { name: forge-otel, key: endpoint, optional: true }
- name: OTEL_EXPORTER_OTLP_HEADERS
valueFrom:
secretKeyRef: { name: forge-otel-auth, key: headers, optional: true }
optional: true is load-bearing: absent ConfigMap → env unset → gate fails (per Phase 2) → no-op, pod still healthy. OTEL_SERVICE_NAME defaults to agent_id. Also emit forge-otel.configmap.example.yaml (a stub) so the operator sees exactly what to populate.
The Initializ Platform populates this ConfigMap automatically at deploy; self-managed operators fill it via kustomize / helm / GitOps.
Build mode interactions
--slim: skips manifests + allowlist entirely → no otel env, no otel egress entry. Ops wires everything.
--prod: rejects the dev-open egress profile. Because the otel entry is dynamic, the in-process enforcer still resolves it correctly at runtime — but for an external collector the NetworkPolicy must permit it (deploy-owned) or the exporter is silently blocked in prod. State this loudly; recommend the in-cluster collector for prod.
Verify
# Build an agent with tracing.enabled (no endpoint committed in forge.yaml):
forge build
jq '.[] | select(.source=="otel")' .forge-output/egress_allowlist.json
# dynamic $OTEL_EXPORTER_OTLP_ENDPOINT entry present
grep -n 'configMapKeyRef\|forge-otel' .forge-output/k8s/*deployment*.yaml
# env reference + optional:true present
ls .forge-output/k8s/ | grep -i 'forge-otel.*example'
# ConfigMap stub emitted
# Runtime host-extraction:
OTEL_EXPORTER_OTLP_ENDPOINT=https://otel.example.com:4318/v1/traces \
forge run --tracing --port 8095 &
# confirm enforcer allowlists host "otel.example.com" (not the full URL)
# check egress_allowed audit/log line
# Build with tracing disabled: NO otel egress entry, NO otel env in Deployment.
Anti-patterns to avoid
- Resolving the endpoint host at build time (it's deploy-time — emit the
$VAR placeholder, don't bake a host).
- Adding a
--otel-endpoint flag (redundant — the dynamic entry supersedes it).
- A second
_HOST env var (host-parse the one endpoint var).
- Literal env
value: instead of valueFrom reference (kills deploy-time override).
- Allowlisting
localhost.
- Expecting the static NetworkPolicy to resolve
$VAR for external collectors.
Goal
A built/deployed agent can reach its collector without baking an environment-specific URL into the image. The endpoint is supplied at deploy time via env; the allowlist resolves to whatever was injected, with zero drift.
Files
forge-cli/buildegress generationtracing.enabled, add a dynamic entry$OTEL_EXPORTER_OTLP_ENDPOINTtoegress_allowlist.json, sourceotel— reusing the existing$VARdynamic-egress mechanism (same as a skill's$K8S_API_DOMAIN)forge-core/securityegress resolveroteldynamic entry, host-extract from the value (the env holds a full URL, the matcher needs a host)forge-cli/buildmanifest generationtracing.enabled, inject env references (ConfigMap/Secret,optional: true) into the Deployment; emit a ConfigMap stubTwo enforcement layers — keep them distinct
1. Forge in-process EgressEnforcer
The layer that would otherwise silently drop the exporter. Solved by the dynamic
$VARentry: the build emits a placeholder, the runtime resolver expands it from the same env var the exporter uses. Destination and allowlist derive from one variable, so they cannot drift.2. K8s NetworkPolicy (static, network-level)
It cannot expand env vars at deploy. So for an external collector with a deploy-injected host, the NetworkPolicy egress rule is the deployer's / Platform's responsibility — the same actor that injects the env. This is the same limitation
$K8S_API_DOMAINalready carries; not a new gap.Recommended sidestep: run the collector as a sidecar (
http://localhost:4318— no NetworkPolicy egress rule at all) or an in-cluster service (a same-cluster egress rule, not internet). Then the agent's external allowlist / NetworkPolicy is untouched and the collector owns forwarding to the real backend.Egress entry
{ "domain": "$OTEL_EXPORTER_OTLP_ENDPOINT", "source": "otel" }Resolver expands at startup → host-extract:
Skip the entry when the expanded host is
localhost(sidecar) or the var is empty. Do not introduce a second_HOSTvar — host-parse the one endpoint var.Deployment env injection
When
tracing.enabled, emit into the Deployment container (references, not literals — so deploy-time override works and missing config degrades to no-op):optional: trueis load-bearing: absent ConfigMap → env unset → gate fails (per Phase 2) → no-op, pod still healthy.OTEL_SERVICE_NAMEdefaults toagent_id. Also emitforge-otel.configmap.example.yaml(a stub) so the operator sees exactly what to populate.The Initializ Platform populates this ConfigMap automatically at deploy; self-managed operators fill it via kustomize / helm / GitOps.
Build mode interactions
--slim: skips manifests + allowlist entirely → no otel env, no otel egress entry. Ops wires everything.--prod: rejects the dev-open egress profile. Because the otel entry is dynamic, the in-process enforcer still resolves it correctly at runtime — but for an external collector the NetworkPolicy must permit it (deploy-owned) or the exporter is silently blocked in prod. State this loudly; recommend the in-cluster collector for prod.Verify
Anti-patterns to avoid
$VARplaceholder, don't bake a host).--otel-endpointflag (redundant — the dynamic entry supersedes it)._HOSTenv var (host-parse the one endpoint var).value:instead ofvalueFromreference (kills deploy-time override).localhost.$VARfor external collectors.