Skip to content

feat(observability): OTel-native operator metrics (Phase 1)#134

Draft
ian-flores wants to merge 33 commits intomainfrom
observability-metrics
Draft

feat(observability): OTel-native operator metrics (Phase 1)#134
ian-flores wants to merge 33 commits intomainfrom
observability-metrics

Conversation

@ian-flores
Copy link
Copy Markdown
Collaborator

@ian-flores ian-flores commented May 5, 2026

Summary

Phase 1 of the OTel-native observability rollout for the team-operator. Adds four team_operator_* metrics on /metrics (Prometheus exporter), optional OTLP gRPC push, and Helm/Kustomize wiring. All eight reconcilers receive a metric.Meter via constructor injection.

Closes #7. Phase 2 (events): #132. Phase 3 (traces): #133. Full design lives in the comment on #7.

What's emitted

Metric Type Labels
team_operator_resource_count Async gauge controller, namespace, phase
team_operator_status_transition_total Counter controller, namespace, from_phase, to_phase
team_operator_dependency_check_total Counter controller, namespace, dependency, result
team_operator_reconcile_requeue_total Counter controller, namespace, reason

No per-CR-name labels (cardinality bound by design). controller_runtime_* built-ins continue to serve unchanged on the same /metrics endpoint.

Configuration

Five new flags with OTel env-var fallbacks. OTEL_SDK_DISABLED=true kills the SDK entirely. SDK init failure falls back to a noop provider — operator always boots. OTLP push is disabled by default; enable via --observability-metrics-otlp-endpoint or OTEL_EXPORTER_OTLP_ENDPOINT. Full reference in docs/observability.md.

Upgrade notes

  • The Helm chart's manager env: block is now rendered unconditionally to inject POD_NAME (used for the service.instance.id resource attribute). Previously the block was guarded by {{- if .Values.controllerManager.container.env }}. Existing deployments will see a one-time diff on upgrade as POD_NAME is added to the container's environment.
  • The flag previously named --observability-metrics-resource-count-interval was renamed to --observability-metrics-export-interval in response to review feedback (it controls the OTLP PeriodicReader cadence overall, not just the gauge refresh). The Helm value observability.metrics.resourceCountInterval is correspondingly renamed to observability.metrics.metricsExportInterval.

Test plan

Verified locally and via CI:

  • make go-test — full suite passes (CI: Envtest job, also re-run locally after each task)
  • just helm-lint — chart lints clean
  • kubectl kustomize config/default and kubectl kustomize config/observability parse cleanly
  • OTEL_SDK_DISABLED=true ./bin/team-operator --help shows new flags and boots cleanly

Pending (require deployment to a real cluster with Posit Team CRs):

  • Deploy adhoc image to `ganso01-staging`; verify `/metrics` serves both `controller_runtime_` and `team_operator_` families
  • Confirm `team_operator_resource_count` populates from real CRs in the test cluster
  • Confirm `team_operator_status_transition_total` shows non-trivial `from_phase` values (not just `reconciling`) once a reconcile cycle completes against an existing CR

Notes for reviewers

  • Roborev "Address review findings" commits are interleaved with feature commits; the repo's squash-merge will absorb them.
  • The `Justfile` build recipe was updated to compile the package path (`./cmd/team-operator/`) so the new `resource_lister.go` is included alongside `main.go`. The `Dockerfile` got the same fix.
  • VCS stamping inside a worktree can produce a `just build` failure (pre-existing); `go build -buildvcs=false ./cmd/team-operator/` works around it.
  • The two review-fix commits at the head of the branch (`34dfcc7`, `d7b79b9`) address feedback from Claude's PR review — see PR comment thread for full details.

ian-flores added 20 commits May 5, 2026 06:35
Changes:
- Reverted premature direct-dependency promotion of four OTel metric packages (`otlpmetricgrpc`, `exporters/prometheus`, `metric`, `sdk/metric`) — no Go source imports them, so `go mod tidy` would demote them anyway
- `metric` returns to `// indirect`; `otlpmetricgrpc`, `prometheus`, and `sdk/metric` removed entirely (will be re-added when consuming code lands)
- OTel ecosystem version bumps (1.39 → 1.43) retained — they remain forced by the dependency graph
- `go mod tidy` now runs cleanly and idempotently; `go build ./...` passes
- Test failures observed are pre-existing environmental issues (missing `/usr/local/kubebuilder/bin/etcd` envtest binary), not caused by this change
All green.
Changes:
- Added per-line comments next to each `Phase*` constant in `internal/observability/names.go` showing the source `status.Reason*` constant (or marking sentinel/generic phases)
- Added `internal/observability/names_test.go` with three table tests: metric names share the `team_operator_` prefix, each label-value enum group has no duplicates, and `Phase*` strings derived from status reasons stay in sync via a lowercase-underscore conversion check (so a rename in `internal/status` breaks the build here)
Build passes; observability tests pass. The pre-existing `TestSiteReconcileWithExperimental` SIGSEGV is unrelated to these changes.
Changes:
- Added `var _ = status.ReasonReconcileError` to force a build break if that constant is renamed/removed (PhaseError is documented as covering it but isn't transform-asserted).
- Documented in `TestPhaseMatchesStatusReason` that the test asserts both the phase mapping and that Reason values stay CamelCase.
- Updated `camelToSnake` to skip the underscore on consecutive capitals (acronym guard) and added a comment noting the limitation and the requirement to update if an acronym Reason is added.
…ister

Registers a team_operator_resource_count async OTel gauge that lists all
seven CR-backed kinds (Site, Connect, Workbench, PackageManager, Chronicle,
Flightdeck, PostgresDatabase) per collection cycle and emits one observation
per (controller, namespace, phase) tuple. Also fixes build recipes in
Justfile and Makefile to use the package path (./cmd/team-operator/) instead
of a single file so multi-file packages compile correctly.
@claude
Copy link
Copy Markdown

claude Bot commented May 5, 2026

Claude finished @ian-flores's task —— View job


Review: feat(observability): OTel-native operator metrics (Phase 1)

  • Read review guidelines
  • Analyze PR diff and key files
  • Provide review feedback

Good overall architecture — the SDK fallback-to-noop pattern, cardinality-bounded label design, and constructor injection of metric.Meter are all solid choices. A few items worth addressing before merge:


Critical / Important

1. NewProvider error contract is dead code (internal/observability/provider.go:53-68)

NewProvider never returns a non-nil error — on failure it logs to stderr and returns (noopProvider, nil). The caller in main.go:158-160 checks err != nil but this branch can never fire, making it dead code. Either:

  • Remove the error return (make the contract explicit), or
  • Return the actual error and let the caller log it

Related: if NewProvider somehow returned nil (which it doesn't today), obsProvider.Meter(...) calls in main.go would panic since there's no nil-guard after the error check.

2. ResourceCountInterval conflated with OTLP export interval (internal/observability/provider.go:136-142)

The config's ResourceCountInterval is used as the PeriodicReader interval for the OTLP exporter. This means all metrics (counters included) export at this cadence, not just the resource gauge refresh. If a user sets 5m, status transition counters will also only export every 5 minutes. Consider a separate/fixed export interval for the OTLP reader.

3. from_phase is always "reconciling" (internal/controller/core/site_controller.go:108, and all other controllers)

Every RecordStatusTransition call hardcodes PhaseReconciling as from_phase. This means the metric can't actually detect flapping (error→ready→error) as described in docs/observability.md. The from_phase label adds cardinality without information. Either track the actual previous phase from the CR's conditions, or drop the label for Phase 1 and document the limitation.

4. Security: Hardcoded WithInsecure() for OTLP (internal/observability/provider.go:131)

OTLP push sends metrics in plaintext. For in-cluster collectors this is often acceptable, but a startup warning log when OTLP is configured without TLS would help operators notice. A --observability-metrics-otlp-insecure flag (defaulting true) would make the choice explicit.


Suggestions (non-blocking)

5. Global instrument cache uses interface keys (internal/observability/metrics.go:20-28)

The map[metric.Meter]metric.Int64Counter cache relies on interface identity for map keys. This works because the SDK returns pointer-receivers, but it's fragile. Creating instruments once in Provider and passing them to controllers directly would eliminate the global mutex+map and be more explicit.

6. multiKindLister.List silently swallows all errors (cmd/team-operator/resource_lister.go:23-34)

If the informer cache isn't synced yet, all List calls error → gauge reports 0 for every kind → potential false alerting. Consider at minimum debug-level logging, or aggregating errors so the callback can signal failure.

7. Repetitive lister methods (cmd/team-operator/resource_lister.go:80-153)

listConnects, listWorkbenches, listPackageManagers, listChronicles, listFlightdecks are identical modulo type. A generic helper would reduce maintenance burden as new CRs are added.

8. otel.SetMeterProvider(mp) is redundant (internal/observability/provider.go:73)

Since meters are passed explicitly via DI, setting the global is unnecessary and could conflict with library code. Consider removing.

9. Helm env: block is now unconditional (dist/chart/templates/manager/manager.yaml:54)

Previously the env: block was conditional on .Values.controllerManager.container.env. Now it always renders (for POD_NAME). This will produce a diff on upgrades for existing deployments. Worth a release-notes mention.


What looks good

  • Graceful degradation to noop — operator always boots
  • OTEL_SDK_DISABLED kill switch
  • Cardinality analysis in docs (3500 series worst case)
  • Thorough unit tests with ManualReader
  • Clean separation of concerns in internal/observability/
  • OTel env-var fallback chain matches spec

ian-flores added 9 commits May 5, 2026 10:51
Changes:
- Add `PrometheusRegisterer` field to `Config` so callers can pass a non-global registerer; tests now use `prometheus.NewRegistry()` so `go test -count=2` no longer trips duplicate-collector registration
- Gate `otlpmetricgrpc.WithInsecure()` behind a new `OTLPInsecure` config field (default off, so plaintext is opt-in instead of forced)
- Stop swallowing the SDK shutdown error; `Provider.Shutdown` now returns it and lets callers decide
- Replace `fmt.Fprintf(os.Stderr, ...)` calls with a `controller-runtime` logger (`ctrl.Log.WithName("observability")`)
- Remove the dead `assert` import and `var _ = assert.New` workaround in `provider_test.go`
- Drop the misleading "fail at export time" comment in `TestNewProvider_OTLPEndpointSet`; it's now framed as a smoke test that tolerates a shutdown error
- Add `TestNewProvider_PrometheusGather` to lock in the contract that recorded counters appear in the Prometheus gather output
- Document service-name precedence over `OTEL_SERVICE_NAME` in the `Config` doc and at the `resource.New` call site
The test failures are pre-existing environment issues (missing `etcd` binary at `/usr/local/kubebuilder/bin/etcd` for controller integration tests), not related to my changes. The build succeeded and `internal/observability` tests pass.
Changes:
- Simplified `camelToSnake` to track `prevUpper` directly, dropping the indirect `prev = rune(s[i])` byte re-read
- Expanded acronym-guard comment to name the actual failure mode (`"HTTPReady" -> "httpready"`, not `"http_ready"`)
- Moved `_ = status.ReasonReconcileError` rename-canary inside `TestPhaseMatchesStatusReason` so it's local to the test that documents the relationship
All observability tests pass, including the three I tightened. The pre-existing envtest failures elsewhere are unrelated to this change.
Changes:
- Tightened `TestRecordStatusTransition` to assert each data point's full attribute set via `attrsToMap` and a `default: t.Fatalf` arm, so wrong/missing labels fail loudly.
- Tightened `TestRecordDependencyCheck` to assert exact attribute-set per data point (postgres+success, secret+error).
- Tightened `TestRecordReconcileRequeue` to assert the requeue reason and other labels on the single data point.
- Promoted `assert.Len` to `require.Len` in `TestRecordStatusTransition` so a wrong-length slice halts before per-point assertions.
- Reused a package-level `noopMeter` instead of allocating a fresh `noop.NewMeterProvider()` on each fallback.
The failures are pre-existing environment issues (missing `/usr/local/kubebuilder/bin/etcd` envtest binary), unrelated to my changes. The build succeeds and the observability tests pass cleanly.
Changes:
- Switched `attrsToMap` helper from `Value.AsString()` to `Value.Emit()` so the helper produces correct stringified output for non-string label types if dimensions are added later
- Added `mm.Name` to the `Fatalf` messages in `TestRecordStatusTransition` and `TestRecordDependencyCheck` so future contributors who add metric families to the same scope can resolve the offending metric without re-tracing the nested loops
All test failures are pre-existing envtest issues (missing `/usr/local/kubebuilder/bin/etcd`) — none are caused by my changes. The observability and cmd/team-operator tests pass.
Changes:
- Move `defer obsProvider.Shutdown(...)` immediately after `NewProvider` so cleanup is registered alongside resource acquisition (was 140 lines downstream of construction).
- Pass `context.Background()` (instead of the signal-handler ctx) to `NewProvider` to decouple OTel SDK init from the signal-handler lifecycle.
- Restore `ctx := ctrl.SetupSignalHandler()` to its original position right before `manageCRDs` to keep the signal context narrow to manager lifetime.
- Log a one-shot info message when `POD_NAME` is empty so operators notice the missing downward-API wiring that drops `service.instance.id`.
- Demote `providerLog.Error` to `providerLog.Info` in `provider.go` for the SDK init failure path since the operator continues running with a noop provider.
ian-flores added 4 commits May 5, 2026 17:06
promexporter.New() without WithRegisterer creates an internal
prometheus.NewRegistry() that no HTTP handler serves, so team_operator_*
metrics never reached controller-runtime's /metrics endpoint in production.
Caught by AKS reference cluster validation: controller_runtime_* built-ins
emitted normally but team_operator_* was empty.

Fix: when cfg.PrometheusRegisterer is nil, default to
prometheus.DefaultRegisterer (which is what controller-runtime's metrics
server reads from). Add regression test that pins the contract via
prometheus.DefaultGatherer.

Refs #134
…ime's Registry

Previous fix (46b48fd) defaulted to prometheus.DefaultRegisterer, but
controller-runtime's metrics server reads from its own internal
sigs.k8s.io/controller-runtime/pkg/metrics.Registry — a separate
*prometheus.Registry, not the global default. So team_operator_*
metrics still never reached /metrics in production.

Default to crmetrics.Registry instead. Update regression test to
gather from there. Re-validated against AKS reference cluster.

Refs #134
Build clean. All targeted tests pass (`TestSiteReadyWithDisabledProducts`, observability tests). Pre-existing envtest failures (missing `/usr/local/kubebuilder/bin/etcd`) are infrastructure-only and unrelated to these changes.
Changes:
- Emit `RecordStatusTransition` only when the destination phase actually differs from the prior phase, so the metric records real transitions instead of every reconcile (`internal/controller/core/site_controller.go`).
- Move the metric emission to after `r.Status().Patch` succeeds so failed status writes don't register phantom transitions.
- Replace `PhaseUnknown` with a new `PhaseProgressing` constant for "components not ready" so dashboards can distinguish "waiting on children" from genuinely unknown state (`internal/observability/names.go`, `phase.go`).
- Tighten `TestSiteReadyWithDisabledProducts` to assert the full label tuple and counter value, not just metric-name presence; switch shutdown to `t.Cleanup` (`internal/controller/core/site_test.go`).
- Add doc comment on `SiteReconciler.Meter` explaining nil-meter no-op contract.
- Update `phase_test.go` and `names_test.go` to cover `PhaseProgressing`.
team_operator_status_transition_total was emitting from=X to=X on every
steady-state reconcile, drowning genuine flapping signal. Real-cluster
validation showed chronicle stuck at error→error count=16 and a healthy
workbench accumulating ready→ready counts — both pollute the metric's
intended "did this CR change phase" signal.

Add early-return when fromPhase == toPhase. Use
controller_runtime_reconcile_total{result=...} for "how often did this
controller reconcile in state X" instead.

Refs #134
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

team-operator: Add observability metrics for operator

1 participant