Skip to content

Add always-on /metrics endpoint with dual pull/push telemetry#138

Merged
sjmiller609 merged 5 commits into
mainfrom
codex/dual-mode-metrics-endpoint
Mar 9, 2026
Merged

Add always-on /metrics endpoint with dual pull/push telemetry#138
sjmiller609 merged 5 commits into
mainfrom
codex/dual-mode-metrics-endpoint

Conversation

@sjmiller609

@sjmiller609 sjmiller609 commented Mar 9, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • add an always-on Prometheus /metrics endpoint on a dedicated metrics listener (default 127.0.0.1:9464)
  • keep OTLP metrics push on a schedule when otel.enabled=true
  • make pull + push share the same OTel meter provider/instruments for consistent metric model
  • add config keys for metrics listener and OTLP push interval

Config additions

  • metrics.listen_address (default 127.0.0.1)
  • metrics.port (default 9464)
  • otel.metric_export_interval (default 60s)

Env mapping (via __ nesting):

  • METRICS__LISTEN_ADDRESS
  • METRICS__PORT
  • OTEL__METRIC_EXPORT_INTERVAL

Behavior details

  • Prometheus pull exporter is required for startup
  • OTLP metric push exporter failure is non-fatal (warn + continue with pull metrics)
  • /metrics is served outside API auth/OpenAPI middleware on a separate listener
  • metrics server lifecycle is integrated into the same startup/shutdown errgroup flow as API server

Tests

  • config defaults/env override/validation tests for new fields
  • OTel tests verifying /metrics is available with push disabled and with push enabled but bad endpoint
  • metrics server tests for address composition, serve/shutdown lifecycle, and bind failure behavior

Note

Medium Risk
Touches telemetry initialization and server startup/shutdown by adding a dedicated /metrics listener and changing OTel init behavior, which could impact availability and observability if misconfigured. Also changes exported VM metrics (removing a series and adding guardrails), which may affect dashboards/alerts.

Overview
Adds an always-on Prometheus pull metrics endpoint (/metrics) served from a dedicated HTTP listener (default 127.0.0.1:9464), while keeping OTLP push export optional when otel.enabled=true and configurable via otel.metric_export_interval.

Introduces metrics.listen_address, metrics.port, and metrics.vm_label_budget config/env keys with validation, wires the budget into vm_metrics, and adds guardrail metrics for per-VM label cardinality (while removing the denormalized hypeman_vm_memory_utilization_ratio series and updating the Grafana dashboard query accordingly).

Improves observability hygiene: adds OTEL tracing spans for WebSocket exec/cp sessions with constrained attributes, ensures HTTP metrics use a sentinel path label for unmatched routes, and reduces INFO-level log noise in several background/ingress paths.

Written by Cursor Bugbot for commit 327d96d. This will update automatically on new commits. Configure here.

@github-actions

github-actions Bot commented Mar 9, 2026

Copy link
Copy Markdown

✱ Stainless preview builds

This PR will update the hypeman SDKs with the following commit message.

feat: Add always-on /metrics endpoint with dual pull/push telemetry
hypeman-openapi studio · code

Your SDK build had at least one "note" diagnostic.
generate ✅

⚠️ hypeman-typescript studio · code

Your SDK build had at least one "error" diagnostic.
generate ❗build ✅lint ✅test ✅

npm install https://pkg.stainless.com/s/hypeman-typescript/41556a922b51d2de413fd4486e86a2ff107cdd48/dist.tar.gz
hypeman-go studio · code

Your SDK build had at least one "note" diagnostic.
generate ✅build ✅lint ✅test ✅

go get github.com/stainless-sdks/hypeman-go@0b3751a37068dca3771ad04fc1af876c5e52384f

This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
If you push custom code to the preview branch, re-run this workflow to update the comment.
Last updated: 2026-03-09 20:37:00 UTC

@sjmiller609 sjmiller609 force-pushed the codex/dual-mode-metrics-endpoint branch from 262d071 to 0a08f81 Compare March 9, 2026 03:33
@sjmiller609 sjmiller609 marked this pull request as ready for review March 9, 2026 04:36

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Comment thread cmd/api/main.go
Comment thread lib/otel/otel.go Outdated

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated Risk Assessment

Risk level: Medium-High

Code review is required for this PR.

Why this risk level (from code diff evidence):

  • Large behavioral change set: 34 files, 1010 insertions, 225 deletions.
  • Core runtime/server lifecycle changes in cmd/api/main.go add a second always-on listener and integrate its startup/shutdown into process control flow.
  • Telemetry stack redesign in lib/otel/otel.go changes provider initialization, exporter failure behavior, runtime instrumentation handling, and introduces Prometheus pull export wiring.
  • Shared config and DI surface changed (cmd/api/config/config.go, lib/providers/providers.go) with new validation/defaults and wiring paths.
  • Metric model/labels changed in shared VM metrics subsystem (lib/vm_metrics/*, lib/middleware/otel.go) affecting operational observability semantics.
  • New skill prompt/instruction files under skills/ are included; prompt/instruction changes are treated as elevated-risk by policy.

Decision:

  • Not approved (only Very Low / Low are eligible for auto-approval).
  • Requested reviewers: @hiroTamada, @rgarcia.

Open in Web View Automation 

@cursor cursor Bot requested review from hiroTamada and rgarcia March 9, 2026 04:46
@sjmiller609 sjmiller609 enabled auto-merge (squash) March 9, 2026 04:51

@hiroTamada hiroTamada left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

solid design -- always-on prometheus pull with optional OTLP push is the right architecture. cardinality guardrails, denormalized metric removal, log noise reduction, and test consolidation are all good improvements. no blocking concerns.

@sjmiller609 sjmiller609 merged commit 27492b9 into main Mar 9, 2026
6 checks passed
@sjmiller609 sjmiller609 deleted the codex/dual-mode-metrics-endpoint branch March 9, 2026 20:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants