Skip to content

Support opentelemetry observability phase2#29

Merged
mwfj merged 17 commits into
mainfrom
support-opentelemetry-observability-phase2
May 10, 2026
Merged

Support opentelemetry observability phase2#29
mwfj merged 17 commits into
mainfrom
support-opentelemetry-observability-phase2

Conversation

@mwfj
Copy link
Copy Markdown
Owner

@mwfj mwfj commented May 9, 2026

OpenTelemetry Observability — Phase 2 (foundation)

Branch: support-opentelemetry-observability-phase2main


Summary

This PR ships the foundation half of the Phase 2 OpenTelemetry plan: the OTLP push pipeline + shutdown drain integration (Group 1) and the Propagator interface refactor + native Jaeger / Composite propagators (Group 8). Together these unblock the remaining Phase 2 surface — per-attempt CLIENT spans on the proxy, the auth.idp_check span, the full §7 metrics catalog, kill-marshal CASE A/B, the self-handler shutdown helper, and per-message WebSocket spans — which will land in a separate Phase 3 PR (no production-facing code in Phase 3 depends on changes back to the interfaces this PR introduces).

Phase 1 (PR #26) shipped the SDK primitives — TracerProvider, MeterProvider, BatchSpanProcessor, PeriodicMetricReader, OtlpHttpExporter, PrometheusExporter, ObservabilityManager finalize-CAS gate, four-phase shutdown — but the BSP/PMR were not actually wired into the request path's lifecycle: FlushObservabilityForShutdown had a TODO at the post-drain point, the PMR was constructed only in tests, the OTLP exporter had no production caller, and the propagator was a static-method W3CPropagator with no extensibility. This PR closes those gaps end-to-end, then builds the propagator abstraction needed for multi-format extract/inject (W3C alongside Jaeger native).

The PR is intentionally split here because the resulting diff is already at 32 files / +2298 LOC. Continuing through Groups 2–7 in the same PR would have pushed the diff well beyond reviewable size.


What's changed

Group 1 — OTLP push pipeline wired at startup + shutdown drain integration

  • HttpServer::MarkServerReady (server/http_server.cc) constructs otlp_upstream_http_client_ + otlp_exporter_ when traces.exporter == "otlp_http" and atomically swaps a BatchSpanProcessor into the TracerProvider via ObservabilityManager::SwapToBatchSpanProcessor. When metrics.exporter == "otlp_http" is also configured, the same exporter instance is fed to a new PeriodicMetricReader and registered with the manager via RegisterMetricReader.
  • ObservabilityManager::FlushAll(deadline) (server/observability_manager.cc) is the new polymorphic drain primitive. HttpServer::FlushObservabilityForShutdown calls WaitForAllAsyncDrain first so in-flight FinalizeIfSnapshot calls land in the BSP queue, then FlushAll(deadline) blocks the BSP and PMR until both queues drain (or the deadline fires). Encapsulates the processor/reader internals so HttpServer doesn't reach across abstraction boundaries.
  • Shared-exporter shutdown coordinator. When BSP and PMR hold the same OtlpHttpExporter shared_ptr (the common case), ObservabilityManager::BeginShutdown calls DisableExporterShutdownOnDrain on both before signalling them, then signals SignalShutdown() on the exporter exactly once after both workers drain. Without this, the first worker to finish would call SignalShutdown on the shared exporter, causing the other's final Export() to return kFailedNotRetryable — silently losing telemetry. The hooks (DisableExporterShutdownOnDrain) were reserved in Phase 1 for exactly this.
  • PeriodicMetricReader::ForceFlush(deadline) now blocks via a flush_completed_count_ cv handshake (zero=no-wait, negative=unbounded, positive=bounded). The worker bumps the counter every cycle so a ForceFlush call returns only after the in-flight cycle has completed.
  • Tracer::SwapProcessorAcrossTracers is a new manager-internal API used at startup to move every Tracer from the boot-time NoopProcessor to the BatchSpanProcessor in a single atomic step.
  • New helper file server/otlp_transport.cc + include/observability/otlp_transport.h: small MakeOtlpTransport(weak<UpstreamHttpClient>) factory so the OTLP exporter can post via the production upstream pool with the same connection pooling semantics as proxied traffic.

Group 8 — Propagator interface + native Jaeger + composite

  • Propagator virtual base class (include/observability/propagator.h). Replaces the static W3CPropagator::*Static API with an instance API: Extract(headers), Inject(ctx, headers) (map + vector overloads), StripOwnedHeaders(headers) (map + vector overloads), Name(). Strip-then-inject is the implementation contract — every concrete impl strips its owned headers before emitting fresh values, defending against client-supplied trace-header spoofing.
  • W3CPropagator final : public Propagator — refactored from static methods to instance methods. The static *Static forwarders are kept and [[deprecated]]-annotated for the migration window; existing call sites swept to the instance API (server/auth_upstream_http_client.cc, the test sweeps).
  • JaegerPropagator final : public Propagator (server/jaeger_propagator.cc) — owns the uber-trace-id header. Parse() accepts 16-hex (legacy 64-bit, left-padded with zeros to canonical 128-bit) or 32-hex trace-ids, 16-hex span-ids, parent-span-id field accepting a literal "0" for root spans (informational; gateway does not reconstruct the parent chain), and 1-2 hex chars of flags (only the sampled bit 0x01 is honored — debug/firehose are dropped). Inject() always emits the canonical 32-hex trace-id form with :0: for parent-span-id and %02x flags. ParseTraceIdHex uses a stack char buf[32] (no heap allocation in the hot extract path).
  • CompositePropagator final : public Propagator (server/composite_propagator.cc). Build(names) returns shared_ptr<const Propagator> so callers can't reach into children_. Throws std::invalid_argument on empty list or unknown name. Extract returns the first child that produced a valid context (precedence == config order). Inject calls every child so a single SpanContext is emitted in every wire format the operator configured. StripOwnedHeaders drops every child-owned header.
  • traces.propagators config field. New ObservabilityConfig::propagators (default ["w3c"], live-reloadable). ConfigLoader::Validate rejects empty list and unknown names via OBSERVABILITY_NAMESPACE::IsKnownPropagatorName. Recognised tokens: kPropagatorNameW3C, kPropagatorNameJaeger.
  • Manager-owned propagator with atomic snapshot. ObservabilityManager::propagator_ is a shared_ptr<const Propagator> swapped via std::atomic_store_explicit(release) / atomic_load_explicit(acquire) on Reload — same pattern as route_overrides_snapshot_. New requests immediately use the new composite; in-flight requests keep the propagator they were dispatched with. Reader sites (ObservabilityMiddleware::BuildRequestTraceContext, auth_upstream_http_client.cc) load through manager->propagator().
  • Hex helper deduplication. IsHexCharLower extracted to an inline helper in propagator.h. Both W3C and Jaeger parsers reject uppercase hex per W3C §3.2 (treating malformed inbound as untrusted is a security property, not pedantry).

Documentation

  • docs/observability.md — new operator guide (~440 lines). Quick start, master/sub switches, sampler + OTLP push + propagator config, Prometheus pull, shutdown drain semantics, full live-reloadable vs restart-required matrix, troubleshooting (spans not appearing, /metrics 404, cardinality overflow, propagator validation errors), full configuration field reference, "Out of scope" listing the Phase 3 deferred items.

CI

  • obs_jaeger_propagator added to .github/workflows/ci.yml::build-linux-tsan-rest enumeration so the new suite is exercised under TSan on every PR.
  • obs_jaeger_propagator added to .github/workflows/weekly-valgrind.yml for memory-safety coverage of the parse/inject paths.
  • macOS subset and the tsan-heavy bucket intentionally untouched — obs_jaeger_propagator is pure parsing/injection logic with no socket / shutdown sequencing.

Test coverage

Suite Header Tests added/swept
obs_jaeger_propagator test/observability_jaeger_propagator_test.h (new) 17 tests across Extract / Inject / StripOwnedHeaders / Composite-fan-out
obs_propagator test/observability_propagator_test.h Swept to W3CPropagator{}.X(...) instance API; added TestW3CPropagatorImplementsInterface
obs_config test/observability_config_test.h 4 propagator-list tests (load / default / empty rejected / unknown rejected)
obs_mgr test/observability_manager_test.h TestManagerPropagatorReflectsConfig, TestManagerReloadSwapsPropagator, TestMiddlewareHonorsCompositePropagator, plus PMR-registration + FlushAll polymorphic-drain tests
obs_export test/observability_export_pipeline_test.h Shared-exporter shutdown coordinator tests, BSP+OTLP wire-format tests
obs_tracer test/observability_tracer_test.h SwapProcessorAcrossTracers semantics
obs_issue_inject test/observability_issue_inject_test.h Swept to instance-API propagator

Final test sweep: 1209/1209 PASS (Phase 1 was 1170; +39 tests). All eight obs_* suites individually green.

make clean && make -j4 && ./test_runner
... 
Total Tests: 1209 | Passed: 1209 | Failed: 0
Success Rate: 100%
[SUCCESS] All tests passed!

What's NOT in this PR (deferred to Phase 3 follow-up)

Group Deliverable
2 Per-attempt CLIENT span on ProxyTransaction::AttemptCheckout; terminal callbacks finalize gated on IsKilledForShutdown; header rewrite + serialization moves from Start() to AttemptCheckout() so retries inject fresh traceparent
3 Auth-path traceparent injection via IssueTraceContext::propagator (gains const Propagator* field); auth.idp_check INTERNAL span around each IdP request lifecycle
4 ScheduleStopAfterCurrentResponse() self-handler shutdown helper
5 Kill-marshal CASE A/B — wires the kill_marshals_in_flight_ reserved counter
6 Full §7 metrics catalog — server / client / upstream pool / auth / rate-limit / CB / DNS / WS / self-metrics
7 Per-message WebSocket spans (gated by new traces.websocket_messages, default false)

File counts

 32 files changed, 2298 insertions(+), 222 deletions(-)
Surface Count
Headers added 2 (include/observability/otlp_transport.h, propagator.h heavily extended)
.cc files added 3 (composite_propagator.cc, jaeger_propagator.cc, otlp_transport.cc)
Tests added 1 new suite (obs_jaeger_propagator) + sweeps across 6 existing suites
Public docs added 1 (docs/observability.md)
CI workflows touched 2 (ci.yml, weekly-valgrind.yml)

Test plan

  • Linux (gcc, clang, ASan+UBSan) all-suites — auto-picked-up by CI
  • Linux TSan (rest-bucket) — obs_jaeger_propagator enumerated in ci.yml
  • macOS subset — no edit needed (pure logic suite)
  • Weekly valgrind — obs_jaeger_propagator enumerated in weekly-valgrind.yml
  • Manual end-to-end smoke against an OpenTelemetry Collector (recommend: OTel Collector Contrib 0.115+ with otlphttp receiver) — confirm spans + metrics arrive
  • Manual smoke with a traceparent-bearing request and propagators: ["w3c", "jaeger"] configured — confirm both headers propagate downstream
  • SIGHUP propagators: ["w3c"]["w3c", "jaeger"] — confirm new requests get the new composite immediately

References

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements the Phase 2 OTLP push pipeline, enabling observability data export via OTLP/HTTP. Key changes include the introduction of a CompositePropagator to support multiple trace-context formats (W3C and Jaeger), the wiring of BatchSpanProcessor and PeriodicMetricReader in the HttpServer startup, and the addition of a coordinated shutdown mechanism to ensure clean exporter drainage. Several improvements were suggested regarding the Jaeger header parsing logic, header map iteration efficiency, and robust handling of asynchronous OTLP transport requests.

Comment thread server/jaeger_propagator.cc Outdated
Comment thread server/propagator.cc Outdated
Comment thread server/otlp_transport.cc
Comment thread server/otlp_transport.cc
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f6b323c2fa

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread server/otlp_transport.cc Outdated
Comment thread server/jaeger_propagator.cc Outdated
@mwfj
Copy link
Copy Markdown
Owner Author

mwfj commented May 10, 2026

LGTM

@mwfj mwfj merged commit d039208 into main May 10, 2026
6 checks passed
@mwfj mwfj deleted the support-opentelemetry-observability-phase2 branch May 10, 2026 12:18
@mwfj mwfj mentioned this pull request May 10, 2026
12 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant