Support opentelemetry observability phase2#29
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements the Phase 2 OTLP push pipeline, enabling observability data export via OTLP/HTTP. Key changes include the introduction of a CompositePropagator to support multiple trace-context formats (W3C and Jaeger), the wiring of BatchSpanProcessor and PeriodicMetricReader in the HttpServer startup, and the addition of a coordinated shutdown mechanism to ensure clean exporter drainage. Several improvements were suggested regarding the Jaeger header parsing logic, header map iteration efficiency, and robust handling of asynchronous OTLP transport requests.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f6b323c2fa
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
LGTM |
OpenTelemetry Observability — Phase 2 (foundation)
Branch:
support-opentelemetry-observability-phase2→mainSummary
This PR ships the foundation half of the Phase 2 OpenTelemetry plan: the OTLP push pipeline + shutdown drain integration (Group 1) and the Propagator interface refactor + native Jaeger / Composite propagators (Group 8). Together these unblock the remaining Phase 2 surface — per-attempt CLIENT spans on the proxy, the
auth.idp_checkspan, the full §7 metrics catalog, kill-marshal CASE A/B, the self-handler shutdown helper, and per-message WebSocket spans — which will land in a separate Phase 3 PR (no production-facing code in Phase 3 depends on changes back to the interfaces this PR introduces).Phase 1 (PR #26) shipped the SDK primitives — TracerProvider, MeterProvider, BatchSpanProcessor, PeriodicMetricReader, OtlpHttpExporter, PrometheusExporter, ObservabilityManager finalize-CAS gate, four-phase shutdown — but the BSP/PMR were not actually wired into the request path's lifecycle:
FlushObservabilityForShutdownhad a TODO at the post-drain point, the PMR was constructed only in tests, the OTLP exporter had no production caller, and the propagator was a static-methodW3CPropagatorwith no extensibility. This PR closes those gaps end-to-end, then builds the propagator abstraction needed for multi-format extract/inject (W3C alongside Jaeger native).The PR is intentionally split here because the resulting diff is already at 32 files / +2298 LOC. Continuing through Groups 2–7 in the same PR would have pushed the diff well beyond reviewable size.
What's changed
Group 1 — OTLP push pipeline wired at startup + shutdown drain integration
HttpServer::MarkServerReady(server/http_server.cc) constructsotlp_upstream_http_client_+otlp_exporter_whentraces.exporter == "otlp_http"and atomically swaps aBatchSpanProcessorinto theTracerProviderviaObservabilityManager::SwapToBatchSpanProcessor. Whenmetrics.exporter == "otlp_http"is also configured, the same exporter instance is fed to a newPeriodicMetricReaderand registered with the manager viaRegisterMetricReader.ObservabilityManager::FlushAll(deadline)(server/observability_manager.cc) is the new polymorphic drain primitive.HttpServer::FlushObservabilityForShutdowncallsWaitForAllAsyncDrainfirst so in-flightFinalizeIfSnapshotcalls land in the BSP queue, thenFlushAll(deadline)blocks the BSP and PMR until both queues drain (or the deadline fires). Encapsulates the processor/reader internals so HttpServer doesn't reach across abstraction boundaries.OtlpHttpExportershared_ptr (the common case),ObservabilityManager::BeginShutdowncallsDisableExporterShutdownOnDrainon both before signalling them, then signalsSignalShutdown()on the exporter exactly once after both workers drain. Without this, the first worker to finish would callSignalShutdownon the shared exporter, causing the other's finalExport()to returnkFailedNotRetryable— silently losing telemetry. The hooks (DisableExporterShutdownOnDrain) were reserved in Phase 1 for exactly this.PeriodicMetricReader::ForceFlush(deadline)now blocks via aflush_completed_count_cv handshake (zero=no-wait, negative=unbounded, positive=bounded). The worker bumps the counter every cycle so aForceFlushcall returns only after the in-flight cycle has completed.Tracer::SwapProcessorAcrossTracersis a new manager-internal API used at startup to move every Tracer from the boot-time NoopProcessor to the BatchSpanProcessor in a single atomic step.server/otlp_transport.cc+include/observability/otlp_transport.h: smallMakeOtlpTransport(weak<UpstreamHttpClient>)factory so the OTLP exporter can post via the production upstream pool with the same connection pooling semantics as proxied traffic.Group 8 — Propagator interface + native Jaeger + composite
Propagatorvirtual base class (include/observability/propagator.h). Replaces the staticW3CPropagator::*StaticAPI with an instance API:Extract(headers),Inject(ctx, headers)(map + vector overloads),StripOwnedHeaders(headers)(map + vector overloads),Name(). Strip-then-inject is the implementation contract — every concrete impl strips its owned headers before emitting fresh values, defending against client-supplied trace-header spoofing.W3CPropagator final : public Propagator— refactored from static methods to instance methods. The static*Staticforwarders are kept and[[deprecated]]-annotated for the migration window; existing call sites swept to the instance API (server/auth_upstream_http_client.cc, the test sweeps).JaegerPropagator final : public Propagator(server/jaeger_propagator.cc) — owns theuber-trace-idheader.Parse()accepts 16-hex (legacy 64-bit, left-padded with zeros to canonical 128-bit) or 32-hex trace-ids, 16-hex span-ids,parent-span-idfield accepting a literal "0" for root spans (informational; gateway does not reconstruct the parent chain), and 1-2 hex chars of flags (only the sampled bit0x01is honored — debug/firehose are dropped).Inject()always emits the canonical 32-hex trace-id form with:0:for parent-span-id and%02xflags.ParseTraceIdHexuses a stackchar buf[32](no heap allocation in the hot extract path).CompositePropagator final : public Propagator(server/composite_propagator.cc).Build(names)returnsshared_ptr<const Propagator>so callers can't reach intochildren_. Throwsstd::invalid_argumenton empty list or unknown name.Extractreturns the first child that produced a valid context (precedence == config order).Injectcalls every child so a single SpanContext is emitted in every wire format the operator configured.StripOwnedHeadersdrops every child-owned header.traces.propagatorsconfig field. NewObservabilityConfig::propagators(default["w3c"], live-reloadable).ConfigLoader::Validaterejects empty list and unknown names viaOBSERVABILITY_NAMESPACE::IsKnownPropagatorName. Recognised tokens:kPropagatorNameW3C,kPropagatorNameJaeger.ObservabilityManager::propagator_is ashared_ptr<const Propagator>swapped viastd::atomic_store_explicit(release)/atomic_load_explicit(acquire)onReload— same pattern asroute_overrides_snapshot_. New requests immediately use the new composite; in-flight requests keep the propagator they were dispatched with. Reader sites (ObservabilityMiddleware::BuildRequestTraceContext,auth_upstream_http_client.cc) load throughmanager->propagator().IsHexCharLowerextracted to an inline helper inpropagator.h. Both W3C and Jaeger parsers reject uppercase hex per W3C §3.2 (treating malformed inbound as untrusted is a security property, not pedantry).Documentation
docs/observability.md— new operator guide (~440 lines). Quick start, master/sub switches, sampler + OTLP push + propagator config, Prometheus pull, shutdown drain semantics, full live-reloadable vs restart-required matrix, troubleshooting (spans not appearing,/metrics404, cardinality overflow, propagator validation errors), full configuration field reference, "Out of scope" listing the Phase 3 deferred items.CI
obs_jaeger_propagatoradded to.github/workflows/ci.yml::build-linux-tsan-restenumeration so the new suite is exercised under TSan on every PR.obs_jaeger_propagatoradded to.github/workflows/weekly-valgrind.ymlfor memory-safety coverage of the parse/inject paths.tsan-heavybucket intentionally untouched —obs_jaeger_propagatoris pure parsing/injection logic with no socket / shutdown sequencing.Test coverage
obs_jaeger_propagatortest/observability_jaeger_propagator_test.h(new)obs_propagatortest/observability_propagator_test.hW3CPropagator{}.X(...)instance API; addedTestW3CPropagatorImplementsInterfaceobs_configtest/observability_config_test.hobs_mgrtest/observability_manager_test.hTestManagerPropagatorReflectsConfig,TestManagerReloadSwapsPropagator,TestMiddlewareHonorsCompositePropagator, plus PMR-registration + FlushAll polymorphic-drain testsobs_exporttest/observability_export_pipeline_test.hobs_tracertest/observability_tracer_test.hSwapProcessorAcrossTracerssemanticsobs_issue_injecttest/observability_issue_inject_test.hFinal test sweep:
1209/1209 PASS(Phase 1 was 1170; +39 tests). All eight obs_* suites individually green.What's NOT in this PR (deferred to Phase 3 follow-up)
ProxyTransaction::AttemptCheckout; terminal callbacks finalize gated onIsKilledForShutdown; header rewrite + serialization moves fromStart()toAttemptCheckout()so retries inject freshtraceparentIssueTraceContext::propagator(gainsconst Propagator*field);auth.idp_checkINTERNAL span around each IdP request lifecycleScheduleStopAfterCurrentResponse()self-handler shutdown helperkill_marshals_in_flight_reserved countertraces.websocket_messages, defaultfalse)File counts
include/observability/otlp_transport.h, propagator.h heavily extended).ccfiles addedcomposite_propagator.cc,jaeger_propagator.cc,otlp_transport.cc)obs_jaeger_propagator) + sweeps across 6 existing suitesdocs/observability.md)ci.yml,weekly-valgrind.yml)Test plan
obs_jaeger_propagatorenumerated inci.ymlobs_jaeger_propagatorenumerated inweekly-valgrind.yml0.115+withotlphttpreceiver) — confirm spans + metrics arrivetraceparent-bearing request andpropagators: ["w3c", "jaeger"]configured — confirm both headers propagate downstreampropagators: ["w3c"]→["w3c", "jaeger"]— confirm new requests get the new composite immediatelyReferences
91b7d60)docs/observability.md