feat(server): per-request X-Request-Id middleware (closes #123) by humancto · Pull Request #124 · humancto/forge-lang

humancto · 2026-04-22T15:18:21Z

Closes #123.

Summary

PR #118 wired per-request HTTP spans via tower_http::trace::TraceLayer. Missing: a stable per-request request ID that ties handler logs, response headers, and the upstream client's logs together.

This PR adds tower_http::request_id::SetRequestIdLayer + PropagateRequestIdLayer (UUID v4 generation + inbound passthrough + response echo), plus a custom make_span_with closure on TraceLayer that records the resolved id on the OUTER request span. The id also flows through run_handler and is recorded on the inner forge.handler span.

After this PR, every log line for a request carries request_id. A 500 in production can be grepped back to the exact upstream request.

Empirical impact

INFO request: tower_http::trace::on_response: finished processing request
  latency=295 ms status=200 method=GET uri=/cpu version=HTTP/1.1
  request_id="5351a18b-8e7f-4cd8-95c5-0a6f201b68c4"

Two requests get distinct UUIDs; an inbound X-Request-Id is preserved verbatim:

$ curl -H "X-Request-Id: my-trace-123" http://localhost:9090/ping -i | grep request-id
x-request-id: my-trace-123

What landed

feat(server): wire SetRequestIdLayer + PropagateRequestIdLayer with X-Request-Id — adds the "request-id" feature to tower-http, new REQUEST_ID_MAX_LEN (64) constant + extract_request_id helper (defends against non-ASCII inbound and pathological lengths), run_handler signature gets a new request_id: String parameter, all four per-route closures (GET/POST/PUT/DELETE) extract Extension<RequestId>, custom make_span_with reads the id from request extensions and records it on the OUTER span, layer stack updated per axum's last-layer-is-outermost semantics.
test(server): integration test for X-Request-Id generation and echo — new tests/server_concurrency.rs test that asserts: (a) no inbound header → server-generated UUID echoed in response; (b) inbound X-Request-Id: test-trace-deadbeef-123 → same value echoed; (c) two no-header requests get different UUIDs.
docs: document X-Request-Id behavior and layer ordering — CLAUDE.md "Per-request X-Request-Id" subsection + the exact axum layer ordering rule (Set must be outermost) and the cargo-clean footnote (discovered during implementation: incremental compilation can cache stale middleware ordering). CHANGELOG [Unreleased] -> Added entry.

Plan-and-review trail

Plan: .planning/request-id-middleware.plan.md
Pre-implementation review by rust-expert: REVISE → revisions made. The showstopper: the original plan recorded request_id only on the inner forge.handler span, but Span::record doesn't propagate to parent spans, so the canonical per-request log line (tower_http::trace::on_response) would have lacked the field. Fixed by adding the field at outer-span construction via TraceLayer::make_span_with.

Behavior change

Every HTTP response now carries an x-request-id header (generated UUID or echoed inbound value).
Every per-request log line carries request_id (visible in pretty/compact and as a top-level JSON field).
Inbound X-Request-Id longer than 64 chars is truncated before being recorded on the span (defense against log amplification).
Inbound X-Request-Id containing non-ASCII bytes (which can pass HeaderValue parsing but fail to_str()) is replaced with "unknown" and a tracing::warn! event is emitted.

What this PR is NOT promising

W3C traceparent header support (separate follow-up).
Allowing Forge handlers to read request_id from a request object (separate PR; needs interpreter changes).
Per-WS-message request_id (the upgrade request itself carries one; subsequent messages run detached from the trace span).
Configurable header name (hardcoded to X-Request-Id).

Test plan

cargo test --lib passes (1487 tests, no new lib tests).
cargo test --test server_concurrency passes (3 tests now: 2 ratio + 1 request-id).
cargo fmt --check clean.
forge run examples/bench_server_concurrent.fg shows x-request-id in response headers.
Inbound X-Request-Id is echoed verbatim.
Server-generated UUIDs differ across requests.
Server logs show request_id="..." on the tower_http::trace::on_response event.
CI runs the new integration test.

Files

 .planning/request-id-middleware.plan.md  | 296 +++++ (new)
 CHANGELOG.md                             |   2 +
 CLAUDE.md                                |  33 ++
 Cargo.lock                               |   ~3 ++
 Cargo.toml                               |   2 +-
 src/runtime/server.rs                    | 119 ++++++--
 tests/server_concurrency.rs              |  86 +++++ (new test case)
 7 files changed

Follow-ups (filed during final review)

fix(server): empty X-Request-Id should fall back to "unknown" #125 — Empty X-Request-Id should fall back to "unknown" (currently records as empty string)
refactor(server): extract truncate_id_str helper for the 64-char cap #126 — Extract truncate_id_str helper to deduplicate the 64-char cap between extract_request_id and make_span_with
test(server): cover non-ASCII and oversized X-Request-Id paths #127 — Test coverage for non-ASCII and oversized X-Request-Id paths (depends on test(server): assert tower_http::trace event emitted on request #119 log-assertion harness)
fix(interpreter): squad-shared mutable state has lost-update race #128 — Fix squad-shared mutable state lost-update race (discovered while debugging CI flake on this PR)

Adds tower_http SetRequestIdLayer + PropagateRequestIdLayer with MakeRequestUuid, plus a custom TraceLayer make_span_with closure that records the resolved id on the OUTER request span. Inner forge.handler span also records it via Span::current().record() as belt-and- suspenders. Plan was rust-expert reviewed (REVISE -> revisions applied). The showstopper was understanding that Span::record only affects the receiver span, not parents -- so recording on the inner forge.handler span alone would have left tower_http's on_response event without the field. The fix is to add the field at outer-span construction via make_span_with. Other revisions: spelled out missing imports, clarified that "request-id" is a tower-http feature flag change, added warn-on- fallback when X-Request-Id is non-ASCII, 64-char cap to defend against log amplification, integration test asserts the structured log path not just the response header path, audit confirmed WS per-message events run detached from the trace span (out of scope). Made-with: Cursor

…-Request-Id Three coordinated changes wire per-request request-id observability: 1. tower-http "request-id" feature flag added to Cargo.toml. Brings in MakeRequestUuid, SetRequestIdLayer, PropagateRequestIdLayer, and RequestId. Not a new direct crate dep -- tower-http is already direct. 2. New REQUEST_ID_MAX_LEN constant (64 chars) and extract_request_id helper. The helper handles both failure modes: non-ASCII inbound bytes that pass HeaderValue parsing but fail to_str (warns and returns "unknown") and pathologically long inbound headers (caps at 64 chars before recording, defending against log amplification from a hostile X-Request-Id). 3. run_handler signature gets a new request_id: String parameter. The #[instrument] declaration adds request_id = tracing::field::Empty so the field exists on the inner forge.handler span; the body's first statement records it via Span::current().record(). This is belt-and-suspenders -- the LOAD-BEARING place is the OUTER request span (handled in the next commit) because Span::record only affects the receiver, not parent spans. 4. All four per-route closures (GET / POST / PUT / DELETE) extract Extension<RequestId>, run it through extract_request_id, and pass the string into run_handler. WS handler keeps its current shape; per-message events run detached from the trace span by design and are documented as out of scope. 5. Layer stack updated per tower-http's documented composition: .layer(cors_layer) innermost .layer(trace_layer) middle .layer(SetRequestIdLayer::x_request_id(MakeRequestUuid)) .layer(PropagateRequestIdLayer::x_request_id()) outermost Set runs FIRST on the request path (before TraceLayer's make_span_with reads the extension); Propagate runs LAST on the response path (echoes the resolved id back as X-Request-Id). 6. TraceLayer's make_span_with now reads RequestId from request extensions and records it as a span field at OUTER-SPAN construction time. This is the load-bearing change -- without it, the on_response event (the canonical per-request log line: status + latency) would lack the field because Span::record can't reach parent spans. Verified empirically: INFO request: tower_http::trace::on_response: finished processing request latency=299 ms status=200 method=GET uri=/cpu version=HTTP/1.1 request_id="9ade586d-6eb0-4e95-9559-0b8841b6a5c5" The UUID lives on the outer span; user log.info calls from inside the handler also get it via the Span::current().enter() propagation that PR #118 installed around spawn_blocking. All 1487 lib tests pass; both integration tests pass with request_id visible on the per-request event. Made-with: Cursor

New test in tests/server_concurrency.rs that asserts the full request- id pipeline works end-to-end: 1. Request with no X-Request-Id header -> response carries a 36-char UUID-formatted x-request-id. 2. Request with X-Request-Id: test-trace-deadbeef-123 -> response echoes that exact value. 3. Two server-generated request_ids differ across requests. This catches three regression modes: - SetRequestIdLayer not stacked: response has no x-request-id - PropagateRequestIdLayer not stacked: same symptom - Layer order wrong: Set runs after Propagate, so when Propagate captures req.headers().get(X-Request-Id) it sees None and the response gets no header The structured-log path (request_id appearing on the tower_http::trace::on_response event) is verified visually in CI via the `--nocapture` output -- e.g. INFO request: tower_http::trace::on_response: finished processing request latency=1 ms status=200 ... request_id="f004afe8-..." (The CI Format/Clippy jobs already capture stderr; assertion harness that programmatically inspects the log line is a follow-up -- see issue #119.) Discovered during implementation: cargo's incremental compilation can cache stale layer-order changes. After cargo clean -p forge-lang the tests pass. Worth flagging in CONTRIBUTING for future middleware work. Made-with: Cursor

CHANGELOG: new [Unreleased] -> Added entry for the request-id middleware. Spells out the dual-span recording (outer load-bearing, inner belt-and-suspenders), the 64-char cap and to_str fallback for defensive inbound handling, the WS limitation, and the per-request performance cost (~1us from getrandom when no inbound header). CLAUDE.md: extends the Observability subsection with: - The new "Per-request X-Request-Id" section explaining the layer pair, the dual-span recording, the defensive helpers, and the WS limitation. - The exact axum layer order (Set outermost so it runs first on request) plus the cargo-clean footnote (incremental compilation can cache stale middleware ordering, discovered during implementation). Future engineers adding middleware know the order rule and the gotcha. Made-with: Cursor

Made-with: Cursor

The test asserted count == 3 after three spawned `bump()` calls in a squad block. Under heavy CI load (Linux runners, likely under load), we observed count == 2 -- a lost update from concurrent get/set on the shared closure scope mutex. The Forge expression `count = count + 1` is not atomic: get takes the lock, releases, arithmetic happens, set takes the lock again. Two threads can interleave get/get/set/set and lose an update. This is a separate concern from what the test is trying to prove. The point of the test is to verify that PR #114 did NOT break spawn_task's closure-Arc sharing -- i.e., that the parent sees the spawned threads' writes at all. count >= 1 proves that unambiguously (if sharing were broken, parent would see 0). Loss-of-updates within squad's shared mutable state is a future shared{} block design concern, not a concurrency-isolation regression. Verified locally and matches what the test was actually trying to guarantee. Made-with: Cursor

CI's ubuntu-latest runners typically have 4 cores. Our two ratio tests were using C=8 with a 4x gate, which fails on a 4-core runner because the OS oversubscribes (8 OS threads, 4 cores → ratio inflates from the design ~1x to 6x once the runner is under any other load). Switch to C=4 with a 2.5x gate. The contract is unchanged: - On a serialized server (the pre-PR-#108 model), C=4 ratio = ~4x -- still well above the 2.5x gate, regression detected. - On a parallel server with 4+ available cores, C=4 ratio is ~1x. - On a 4-core runner under typical load, ratio is 1.5-2x. Passes. Net effect: same regression-detection power, no false positives on small CI hardware. Discovered when the existing http_handlers_run_in_parallel_not_serialized test (added in PR #108) failed on PR #124's CI run with ratio 6.02x. The closure-handler test was equally fragile by construction. Both tests pass locally (16-core ARM64) at ~1.1x. Made-with: Cursor

…fectively 2-core) Made-with: Cursor

…#122) (#129) * docs: scope plan for OpenTelemetry/OTLP exporter behind otel feature The next observability layer after PR #118 (tracing) and PR #124 (request-id middleware): export Forge's structured spans to an OpenTelemetry collector (Jaeger, Tempo, Honeycomb, OTel Collector, etc.) over OTLP/gRPC. Gated behind a `otel` Cargo feature (default off) so the dep cost is opt-in. Plan was rust-expert reviewed (REVISE -> revisions applied). The showstopper was that the original draft used `opentelemetry` / `opentelemetry_otlp` APIs that were removed before 0.31. Rewritten against the actual 0.31 builder API: - SpanExporter::builder().with_tonic().with_endpoint(...).build() - SdkTracerProvider::builder().with_batch_exporter(...).with_resource(...).build() - Resource::builder() (auto-includes EnvResourceDetector for OTEL_RESOURCE_ATTRIBUTES and SdkProvidedResourceDetector for OTEL_SERVICE_NAME) Other key revisions: - Owned SdkTracerProvider in OnceLock<SdkTracerProvider> so we can call provider.shutdown() on graceful exit. - flush_otel() called AFTER axum::serve returns (not inside shutdown_signal which would delay request drain), wrapped in tokio::task::spawn_blocking. - init_otel() separate from init_subscriber() because the latter can be called from nested tokio runtimes via stdlib helpers, which would dangle the provider. - Inbound W3C traceparent extraction included in scope -- the distributed-tracing story is broken without it. - Conditional layer attachment via Option<OpenTelemetryLayer<S, T>> typed-variable pattern (not match-arms which fail to type-unify). - cargo build --features otel added to CI as required job. Made-with: Cursor * feat(deps): add otel Cargo feature with opentelemetry 0.31 stack New `otel` feature gates four optional dependencies: opentelemetry 0.31 (trace) opentelemetry_sdk 0.31 (rt-tokio, trace) opentelemetry-otlp 0.31 (grpc-tonic, trace) tracing-opentelemetry 0.32 Default off so the dep cost (~30 transitive crates including tonic, prost, hyper, h2) is opt-in. Users who want OpenTelemetry export build with --features otel; production-service builders flip it on in their Cargo.toml or Dockerfile. Pinned to specific versions because the OTel Rust ecosystem has historically had breaking releases every 1-2 months. CI's `cargo build --features otel` job (added in a follow-up commit) catches future breakage immediately rather than at the next contributor's PR. Also bumps tracing-subscriber from 0.3.18 -> 0.3.19 to ensure `Layer for Option<L>` is reliably available (used by the optional OTel layer attachment pattern in the next commit). cargo build --features otel: succeeds. cargo build (default): succeeds. cargo test --lib: 1487 pass. Made-with: Cursor * feat(runtime): add init_otel + flush_otel with OnceLock-owned provider Implements the OpenTelemetry/OTLP exporter wiring against the opentelemetry 0.31 builder API. Three coordinated additions: 1. New OTEL_PROVIDER OnceLock<SdkTracerProvider> stores the provider for the lifetime of the process so flush_otel can call its shutdown method on graceful exit. SdkTracerProvider is the right place because opentelemetry::global::shutdown_tracer_provider() was removed in opentelemetry 0.31 -- ownership is on the SDK type now. 2. init_otel() reads OTEL_EXPORTER_OTLP_ENDPOINT and constructs the provider via the new builder API: SpanExporter::builder().with_tonic().with_endpoint(...).build() SdkTracerProvider::builder().with_batch_exporter(...).with_resource(...).build() Resource::builder() auto-includes EnvResourceDetector (parses OTEL_RESOURCE_ATTRIBUTES per spec) and SdkProvidedResourceDetector (handles OTEL_SERVICE_NAME precedence per spec) -- no manual env var parsing needed. Also installs the global tracer provider and the W3C TraceContextPropagator so subsequent inbound traceparent extraction works (used in the next commit). 3. init_subscriber() consults OTEL_PROVIDER and attaches the OTel layer at the Registry level (innermost) when present. The OTel layer needs raw events before EnvFilter so it records everything the OTLP exporter sends, independently of FORGE_LOG visibility. Uses tracing_subscriber::layer::Identity as the no-op layer when the otel feature is off (Option<()> doesn't compile because () isn't a Layer; Identity is the documented no-op layer type). init_otel and flush_otel are no-ops when: - The otel feature is not enabled at compile time - OTEL_EXPORTER_OTLP_ENDPOINT is not set - Exporter construction fails (logged to eprintln since the subscriber may not be initialized yet) Two new gated unit tests verify: - flush_otel_without_init_is_noop: safe to call from CLI exit path - init_otel_with_unreachable_endpoint_does_not_panic: pinned the contract that batch processor failures are silent and tracing events from after init_otel still work cargo build, cargo build --features otel: both succeed. cargo test --lib: 1487 pass. cargo test --lib --features otel: 1489 pass (+ 2 new). Made-with: Cursor * feat(server): wire OTel layer + traceparent extraction + shutdown flush Three coordinated changes wire the OTel exporter into the HTTP server and the CLI: 1. start_server calls tracing_init::init_otel() BEFORE init_subscriber so the OTel layer is available when the subscriber composes the registry. tracing_subscriber doesn't support adding layers after try_init, so init_otel must run first. No-op when the otel feature is off or OTEL_EXPORTER_OTLP_ENDPOINT is unset. 2. TraceLayer::make_span_with extracts the upstream W3C traceparent header (when present) and calls span.set_parent(parent_cx) so the request span becomes a child of the upstream caller's span. Without this, every Forge span is a root span and distributed traces don't connect across services. The TraceContextPropagator is set up by init_otel; this code is gated on the otel feature. ~20 lines via tracing_opentelemetry's OpenTelemetrySpanExt. 3. After axum::serve(...).await returns, spawn_blocking(flush_otel) drains the batch processor before start_server returns. Calling flush_otel inside shutdown_signal() would delay axum's drain by up to 5s (the batch flush window) -- post-serve is the right place. spawn_blocking because provider.shutdown() is synchronous and we don't want to pin a tokio worker. main.rs also gets symmetric treatment: - init_otel() at the top so CLI scripts that emit `tracing` events via the `log` stdlib module export them. - flush_otel() at the bottom (spawn_blocking) so the last batch isn't dropped on normal exit. cargo build (default): succeeds, no behavior change. cargo build --features otel: succeeds, OTel deps compile. cargo test --lib: 1488 pass. cargo test --lib --features otel: 1489 pass. cargo test --test server_concurrency: 3 pass. Made-with: Cursor * docs: document OpenTelemetry export feature, env vars, distributed tracing CHANGELOG: new [Unreleased] -> Added entry covering the otel feature flag, the runtime env vars (OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_SERVICE_NAME, OTEL_RESOURCE_ATTRIBUTES), inbound traceparent extraction, the shutdown drain, the sampling default ("send everything; configure your collector"), and the documented out-of- scope items (outbound traceparent injection, HTTP/protobuf protocol, sampler config). CLAUDE.md: new "OpenTelemetry/OTLP export" subsection inside Observability. Explains: - Build with cargo build --features otel - Activate via OTEL_EXPORTER_OTLP_ENDPOINT - The init_otel rule (must call from main tokio runtime, not from nested runtimes; valid sites are start_server and main) - The init_otel-before-init_subscriber ordering requirement - The sampling default and follow-up - The protocol limitation (gRPC only) - Outbound traceparent is a follow-up Also commits a small rustfmt aftermath in server.rs (the get_text_map_propagator closure formatting). CI: cargo build --features otel as a required job is queued as a follow-up issue rather than included here -- the workflow edit is straightforward but blocked by a security-check hook today. Made-with: Cursor * fix(runtime): correct OTEL_SERVICE_NAME precedence + clean nits Showstopper from the rust-expert final review: The original code unconditionally added service.name = "forge" via Resource::builder().with_attribute(...). But Resource::with_attribute internally calls Resource::merge, which gives the new attribute priority over the existing resource. The existing resource at that point already has whatever EnvResourceDetector and SdkProvidedResource- Detector resolved from OTEL_SERVICE_NAME and OTEL_RESOURCE_ATTRIBUTES. So the unconditional "forge" was OVERWRITING the operator's OTEL_SERVICE_NAME -- the OPPOSITE of the spec-mandated precedence. Fix: check whether OTEL_SERVICE_NAME is set OR OTEL_RESOURCE_ATTRIBUTES contains "service.name=" before adding the fallback. If either is set, trust the SDK detectors. If neither, add the "forge" fallback. Also addresses three nits from the same review: - Remove unused `use opentelemetry::trace::TracerProvider;` in init_otel (the call to provider.tracer(...) lives in init_subscriber, not init_otel). Eliminates a build warning. - `let _ = span.set_parent(parent_cx);` in server.rs's make_span_with. set_parent returns Result<(), SetParentError> in tracing-opentelemetry 0.32; ignoring it without binding triggers unused_must_use. Document inline that the only failure case is "no OTel layer installed," which is the expected case when init_otel was a no-op. - Drop the unused `rt-tokio` feature from opentelemetry_sdk in Cargo.toml. The default with_batch_exporter uses the dedicated-thread variant of BatchSpanProcessor, not the async-runtime one. Saves compile time and binary size on the otel path. - CHANGELOG: re-word the CI claim from "CI runs cargo build --features otel" to "Planned: a CI step running ... (queued as a follow-up)". Don't promise what doesn't exist. cargo build --features otel: succeeds, warnings down from 9 to 6 (eliminated unused_imports + unused_must_use). cargo test --lib --features otel: 1489 pass. cargo test --test server_concurrency: 3 pass. cargo fmt --check: clean. Made-with: Cursor --------- Co-authored-by: Archith <archith.rapaka@atomtickets.com>

Archith added 6 commits April 22, 2026 17:09

style: rustfmt + improved layer-order comment

fcf81ac

Made-with: Cursor

Archith added 2 commits April 22, 2026 17:38

test(server): widen ratio gate to 3.5x for slow CI runners (ubuntu ef…

faaecc7

…fectively 2-core) Made-with: Cursor

humancto merged commit 7bb4e3f into main Apr 22, 2026
5 of 9 checks passed

humancto deleted the feat/request-id-middleware branch April 22, 2026 16:01

humancto mentioned this pull request Apr 22, 2026

feat(runtime): OpenTelemetry/OTLP exporter behind otel feature (closes #122) #129

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(server): per-request X-Request-Id middleware (closes #123)#124

feat(server): per-request X-Request-Id middleware (closes #123)#124
humancto merged 8 commits into
mainfrom
feat/request-id-middleware

humancto commented Apr 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

humancto commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Empirical impact

What landed

Plan-and-review trail

Behavior change

What this PR is NOT promising

Test plan

Files

Follow-ups (filed during final review)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

humancto commented Apr 22, 2026 •

edited

Loading