Skip to content

perf: v1.1.0-beta.9 — 3.3M H2 rps, context leak fix, zero-alloc frame fast path#108

Merged
FumingPower3925 merged 5 commits intomainfrom
perf/beta9
Mar 25, 2026
Merged

perf: v1.1.0-beta.9 — 3.3M H2 rps, context leak fix, zero-alloc frame fast path#108
FumingPower3925 merged 5 commits intomainfrom
perf/beta9

Conversation

@FumingPower3925
Copy link
Copy Markdown
Contributor

Summary

Profile-driven optimization loop targeting HTTP/2 throughput and io_uring parity with epoll. Two performance commits plus one infrastructure commit.

Commit 1: Fix H2 context leak + 9 optimizations (+190% io_uring H2)

  • Fix H2 context pool leak: acquireContext cached contexts on ephemeral H2 streams, causing them to leak (never returned to sync.Pool). This was 87% of all H2 allocations. Fix: only cache on H1 streams (persistent keep-alive connections). H2 inline handlers use InlineCachedCtx instead.
  • io_uring: reorder H2 drain before dirty list (reduces pipeline stalls)
  • Async H2 adapter: use manual HPACK content-length encoding (~140ns/req)
  • H1: fused cachedStatus200Date block (one append for status + date)
  • io_uring: conditional immediate Submit (skip when CQEs already visible)
  • Pre-allocate Stream.Headers capacity for H2 (avoid first-use allocation)
  • Persistent HPACK emit function per Processor (eliminate per-request closure)
  • Reorder IsCancelled after canRunInline (avoid atomic on hot inline path)
  • Remove redundant nil check in acquireContext

Commit 2: Zero-alloc HEADERS fast path (+12% io_uring H2)

Bypass x/net framer for common HEADERS frames (END_HEADERS set, no PADDED/PRIORITY, not during CONTINUATION). The framer's *HeadersFrame allocation was 98% of remaining H2 allocations (2.5 GB/20s at 3M rps). The fast path reads the 9-byte frame header directly, extracts streamID/flags/payload, and passes them to ProcessRawHeaders — all RFC 7540 validations preserved.

Infrastructure: Three-way benchmark comparison

CloudBenchmarkSplit now builds 3 binaries (main, HEAD savepoint, current working tree) and runs a 9-pass interleaved schedule for three-way comparison. Enables tracking both total improvement (vs main) and incremental improvement (vs last commit).

Cloud Benchmark Results (arm64 c7g.2xlarge, split server/client)

Config Before beta.9 After beta.9 Delta
io_uring H2 930K rps 3.30M rps +255%
epoll H2 2.37M rps 3.33M rps +41%
io_uring vs epoll H2 gap -56 to -69% ±1% Eliminated
H2/H1 ratio 160-400% 570-577% H2 is 5.7x H1
H1 (all engines) ~580K rps ~590K rps Stable

Profile Analysis (post-optimization)

  • H1: 81% kernel syscall-bound, 0 allocations on hot path, fully optimized
  • H2: 18% kernel + 27% HPACK decode + 27% handler/response + 13% sync.Pool + 9% stream mgmt
  • H2 allocations: 18 KB total over 20s (down from 2.5 GB before zero-alloc fast path)

Test plan

  • mage fullCompliance — all 4 phases pass (unit tests with race detector, 9 fuzz targets, h1spec + h2spec 142/146, conformance matrix 9 engine×protocol combos, integration tests)
  • mage cloudBenchmarkSplit — 9-pass interleaved A/B/C on arm64 c7g.2xlarge
  • mage cloudProfileSplit — CPU + allocation profiles on 4 configs
  • h2spec: 146/146 on io_uring and epoll (4 known std engine failures unchanged)
  • Zero H1 regression across all engine×objective combinations

Tag EC2 instances with Project=celeris-mage and KeyPair=<run-key> so
instances from different runs/branches/projects are distinguishable.
Add cleanup scope logging to make it clear which resources are being
terminated.

Safety audit confirms: all instance termination uses explicit IDs
tracked from launch — no tag-based discovery or bulk operations that
could affect other workloads sharing the same AWS account.
…_uring H2)

9 profile-driven optimizations targeting H2 allocation hotspots and
io_uring pipeline stalls:

1. io_uring: reorder H2 drain before dirty list (reduces pipeline stalls)
2. Async H2 adapter: manual HPACK content-length (~140ns/req savings)
3. H1: fused status 200 + date cached block (one append vs two)
4. acquireContext: remove redundant nil check
5. io_uring: conditional immediate Submit (skip when CQEs ready)
6. Fix H2 context pool leak: only cache Context on H1 streams.
   H2 streams are ephemeral — caching caused contexts to leak (never
   returned to pool). This was 87% of H2 allocations.
7. Pre-allocate Stream.Headers capacity for H2 (avoid first-use alloc)
8. Persistent HPACK emit function per Processor (eliminate per-request
   closure allocation in header decode)
9. Reorder IsCancelled after canRunInline (avoid atomic on hot inline path)

Infrastructure: three-way benchmark comparison (main vs savepoint vs current)
in CloudBenchmarkSplit for incremental optimization tracking.

Cloud benchmark results (arm64 c7g.2xlarge, split server/client):

  io_uring H2:  930K → 2.73M rps  (+190-244%)
  epoll H2:    2.37M → 2.95M rps  (+23-27%)
  H1:           ~580K → ~578K rps  (stable, within noise)

  io_uring vs epoll H2 gap: -56-69% → -7.2-8.1%
  H2/H1 ratio: 470-510% (H2 is now 5x faster than H1)
…ng H2)

Add a zero-allocation HEADERS frame fast path in ProcessH2 that bypasses
the x/net/http2 framer for the common case: HEADERS frames with
END_HEADERS set, no PADDED, no PRIORITY, not during CONTINUATION.

The x/net framer allocates a *HeadersFrame struct per ReadFrame call,
which was 98% of remaining H2 allocations (2.5 GB/20s at 3M rps).
The fast path reads the 9-byte frame header directly from the recv
buffer, extracts streamID/flags/payload, and passes them to a new
ProcessRawHeaders method on the Processor that performs all RFC 7540
validations without allocating intermediate structs.

Complex frames (PADDED, PRIORITY, CONTINUATION, non-HEADERS types)
fall through to the existing x/net framer path unchanged.

Cloud benchmark results (arm64 c7g.2xlarge, split server/client):

  io_uring H2:  2.90M → 3.24M rps  (+11.5-12.3%)
  epoll H2:     3.09M → 3.26M rps  (+4.7-5.6%)
  H1:            ~566K → ~564K rps  (stable)

  io_uring vs epoll H2 gap: ELIMINATED (within ±0.6%)
  H2/H1 ratio: 570-577% (H2 is ~5.7x faster than H1)

h2spec: 146/146 on io_uring and epoll (no new failures)
- Add highlights section with headline numbers (3.3M H2 rps, 590K H1 rps)
- Update benchmarks to cloud results from arm64 c7g.2xlarge
- Add multishot recv, zero-alloc HEADERS, inline H2 handlers to feature matrix
- Update methodology (wrk + h2load, 9-pass interleaved)
- Update SECURITY.md: only >= 1.1.0 is supported
@FumingPower3925 FumingPower3925 self-assigned this Mar 25, 2026
@FumingPower3925 FumingPower3925 merged commit e17d32f into main Mar 25, 2026
10 checks passed
@FumingPower3925 FumingPower3925 deleted the perf/beta9 branch March 25, 2026 07:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant