perf: v1.1.0-beta.9 — 3.3M H2 rps, context leak fix, zero-alloc frame fast path by FumingPower3925 · Pull Request #108 · goceleris/celeris

FumingPower3925 · 2026-03-25T07:05:10Z

Summary

Profile-driven optimization loop targeting HTTP/2 throughput and io_uring parity with epoll. Two performance commits plus one infrastructure commit.

Commit 1: Fix H2 context leak + 9 optimizations (+190% io_uring H2)

Fix H2 context pool leak: acquireContext cached contexts on ephemeral H2 streams, causing them to leak (never returned to sync.Pool). This was 87% of all H2 allocations. Fix: only cache on H1 streams (persistent keep-alive connections). H2 inline handlers use InlineCachedCtx instead.
io_uring: reorder H2 drain before dirty list (reduces pipeline stalls)
Async H2 adapter: use manual HPACK content-length encoding (~140ns/req)
H1: fused cachedStatus200Date block (one append for status + date)
io_uring: conditional immediate Submit (skip when CQEs already visible)
Pre-allocate Stream.Headers capacity for H2 (avoid first-use allocation)
Persistent HPACK emit function per Processor (eliminate per-request closure)
Reorder IsCancelled after canRunInline (avoid atomic on hot inline path)
Remove redundant nil check in acquireContext

Commit 2: Zero-alloc HEADERS fast path (+12% io_uring H2)

Bypass x/net framer for common HEADERS frames (END_HEADERS set, no PADDED/PRIORITY, not during CONTINUATION). The framer's *HeadersFrame allocation was 98% of remaining H2 allocations (2.5 GB/20s at 3M rps). The fast path reads the 9-byte frame header directly, extracts streamID/flags/payload, and passes them to ProcessRawHeaders — all RFC 7540 validations preserved.

Infrastructure: Three-way benchmark comparison

CloudBenchmarkSplit now builds 3 binaries (main, HEAD savepoint, current working tree) and runs a 9-pass interleaved schedule for three-way comparison. Enables tracking both total improvement (vs main) and incremental improvement (vs last commit).

Cloud Benchmark Results (arm64 c7g.2xlarge, split server/client)

Config	Before beta.9	After beta.9	Delta
io_uring H2	930K rps	3.30M rps	+255%
epoll H2	2.37M rps	3.33M rps	+41%
io_uring vs epoll H2 gap	-56 to -69%	±1%	Eliminated
H2/H1 ratio	160-400%	570-577%	H2 is 5.7x H1
H1 (all engines)	~580K rps	~590K rps	Stable

Profile Analysis (post-optimization)

H1: 81% kernel syscall-bound, 0 allocations on hot path, fully optimized
H2: 18% kernel + 27% HPACK decode + 27% handler/response + 13% sync.Pool + 9% stream mgmt
H2 allocations: 18 KB total over 20s (down from 2.5 GB before zero-alloc fast path)

Test plan

mage fullCompliance — all 4 phases pass (unit tests with race detector, 9 fuzz targets, h1spec + h2spec 142/146, conformance matrix 9 engine×protocol combos, integration tests)
mage cloudBenchmarkSplit — 9-pass interleaved A/B/C on arm64 c7g.2xlarge
mage cloudProfileSplit — CPU + allocation profiles on 4 configs
h2spec: 146/146 on io_uring and epoll (4 known std engine failures unchanged)
Zero H1 regression across all engine×objective combinations

Tag EC2 instances with Project=celeris-mage and KeyPair=<run-key> so instances from different runs/branches/projects are distinguishable. Add cleanup scope logging to make it clear which resources are being terminated. Safety audit confirms: all instance termination uses explicit IDs tracked from launch — no tag-based discovery or bulk operations that could affect other workloads sharing the same AWS account.

…_uring H2) 9 profile-driven optimizations targeting H2 allocation hotspots and io_uring pipeline stalls: 1. io_uring: reorder H2 drain before dirty list (reduces pipeline stalls) 2. Async H2 adapter: manual HPACK content-length (~140ns/req savings) 3. H1: fused status 200 + date cached block (one append vs two) 4. acquireContext: remove redundant nil check 5. io_uring: conditional immediate Submit (skip when CQEs ready) 6. Fix H2 context pool leak: only cache Context on H1 streams. H2 streams are ephemeral — caching caused contexts to leak (never returned to pool). This was 87% of H2 allocations. 7. Pre-allocate Stream.Headers capacity for H2 (avoid first-use alloc) 8. Persistent HPACK emit function per Processor (eliminate per-request closure allocation in header decode) 9. Reorder IsCancelled after canRunInline (avoid atomic on hot inline path) Infrastructure: three-way benchmark comparison (main vs savepoint vs current) in CloudBenchmarkSplit for incremental optimization tracking. Cloud benchmark results (arm64 c7g.2xlarge, split server/client): io_uring H2: 930K → 2.73M rps (+190-244%) epoll H2: 2.37M → 2.95M rps (+23-27%) H1: ~580K → ~578K rps (stable, within noise) io_uring vs epoll H2 gap: -56-69% → -7.2-8.1% H2/H1 ratio: 470-510% (H2 is now 5x faster than H1)

…ng H2) Add a zero-allocation HEADERS frame fast path in ProcessH2 that bypasses the x/net/http2 framer for the common case: HEADERS frames with END_HEADERS set, no PADDED, no PRIORITY, not during CONTINUATION. The x/net framer allocates a *HeadersFrame struct per ReadFrame call, which was 98% of remaining H2 allocations (2.5 GB/20s at 3M rps). The fast path reads the 9-byte frame header directly from the recv buffer, extracts streamID/flags/payload, and passes them to a new ProcessRawHeaders method on the Processor that performs all RFC 7540 validations without allocating intermediate structs. Complex frames (PADDED, PRIORITY, CONTINUATION, non-HEADERS types) fall through to the existing x/net framer path unchanged. Cloud benchmark results (arm64 c7g.2xlarge, split server/client): io_uring H2: 2.90M → 3.24M rps (+11.5-12.3%) epoll H2: 3.09M → 3.26M rps (+4.7-5.6%) H1: ~566K → ~564K rps (stable) io_uring vs epoll H2 gap: ELIMINATED (within ±0.6%) H2/H1 ratio: 570-577% (H2 is ~5.7x faster than H1) h2spec: 146/146 on io_uring and epoll (no new failures)

- Add highlights section with headline numbers (3.3M H2 rps, 590K H1 rps) - Update benchmarks to cloud results from arm64 c7g.2xlarge - Add multishot recv, zero-alloc HEADERS, inline H2 handlers to feature matrix - Update methodology (wrk + h2load, 9-pass interleaved) - Update SECURITY.md: only >= 1.1.0 is supported

FumingPower3925 added 5 commits March 24, 2026 17:19

style: fix gofmt alignment in processor.go and stream.go

9ee532c

FumingPower3925 self-assigned this Mar 25, 2026

FumingPower3925 merged commit e17d32f into main Mar 25, 2026
10 checks passed

FumingPower3925 deleted the perf/beta9 branch March 25, 2026 07:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: v1.1.0-beta.9 — 3.3M H2 rps, context leak fix, zero-alloc frame fast path#108

perf: v1.1.0-beta.9 — 3.3M H2 rps, context leak fix, zero-alloc frame fast path#108
FumingPower3925 merged 5 commits intomainfrom
perf/beta9

FumingPower3925 commented Mar 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

FumingPower3925 commented Mar 25, 2026

Summary

Commit 1: Fix H2 context leak + 9 optimizations (+190% io_uring H2)

Commit 2: Zero-alloc HEADERS fast path (+12% io_uring H2)

Infrastructure: Three-way benchmark comparison

Cloud Benchmark Results (arm64 c7g.2xlarge, split server/client)

Profile Analysis (post-optimization)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant