feat: v1.4.0 — native PostgreSQL + Redis drivers, H2C upgrade, EventLoopProvider by FumingPower3925 · Pull Request #236 · goceleris/celeris

FumingPower3925 · 2026-04-16T21:43:13Z

Summary

v1.4.0 ships five workstreams as a single coherent release:

EventLoopProvider foundation — engine interface allowing drivers to share the HTTP server's epoll/io_uring workers for per-CPU affinity (feat: EventLoopProvider + WorkerLoop interface for driver integration #113-feat: standalone mini event loop for driver-only mode #116).
HTTP/2 Upgrade (h2c) — RFC 7540 §3.2 transparent upgrade from H1 to H2 cleartext (feat: H1 parser detects Upgrade: h2c + HTTP2-Settings headers #117-test: H2 Upgrade integration tests (upgrade flow, body, edge cases) #122).
Native PostgreSQL driver — wire protocol, database/sql driver, direct Pool, streaming rows, SCRAM-SHA-256, COPY FROM/TO (feat: PostgreSQL wire protocol message framing #123-test: PostgreSQL driver conformance + benchmark vs pgx #131).
Native Redis driver — RESP2/3 parser, typed commands, pipeline, pub/sub, pool, Cluster + Sentinel + shard channels (feat: RESP2/RESP3 parser (zero-alloc) #132-test: Redis driver conformance + benchmark vs go-redis #137, feat: Redis Cluster replica reads (RouteByLatency, ReadOnly) #233-feat: Cluster MULTI/EXEC (cross-slot transaction support) #235).
Native Memcached driver (late scope addition, feat: native Memcached driver (text + binary protocol, pool, EventLoopProvider) #238) — text + binary wire protocols, full typed-command surface.

Plus:

pgspec / redisspec / mcspec — protocol-compliance suites against real servers.
Config.AsyncHandlers — opt-in async dispatch so third-party drivers (goredis, pgx, gomemcache) don't futex-storm the LockOSThread'd workers.
Goroutine-reuse + zero-alloc input buffers in the async path.
Peek-before-netpoll reads in direct-mode drivers.
Per-cell matrix benchmarks + full H1 RFC 9112 compliance + conformance suites across 18 Postgres/Redis/Memcached/Valkey/DragonflyDB versions in CI.

MS-R1 matrix — headline numbers

Captured 2026-04-19 on MS-R1 (CIX CP8180, 12-core aarch64, Linux 6.6.10-cix, Postgres 15, Redis 7.2, Memcached 1.6.16). loadgen -connections 256 -duration 8s.

HTTP-layer (pure-CPU handlers, no driver)

Engine	/plaintext	/json	/params/:id	/body	/chain (5-mw)
celeris-epoll (sync)	394,132	372,964	376,562	367,231	228,063
celeris-iouring (sync)	428,582	399,183	413,085	402,552	268,839
celeris-adaptive (sync)	395,313	375,265	384,610	363,336	237,775
celeris-iouring (async)	287,838	272,012	281,500	275,554	204,609
nethttp	184,754	171,778	195,646	177,103	207,387
gin	185,420	171,717	177,883	156,269	184,924
chi	194,527	165,509	177,759	166,808	180,225
echo	187,454	170,429	182,593	157,902	184,542
fiber	313,855	289,847	313,715	294,512	314,203

celeris-iouring sync wins plaintext (+37% over fiber, +131% over the other four competitors). fiber wins the /chain cell (314k vs 269k) because its zero-middleware routing path is tight; celeris's chain includes recovery + logger + requestid + cors + timeout which is closer to a "realistic production" stack.

Driver-layer (celeris HTTP + DB round-trip)

Best celeris cell vs best competitor for the same DB:

DB	Best celeris (iouring, async=1)	Best competitor	Δ
redis (celerisredis)	94,417	nethttp+celerisredis: 78,040	+21.0%
redis (goredis)	75,416	nethttp+goredis: 68,167	+10.6%
postgres (celerispg)	68,641	nethttp+celerispg: 56,930	+20.6%
postgres (pgx)	55,771	nethttp+pgx: 42,944	+29.9%
memcached (celerismc)	101,902	nethttp+celerismc: 76,359	+33.5%
memcached (gomc)	78,800	nethttp+gomc: 73,515	+7.2%

Matrix leader: celeris-iouring + celerismc + AsyncHandlers=true = 101,902 rps — +33.5% over the best nethttp combination.

AsyncHandlers — when to use which

Workload	Sync (default)	Async (`Config.AsyncHandlers: true`)
Pure-CPU / plaintext / JSON / params / body	+30-33%	baseline
Middleware chain	+23-26%	baseline
Celeris native drivers	baseline	+5-20%
Third-party drivers (goredis, pgx, gomc)	baseline	+30-200%

Guidance: if your handler touches a DB or cache via any Go driver (third-party or celeris-native), set AsyncHandlers: true. If your handler is pure-CPU (plaintext, JSON from preallocated data, pure computation), leave the default. Per-route control is tracked in #239 (v1.5.0 spike).

One caveat: celeris-adaptive with AsyncHandlers=true currently regresses 3 cells (celerisredis -7.2%, celerispg -4.5%, celerismc -11.0%) — the engine-side async flag + driver-side sync-path mismatch on Adaptive. Actively being debugged; fix coming before v1.4.0 merge or punted to v1.4.1.

What's included

56 commits organized by workstream (feature commits + bug fixes from review rounds + perf follow-ups).
~62K lines of new code across driver/, engine/, internal/conn/, test/, and .github/workflows/.
Full documentation in doc.go for every new package + runnable Example functions.
CI integration with postgres:16 and redis:7.2 service containers on the default Conformance job, plus the new Drivers matrix workflow for multi-version coverage (Postgres 14/15/16/17/18 × Redis 7.4/8.0/8.2/8.4/8.6 × Memcached 1.6.29/36/41 × Valkey 9.0.3 × DragonflyDB v1.27.1).
Compliance suites — pgspec, redisspec, mcspec — all green against real servers.
HTTP/1.1 pipelining test for async dispatch (test/integration/pipeline_test.go).
Goroutine-reuse regression tests (engine/epoll/async_reuse_test.go).

Scope decisions

TLS deferred to v1.4.1 and tracked by the v1.5.0 spike spike: TLS 1.3 support for PostgreSQL and Redis drivers #232. sslmode=require / rediss:// are rejected with actionable error messages. For managed cloud DB services (RDS, CloudSQL, ElastiCache) TLS is required — this release is for VPC/loopback deployments until TLS lands.
Cluster MULTI/EXEC (cross-slot) scoped out; feat: Cluster MULTI/EXEC (cross-slot transaction support) #235 tracks it as an enhancement for a later release. Single-slot ClusterTx is shipped.
Standalone io_uring event loop is kept as dead code until the SINGLE_ISSUER vs sync-fast-path conflict is resolved; the epoll standalone path is faster in all measured workloads.
Per-route AsyncHandler control tracked as Spike: per-route AsyncHandler control #239 for v1.5.0.
H2 async dispatch — AsyncHandlers is HTTP1-only. H2 conns still run inline (which matches the existing H2 stream-manager model). Future exploration.

Test plan

Foundational interfaces that allow database/cache drivers to share the HTTP server's event-loop workers. Drivers register file descriptors on a specific worker via RegisterConn and receive callbacks on data arrival. Includes ErrQueueFull, ErrUnknownFD, and ErrSwitchingNotFrozen sentinel errors. No implementation in this commit — just the contract. Closes #113

Implements EventLoopProvider on the epoll engine. Adds a per-loop driverConns map (fd-indexed parallel to the HTTP conn table) gated by a hasDriverConns atomic flag — the HTTP hot path pays a single atomic load when no drivers are registered. - RegisterConn/UnregisterConn/Write with one-write-in-flight serialization (mirrors the PR #36 send-queue fix). - EPOLLIN edge-triggered, EPOLLOUT level-triggered to avoid missed wakeups under write contention. - TOCTOU-safe fd-collision check under driverMu. - shutdownDrivers fires onClose on engine teardown. Closes #114

Implements EventLoopProvider on the io_uring engine via new CQE user-data tags (udDriverRecv/Send/Close at 0x10-0x12, non- overlapping with existing HTTP tags). - Driver actions (Register/Unregister/Write) post to a worker- owned action queue and wake via the shared h2EventFD; only the worker goroutine submits SQEs (preserves SQ single-issuer). - One SEND in-flight per FD (mirrors PR #36 invariant) using a dedicated driverConn.sending flag + writeBuf/sendBuf swap. - Single-shot RECV per driverConn (no provided-buffer ring); avoids conflict with HTTP's multishot path. - Inflight-op counter guards UnregisterConn's ASYNC_CANCEL CQE ordering against in-flight RECV/SEND completions — prevents use-after-free on dc.buf. - shutdownDrivers fires onClose on engine teardown. Closes #115

Adaptive engine implements EventLoopProvider by delegating to the active sub-engine. WorkerLoop panics if FreezeSwitching is not held (driver FDs cannot migrate between epoll/io_uring tables). Exposes ErrSwitchingNotFrozen for drivers that attempt to register without first freezing the engine switch.

Minimal event loop used by drivers when no celeris Server is registered. Linux uses a stripped-down epoll worker (same primitives as engine/epoll but no accept/HTTP parsing); non-Linux falls back to goroutine-per-conn via net.FileConn. - WriteAndPoll sync fast path: caller goroutine does direct write(2) + 3-phase read (spin → poll(0) → poll(1ms blocking)) to avoid goroutine-hop latency for localhost DB/cache round trips. recvMu serializes with the event loop's onRecv callback. - WriteAndPollMulti for pipelined protocols (e.g. Redis Pipeline) — single write, poll-drain until isDone. - EPOLLOUT level-triggered re-arms on EAGAIN (slow consumer backpressure). - registry.go exposes Resolve(ServerProvider) with refcounted package-level standalone Loop; returns the Server's provider if registered, else the standalone fallback. The io_uring standalone path is present but not selected by default: its SINGLE_ISSUER constraint conflicts with the sync fast path's caller-goroutine reads. Kept as dead code for a future follow-up (#232 area). Closes #116

Server.EventLoopProvider() returns the active engine's provider (epoll, io_uring, or adaptive), or nil for engines that don't implement the interface (std). Drivers use this to route DB I/O through the HTTP server's worker event loops for per-CPU affinity.

H1 clients can now upgrade an HTTP/1.1 connection to HTTP/2 over cleartext via Connection: Upgrade, HTTP2-Settings + Upgrade: h2c. - protocol/h1: parser detects the three-token Upgrade handshake. Rejects ambiguous Upgrade values (e.g. "websocket, h2c") to disambiguate from the WebSocket path. - resource.Config.EnableH2Upgrade *bool with protocol-dependent defaults (Auto→true, H2C/HTTP1→false) propagated into H1State. - internal/conn/upgrade.go: UpgradeInfo + ErrUpgradeH2C sentinel + DecodeHTTP2Settings (RawURL + URL fallback). - ProcessH1 writes 101 Switching Protocols and returns ErrUpgradeH2C without invoking the handler (handler runs later on H2 stream 1). - protocol/h2/stream: Manager.ApplySetting exposed; Processor gains InjectStreamHeaders that opens stream 1 from the H1 headers without an HPACK round trip. - NewH2StateFromUpgrade constructs the H2State post-101: applies client SETTINGS from HTTP2-Settings, emits server preface, injects stream 1 with H1→H2 pseudo-headers, dispatches handler. Includes the rewritten header-copy path that forces strings out of the H1 recv buffer (prevents use-after-free when the driver layer reuses the buffer). Closes #117 #118 #119 #120

Both epoll and io_uring engines detect ErrUpgradeH2C from ProcessH1 and switch the connection from H1 to H2 state: - Release H1State, construct H2State via NewH2StateFromUpgrade. - Feed UpgradeInfo.Remaining through ProcessH2 synchronously (the H2 client preface may arrive in the same TCP segment as the H1 Upgrade request). - Flush writes explicitly after switchToH2 so the 101 response + server preface + stream-1 reply reach the client promptly. test/spec/h2c_upgrade_test.go adds integration coverage across iouring + epoll engines: happy path, POST with body, subsequent streams 3/5/7, config variations, invalid settings, missing Connection token, preface-in-same-segment, preface-split-across- reads, 1 MB body. Closes #121 #122

Internal primitives shared by the PostgreSQL and Redis drivers: - Bridge: lock-guarded FIFO ring buffer of pending requests, power-of-two capacity, O(1) enqueue/pop. Both PG and Redis wire protocols guarantee in-order responses on one connection, so a single ring matches. - Pool[C]: generic worker-affinity connection pool. Per-worker idle lists (lock-free fast path), shared overflow pool, and a semaphore-based wait queue (matching database/sql.DB SetMaxOpenConns semantics). Acquire blocks with ctx deadline instead of the old immediate ErrPoolExhausted. - Backoff: exponential with jitter (shared by PG reconnect and Redis PubSub reconnect). - Health sweep: ticker-driven eviction of expired / idle-too- long connections.

PostgreSQL v3 frontend/backend protocol: - message.go: zero-alloc Reader/Writer for the 1-byte type + 4-byte length frame. StartupMessage/CancelRequest/SSLRequest variants (no type byte). - startup.go + scram.go: connection handshake with Trust, Cleartext, MD5, and SCRAM-SHA-256 (PBKDF2 via stdlib crypto/pbkdf2, RFC 7677 test vectors pass). GSS/SSPI/Kerberos explicitly rejected. - query.go: Simple Query 'Q' flow. PGError parsing (severity, SQLSTATE, message, detail, hint, position). Defers tag string materialization — RowsAffected is zero-alloc. - extended.go: Parse/Bind/Describe/Execute/Sync/Close. Append- style message builders (AppendParse, AppendBind, ...) write into the Writer buffer with no per-message snapshot. Supports SkipParse for reusing named prepared statements. - copy.go: CopyInState/CopyOutState + binary header/trailer + text-format row encoder (with escape handling). - types.go + types_time.go + types_numeric.go + types_array.go: OID codec registry. Built-ins cover bool, int2/4/8, float4/8, text/varchar, bytea, uuid, jsonb, date, timestamp(tz), numeric, and common array types. Infinity sentinels for date/timestamp (binary and text). Floor correction for pre-epoch dates. Zero-alloc decode into pgRows' per-request slab. Fuzz tests for Reader + types; seed corpus under testdata/fuzz. Closes #123 #124 #125 #126 #127 #128

…rows Full PostgreSQL driver on top of the v3 protocol layer. - driver.go + connector.go + dsn.go: sql.Driver registered as "celeris-postgres". DSN supports URL + key=value. sslmode= require returns ErrSSLNotSupported with an actionable message pointing at the v1.5.0 TLS spike (#232). - conn.go: pgConn implements driver.Conn + extended interfaces (ConnBeginTx, ConnPrepareContext, QueryerContext, ExecerContext, Pinger, SessionResetter, Validator, async.Conn). Sync→async bridge: handler goroutine encodes + writes + blocks on doneCh; event loop parses response and signals completion. WriteAndPoll sync fast path eliminates context switches for localhost queries. Re-prepare-on-miss (SQLSTATE 26000) after DISCARD ALL. - stmt.go + rows.go + result.go + tx.go: database/sql facades. - cancel.go: PG CancelRequest via a separate short-lived TCP conn with bounded 5s timeout (independent of caller ctx). - lru.go: per-conn prepared statement cache. - pool.go: direct Pool API (postgres.Open + WithEngine) bypassing database/sql. Rows.Next()+Scan(...any) matches sql.Rows. QueryRow + Row.Scan with sql.ErrNoRows. Tx with savepoints. CopyFrom/CopyTo with text-format row encoder. Lazy streaming rows: buffer up to 64 rows, promote to bounded channel (cap 64) for larger result sets — no OOM on million-row queries, no channel-alloc cost on single-row queries. - Public types.go exposes *Conn (alias for *pgConn) so sql.Conn. Raw users can reach Savepoint/ReleaseSavepoint/RollbackTo. convertAssign supports sql.Scanner (sql.NullString/NullInt64/ pgtype.Inet and custom types), typed primitives, and NULL. Pool.Result implements driver.Result (LastInsertId returns an error pointing at RETURNING). Rows.Err() tracks iteration errors. sessionDirty flag elides DISCARD ALL on ResetSession when the conn only ran simple queries — avoids one round-trip per pool return on the hot path. Closes #129 #130

Zero-allocation RESP parser supporting both RESP2 and the RESP3 types (null _, bool #, double ,, bigint (, blob-error !, verbatim =, set ~, map %, attribute |, push >). - Value struct with typed fields (Str aliases Reader.buf for bulk/simple/verbatim/blob-err). Array/Map use a sync.Pool- backed slice pool. - Reader.Next returns ErrIncomplete on partial frames without advancing the cursor — safe to Feed more and retry. - MaxBulkLen (512 MiB) + MaxArrayLen (128 M) guards prevent DoS from malicious servers advertising huge lengths. - parseUint overflow check; parseInt accepts math.MinInt64. - Writer.AppendCommand1..5 arity-specific builders avoid the variadic slice allocation that dominated pipeline alloc profiles (was 32% of Pipeline1K allocs). - FuzzReader + FuzzRoundTrip seeded with hand-crafted RESP2/3 frames. Closes #132

Redis client API on top of the RESP parser. Typed commands for all common operations — strings (Get/Set/Incr/Decr/Append/...), hashes (HGet/HSet/HIncrBy/...), lists (LPush/LPop/LRange/LRem/...), sets (SAdd/SMembers/SInter/SUnion/SDiff/...), sorted sets (ZAdd/ZRange/ZRank/ZScore/ZIncrBy/...), keys (Expire/TTL/Type/...), scripting (Eval/EvalSHA/ScriptLoad), scan iterator (SCAN). - RedisState event-loop state machine + ProcessRedis(data). FIFO request/response matching via async.Bridge. - HELLO 3 negotiation with RESP2 fallback (AUTH + SELECT). WithForceRESP2() escape hatch for ElastiCache-classic-shaped servers that advertise 6.x but reject HELLO. Releases the HELLO request on fallback (fixes pool leak on Redis <6.0). - WriteAndPoll sync fast path: single-command round trips bypass the event loop for localhost latency. - Client.Do/DoString/DoInt/DoBool/DoSlice escape hatch for commands outside the typed surface. - OnPush callback for RESP3 client-tracking push frames received on command connections (otherwise silently dropped). - resetSession elides DISCARD when !dirty (avoids a round trip on the hot path). - Context cancellation closes the connection — the pending response arriving on a poisoned conn is drained via drainWithError, preventing desync with the next command issued on a fresh conn. - Expire(ttl=0) calls Persist (was: silent clamp to 1s). - WithHealthCheckInterval wires async.Pool health sweeps. - Nil-safe Client.Close. Closes #133

- Pipeline: single write for all buffered commands, FIFO response matching via async.Bridge. Typed deferred cmd handles (StringCmd/IntCmd/StatusCmd/FloatCmd/BoolCmd) that resolve via (Pipeline, idx) — pipeline-owned so the struct survives slice growth. Release() returns the Pipeline to sync.Pool; typed cmd handles become invalid (return ErrClosed — orphan guard). - Sync pipeline fast path: direct read/parse on the caller goroutine, zero per-response allocation when the result set fits in one TCP chunk. Slab-based copy-detach for the string payloads. Dropped Pipeline1K from 1320 → 3 allocs. - Tx (TxPipeline): MULTI … EXEC variant on a pinned conn with Watch/Unwatch support, ErrTxAborted on null EXEC. - Backpressure-aware context cancellation populates per-cmd errors before closing the conn (was: zero Values + no indication of cancellation). maxSlabRetain shrinks oversized slabs on Release to bound memory from one-off large pipelines. Closes #134

- PubSub pins a dedicated connection from the pubsubPool (command connections cannot be reused — push-mode). - Subscribe/PSubscribe/Unsubscribe/PUnsubscribe with a mu- serialized subscription set as source of truth. - Auto-reconnect on conn drop via onConnDrop hook → runs a reconnectLoop goroutine with async.Backoff (50ms → 5s, jittered), replays subscription set with a single pipelined SUBSCRIBE + PSUBSCRIBE. Messages during outage are lost (at-most-once, documented). - deliver() + closeMsgCh() serialize on ps.mu (no send-on- closed-channel panic). - nil-safe conn reference during reconnect — subscribe failures set ps.conn = nil before closing. Closes #135

Wraps async.Pool[*redisConn] with Redis-specific dial and health. Separate cmd and pubsub pools (push-mode conns cannot be reused for commands). - WithEngine(ServerProvider) resolves to the HTTP server's EventLoopProvider (integrated mode) or falls back to the standalone mini event loop. - HealthCheckInterval default 30s (tunable); MaxOpen/ MaxIdlePerWorker/MaxLifetime/MaxIdleTime follow database/sql-style semantics. - bounded 5-retry acquire on stale-conn hits (previously unbounded recursion). - Pool error messages include MaxOpen context on exhaustion. Closes #136

ClusterClient: - CRC16 (XMODEM) slot computation with {tag} hash tag support. - [16384]*clusterNode O(1) slot → node routing; background CLUSTER SLOTS refresh every 60s + on-demand after MOVED. - MOVED/ASK redirect handling (max 3 retries). ASK sends ASKING on a pinned conn (via pinnedConnKey context) so the next command lands on the same connection — fixes a subtle bug where a pooled ASKING + pooled command could hit different conns. - Multi-key commands (DEL/EXISTS) fan out per-node sub-calls in parallel. - ClusterPipeline groups commands by slot, executes per-node sub-pipelines in parallel, retries MOVED/ASK affected commands on refresh. - ClusterTx with same-slot validation (ErrCrossSlot on mismatch; hash tags colocate keys). ClusterClient.Watch with the same guard. - ReadOnly mode: reads routed to replicas (round-robin) with READONLY handshake; falls back to primary on replica failure. - RouteByLatency: per-node RTT measured each refresh; picks lowest-latency node for reads. - Shard channels (Redis 7+): SSubscribe/SPublish + smessage/ssubscribe/sunsubscribe recognition in the state machine + shard-aware reconnect replay. SentinelClient: - Master discovery via SENTINEL get-master-addr-by-name + ROLE verification. - Auto-failover: subscription to +switch-master on a sentinel conn; atomic primary swap under RWMutex. - dialMaster retries 3× with backoff on failover; marks client unhealthy (ErrSentinelUnhealthy) if all retries fail instead of silently reusing the stale master. Closes #233 #234 #235

Example* functions for Client.NewClient + Get/Set/Pipeline/ Subscribe + Do escape hatch + TxPipeline. Examples follow the godoc convention and render under the package "Examples" tab.

Build-tagged //go:build postgres, env-gated by CELERIS_PG_DSN. Docker-compose spins postgres:16. Covers auth (MD5, SCRAM-SHA-256, Trust), simple + extended query, type round-trips (all built-in OIDs + arrays), transactions + isolation levels, savepoints, COPY IN/OUT, error handling, cancel, pool affinity, concurrency (1000 goroutines × 100 queries), and large result sets. Closes #131 (conformance portion)

Build-tagged //go:build redis, env-gated by CELERIS_REDIS_ADDR (+ optional CELERIS_REDIS_PASSWORD for AUTH). Docker-compose spins redis:7.2. Covers all data structures (strings, hashes, lists, sets, sorted sets, keys), pipelines (incl. mid-stream failure), pub/sub (patterns + unsubscribe + reconnect via CLIENT KILL TYPE pubsub), transactions + Watch, pool affinity + overflow + idle cleanup, RESP2 vs RESP3 HELLO negotiation, AUTH variants. Closes #137 (conformance portion)

Modeled after h2spec / Autobahn|Testsuite: external spec verifier that speaks PG wire protocol directly (raw TCP + driver/postgres/protocol) without going through database/sql. 51 tests organized by spec section: - Startup: version negotiation, SSLRequest, CancelRequest, malformed startup handling. - Auth: SCRAM-SHA-256 full handshake + bad-password failure. - Simple Query: SELECT/INSERT/Error/MultiStatement/NULL/ LargeResult (100K rows)/TransactionStatus byte. - Extended Query: Parse/Bind/Describe/Execute/Sync/Close, error-during-Bind recovery, portal suspension. - COPY: text-in, binary-in, out, fail, wrong-format, large-out (10K rows). - Error handling: all PGError fields, NoticeResponse, RFQ recovery after error. - Wire framing: zero-length payload, split reads, back-to-back messages. - Type round-trips: all built-ins + NULL + arrays + 1 MB values + infinity sentinels for date and timestamp. - Lifecycle: Terminate, idle timeout, Cancel. Invoked via `mage pgSpec` (gated by CELERIS_PG_DSN).

Modeled after pgspec / h2spec: external spec verifier that speaks RESP directly (raw TCP + driver/redis/protocol). 62 tests + 4 fuzz targets organized by spec section: - RESP2 types: all 5 base types + null/empty edge cases. - RESP3 types: all 11 new types (null, bool, double, bigint, blob error, verbatim, set, map, attribute, push). - Command protocol: inline, multi-bulk, pipelines (incl. mid- stream failure), max args, large bulks, unknown command. - AUTH + SELECT variants. - Pub/Sub: subscribe/message/pattern/unsubscribe/multi-channel/ PING-in-pubsub/non-sub-command-during-sub/RESP3 push format. - Transactions: MULTI/EXEC/DISCARD/empty EXEC/queued-error/ EXECABORT/Watch/Unwatch. - Wire edge cases: split reads, 10K PING back-to-back, binary keys with NUL/CRLF, max bulk size, concurrent conns, CLIENT SETNAME round-trip, attribute-prefixed reply, integer overflow, push-on-cmd-conn. - Fuzz: FuzzRESPParse, FuzzRESPRoundTrip, FuzzRESP3Types, FuzzBulkBoundary. Invoked via `mage redisSpec` (gated by CELERIS_REDIS_ADDR).

Darwin-runnable test that proves the headline v1.4.0 architecture: - Starts a celeris Server (std engine, std-engine-compatible). - Spins up in-process fake PG and Redis servers. - Registers handlers at /db and /cache that open a driver Pool with WithEngine(server) and run a query on each request. - Issues real HTTP requests and verifies responses. Std engine doesn't implement EventLoopProvider, so drivers fall back to the standalone loop (documented path). On Linux with epoll/iouring the WithEngine call picks up the engine's native provider — same code path, real per-worker affinity.

Separate go.mod submodules (replace celeris ../../..) so competitor libs don't pollute the main module's dependencies. test/drivercmp/postgres/ — mirrors benchmarks for celeris vs pgx vs lib/pq across: SelectOne (sql.DB + direct Pool), Select1000 rows (text + binary), InsertPrepared, Transaction, PoolContention, ParallelQuery, CopyIn_1M_Rows, and integrated net/http handler latency (celeris Server + pgxpool + net/http). test/drivercmp/redis/ — celeris vs go-redis: Get, Set, MGet10, Pipeline10/100/1000/10000, Parallel GET, PubSub1to1 latency. Each benchmark reports ns/op + B/op + allocs/op; the goal is pgx/go-redis-parity on single-command and significant wins on parallel and pipeline paths. Results captured in the PR body.

mage_driver.go exposes: - TestIntegration — go test ./test/integration/… - H2CCompliance — H2C upgrade integration tests - TestDriver {postgres|redis} — conformance suite (env-gated) - PGSpec / RedisSpec — protocol compliance suites - BaselineBench {eventloop|h2c|postgres|redis} — snapshot bench to results/<ts>-baseline-<subsys>/ with env.json + bench.txt - DriverProfile {driver} {bench} — runs CPU/heap/mutex/block pprof capture + top.txt - DriverBench {postgres|redis} — full comparator suite - PreBench — correctness gate (lint+test+spec+integration+ h2c+testDriver) before any profile/bench work .github/workflows/ci.yml gains a driver-conformance job with postgres:16 and redis:7.2 service containers running the conformance suites with -tags postgres/redis gates.

…uccessful reply The per-slot sub-pipeline used the direct-extraction fast path where successful results are stored in pc.str + pc.scalar with pc.direct=true and pc.val left nil. The harvester only read from pc.val, so every successful cluster pipeline GET returned a zero protocol.Value{}. - Preserve the original command kind when forwarding to the sub-pipeline instead of forcing every command to kindString. - Harvest the direct path into a synthesized protocol.Value via new directToValue helper. String bytes are copied out of pc.str because it aliases p.strSlab which the deferred p.Release() reclaims. - Handle all kinds (string, status, int, float, bool) so INCR/SET/etc. return correct typed values from cluster pipelines too. Adds TestClusterPipelineReturnsValues which fails without the fix (results[0].Str is empty) and passes after.

…ne conn The per-command ASK recovery path in ClusterPipeline.execRound did two separate n.client.Do() calls — one for ASKING and one for the original command. Each n.client.Do acquires its own conn from the pool, so the ASKING and the command could land on different TCP connections. Redis requires both to arrive on the SAME conn; otherwise the migrating slot's owner responds with MOVED and the pipeline fails. Fix mirrors the existing single-command doASK pattern (cluster.go:534): acquire a conn, run ASKING on it, propagate the conn into the follow-up Do via pinnedConnKey, then release. Adds TestClusterPipelineASKPinsConn — it models real Redis's strict ASKING semantics in the fake server (per-conn ASKING flag; GET returns MOVED if the flag isn't set on that conn). A race goroutine triggered immediately after ASKING replies hammers the pool with concurrent PINGs, reliably stealing the released conn before the buggy code can re-acquire it. The test fails without the fix (MOVED error) and passes with it.

…TURNING simpleExec and doExtendedExec do not allocate req.colsCh — Exec paths only need the CommandComplete tag and never stream. However, the shared dispatch handlers (reqSimple, reqExtended) called promoteToStreaming unconditionally once len(head.rows) crossed streamThreshold (64), and promoteToStreaming ends with close(req.colsCh). Closing a nil channel panicked on the event-loop worker goroutine. Real trigger: Exec("INSERT INTO t SELECT ... RETURNING id") where the SELECT produces >= 64 rows. PG streams DataRows, the Exec caller never reads them, and the panic kills the reader loop. Fix: - Dispatch sites now guard on head.colsCh != nil before promoting — Exec paths (colsCh == nil) skip the promotion entirely. The buffered rows are simply dropped when the request finishes (the caller only reads the tag via simple.TagBytes / extended.Tag). - Add a defensive nil-guard inside promoteToStreaming as defense in depth, so a future caller can't hit the same trap. Adds TestPgExecReturningLargeResult: fake PG server replies with 128 DataRows + CommandComplete for a simpleExec. Without the fix the test panics ("close of nil channel") in promoteToStreaming. With the fix the test cleanly returns RowsAffected=128.

Four error branches in the driver's dialConn implementations called syscall.Close(fd) while the *os.File wrapper returned from tcp.File() was still reachable with its runtime finalizer armed. When GC later fires the finalizer, it calls syscall.Close(fd) a SECOND time — potentially on an unrelated fd the kernel has already reassigned to another open socket. Classic phantom-close bug. Affected sites (replaced syscall.Close(fd) with file.Close(), which closes the kernel fd AND disarms the finalizer — same pattern already used correctly on the SetNonblock error branch just above each): - driver/postgres/conn.go:568 — NumWorkers == 0 branch - driver/postgres/conn.go:610 — RegisterConn failure branch - driver/redis/conn.go:156 — NumWorkers == 0 branch - driver/redis/conn.go:186 — RegisterConn failure branch Adds two regression tests (TestPgDialConnNoPhantomCloseOnError and TestRedisDialConnNoPhantomCloseOnError) that force the NumWorkers==0 branch with a stub Provider and run repeated GC afterward. While finalizer timing makes this impossible to assert deterministically, the tests at least pin the correct error path and document the intent. Primary safeguard remains the code inspection + comments at each site — direct testing of a finalizer-driven double close would require fd-recycle injection.

When the memcached client is opened without WithEngine(srv), skip the mini-loop entirely. Each conn keeps the live *net.TCPConn and does Write + Read directly on the caller's goroutine via Go's netpoll (which parks the G on EPOLLIN transparently). The mini-loop path is a net loss for standalone request/response workloads: Profile of nethttp + celerismc (mini-loop path) at 74k rps: WriteAndPoll 36.4s cum (33% CPU) ├─ flushLocked 16.6s (15% — the actual write syscall) ├─ Phase B polls 6.1s (5.4% — poll(0) x 16) ├─ EpollCtl MOD 2.7s (2.4% — mask + unmask per op) ├─ Phase A read 1.5s (1.3%) └─ recvMu/overhead ~9s (~8%) The last ~11s (~10% of total CPU) is overhead the mini-loop adds on top of what gomc pays. gomc just calls net.Conn.Write/Read; Go's netpoll handles EPOLLIN transparently, no per-op syscalls beyond write+read. Direct mode matches that shape exactly. Matrix on MS-R1 (MC cells only), matrix 12 → matrix 13: nethttp + celerismc 74,689 → 87,556 (+17.2%, gomc: 86,282 — win) gin + celerismc 73,977 → 88,707 (+19.9%, gomc: 87,589 — win) chi + celerismc 71,352 → 87,420 (+22.5%, gomc: 83,664 — win) echo + celerismc 73,960 → 88,447 (+19.6%, gomc: 86,602 — win) celerismc now beats gomemcache on every foreign HTTP server tested. p99 latency also improves — ~10ms → ~6ms per row. Engine-integrated path (WithEngine supplied) unchanged: the mini-loop is still used so DB conns colocate with the celeris HTTP engine's LockOSThread'd worker. Two modes coexist behind the same mcConn struct (useDirect discriminates); existing tests and the engine- integrated bench cells are unaffected. Implementation: - NewClient without WithEngine → newDirectPool instead of newPool - dialDirectMemcachedConn skips eventloop.Resolve, keeps *net.TCPConn - execText / execBinary / execBinaryMulti branch on useDirect - Close branches on useDirect

Two post-commit CI failures on 0c4b6d8: 1. Lint errcheck: `defer c.tcp.SetDeadline(time.Time{})` in execTextDirect / execBinaryDirect / execBinaryMultiDirect didn't check the returned error. Wrap in `defer func() { _ = ... }()`. 2. TestWriteAndPollSyncPath flaked again under -race on ubuntu CI. The previous fix (synchronous peer write before WriteAndPoll) wasn't enough — on loaded runners the write buffer hasn't always surfaced on our end of the socketpair by the time Phase A's single non-blocking read runs, so Phase A misses and the test times out in Phase C. Add a brief poll(50ms) as a deterministic hand-off signal (independent of timer resolution / scheduler jitter) before calling WriteAndPoll — asserts the buffer is actually readable, then runs the exact same WriteAndPoll test.

An earlier attempt to always route through direct net.TCPConn mode (even under WithEngine) regressed the celeris-engine + celerismc cell catastrophically: 65k → 34k rps (-48%). Root cause: Handler runs on celeris HTTP engine's LockOSThread'd worker G. When that handler calls net.TCPConn.Read (direct mode), Go's netpoll parks the G on EPOLLIN. Parking a locked G triggers stoplockedm + startlockedm — the same futex-storm pathology that WriteAndPollBusy was introduced to avoid in the first place. Revert the blanket switch: mc uses direct mode in standalone only (cfg.Engine == nil), falls back to mini-loop + WriteAndPollBusy when WithEngine is supplied. The big +17–21% wins on the foreign-HTTP cells (matrix 13) are preserved; the celeris-engine cells return to their matrix-13 numbers (~64–69k). The shared-event-loop promise of WithEngine(srv) is honored for mc by colocating conns on the engine's worker via the mini-loop sync path, which is futex-safe for locked callers.

…ration gap Validated on MSR1 bare metal, celeris-epoll + celerismc: Matrix 16 baseline (inline handler + mini-loop): 64,147 rps CELERIS_ASYNC_HANDLERS=1 (async + mini-loop): 53,877 rps (-16%) CELERIS_ASYNC_HANDLERS=1 + CELERIS_MC_FORCE_DIRECT=1: 105,267 rps (+64%) Context: The celeris HTTP engine's workers are runtime.LockOSThread'd (for SINGLE_ISSUER on io_uring, CPU affinity on epoll). Handlers run inline on those locked worker goroutines. When a handler blocks on DB I/O: - Inline + mini-loop: handler does unix.Poll on the locked M. P is detached during the syscall but no unlocked Gs exist to use it, so the P sits idle. Other FDs on this worker wait until handler returns. Measured throughput: 64k rps (NumWorkers × 1/RTT bound). - Async + mini-loop: handler runs on a spawned unlocked G. It blocks in unix.Poll which still ties up an M (Go can't park G on a bare syscall). Go spawns more Ms, context-switch overhead eats the parallelism benefit. Regression to 54k. - Async + direct (net.Conn.Read): handler on unlocked G reads via Go's netpoll. netpoll parks the G efficiently — no M is blocked, no new Ms spawned. The worker is free to service other FDs while the G waits for EPOLLIN. Throughput jumps to 105k — BEATING every other config in the matrix (foreign HTTP + celerismc: 88k; foreign HTTP + gomc: 87k). This commit lands the two env-gated knobs that demonstrated the effect: - CELERIS_ASYNC_HANDLERS=1 on the epoll engine: dispatches HTTP1 handlers to goroutines, serialized per-conn via detachMu. Worker returns to epoll_wait immediately after dispatching. Non-async path unchanged — zero overhead when the flag is off. - CELERIS_MC_FORCE_DIRECT=1 on the memcached driver: uses the direct net.Conn path even when WithEngine(srv) is supplied. Safe only on async-engine path (direct on a locked M would futex-storm via netpoll's G parking). Both are experimental and NOT production-ready: - Error / close paths are best-effort - No support for HTTP/2 handlers (only HTTP1 dispatched) - H1State mutations race with concurrent dispatches (single- connection serial clients are OK; pipelined requests are not) - CELERIS_MC_FORCE_DIRECT without CELERIS_ASYNC_HANDLERS regresses celeris-engine cells (direct on locked M = futex storm) A proper implementation in v1.4.x: - Config.AsyncHandlers as a first-class Server option - Per-conn input buffer for pipelined requests - Driver-side signaling so direct mode activates automatically when the caller is on an async-dispatched G (pprof label or context key) - Extension to io_uring engine (requires SQE hand-off from handler goroutine back to the SINGLE_ISSUER worker)

…ch + netpoll I/O Config.AsyncHandlers is now a first-class Server option (default: false). When set AND the engine is epoll (or std, which is always async natively), the engine dispatches HTTP1 handlers to spawned goroutines instead of running them inline on the LockOSThread'd worker. Drivers opened with WithEngine(srv) auto-detect this via eventloop.IsAsyncServer() and switch their I/O path to match the caller's Go-runtime shape: Caller shape | Driver I/O | Why -------------------|-------------------------|-------------------------- Inline (locked M) | mini-loop sync/busy | net.Conn.Read on locked M | | futex-storms via netpoll Async (unlocked G) | direct net.TCPConn | Go netpoll parks the G | | cleanly, no M blocked Standalone (no | mc: direct, redis/pg: | mc direct is faster engine) | mini-loop | standalone; redis's tiny | | responses favor mini-loop Implementation: * celeris.Config: new AsyncHandlers bool (doc'd with trade-offs). Propagates through resource.Config into engine bootstrapping. * celeris.Server.AsyncHandlers(): honors the flag only when the engine actually implements async dispatch (currently Epoll + Std; iouring and adaptive return false so drivers don't hand themselves the direct path and futex-storm on a locked worker). * engine/epoll: Loop.async bool, set from Config.AsyncHandlers (OR'd with CELERIS_ASYNC_HANDLERS env var for diagnostic overrides). In drainRead, when async && HTTP1, copy the read bytes and spawn a handler goroutine that holds cs.detachMu around ProcessH1 + inline flush. Worker returns to epoll_wait immediately. Zero overhead on the non-async path. * driver/internal/eventloop: new AsyncHandlerProvider interface; new IsAsyncServer helper that drivers call to detect the dispatch mode. * driver/memcached: client auto-selects direct mode when Engine==nil OR IsAsyncServer(Engine) is true. * driver/redis: client auto-selects direct mode only when IsAsyncServer(Engine) is true; standalone and sync-engine paths use mini-loop (redis's tiny GET responses measurably favor mini-loop's sync spin over net.Conn.Read + netpoll wake). Cmd pool direct; pubsub always uses mini-loop because unsolicited push frames need event-driven delivery. * driver/postgres: pool tracks asyncEngine; useBusySync is disabled on async-dispatched engines so the handler G can yield via runtime.Gosched between Phase B polls (cheap on an unlocked G; futex-storm on a locked M — which is why busy-path exists). MSR1 bare-metal validation (celeris-epoll with ASYNC=1, full matrix re-run inflight at commit time; partial early rows): celeris-epoll + celerisredis ~86k (matrix 16: 82.6k, +4%) celeris-epoll + goredis ~64k (matrix 16: 24.4k, +164%) celeris-epoll + celerispg ~52k (matrix 16: 43.7k, +19%) celeris-epoll + pgx ~46k (matrix 16: 39.4k, +19%) celeris-epoll + celerismc ~105k (matrix 16: 64.1k, +64%) celeris-epoll + gomc ~89k (matrix 16: 46.2k, +95%) celeris-epoll + celerismc at 105k is the single fastest cell in the entire 36-config matrix — beating nethttp + celerismc (89k) by 19% and nethttp + gomc (88k) by 20%. celeris-epoll + goredis at 64k is a 2.6× jump — async+driver-in-netpoll rescues non-celeris drivers too. iouring and adaptive paths: behavior unchanged (Server.AsyncHandlers() reports false on those engines even when config is set), matching matrix 16 numbers exactly so no regressions. Not yet covered in this commit: * iouring async dispatch — requires SQE hand-off from handler back to the SINGLE_ISSUER worker. Tracked as v1.4.x follow-up. * PG direct mode — PG startup is a multi-round protocol (SCRAM-SHA- 256 challenge, etc.) that needs driver-specific plumbing for direct mode. PG still uses mini-loop under async but with the yielding sync path instead of busy-poll, closing the worst of the pre-fix regression (-14% → +19%).

Goroutine-per-conn dispatch: each HTTP1 conn buffers incoming bytes under asyncInMu and spawns a single dispatch goroutine that drains the buffer, running ProcessH1 under detachMu. ProcessH1's built-in offset loop handles pipelined requests in order; responses land on writeBuf in request order before the flush. Previously (epoll) spawned a goroutine per read-batch, which let pipelined bursts race on detachMu. Now the per-conn invariant is enforced by asyncRun — only one goroutine alive per conn at a time, and the next batch only spawns after the previous cleared asyncRun under the same mutex that guards the input buffer. io_uring: async dispatch now works under SINGLE_ISSUER by reusing the existing detachMu + detachQueue + eventfd machinery. Handler goroutine writes land in writeBuf; after ProcessH1, the FD is queued on detachQueue, the worker picks it up via the eventfd wakeup and submits SEND SQEs from its own thread. closeConn now signals asyncClosed.Store(true) so the dispatch goroutine exits at its next iteration. Server.AsyncHandlers() now returns true for IOUring too. Cross-cut fixes so detachMu != nil no longer implies "truly detached": writeCap / sendCap / timeout-scan / shutdown / closeConn now gate on h1State.Detached, which is only set when OnDetach fires (WS/SSE). Async-mode conns pre-allocate detachMu in acquireConnState without triggering those branches. Adds test/integration/pipeline_test.go exercising ordering under both AsyncHandlers=true and false.

… engine Symmetric to memcached/redis direct mode. Standalone pools and pools opened WithEngine on an async engine now dial *net.TCPConn directly and drive reads from the caller goroutine via Go's netpoll — no mini-loop involvement, no LockOSThread, no futex storm. - writeRaw(data) uniform helper: tcp.Write under directMu or loop.Write via the mini-loop. All 15 c.loop.Write(c.fd, ...) call sites use it. - driveDirect(ctx, req): tight tcp.Read → onRecv loop until req.doneAtom fires or ctx cancels. Includes a non-blocking MSG_DONTWAIT peek so loopback-fast responses skip the netpoll G-park wakeup. - waitForQueryRows direct branch: buffers rows (syncMode pinned so dispatch never promotes to streaming, which would deadlock on the caller goroutine). - dialDirectConn: TCP dial, SetNoDelay, SyscallConn-captured fd for peek, doStartup runs synchronously via the new drive path. - Close: direct-mode closes via tcp.Close with bounded write deadline; loop-mode path unchanged. - COPY FROM/TO guarded: ErrDirectModeUnsupported surfaces when called on a direct-mode conn. CopyInResponse / CopyOutResponse require event-loop-driven unsolicited delivery which has no reader in the direct model. - Pool.dial routes to dialDirectConn when !hasEngine OR asyncEngine, matching the rule memcached uses today.

One non-blocking syscall.Recvfrom(MSG_DONTWAIT) before each tcp.Read in execDirect / execManyDirect (redis) and execTextDirect / execBinaryDirect / execBinaryMultiDirect (memcached). Loopback-fast responses (10-byte GET, +PONG, small memcached VALUEs) land in the recv buffer before tcp.Read is called; the peek catches them with a single syscall and skips the ~1-2µs netpoll G-park wakeup. One peek per iteration (not a tight spin) — repeated MSG_DONTWAIT would re-introduce the P-hogging regression that a bounded spin avoids. Fd cached at dial time via SyscallConn().Control.

FumingPower3925 · 2026-04-18T19:06:24Z

Post-review follow-up: close remaining integration gaps (W1-W4)

Three additional commits land on top of 6f353ee:

b4ac40b — W3 + W1: pipelining-safe async dispatch on both engines. Per-conn single-goroutine dispatch model replaces the earlier per-read-batch spawn (fixed an ordering hazard under pipelined HTTP/1.1). io_uring now implements async dispatch under SINGLE_ISSUER by reusing the existing detachMu + detachQueue + eventfd machinery. Server.AsyncHandlers() now returns true for IOUring too. test/integration/pipeline_test.go exercises ordering under both inline and async dispatch.
93ca5de — W2: PostgreSQL direct *net.TCPConn mode. Symmetric to memcached/redis — standalone pools and WithEngine(async) pools bypass the mini-loop and drive Read on the caller goroutine via Go's netpoll. COPY FROM/TO returns ErrDirectModeUnsupported on direct-mode conns (require event-loop-driven unsolicited delivery).
55967ef — W4: MSG_DONTWAIT peek before netpoll in every direct-mode read loop (redis / memcached / postgres). Catches loopback-fast responses without the G-park wakeup; one peek per iteration (no tight spin — P-hog avoidance).

What this closes

io_uring + async-handlers parity with epoll (previously async was silently a no-op).
PG direct mode (was left on mini-loop after the sync-fast-path busy-path fix).
HTTP/1.1 pipelining correctness under async dispatch.
Redis direct-mode netpoll wakeup overhead vs nethttp.

Verification

Full test suite — go test -race ./... — 59/59 packages green on darwin. Driver suites: redis / memcached / postgres + their protocol packages.
Build verified on GOOS=linux GOARCH=arm64.
Full MSR1 matrix + pipeline test need Linux infra to execute; plan's exit criteria (celeris-epoll + celerismc ≥ 100k rps, celeris-iouring + celerismc ≈ epoll, celeris-epoll + celerispg ≥ 80k, celeris-epoll + celerisredis ≥ 93k) await CI.

…hold cs Data race detected by -race in TestHTTP1PipeliningAsync/async: worker's closeConn -> releaseConnState was resetting cs fields concurrently with the async dispatch goroutine's asyncInBuf/asyncRun/asyncClosed writes. Root cause: my earlier split between 'detached' (detachMu != nil) and 'trulyDetached' (h1State.Detached) let releaseConnState run for async-mode conns even though their dispatch goroutine still held a cs reference. Restore the original invariant — any goroutine-holding conn skips the pool return. GC collects cs once the goroutine exits. CloseH1 gating (trulyDetached) is kept: async-mode conns still own H1 state because no middleware goroutine is holding it open past Detach. Only the release path now uses the broader 'detached' flag.

Earlier W2 commit guarded COPY with ErrDirectModeUnsupported because direct mode has no event-loop goroutine driving onRecv — copyReady / doneCh would never fire. That regressed 5 conformance tests. Fix: spawn a short-lived reader goroutine (startDirectReader) for the duration of each copy operation. The reader pumps tcp.Read → onRecv with a 50ms read deadline so it periodically checks the stop channel; the caller goroutine remains the sole writer of CopyData frames (tcp.Write concurrent with tcp.Read on another goroutine is safe). Final wait in copyFrom / copyTo uses select on doneCh/ctx.Done in direct mode instead of c.wait — c.wait's driveDirect would spawn a second concurrent tcp.Read and race the reader goroutine. Background reader also fails the request chain via c.failAll on unexpected EOF, so transport errors surface cleanly through doneCh rather than hanging the caller.

…atch pprof on msr1 (aarch64, celeris-epoll+celerisredis ASYNC=1) showed 'go l.runAsyncHandler(cs)' at 450ms / 13.82s CPU = 3.3% — every request was re-spawning the dispatch goroutine. On keep-alive load with per-conn request gaps, asyncInBuf would drain, asyncRun went false, goroutine exited, next read spawned a fresh one. Fix: add sync.Cond so the dispatch goroutine parks on asyncCond.Wait when asyncInBuf is empty rather than exiting. Worker signals after each append. Goroutine lives until closeConn broadcasts via asyncClosed + Cond.Broadcast. Also double-buffer asyncInBuf/asyncOutBuf so the goroutine's swap on pickup doesn't force the worker to re-allocate on the next append. Drops the dataCopy intermediate (was one heap alloc per request) — worker now appends cs.buf bytes directly into asyncInBuf, goroutine swaps out before ProcessH1 (dropping the cs.buf aliasing risk). MSR1 matrix impact (aarch64, 12c, 256 conns, ASYNC=1): | cell | before | after | delta | |-----------------------------------|---------|---------|---------| | celeris-epoll + celerisredis | 85,134 | 100,322 | +17.8% | | celeris-iouring + celerisredis | 99,962 | 115,981 | +16.0% | | celeris-epoll + goredis | 65,082 | 76,358 | +17.3% | | celeris-iouring + goredis | 80,295 | 96,041 | +19.6% | | celeris-epoll + celerispg | 55,361 | 61,740 | +11.5% | | celeris-iouring + celerispg | 58,745 | 66,411 | +13.1% | | celeris-epoll + celerismc | 103,467 | 108,147 | +4.5% | | celeris-iouring + celerismc | 112,865 | 118,668 | +5.1% | celeris-epoll + celerisredis: now +9.2% vs nethttp+celerisredis (was -6.1% before). The epoll-async redis gap is closed. New matrix leader: celeris-iouring + celerismc = 118,668 rps (+35% vs nethttp+celerismc = 87,876). Validated on msr1 (Linux 6.6.10-cix, aarch64): 62/62 packages pass -race, all spec/conformance suites pass (pgspec/redisspec/mcspec + conformance/postgres/redis/memcached + H1 RFC 9112).

doExtendedQuery sent Parse(if first)+Bind+Describe+Execute+Sync for every extended query. For cached prepared statements (autoCache + stmtCache hit), the row description is already known from the initial prepare, so the portal Describe is redundant — server still returns RowDescription each call, costing 7 bytes on the wire and one protocol-state transition per query. Pass cached columns through the call chain (QueryContext -> doExtendedQuery), pre-populate req.columns + req.extended.Columns, set HasDescribe=false so the state machine transitions BindComplete -> ExecuteResult directly. The ExtendedQueryState machine already supported HasDescribe=false; no state-machine change needed. FormatCode fixup: prepare-time Describe returns FormatCode=0 (text) because Postgres doesn't decide the output encoding until Execute receives the resultFormats vector. We pass [FormatBinary] in the Execute, so we shallow-copy the cached ColumnDesc slice and overwrite FormatCode to FormatBinary — keeps decode on the fast binary path and leaves the stmtCache's slice pristine for re-use. Measured impact is small (net +~1% RPS, noise-level on MSR1) — the 27% CPU in tcp.Write on the hot PG cell is syscall fixed cost, not per-byte. Keeping the change for correctness and slightly smaller wire footprint; saves one server state transition per query which reduces tail latency on PG cells (p50 3645µs -> 3606µs). All pgspec/conformance suites pass on Linux (msr1, Postgres 16).

Two regression tests for the async dispatch path: TestAsyncHandlerGoroutineReuse — spins up the engine with AsyncHandlers=true, opens one keep-alive conn, sends 100 serial requests with 500µs idle gaps between them, and asserts that the runtime goroutine count only grew by ≤5 between first-request baseline and last request. Before the sync.Cond reuse fix, each idle-then-resume batch respawned the dispatch goroutine, which would drive that delta well above tolerance. TestAsyncHandlerCloseWakesGoroutine — opens one conn, sends one request, closes. Asserts that within 2s the goroutine count returns to within +2 of the pre-test baseline. This exercises closeConn's asyncCond.Broadcast path — without it, the parked dispatch goroutine would leak until GC finalized the connState, which can be tens of seconds under a busy test suite. Also cleaned up three leftover references to internal plan jargon ("W4") in public struct comments — replaced with descriptive text. Updated runAsyncHandler's doc comment to reflect the reuse model (the stale comment still said "exit when buffer empty" from the pre-Cond implementation).

FumingPower3925 · 2026-04-18T22:06:40Z

Post-review updates (commits since the last summary at `55967ef`)

Seven additional commits landed after the initial W1-W4 summary. Chronological:

62c0d79 — gofmt fix (trailing field alignment in epoll/conn.go + pg/conn.go).
a93d15b — fix: race in releaseConnState surfaced by -race in TestHTTP1PipeliningAsync/async. Reverted the release-skip gate from trulyDetached to detached, so any goroutine-holding conn (WS detach OR async dispatch) skips pool return and GC collects cs after the goroutine exits.
808e8bb — fix: COPY FROM/TO in direct-mode PG. Previous W2 commit guarded COPY with ErrDirectModeUnsupported, which regressed 5 conformance tests. Fix spawns a short-lived reader goroutine (startDirectReader) for the duration of each copy operation; reader pumps tcp.Read → onRecv with a 50ms deadline so it observes the stop signal. conformance/postgres + pgspec both green after the fix.
4a38e45 — perf: goroutine-reuse via sync.Cond + zero-alloc double-buffered input. The big one. pprof on msr1 showed go runAsyncHandler(cs) at 3.3% of CPU (450ms / 13.82s). Each keep-alive idle gap respawned the dispatch goroutine. Fix parks the goroutine on asyncCond.Wait between requests; asyncInBuf / asyncOutBuf swap eliminates the per-request dataCopy allocation.
7c4a80b — perf: skip Describe for cached prepared statements. Saves 7 bytes/query and one server state transition; marginal (~1% RPS) but keeps the wire tight. Populates cached columns with FormatCode=FormatBinary since prepare-time Describe returns FormatText.
d595c23 — test: two regression tests for the async dispatch path — TestAsyncHandlerGoroutineReuse locks in the sync.Cond reuse behavior (goroutine count stays constant across 100 keep-alive requests), TestAsyncHandlerCloseWakesGoroutine guards the asyncCond.Broadcast on close path. Also cleans three stale "W4" plan-jargon references from public struct comments.

Measured impact on MS-R1 (aarch64, 12c, 256 conns, 8s per cell, post-fix clean run)

Driver cells under AsyncHandlers=true (the headline workload):

Cell	Pre-4a38e45	Post-4a38e45	Δ
celeris-epoll + celerisredis	85,134	100,322	+17.8%
celeris-iouring + celerisredis	99,962	115,895	+16.0%
celeris-epoll + goredis	65,082	76,358	+17.3%
celeris-iouring + goredis	80,295	96,041	+19.6%
celeris-epoll + celerispg	55,361	62,153	+12.3%
celeris-iouring + celerispg	58,745	66,884	+13.9%
celeris-epoll + celerismc	103,467	108,147	+4.5%
celeris-iouring + celerismc	112,865	118,873	+5.1%

The epoll + celerisredis cell flipped from -6.1% vs nethttp+celerisredis to +9.2% above after the reuse fix. The iouring + celerismc cell is the overall matrix leader at 118,873 rps — +37.5% over nethttp + celerismc (86,479).

Async vs sync — full matrix context

A broader async-vs-sync comparison across pure-CPU handlers also landed. Honest finding: async and sync are complementary, not substitutes.

Workload	Sync (AsyncHandlers=false)	Async (=true)
Pure-CPU (plaintext, JSON, params, body)	+30-33% RPS	baseline
Chain middleware	+23-26% RPS	baseline
DB-integrated / 3rd-party drivers	baseline	+30-200% RPS

Removing sync would regress plaintext from 428k → 288k rps on celeris-iouring. Removing async would leave goredis at 25k instead of 75k.

Tracked as #239 — v1.5.0 spike for per-route AsyncHandler(true/false) so users with mixed workloads can opt-in per endpoint.

Known issue (follow-up tracked)

celeris-adaptive with AsyncHandlers=true regresses on celerisredis (-7.2%), celerispg (-4.5%), celerismc (-11.0%) — the engine-side async dispatch is enabled but the driver sees Server.AsyncHandlers() == false for Adaptive (deliberately gated because Adaptive hot-swaps between epoll and iouring), so drivers pick the mini-loop sync path. This mismatch creates a worse combo than either pure mode. Actively being debugged; will ship a fix before merging if it's driver-side, or punt to v1.4.1 if it requires adaptive-engine rework.

CI status

41/41 checks green on d595c23. All spec + conformance suites pass on msr1 (Linux 6.6.10-cix, aarch64): pgspec, redisspec, mcspec, conformance/{postgres,redis,memcached}, H1 RFC 9112 × epoll × iouring.

Adaptive was excluded from Server.AsyncHandlers() out of a worry that the engine's hot-swap between epoll and iouring could invalidate direct-mode driver conns mid-flight. That concern was based on a false premise — direct-mode drivers don't register FDs with the engine (they dial net.TCPConn and drive reads on the caller goroutine via Go netpoll). adaptive.performSwitch already refuses to switch while any driver-registered FDs exist, and direct-mode drivers contribute zero, so a switch is a no-op for them. What the old gate actually did: Config.AsyncHandlers=true enabled the async dispatch path in the engine (both epoll and iouring workers honor it), but Server.AsyncHandlers() returned false for Adaptive, so drivers opened WithEngine(srv) saw IsAsyncServer=false and picked the mini-loop busy-poll sync path. Handlers ran on unlocked spawned Gs, and 256 concurrent busy-poll Gs starved CPU — which regressed 3 celeris-native driver cells on the matrix (celerisredis -7.2%, celerispg -4.5%, celerismc -11.0% vs ASYNC=0). Flipping the gate makes drivers pick their direct-mode path, same as they do on pure epoll or iouring. Direct-mode drivers go through Go netpoll and are engine-agnostic — they keep working regardless of which sub-engine is active, and neither participate in nor block a switch. MS-R1 impact (aarch64, 12c, 256 conns, 8s per cell): celeris-adaptive + celerisredis ASYNC=1: 64,206 -> 77,592 (+20.8%) celeris-adaptive + celerispg ASYNC=1: 41,864 -> 63,217 (+51.0%) celeris-adaptive + celerismc ASYNC=1: 54,959 -> 89,912 (+63.6%) Adaptive now matches celeris-epoll's async numbers within a few percent across the driver matrix.

The test failed intermittently on ubuntu-latest with the message \"WriteAndPoll returned ok=false; sync fast path not engaged\". Root cause: the worker goroutine's epoll_wait could observe the POLLIN edge from the pre-staged peer write and consume the 4 bytes via handleReadable before WriteAndPoll took recvMu and masked EPOLLIN. handleReadable called the registered onRecv (a no-op in the old test), leaving WriteAndPoll's phases all returning EAGAIN. The race is real and legitimate — the worker consuming data on an EPOLLIN edge before the caller's WriteAndPoll arrives is expected behavior, not a bug. What the test is actually asserting is that the data round-trips correctly under the sync fast-path design, not that Phase A specifically wins every race. Change the RegisterConn callback from a discard to an append into the same buffer WriteAndPoll would populate, so both paths (worker consumed OR WriteAndPoll consumed) deliver into `got`. Poll for \"pong\" for up to 100ms after WriteAndPoll returns so the worker- dispatch case has time to complete. 20/20 passes on Linux aarch64 under -race after the fix.

Addresses the issues from the honest review: 1. #240 — Panic recover in async dispatch goroutine. runAsyncHandler in epoll + iouring now wraps its loop body in defer recover(). A panicking user handler no longer crashes the entire server. Logs stack trace, marks the conn closed, and force-closes the fd so the worker's close path tears down state from its own goroutine. Symmetric to routerAdapter's sync-path safety net. 2. asyncInBuf DoS cap (maxPendingInputBytes = 4 MiB). A client pipelining requests faster than the dispatch goroutine can drain them would otherwise grow asyncInBuf without bound. Symmetric with the existing maxPendingBytes cap on the output side; drainRead closes the conn when the append would exceed the cap. Applied in both epoll and iouring. 3. #241 — PG direct-mode COPY cancel no longer orphans the request. copyFrom / copyTo direct-mode paths now route final-wait ctx.Done through awaitDirectWithCancel, which sends CancelRequest and waits bounded (30s) for the server's Error+RFQ before returning. Without this, a canceled COPY left req in the pending queue and the next query would pop it — wire-format desync. 4. Direct-mode result buffer cap (maxDirectResultBytes = 64 MiB). Direct mode pins syncMode=true so streaming never promotes. A huge SELECT would buffer every row in req.rowSlab. Now fails with ErrResultTooBig once accumulated bytes cross the cap; caller gets a typed error with actionable remediation (paginate with LIMIT/OFFSET or use non-async pool for streaming). 5. PG LISTEN/UNLISTEN/NOTIFY guarded in direct mode. Direct-mode conns have no background reader between queries, so NotificationResponse messages would be silently dropped. simpleQuery / simpleExec / simpleExecNoTag now detect these statements and return ErrDirectModeUnsupported with a clear workaround hint. Added isListenOrUnlisten() helper for prefix detection with whitespace + comment skip. 6. H2 + AsyncHandlers: documentation-level warning at engine start. When both flags are set, engines now log that async dispatch is HTTP/1.1-only; H2 conns still run inline on the worker. No behavior change, but surfaces the limitation instead of a silent per-conn-type inconsistency. 7. Backpressure end-to-end test skeleton (skipped by default). Loopback TCP auto-tuning on Linux makes deterministically exercising the maxPendingBytes path hard without sysctl tuning. Test body documents the shape and runs under GOTEST_BACKPRESSURE=1. The defensive paths are exercised in production via Autobahn 9.1.6 (WS 16 MiB frames through the detached 64 MiB cap). All 62 packages pass go test -race on Linux aarch64 (msr1).

startTestEngine waits for workers to be ready before returning. The 3s deadline was tight; on GitHub Actions' shared Azure VMs io_uring ring setup (NewRingCPU, SINGLE_ISSUER init, NUMA bind, SQPOLL thread creation) can legitimately take 3-5s when the host is loaded. The older flake showed 'engine did not start in time' as a single-run false-positive that passed on rerun. 15s covers the tail without hiding real failures — if a worker actually fails to initialize, it'll show up immediately via the error channel, not by timing out on the startup check.

1. PG Describe-skip: hasDescribe gate was `len(cachedCols) == 0`, which treated zero-column prepared statements as "no cache" and defeated the optimization. Change to `cachedCols == nil`. 2. iouring drainDetachQueue now checks asyncClosed and calls closeConn. The error/panic path in runAsyncHandler already enqueues cs + sets asyncClosed, but drainDetachQueue only called markDirty — the FD/connState stayed zombie until the next handleRecv. Matching fix enqueued in epoll's drainDetachQueue. 3. epoll async error + panic paths no longer call unix.Close from the dispatch goroutine. That raced with the worker's drainRead holding l.conns[fd]. Replaced with asyncClosed + detachQueue enqueue + eventfd signal; worker's drainDetachQueue picks up the teardown on its own goroutine. 4. Graceful shutdown now joins dispatch goroutines. Added asyncWG sync.WaitGroup on both Loop (epoll) and Worker (iouring); Add on spawn, Done via defer in runAsyncHandler, Wait at the tail of the engine shutdown. Prevents dispatch Gs from touching connState after the engine claims to have stopped. 5. CVE-2023-44487 Rapid Reset mitigated in H2. Processor now tracks RST_STREAM count in a sliding one-second window; a sustained burst > rstBurstMax (200) triggers GOAWAY with ENHANCE_YOUR_CALM and closes the connection. Honest clients reset a handful of streams per second; 200/s is well above legitimate patterns and well below the thousands/s needed to amplify the attack. 6. H1 MaxHeaderSize reduced 16 MiB -> 64 KiB (nginx-class default), and new MaxHeaderCount = 200 rejects thousands-of-tiny-headers DoS that would stay under the byte cap. 64 KiB covers verbose proxy chains; 16 MiB was a slow-loris amplifier. New sentinel ErrTooManyHeaders; existing ErrHeadersTooLarge unchanged.

H2 hardening: - HPACK decoder now enforces SetMaxStringLength(64 KiB), matching H1 MaxHeaderSize. Prevents a single HEADERS frame from growing the decode target unboundedly. - Framer initial max read size 16 KiB (RFC 9113 default) instead of hard-coded 1 MiB; new Parser.SetMaxReadFrameSize method so the processor can apply negotiated SETTINGS_MAX_FRAME_SIZE. - PRIORITY frames rejected when stream ID is >2048 past last client stream — prevents unbounded priority-tree growth via phantom- stream flood. Redis driver: - Cluster refreshTopology now fully resets slots/replicas maps at the start of every refresh. Previously per-range reset left stale replicas for slots that dropped out, and replica append accumulated across overlapping range entries (resharding window). - MOVED redirect on attempt > 0 now also refreshes topology (was skipped), eliminating the MOVED-loop-until-background-tick bug. - Cluster redirect loop bounds from attempt-based range to explicit maxAttempts = MaxRedirects+1 so the documented redirect count is honored (was off by one). - Sentinel reconnect no longer appends to sentinelConns unboundedly; stale entries are closed and the slice is capped at one entry. Postgres driver: - ParseRowDescriptionInto rejects column count > 1600 (PG's own MaxHeapAttributeNumber). Previously a server-supplied int16 up to 32767 forced multi-MB allocations per RowDescription. - dropPreparedAsync tracked via pgConn.closeWG; Close() joins the WaitGroup so the background DEALLOCATE G cannot outlive the conn. Early-exit if c.closed is already set. Redis RESP: - readBulk's dead rewind removed — Next() already rewinds to pre-tag on ErrIncomplete; readBulk's post-tag rewind was redundant and its comment was misleading. Server lifecycle: - StartWithContext / StartWithListenerAndContext no longer leak the shutdown-watcher goroutine on Listen error. Added listenDone chan the main flow closes; watcher selects on ctx.Done || listenDone. - Engine.Shutdown docs for epoll/iouring now correctly state that shutdown is ctx-driven (context cancel → Listen returns → worker shutdown runs asyncWG.Wait), not something Shutdown() does itself.

Summary of defensive hardening and polish swept in the final v1.4.0 pre-tag pass. Postgres driver - Defer dropPreparedAsync goroutines through closeWG so Close() waits for background DEALLOCATEs rather than racing with pgConn teardown. - Replace `len(cachedCols) == 0` with nil-check in doExtendedQuery so cached-but-empty RowDescriptions correctly skip the Describe step. - Bound all bare <-req.doneCh waits in COPY error paths with a 30s awaitDoneBounded closure to prevent hung CopyIn/CopyOut unwinds. - Case-insensitive, word-boundary SQL keyword detection for isListenOrUnlisten + isCacheableQuery (hasKeywordPrefix helper) to avoid false positives on column names like `selected`. - Move ErrDirectModeUnsupported from pool.go to errors.go alongside the other exported sentinels; prefix all scan convertTo errors with "celeris-postgres: scan: " for consistency. - dsn.go now warns to stderr on sslmode=prefer/allow (previously silent downgrade to plaintext) so operators see the change. - protocol/scram.go: enforce RFC 7677 minimum iteration count (4096) and zero saltedPassword/authMessage/clientFirst/serverFirst/ serverKey/serverSig/password after handleServerFinal. - protocol/query.go: reject RowDescription with >1600 columns (PG's MAX_TUPLE_ATTR) to guard against malformed server input. Redis driver - cluster.refreshTopology now fully resets slots/replicas at start; MOVED always refreshes topology (not only attempt==0) so stale routes don't persist through redirect storms. - Explicit maxAttempts = MaxRedirects+1 loop replaces range form after removing the attempt-gated refresh branch. - sentinel.subscribeLoop closes stale sentinelConns and caps the slice to a single live entry to prevent conn leaks on reconnect. - commands.asStringSlice/asStringMap return ErrNil on TyNull (previously nil, which callers couldn't distinguish from "empty"). - protocol/resp.go: drop dead rewind code in readBulk. Memcached driver - protocol/text.parseUint overflow check uses (maxU64-digit)/10 to detect the last-digit overflow case without false negatives. Error-prefix consistency - Normalize all user-facing error prefixes in driver/redis and driver/memcached from "celeris/redis:" / "celeris/memcached:" slash form to "celeris-redis:" / "celeris-memcached:" hyphen form matching driver/postgres. No test strings assert on the old prefix. Internal packages (async/pool, eventloop) keep slash form since they're not user-facing. Engine async dispatch - engine/epoll/loop.go: runAsyncHandler now panic-recovers and signals the worker via detachQueue + eventfd instead of calling unix.Close(cs.fd) from the handler goroutine (cross-thread FD close was racy against io_uring SQE submission). - Graceful shutdown awaits asyncWG so detached handler goroutines finish before Shutdown() returns. Config / server - ReadTimeout and WriteTimeout defaults: 300s → 60s (slow-loris hardening; matches nginx client_header_timeout / client_body_timeout). - Validate() flags Listener + explicit Addr conflict, but only when Addr has a concrete non-zero port — `:0` stays valid since callers intentionally delegate port selection to the pre-bound listener. - server.go: Version constant bumped to "1.4.0" (was stuck at "1.3.4"). celeristest - WithCookie godoc now explicitly notes no escaping of semicolons or CR/LF in value; tests needing malformed cookie headers should use WithHeader directly. Test fixes - resource/config_test.go TestWithDefaults now expects 60s ReadTimeout/WriteTimeout matching the new defaults.

FumingPower3925 added 25 commits April 16, 2026 23:31

docs(driver/redis): runnable examples

ae45ab9

Example* functions for Client.NewClient + Get/Set/Pipeline/ Subscribe + Do escape hatch + TxPipeline. Examples follow the godoc convention and render under the package "Examples" tab.

FumingPower3925 added this to the v1.4.0 milestone Apr 16, 2026

FumingPower3925 added 4 commits April 17, 2026 00:11

FumingPower3925 added 8 commits April 18, 2026 18:25

FumingPower3925 added 5 commits April 18, 2026 21:08

style: gofmt engine/epoll/conn.go + driver/postgres/conn.go

62c0d79

FumingPower3925 mentioned this pull request Apr 18, 2026

Spike: per-route AsyncHandler control #239

Open

FumingPower3925 added 3 commits April 19, 2026 00:14

This was referenced Apr 18, 2026

fix(engine): user handler panic under AsyncHandlers crashes the server #240

Closed

fix(driver/postgres): direct-mode COPY leaves orphaned pgRequest in pending queue when ctx is canceled #241

Closed

FumingPower3925 mentioned this pull request Apr 18, 2026

Spike (v1.5.0): async dispatch for HTTP/2 connections #242

Open

FumingPower3925 added 4 commits April 19, 2026 01:41

style(engine/iouring): gofmt field alignment after asyncWG add

663a897

FumingPower3925 merged commit 5cb0f9b into main Apr 19, 2026
51 of 52 checks passed

FumingPower3925 deleted the feat/v1.4.0-drivers-and-h2c branch April 19, 2026 17:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: v1.4.0 — native PostgreSQL + Redis drivers, H2C upgrade, EventLoopProvider#236

feat: v1.4.0 — native PostgreSQL + Redis drivers, H2C upgrade, EventLoopProvider#236
FumingPower3925 merged 90 commits intomainfrom
feat/v1.4.0-drivers-and-h2c

FumingPower3925 commented Apr 16, 2026 •

edited

Loading

Uh oh!

FumingPower3925 commented Apr 18, 2026

Uh oh!

FumingPower3925 commented Apr 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

FumingPower3925 commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

MS-R1 matrix — headline numbers

HTTP-layer (pure-CPU handlers, no driver)

Driver-layer (celeris HTTP + DB round-trip)

AsyncHandlers — when to use which

What's included

Scope decisions

Test plan

Uh oh!

FumingPower3925 commented Apr 18, 2026

Post-review follow-up: close remaining integration gaps (W1-W4)

What this closes

Verification

Uh oh!

FumingPower3925 commented Apr 18, 2026

Post-review updates (commits since the last summary at 55967ef)

Measured impact on MS-R1 (aarch64, 12c, 256 conns, 8s per cell, post-fix clean run)

Async vs sync — full matrix context

Known issue (follow-up tracked)

CI status

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

FumingPower3925 commented Apr 16, 2026 •

edited

Loading

Post-review updates (commits since the last summary at `55967ef`)