feat: v1.4.0 — native PostgreSQL + Redis drivers, H2C upgrade, EventLoopProvider#236
feat: v1.4.0 — native PostgreSQL + Redis drivers, H2C upgrade, EventLoopProvider#236FumingPower3925 merged 90 commits intomainfrom
Conversation
Foundational interfaces that allow database/cache drivers to share the HTTP server's event-loop workers. Drivers register file descriptors on a specific worker via RegisterConn and receive callbacks on data arrival. Includes ErrQueueFull, ErrUnknownFD, and ErrSwitchingNotFrozen sentinel errors. No implementation in this commit — just the contract. Closes #113
Implements EventLoopProvider on the epoll engine. Adds a per-loop driverConns map (fd-indexed parallel to the HTTP conn table) gated by a hasDriverConns atomic flag — the HTTP hot path pays a single atomic load when no drivers are registered. - RegisterConn/UnregisterConn/Write with one-write-in-flight serialization (mirrors the PR #36 send-queue fix). - EPOLLIN edge-triggered, EPOLLOUT level-triggered to avoid missed wakeups under write contention. - TOCTOU-safe fd-collision check under driverMu. - shutdownDrivers fires onClose on engine teardown. Closes #114
Implements EventLoopProvider on the io_uring engine via new CQE user-data tags (udDriverRecv/Send/Close at 0x10-0x12, non- overlapping with existing HTTP tags). - Driver actions (Register/Unregister/Write) post to a worker- owned action queue and wake via the shared h2EventFD; only the worker goroutine submits SQEs (preserves SQ single-issuer). - One SEND in-flight per FD (mirrors PR #36 invariant) using a dedicated driverConn.sending flag + writeBuf/sendBuf swap. - Single-shot RECV per driverConn (no provided-buffer ring); avoids conflict with HTTP's multishot path. - Inflight-op counter guards UnregisterConn's ASYNC_CANCEL CQE ordering against in-flight RECV/SEND completions — prevents use-after-free on dc.buf. - shutdownDrivers fires onClose on engine teardown. Closes #115
Adaptive engine implements EventLoopProvider by delegating to the active sub-engine. WorkerLoop panics if FreezeSwitching is not held (driver FDs cannot migrate between epoll/io_uring tables). Exposes ErrSwitchingNotFrozen for drivers that attempt to register without first freezing the engine switch.
Minimal event loop used by drivers when no celeris Server is registered. Linux uses a stripped-down epoll worker (same primitives as engine/epoll but no accept/HTTP parsing); non-Linux falls back to goroutine-per-conn via net.FileConn. - WriteAndPoll sync fast path: caller goroutine does direct write(2) + 3-phase read (spin → poll(0) → poll(1ms blocking)) to avoid goroutine-hop latency for localhost DB/cache round trips. recvMu serializes with the event loop's onRecv callback. - WriteAndPollMulti for pipelined protocols (e.g. Redis Pipeline) — single write, poll-drain until isDone. - EPOLLOUT level-triggered re-arms on EAGAIN (slow consumer backpressure). - registry.go exposes Resolve(ServerProvider) with refcounted package-level standalone Loop; returns the Server's provider if registered, else the standalone fallback. The io_uring standalone path is present but not selected by default: its SINGLE_ISSUER constraint conflicts with the sync fast path's caller-goroutine reads. Kept as dead code for a future follow-up (#232 area). Closes #116
Server.EventLoopProvider() returns the active engine's provider (epoll, io_uring, or adaptive), or nil for engines that don't implement the interface (std). Drivers use this to route DB I/O through the HTTP server's worker event loops for per-CPU affinity.
H1 clients can now upgrade an HTTP/1.1 connection to HTTP/2 over cleartext via Connection: Upgrade, HTTP2-Settings + Upgrade: h2c. - protocol/h1: parser detects the three-token Upgrade handshake. Rejects ambiguous Upgrade values (e.g. "websocket, h2c") to disambiguate from the WebSocket path. - resource.Config.EnableH2Upgrade *bool with protocol-dependent defaults (Auto→true, H2C/HTTP1→false) propagated into H1State. - internal/conn/upgrade.go: UpgradeInfo + ErrUpgradeH2C sentinel + DecodeHTTP2Settings (RawURL + URL fallback). - ProcessH1 writes 101 Switching Protocols and returns ErrUpgradeH2C without invoking the handler (handler runs later on H2 stream 1). - protocol/h2/stream: Manager.ApplySetting exposed; Processor gains InjectStreamHeaders that opens stream 1 from the H1 headers without an HPACK round trip. - NewH2StateFromUpgrade constructs the H2State post-101: applies client SETTINGS from HTTP2-Settings, emits server preface, injects stream 1 with H1→H2 pseudo-headers, dispatches handler. Includes the rewritten header-copy path that forces strings out of the H1 recv buffer (prevents use-after-free when the driver layer reuses the buffer). Closes #117 #118 #119 #120
Both epoll and io_uring engines detect ErrUpgradeH2C from ProcessH1 and switch the connection from H1 to H2 state: - Release H1State, construct H2State via NewH2StateFromUpgrade. - Feed UpgradeInfo.Remaining through ProcessH2 synchronously (the H2 client preface may arrive in the same TCP segment as the H1 Upgrade request). - Flush writes explicitly after switchToH2 so the 101 response + server preface + stream-1 reply reach the client promptly. test/spec/h2c_upgrade_test.go adds integration coverage across iouring + epoll engines: happy path, POST with body, subsequent streams 3/5/7, config variations, invalid settings, missing Connection token, preface-in-same-segment, preface-split-across- reads, 1 MB body. Closes #121 #122
Internal primitives shared by the PostgreSQL and Redis drivers: - Bridge: lock-guarded FIFO ring buffer of pending requests, power-of-two capacity, O(1) enqueue/pop. Both PG and Redis wire protocols guarantee in-order responses on one connection, so a single ring matches. - Pool[C]: generic worker-affinity connection pool. Per-worker idle lists (lock-free fast path), shared overflow pool, and a semaphore-based wait queue (matching database/sql.DB SetMaxOpenConns semantics). Acquire blocks with ctx deadline instead of the old immediate ErrPoolExhausted. - Backoff: exponential with jitter (shared by PG reconnect and Redis PubSub reconnect). - Health sweep: ticker-driven eviction of expired / idle-too- long connections.
PostgreSQL v3 frontend/backend protocol: - message.go: zero-alloc Reader/Writer for the 1-byte type + 4-byte length frame. StartupMessage/CancelRequest/SSLRequest variants (no type byte). - startup.go + scram.go: connection handshake with Trust, Cleartext, MD5, and SCRAM-SHA-256 (PBKDF2 via stdlib crypto/pbkdf2, RFC 7677 test vectors pass). GSS/SSPI/Kerberos explicitly rejected. - query.go: Simple Query 'Q' flow. PGError parsing (severity, SQLSTATE, message, detail, hint, position). Defers tag string materialization — RowsAffected is zero-alloc. - extended.go: Parse/Bind/Describe/Execute/Sync/Close. Append- style message builders (AppendParse, AppendBind, ...) write into the Writer buffer with no per-message snapshot. Supports SkipParse for reusing named prepared statements. - copy.go: CopyInState/CopyOutState + binary header/trailer + text-format row encoder (with escape handling). - types.go + types_time.go + types_numeric.go + types_array.go: OID codec registry. Built-ins cover bool, int2/4/8, float4/8, text/varchar, bytea, uuid, jsonb, date, timestamp(tz), numeric, and common array types. Infinity sentinels for date/timestamp (binary and text). Floor correction for pre-epoch dates. Zero-alloc decode into pgRows' per-request slab. Fuzz tests for Reader + types; seed corpus under testdata/fuzz. Closes #123 #124 #125 #126 #127 #128
…rows Full PostgreSQL driver on top of the v3 protocol layer. - driver.go + connector.go + dsn.go: sql.Driver registered as "celeris-postgres". DSN supports URL + key=value. sslmode= require returns ErrSSLNotSupported with an actionable message pointing at the v1.5.0 TLS spike (#232). - conn.go: pgConn implements driver.Conn + extended interfaces (ConnBeginTx, ConnPrepareContext, QueryerContext, ExecerContext, Pinger, SessionResetter, Validator, async.Conn). Sync→async bridge: handler goroutine encodes + writes + blocks on doneCh; event loop parses response and signals completion. WriteAndPoll sync fast path eliminates context switches for localhost queries. Re-prepare-on-miss (SQLSTATE 26000) after DISCARD ALL. - stmt.go + rows.go + result.go + tx.go: database/sql facades. - cancel.go: PG CancelRequest via a separate short-lived TCP conn with bounded 5s timeout (independent of caller ctx). - lru.go: per-conn prepared statement cache. - pool.go: direct Pool API (postgres.Open + WithEngine) bypassing database/sql. Rows.Next()+Scan(...any) matches sql.Rows. QueryRow + Row.Scan with sql.ErrNoRows. Tx with savepoints. CopyFrom/CopyTo with text-format row encoder. Lazy streaming rows: buffer up to 64 rows, promote to bounded channel (cap 64) for larger result sets — no OOM on million-row queries, no channel-alloc cost on single-row queries. - Public types.go exposes *Conn (alias for *pgConn) so sql.Conn. Raw users can reach Savepoint/ReleaseSavepoint/RollbackTo. convertAssign supports sql.Scanner (sql.NullString/NullInt64/ pgtype.Inet and custom types), typed primitives, and NULL. Pool.Result implements driver.Result (LastInsertId returns an error pointing at RETURNING). Rows.Err() tracks iteration errors. sessionDirty flag elides DISCARD ALL on ResetSession when the conn only ran simple queries — avoids one round-trip per pool return on the hot path. Closes #129 #130
Zero-allocation RESP parser supporting both RESP2 and the RESP3 types (null _, bool #, double ,, bigint (, blob-error !, verbatim =, set ~, map %, attribute |, push >). - Value struct with typed fields (Str aliases Reader.buf for bulk/simple/verbatim/blob-err). Array/Map use a sync.Pool- backed slice pool. - Reader.Next returns ErrIncomplete on partial frames without advancing the cursor — safe to Feed more and retry. - MaxBulkLen (512 MiB) + MaxArrayLen (128 M) guards prevent DoS from malicious servers advertising huge lengths. - parseUint overflow check; parseInt accepts math.MinInt64. - Writer.AppendCommand1..5 arity-specific builders avoid the variadic slice allocation that dominated pipeline alloc profiles (was 32% of Pipeline1K allocs). - FuzzReader + FuzzRoundTrip seeded with hand-crafted RESP2/3 frames. Closes #132
Redis client API on top of the RESP parser. Typed commands for all common operations — strings (Get/Set/Incr/Decr/Append/...), hashes (HGet/HSet/HIncrBy/...), lists (LPush/LPop/LRange/LRem/...), sets (SAdd/SMembers/SInter/SUnion/SDiff/...), sorted sets (ZAdd/ZRange/ZRank/ZScore/ZIncrBy/...), keys (Expire/TTL/Type/...), scripting (Eval/EvalSHA/ScriptLoad), scan iterator (SCAN). - RedisState event-loop state machine + ProcessRedis(data). FIFO request/response matching via async.Bridge. - HELLO 3 negotiation with RESP2 fallback (AUTH + SELECT). WithForceRESP2() escape hatch for ElastiCache-classic-shaped servers that advertise 6.x but reject HELLO. Releases the HELLO request on fallback (fixes pool leak on Redis <6.0). - WriteAndPoll sync fast path: single-command round trips bypass the event loop for localhost latency. - Client.Do/DoString/DoInt/DoBool/DoSlice escape hatch for commands outside the typed surface. - OnPush callback for RESP3 client-tracking push frames received on command connections (otherwise silently dropped). - resetSession elides DISCARD when !dirty (avoids a round trip on the hot path). - Context cancellation closes the connection — the pending response arriving on a poisoned conn is drained via drainWithError, preventing desync with the next command issued on a fresh conn. - Expire(ttl=0) calls Persist (was: silent clamp to 1s). - WithHealthCheckInterval wires async.Pool health sweeps. - Nil-safe Client.Close. Closes #133
- Pipeline: single write for all buffered commands, FIFO response matching via async.Bridge. Typed deferred cmd handles (StringCmd/IntCmd/StatusCmd/FloatCmd/BoolCmd) that resolve via (Pipeline, idx) — pipeline-owned so the struct survives slice growth. Release() returns the Pipeline to sync.Pool; typed cmd handles become invalid (return ErrClosed — orphan guard). - Sync pipeline fast path: direct read/parse on the caller goroutine, zero per-response allocation when the result set fits in one TCP chunk. Slab-based copy-detach for the string payloads. Dropped Pipeline1K from 1320 → 3 allocs. - Tx (TxPipeline): MULTI … EXEC variant on a pinned conn with Watch/Unwatch support, ErrTxAborted on null EXEC. - Backpressure-aware context cancellation populates per-cmd errors before closing the conn (was: zero Values + no indication of cancellation). maxSlabRetain shrinks oversized slabs on Release to bound memory from one-off large pipelines. Closes #134
- PubSub pins a dedicated connection from the pubsubPool (command connections cannot be reused — push-mode). - Subscribe/PSubscribe/Unsubscribe/PUnsubscribe with a mu- serialized subscription set as source of truth. - Auto-reconnect on conn drop via onConnDrop hook → runs a reconnectLoop goroutine with async.Backoff (50ms → 5s, jittered), replays subscription set with a single pipelined SUBSCRIBE + PSUBSCRIBE. Messages during outage are lost (at-most-once, documented). - deliver() + closeMsgCh() serialize on ps.mu (no send-on- closed-channel panic). - nil-safe conn reference during reconnect — subscribe failures set ps.conn = nil before closing. Closes #135
Wraps async.Pool[*redisConn] with Redis-specific dial and health. Separate cmd and pubsub pools (push-mode conns cannot be reused for commands). - WithEngine(ServerProvider) resolves to the HTTP server's EventLoopProvider (integrated mode) or falls back to the standalone mini event loop. - HealthCheckInterval default 30s (tunable); MaxOpen/ MaxIdlePerWorker/MaxLifetime/MaxIdleTime follow database/sql-style semantics. - bounded 5-retry acquire on stale-conn hits (previously unbounded recursion). - Pool error messages include MaxOpen context on exhaustion. Closes #136
ClusterClient:
- CRC16 (XMODEM) slot computation with {tag} hash tag support.
- [16384]*clusterNode O(1) slot → node routing; background
CLUSTER SLOTS refresh every 60s + on-demand after MOVED.
- MOVED/ASK redirect handling (max 3 retries). ASK sends
ASKING on a pinned conn (via pinnedConnKey context) so the
next command lands on the same connection — fixes a subtle
bug where a pooled ASKING + pooled command could hit
different conns.
- Multi-key commands (DEL/EXISTS) fan out per-node sub-calls
in parallel.
- ClusterPipeline groups commands by slot, executes per-node
sub-pipelines in parallel, retries MOVED/ASK affected
commands on refresh.
- ClusterTx with same-slot validation (ErrCrossSlot on
mismatch; hash tags colocate keys). ClusterClient.Watch
with the same guard.
- ReadOnly mode: reads routed to replicas (round-robin) with
READONLY handshake; falls back to primary on replica failure.
- RouteByLatency: per-node RTT measured each refresh; picks
lowest-latency node for reads.
- Shard channels (Redis 7+): SSubscribe/SPublish +
smessage/ssubscribe/sunsubscribe recognition in the state
machine + shard-aware reconnect replay.
SentinelClient:
- Master discovery via SENTINEL get-master-addr-by-name
+ ROLE verification.
- Auto-failover: subscription to +switch-master on a
sentinel conn; atomic primary swap under RWMutex.
- dialMaster retries 3× with backoff on failover; marks
client unhealthy (ErrSentinelUnhealthy) if all retries fail
instead of silently reusing the stale master.
Closes #233 #234 #235
Example* functions for Client.NewClient + Get/Set/Pipeline/ Subscribe + Do escape hatch + TxPipeline. Examples follow the godoc convention and render under the package "Examples" tab.
Build-tagged //go:build postgres, env-gated by CELERIS_PG_DSN. Docker-compose spins postgres:16. Covers auth (MD5, SCRAM-SHA-256, Trust), simple + extended query, type round-trips (all built-in OIDs + arrays), transactions + isolation levels, savepoints, COPY IN/OUT, error handling, cancel, pool affinity, concurrency (1000 goroutines × 100 queries), and large result sets. Closes #131 (conformance portion)
Build-tagged //go:build redis, env-gated by CELERIS_REDIS_ADDR (+ optional CELERIS_REDIS_PASSWORD for AUTH). Docker-compose spins redis:7.2. Covers all data structures (strings, hashes, lists, sets, sorted sets, keys), pipelines (incl. mid-stream failure), pub/sub (patterns + unsubscribe + reconnect via CLIENT KILL TYPE pubsub), transactions + Watch, pool affinity + overflow + idle cleanup, RESP2 vs RESP3 HELLO negotiation, AUTH variants. Closes #137 (conformance portion)
Modeled after h2spec / Autobahn|Testsuite: external spec verifier that speaks PG wire protocol directly (raw TCP + driver/postgres/protocol) without going through database/sql. 51 tests organized by spec section: - Startup: version negotiation, SSLRequest, CancelRequest, malformed startup handling. - Auth: SCRAM-SHA-256 full handshake + bad-password failure. - Simple Query: SELECT/INSERT/Error/MultiStatement/NULL/ LargeResult (100K rows)/TransactionStatus byte. - Extended Query: Parse/Bind/Describe/Execute/Sync/Close, error-during-Bind recovery, portal suspension. - COPY: text-in, binary-in, out, fail, wrong-format, large-out (10K rows). - Error handling: all PGError fields, NoticeResponse, RFQ recovery after error. - Wire framing: zero-length payload, split reads, back-to-back messages. - Type round-trips: all built-ins + NULL + arrays + 1 MB values + infinity sentinels for date and timestamp. - Lifecycle: Terminate, idle timeout, Cancel. Invoked via `mage pgSpec` (gated by CELERIS_PG_DSN).
Modeled after pgspec / h2spec: external spec verifier that speaks RESP directly (raw TCP + driver/redis/protocol). 62 tests + 4 fuzz targets organized by spec section: - RESP2 types: all 5 base types + null/empty edge cases. - RESP3 types: all 11 new types (null, bool, double, bigint, blob error, verbatim, set, map, attribute, push). - Command protocol: inline, multi-bulk, pipelines (incl. mid- stream failure), max args, large bulks, unknown command. - AUTH + SELECT variants. - Pub/Sub: subscribe/message/pattern/unsubscribe/multi-channel/ PING-in-pubsub/non-sub-command-during-sub/RESP3 push format. - Transactions: MULTI/EXEC/DISCARD/empty EXEC/queued-error/ EXECABORT/Watch/Unwatch. - Wire edge cases: split reads, 10K PING back-to-back, binary keys with NUL/CRLF, max bulk size, concurrent conns, CLIENT SETNAME round-trip, attribute-prefixed reply, integer overflow, push-on-cmd-conn. - Fuzz: FuzzRESPParse, FuzzRESPRoundTrip, FuzzRESP3Types, FuzzBulkBoundary. Invoked via `mage redisSpec` (gated by CELERIS_REDIS_ADDR).
Darwin-runnable test that proves the headline v1.4.0 architecture: - Starts a celeris Server (std engine, std-engine-compatible). - Spins up in-process fake PG and Redis servers. - Registers handlers at /db and /cache that open a driver Pool with WithEngine(server) and run a query on each request. - Issues real HTTP requests and verifies responses. Std engine doesn't implement EventLoopProvider, so drivers fall back to the standalone loop (documented path). On Linux with epoll/iouring the WithEngine call picks up the engine's native provider — same code path, real per-worker affinity.
Separate go.mod submodules (replace celeris ../../..) so competitor libs don't pollute the main module's dependencies. test/drivercmp/postgres/ — mirrors benchmarks for celeris vs pgx vs lib/pq across: SelectOne (sql.DB + direct Pool), Select1000 rows (text + binary), InsertPrepared, Transaction, PoolContention, ParallelQuery, CopyIn_1M_Rows, and integrated net/http handler latency (celeris Server + pgxpool + net/http). test/drivercmp/redis/ — celeris vs go-redis: Get, Set, MGet10, Pipeline10/100/1000/10000, Parallel GET, PubSub1to1 latency. Each benchmark reports ns/op + B/op + allocs/op; the goal is pgx/go-redis-parity on single-command and significant wins on parallel and pipeline paths. Results captured in the PR body.
mage_driver.go exposes:
- TestIntegration — go test ./test/integration/…
- H2CCompliance — H2C upgrade integration tests
- TestDriver {postgres|redis} — conformance suite (env-gated)
- PGSpec / RedisSpec — protocol compliance suites
- BaselineBench {eventloop|h2c|postgres|redis} — snapshot bench
to results/<ts>-baseline-<subsys>/ with env.json + bench.txt
- DriverProfile {driver} {bench} — runs CPU/heap/mutex/block
pprof capture + top.txt
- DriverBench {postgres|redis} — full comparator suite
- PreBench — correctness gate (lint+test+spec+integration+
h2c+testDriver) before any profile/bench work
.github/workflows/ci.yml gains a driver-conformance job with
postgres:16 and redis:7.2 service containers running the
conformance suites with -tags postgres/redis gates.
…uccessful reply
The per-slot sub-pipeline used the direct-extraction fast path where successful
results are stored in pc.str + pc.scalar with pc.direct=true and pc.val left
nil. The harvester only read from pc.val, so every successful cluster
pipeline GET returned a zero protocol.Value{}.
- Preserve the original command kind when forwarding to the sub-pipeline
instead of forcing every command to kindString.
- Harvest the direct path into a synthesized protocol.Value via new
directToValue helper. String bytes are copied out of pc.str because it
aliases p.strSlab which the deferred p.Release() reclaims.
- Handle all kinds (string, status, int, float, bool) so INCR/SET/etc.
return correct typed values from cluster pipelines too.
Adds TestClusterPipelineReturnsValues which fails without the fix
(results[0].Str is empty) and passes after.
…ne conn The per-command ASK recovery path in ClusterPipeline.execRound did two separate n.client.Do() calls — one for ASKING and one for the original command. Each n.client.Do acquires its own conn from the pool, so the ASKING and the command could land on different TCP connections. Redis requires both to arrive on the SAME conn; otherwise the migrating slot's owner responds with MOVED and the pipeline fails. Fix mirrors the existing single-command doASK pattern (cluster.go:534): acquire a conn, run ASKING on it, propagate the conn into the follow-up Do via pinnedConnKey, then release. Adds TestClusterPipelineASKPinsConn — it models real Redis's strict ASKING semantics in the fake server (per-conn ASKING flag; GET returns MOVED if the flag isn't set on that conn). A race goroutine triggered immediately after ASKING replies hammers the pool with concurrent PINGs, reliably stealing the released conn before the buggy code can re-acquire it. The test fails without the fix (MOVED error) and passes with it.
…TURNING
simpleExec and doExtendedExec do not allocate req.colsCh — Exec paths
only need the CommandComplete tag and never stream. However, the shared
dispatch handlers (reqSimple, reqExtended) called promoteToStreaming
unconditionally once len(head.rows) crossed streamThreshold (64), and
promoteToStreaming ends with close(req.colsCh). Closing a nil channel
panicked on the event-loop worker goroutine.
Real trigger: Exec("INSERT INTO t SELECT ... RETURNING id") where the
SELECT produces >= 64 rows. PG streams DataRows, the Exec caller never
reads them, and the panic kills the reader loop.
Fix:
- Dispatch sites now guard on head.colsCh != nil before promoting — Exec
paths (colsCh == nil) skip the promotion entirely. The buffered rows
are simply dropped when the request finishes (the caller only reads
the tag via simple.TagBytes / extended.Tag).
- Add a defensive nil-guard inside promoteToStreaming as defense in
depth, so a future caller can't hit the same trap.
Adds TestPgExecReturningLargeResult: fake PG server replies with 128
DataRows + CommandComplete for a simpleExec. Without the fix the test
panics ("close of nil channel") in promoteToStreaming. With the fix the
test cleanly returns RowsAffected=128.
Four error branches in the driver's dialConn implementations called syscall.Close(fd) while the *os.File wrapper returned from tcp.File() was still reachable with its runtime finalizer armed. When GC later fires the finalizer, it calls syscall.Close(fd) a SECOND time — potentially on an unrelated fd the kernel has already reassigned to another open socket. Classic phantom-close bug. Affected sites (replaced syscall.Close(fd) with file.Close(), which closes the kernel fd AND disarms the finalizer — same pattern already used correctly on the SetNonblock error branch just above each): - driver/postgres/conn.go:568 — NumWorkers == 0 branch - driver/postgres/conn.go:610 — RegisterConn failure branch - driver/redis/conn.go:156 — NumWorkers == 0 branch - driver/redis/conn.go:186 — RegisterConn failure branch Adds two regression tests (TestPgDialConnNoPhantomCloseOnError and TestRedisDialConnNoPhantomCloseOnError) that force the NumWorkers==0 branch with a stub Provider and run repeated GC afterward. While finalizer timing makes this impossible to assert deterministically, the tests at least pin the correct error path and document the intent. Primary safeguard remains the code inspection + comments at each site — direct testing of a finalizer-driven double close would require fd-recycle injection.
When the memcached client is opened without WithEngine(srv), skip the
mini-loop entirely. Each conn keeps the live *net.TCPConn and does
Write + Read directly on the caller's goroutine via Go's netpoll
(which parks the G on EPOLLIN transparently).
The mini-loop path is a net loss for standalone request/response
workloads:
Profile of nethttp + celerismc (mini-loop path) at 74k rps:
WriteAndPoll 36.4s cum (33% CPU)
├─ flushLocked 16.6s (15% — the actual write syscall)
├─ Phase B polls 6.1s (5.4% — poll(0) x 16)
├─ EpollCtl MOD 2.7s (2.4% — mask + unmask per op)
├─ Phase A read 1.5s (1.3%)
└─ recvMu/overhead ~9s (~8%)
The last ~11s (~10% of total CPU) is overhead the mini-loop adds on
top of what gomc pays. gomc just calls net.Conn.Write/Read; Go's
netpoll handles EPOLLIN transparently, no per-op syscalls beyond
write+read. Direct mode matches that shape exactly.
Matrix on MS-R1 (MC cells only), matrix 12 → matrix 13:
nethttp + celerismc 74,689 → 87,556 (+17.2%, gomc: 86,282 — win)
gin + celerismc 73,977 → 88,707 (+19.9%, gomc: 87,589 — win)
chi + celerismc 71,352 → 87,420 (+22.5%, gomc: 83,664 — win)
echo + celerismc 73,960 → 88,447 (+19.6%, gomc: 86,602 — win)
celerismc now beats gomemcache on every foreign HTTP server tested.
p99 latency also improves — ~10ms → ~6ms per row.
Engine-integrated path (WithEngine supplied) unchanged: the mini-loop
is still used so DB conns colocate with the celeris HTTP engine's
LockOSThread'd worker. Two modes coexist behind the same mcConn
struct (useDirect discriminates); existing tests and the engine-
integrated bench cells are unaffected.
Implementation:
- NewClient without WithEngine → newDirectPool instead of newPool
- dialDirectMemcachedConn skips eventloop.Resolve, keeps *net.TCPConn
- execText / execBinary / execBinaryMulti branch on useDirect
- Close branches on useDirect
Two post-commit CI failures on 0c4b6d8: 1. Lint errcheck: `defer c.tcp.SetDeadline(time.Time{})` in execTextDirect / execBinaryDirect / execBinaryMultiDirect didn't check the returned error. Wrap in `defer func() { _ = ... }()`. 2. TestWriteAndPollSyncPath flaked again under -race on ubuntu CI. The previous fix (synchronous peer write before WriteAndPoll) wasn't enough — on loaded runners the write buffer hasn't always surfaced on our end of the socketpair by the time Phase A's single non-blocking read runs, so Phase A misses and the test times out in Phase C. Add a brief poll(50ms) as a deterministic hand-off signal (independent of timer resolution / scheduler jitter) before calling WriteAndPoll — asserts the buffer is actually readable, then runs the exact same WriteAndPoll test.
An earlier attempt to always route through direct net.TCPConn mode (even under WithEngine) regressed the celeris-engine + celerismc cell catastrophically: 65k → 34k rps (-48%). Root cause: Handler runs on celeris HTTP engine's LockOSThread'd worker G. When that handler calls net.TCPConn.Read (direct mode), Go's netpoll parks the G on EPOLLIN. Parking a locked G triggers stoplockedm + startlockedm — the same futex-storm pathology that WriteAndPollBusy was introduced to avoid in the first place. Revert the blanket switch: mc uses direct mode in standalone only (cfg.Engine == nil), falls back to mini-loop + WriteAndPollBusy when WithEngine is supplied. The big +17–21% wins on the foreign-HTTP cells (matrix 13) are preserved; the celeris-engine cells return to their matrix-13 numbers (~64–69k). The shared-event-loop promise of WithEngine(srv) is honored for mc by colocating conns on the engine's worker via the mini-loop sync path, which is futex-safe for locked callers.
…ration gap
Validated on MSR1 bare metal, celeris-epoll + celerismc:
Matrix 16 baseline (inline handler + mini-loop): 64,147 rps
CELERIS_ASYNC_HANDLERS=1 (async + mini-loop): 53,877 rps (-16%)
CELERIS_ASYNC_HANDLERS=1 + CELERIS_MC_FORCE_DIRECT=1: 105,267 rps (+64%)
Context:
The celeris HTTP engine's workers are runtime.LockOSThread'd (for
SINGLE_ISSUER on io_uring, CPU affinity on epoll). Handlers run inline
on those locked worker goroutines. When a handler blocks on DB I/O:
- Inline + mini-loop: handler does unix.Poll on the locked M. P is
detached during the syscall but no unlocked Gs exist to use it, so
the P sits idle. Other FDs on this worker wait until handler
returns. Measured throughput: 64k rps (NumWorkers × 1/RTT bound).
- Async + mini-loop: handler runs on a spawned unlocked G. It blocks
in unix.Poll which still ties up an M (Go can't park G on a bare
syscall). Go spawns more Ms, context-switch overhead eats the
parallelism benefit. Regression to 54k.
- Async + direct (net.Conn.Read): handler on unlocked G reads via
Go's netpoll. netpoll parks the G efficiently — no M is blocked,
no new Ms spawned. The worker is free to service other FDs while
the G waits for EPOLLIN. Throughput jumps to 105k — BEATING every
other config in the matrix (foreign HTTP + celerismc: 88k; foreign
HTTP + gomc: 87k).
This commit lands the two env-gated knobs that demonstrated the effect:
- CELERIS_ASYNC_HANDLERS=1 on the epoll engine: dispatches HTTP1
handlers to goroutines, serialized per-conn via detachMu. Worker
returns to epoll_wait immediately after dispatching. Non-async
path unchanged — zero overhead when the flag is off.
- CELERIS_MC_FORCE_DIRECT=1 on the memcached driver: uses the direct
net.Conn path even when WithEngine(srv) is supplied. Safe only on
async-engine path (direct on a locked M would futex-storm via
netpoll's G parking).
Both are experimental and NOT production-ready:
- Error / close paths are best-effort
- No support for HTTP/2 handlers (only HTTP1 dispatched)
- H1State mutations race with concurrent dispatches (single-
connection serial clients are OK; pipelined requests are not)
- CELERIS_MC_FORCE_DIRECT without CELERIS_ASYNC_HANDLERS regresses
celeris-engine cells (direct on locked M = futex storm)
A proper implementation in v1.4.x:
- Config.AsyncHandlers as a first-class Server option
- Per-conn input buffer for pipelined requests
- Driver-side signaling so direct mode activates automatically when
the caller is on an async-dispatched G (pprof label or context key)
- Extension to io_uring engine (requires SQE hand-off from handler
goroutine back to the SINGLE_ISSUER worker)
…ch + netpoll I/O
Config.AsyncHandlers is now a first-class Server option (default: false).
When set AND the engine is epoll (or std, which is always async natively),
the engine dispatches HTTP1 handlers to spawned goroutines instead of
running them inline on the LockOSThread'd worker. Drivers opened with
WithEngine(srv) auto-detect this via eventloop.IsAsyncServer() and
switch their I/O path to match the caller's Go-runtime shape:
Caller shape | Driver I/O | Why
-------------------|-------------------------|--------------------------
Inline (locked M) | mini-loop sync/busy | net.Conn.Read on locked M
| | futex-storms via netpoll
Async (unlocked G) | direct net.TCPConn | Go netpoll parks the G
| | cleanly, no M blocked
Standalone (no | mc: direct, redis/pg: | mc direct is faster
engine) | mini-loop | standalone; redis's tiny
| | responses favor mini-loop
Implementation:
* celeris.Config: new AsyncHandlers bool (doc'd with trade-offs).
Propagates through resource.Config into engine bootstrapping.
* celeris.Server.AsyncHandlers(): honors the flag only when the engine
actually implements async dispatch (currently Epoll + Std; iouring and
adaptive return false so drivers don't hand themselves the direct
path and futex-storm on a locked worker).
* engine/epoll: Loop.async bool, set from Config.AsyncHandlers (OR'd
with CELERIS_ASYNC_HANDLERS env var for diagnostic overrides). In
drainRead, when async && HTTP1, copy the read bytes and spawn a
handler goroutine that holds cs.detachMu around ProcessH1 + inline
flush. Worker returns to epoll_wait immediately. Zero overhead on
the non-async path.
* driver/internal/eventloop: new AsyncHandlerProvider interface; new
IsAsyncServer helper that drivers call to detect the dispatch mode.
* driver/memcached: client auto-selects direct mode when Engine==nil
OR IsAsyncServer(Engine) is true.
* driver/redis: client auto-selects direct mode only when
IsAsyncServer(Engine) is true; standalone and sync-engine paths use
mini-loop (redis's tiny GET responses measurably favor mini-loop's
sync spin over net.Conn.Read + netpoll wake). Cmd pool direct;
pubsub always uses mini-loop because unsolicited push frames need
event-driven delivery.
* driver/postgres: pool tracks asyncEngine; useBusySync is disabled on
async-dispatched engines so the handler G can yield via
runtime.Gosched between Phase B polls (cheap on an unlocked G;
futex-storm on a locked M — which is why busy-path exists).
MSR1 bare-metal validation (celeris-epoll with ASYNC=1, full matrix
re-run inflight at commit time; partial early rows):
celeris-epoll + celerisredis ~86k (matrix 16: 82.6k, +4%)
celeris-epoll + goredis ~64k (matrix 16: 24.4k, +164%)
celeris-epoll + celerispg ~52k (matrix 16: 43.7k, +19%)
celeris-epoll + pgx ~46k (matrix 16: 39.4k, +19%)
celeris-epoll + celerismc ~105k (matrix 16: 64.1k, +64%)
celeris-epoll + gomc ~89k (matrix 16: 46.2k, +95%)
celeris-epoll + celerismc at 105k is the single fastest cell in the
entire 36-config matrix — beating nethttp + celerismc (89k) by 19%
and nethttp + gomc (88k) by 20%. celeris-epoll + goredis at 64k is
a 2.6× jump — async+driver-in-netpoll rescues non-celeris drivers too.
iouring and adaptive paths: behavior unchanged (Server.AsyncHandlers()
reports false on those engines even when config is set), matching
matrix 16 numbers exactly so no regressions.
Not yet covered in this commit:
* iouring async dispatch — requires SQE hand-off from handler back to
the SINGLE_ISSUER worker. Tracked as v1.4.x follow-up.
* PG direct mode — PG startup is a multi-round protocol (SCRAM-SHA-
256 challenge, etc.) that needs driver-specific plumbing for
direct mode. PG still uses mini-loop under async but with the
yielding sync path instead of busy-poll, closing the worst of the
pre-fix regression (-14% → +19%).
Goroutine-per-conn dispatch: each HTTP1 conn buffers incoming bytes under asyncInMu and spawns a single dispatch goroutine that drains the buffer, running ProcessH1 under detachMu. ProcessH1's built-in offset loop handles pipelined requests in order; responses land on writeBuf in request order before the flush. Previously (epoll) spawned a goroutine per read-batch, which let pipelined bursts race on detachMu. Now the per-conn invariant is enforced by asyncRun — only one goroutine alive per conn at a time, and the next batch only spawns after the previous cleared asyncRun under the same mutex that guards the input buffer. io_uring: async dispatch now works under SINGLE_ISSUER by reusing the existing detachMu + detachQueue + eventfd machinery. Handler goroutine writes land in writeBuf; after ProcessH1, the FD is queued on detachQueue, the worker picks it up via the eventfd wakeup and submits SEND SQEs from its own thread. closeConn now signals asyncClosed.Store(true) so the dispatch goroutine exits at its next iteration. Server.AsyncHandlers() now returns true for IOUring too. Cross-cut fixes so detachMu != nil no longer implies "truly detached": writeCap / sendCap / timeout-scan / shutdown / closeConn now gate on h1State.Detached, which is only set when OnDetach fires (WS/SSE). Async-mode conns pre-allocate detachMu in acquireConnState without triggering those branches. Adds test/integration/pipeline_test.go exercising ordering under both AsyncHandlers=true and false.
… engine Symmetric to memcached/redis direct mode. Standalone pools and pools opened WithEngine on an async engine now dial *net.TCPConn directly and drive reads from the caller goroutine via Go's netpoll — no mini-loop involvement, no LockOSThread, no futex storm. - writeRaw(data) uniform helper: tcp.Write under directMu or loop.Write via the mini-loop. All 15 c.loop.Write(c.fd, ...) call sites use it. - driveDirect(ctx, req): tight tcp.Read → onRecv loop until req.doneAtom fires or ctx cancels. Includes a non-blocking MSG_DONTWAIT peek so loopback-fast responses skip the netpoll G-park wakeup. - waitForQueryRows direct branch: buffers rows (syncMode pinned so dispatch never promotes to streaming, which would deadlock on the caller goroutine). - dialDirectConn: TCP dial, SetNoDelay, SyscallConn-captured fd for peek, doStartup runs synchronously via the new drive path. - Close: direct-mode closes via tcp.Close with bounded write deadline; loop-mode path unchanged. - COPY FROM/TO guarded: ErrDirectModeUnsupported surfaces when called on a direct-mode conn. CopyInResponse / CopyOutResponse require event-loop-driven unsolicited delivery which has no reader in the direct model. - Pool.dial routes to dialDirectConn when !hasEngine OR asyncEngine, matching the rule memcached uses today.
One non-blocking syscall.Recvfrom(MSG_DONTWAIT) before each tcp.Read in execDirect / execManyDirect (redis) and execTextDirect / execBinaryDirect / execBinaryMultiDirect (memcached). Loopback-fast responses (10-byte GET, +PONG, small memcached VALUEs) land in the recv buffer before tcp.Read is called; the peek catches them with a single syscall and skips the ~1-2µs netpoll G-park wakeup. One peek per iteration (not a tight spin) — repeated MSG_DONTWAIT would re-introduce the P-hogging regression that a bounded spin avoids. Fd cached at dial time via SyscallConn().Control.
Post-review follow-up: close remaining integration gaps (W1-W4)Three additional commits land on top of
What this closes
Verification
|
…hold cs Data race detected by -race in TestHTTP1PipeliningAsync/async: worker's closeConn -> releaseConnState was resetting cs fields concurrently with the async dispatch goroutine's asyncInBuf/asyncRun/asyncClosed writes. Root cause: my earlier split between 'detached' (detachMu != nil) and 'trulyDetached' (h1State.Detached) let releaseConnState run for async-mode conns even though their dispatch goroutine still held a cs reference. Restore the original invariant — any goroutine-holding conn skips the pool return. GC collects cs once the goroutine exits. CloseH1 gating (trulyDetached) is kept: async-mode conns still own H1 state because no middleware goroutine is holding it open past Detach. Only the release path now uses the broader 'detached' flag.
Earlier W2 commit guarded COPY with ErrDirectModeUnsupported because direct mode has no event-loop goroutine driving onRecv — copyReady / doneCh would never fire. That regressed 5 conformance tests. Fix: spawn a short-lived reader goroutine (startDirectReader) for the duration of each copy operation. The reader pumps tcp.Read → onRecv with a 50ms read deadline so it periodically checks the stop channel; the caller goroutine remains the sole writer of CopyData frames (tcp.Write concurrent with tcp.Read on another goroutine is safe). Final wait in copyFrom / copyTo uses select on doneCh/ctx.Done in direct mode instead of c.wait — c.wait's driveDirect would spawn a second concurrent tcp.Read and race the reader goroutine. Background reader also fails the request chain via c.failAll on unexpected EOF, so transport errors surface cleanly through doneCh rather than hanging the caller.
…atch pprof on msr1 (aarch64, celeris-epoll+celerisredis ASYNC=1) showed 'go l.runAsyncHandler(cs)' at 450ms / 13.82s CPU = 3.3% — every request was re-spawning the dispatch goroutine. On keep-alive load with per-conn request gaps, asyncInBuf would drain, asyncRun went false, goroutine exited, next read spawned a fresh one. Fix: add sync.Cond so the dispatch goroutine parks on asyncCond.Wait when asyncInBuf is empty rather than exiting. Worker signals after each append. Goroutine lives until closeConn broadcasts via asyncClosed + Cond.Broadcast. Also double-buffer asyncInBuf/asyncOutBuf so the goroutine's swap on pickup doesn't force the worker to re-allocate on the next append. Drops the dataCopy intermediate (was one heap alloc per request) — worker now appends cs.buf bytes directly into asyncInBuf, goroutine swaps out before ProcessH1 (dropping the cs.buf aliasing risk). MSR1 matrix impact (aarch64, 12c, 256 conns, ASYNC=1): | cell | before | after | delta | |-----------------------------------|---------|---------|---------| | celeris-epoll + celerisredis | 85,134 | 100,322 | +17.8% | | celeris-iouring + celerisredis | 99,962 | 115,981 | +16.0% | | celeris-epoll + goredis | 65,082 | 76,358 | +17.3% | | celeris-iouring + goredis | 80,295 | 96,041 | +19.6% | | celeris-epoll + celerispg | 55,361 | 61,740 | +11.5% | | celeris-iouring + celerispg | 58,745 | 66,411 | +13.1% | | celeris-epoll + celerismc | 103,467 | 108,147 | +4.5% | | celeris-iouring + celerismc | 112,865 | 118,668 | +5.1% | celeris-epoll + celerisredis: now +9.2% vs nethttp+celerisredis (was -6.1% before). The epoll-async redis gap is closed. New matrix leader: celeris-iouring + celerismc = 118,668 rps (+35% vs nethttp+celerismc = 87,876). Validated on msr1 (Linux 6.6.10-cix, aarch64): 62/62 packages pass -race, all spec/conformance suites pass (pgspec/redisspec/mcspec + conformance/postgres/redis/memcached + H1 RFC 9112).
doExtendedQuery sent Parse(if first)+Bind+Describe+Execute+Sync for every extended query. For cached prepared statements (autoCache + stmtCache hit), the row description is already known from the initial prepare, so the portal Describe is redundant — server still returns RowDescription each call, costing 7 bytes on the wire and one protocol-state transition per query. Pass cached columns through the call chain (QueryContext -> doExtendedQuery), pre-populate req.columns + req.extended.Columns, set HasDescribe=false so the state machine transitions BindComplete -> ExecuteResult directly. The ExtendedQueryState machine already supported HasDescribe=false; no state-machine change needed. FormatCode fixup: prepare-time Describe returns FormatCode=0 (text) because Postgres doesn't decide the output encoding until Execute receives the resultFormats vector. We pass [FormatBinary] in the Execute, so we shallow-copy the cached ColumnDesc slice and overwrite FormatCode to FormatBinary — keeps decode on the fast binary path and leaves the stmtCache's slice pristine for re-use. Measured impact is small (net +~1% RPS, noise-level on MSR1) — the 27% CPU in tcp.Write on the hot PG cell is syscall fixed cost, not per-byte. Keeping the change for correctness and slightly smaller wire footprint; saves one server state transition per query which reduces tail latency on PG cells (p50 3645µs -> 3606µs). All pgspec/conformance suites pass on Linux (msr1, Postgres 16).
Two regression tests for the async dispatch path:
TestAsyncHandlerGoroutineReuse — spins up the engine with
AsyncHandlers=true, opens one keep-alive conn, sends 100 serial
requests with 500µs idle gaps between them, and asserts that the
runtime goroutine count only grew by ≤5 between first-request
baseline and last request. Before the sync.Cond reuse fix, each
idle-then-resume batch respawned the dispatch goroutine, which
would drive that delta well above tolerance.
TestAsyncHandlerCloseWakesGoroutine — opens one conn, sends one
request, closes. Asserts that within 2s the goroutine count
returns to within +2 of the pre-test baseline. This exercises
closeConn's asyncCond.Broadcast path — without it, the parked
dispatch goroutine would leak until GC finalized the connState,
which can be tens of seconds under a busy test suite.
Also cleaned up three leftover references to internal plan jargon
("W4") in public struct comments — replaced with descriptive text.
Updated runAsyncHandler's doc comment to reflect the reuse model
(the stale comment still said "exit when buffer empty" from the
pre-Cond implementation).
Post-review updates (commits since the last summary at 55967ef)Seven additional commits landed after the initial W1-W4 summary. Chronological:
Measured impact on MS-R1 (aarch64, 12c, 256 conns, 8s per cell, post-fix clean run)Driver cells under
The epoll + celerisredis cell flipped from -6.1% vs Async vs sync — full matrix contextA broader async-vs-sync comparison across pure-CPU handlers also landed. Honest finding: async and sync are complementary, not substitutes.
Removing sync would regress plaintext from 428k → 288k rps on Tracked as #239 — v1.5.0 spike for per-route Known issue (follow-up tracked)
CI status41/41 checks green on |
Adaptive was excluded from Server.AsyncHandlers() out of a worry that the engine's hot-swap between epoll and iouring could invalidate direct-mode driver conns mid-flight. That concern was based on a false premise — direct-mode drivers don't register FDs with the engine (they dial net.TCPConn and drive reads on the caller goroutine via Go netpoll). adaptive.performSwitch already refuses to switch while any driver-registered FDs exist, and direct-mode drivers contribute zero, so a switch is a no-op for them. What the old gate actually did: Config.AsyncHandlers=true enabled the async dispatch path in the engine (both epoll and iouring workers honor it), but Server.AsyncHandlers() returned false for Adaptive, so drivers opened WithEngine(srv) saw IsAsyncServer=false and picked the mini-loop busy-poll sync path. Handlers ran on unlocked spawned Gs, and 256 concurrent busy-poll Gs starved CPU — which regressed 3 celeris-native driver cells on the matrix (celerisredis -7.2%, celerispg -4.5%, celerismc -11.0% vs ASYNC=0). Flipping the gate makes drivers pick their direct-mode path, same as they do on pure epoll or iouring. Direct-mode drivers go through Go netpoll and are engine-agnostic — they keep working regardless of which sub-engine is active, and neither participate in nor block a switch. MS-R1 impact (aarch64, 12c, 256 conns, 8s per cell): celeris-adaptive + celerisredis ASYNC=1: 64,206 -> 77,592 (+20.8%) celeris-adaptive + celerispg ASYNC=1: 41,864 -> 63,217 (+51.0%) celeris-adaptive + celerismc ASYNC=1: 54,959 -> 89,912 (+63.6%) Adaptive now matches celeris-epoll's async numbers within a few percent across the driver matrix.
The test failed intermittently on ubuntu-latest with the message \"WriteAndPoll returned ok=false; sync fast path not engaged\". Root cause: the worker goroutine's epoll_wait could observe the POLLIN edge from the pre-staged peer write and consume the 4 bytes via handleReadable before WriteAndPoll took recvMu and masked EPOLLIN. handleReadable called the registered onRecv (a no-op in the old test), leaving WriteAndPoll's phases all returning EAGAIN. The race is real and legitimate — the worker consuming data on an EPOLLIN edge before the caller's WriteAndPoll arrives is expected behavior, not a bug. What the test is actually asserting is that the data round-trips correctly under the sync fast-path design, not that Phase A specifically wins every race. Change the RegisterConn callback from a discard to an append into the same buffer WriteAndPoll would populate, so both paths (worker consumed OR WriteAndPoll consumed) deliver into `got`. Poll for \"pong\" for up to 100ms after WriteAndPoll returns so the worker- dispatch case has time to complete. 20/20 passes on Linux aarch64 under -race after the fix.
Addresses the issues from the honest review: 1. #240 — Panic recover in async dispatch goroutine. runAsyncHandler in epoll + iouring now wraps its loop body in defer recover(). A panicking user handler no longer crashes the entire server. Logs stack trace, marks the conn closed, and force-closes the fd so the worker's close path tears down state from its own goroutine. Symmetric to routerAdapter's sync-path safety net. 2. asyncInBuf DoS cap (maxPendingInputBytes = 4 MiB). A client pipelining requests faster than the dispatch goroutine can drain them would otherwise grow asyncInBuf without bound. Symmetric with the existing maxPendingBytes cap on the output side; drainRead closes the conn when the append would exceed the cap. Applied in both epoll and iouring. 3. #241 — PG direct-mode COPY cancel no longer orphans the request. copyFrom / copyTo direct-mode paths now route final-wait ctx.Done through awaitDirectWithCancel, which sends CancelRequest and waits bounded (30s) for the server's Error+RFQ before returning. Without this, a canceled COPY left req in the pending queue and the next query would pop it — wire-format desync. 4. Direct-mode result buffer cap (maxDirectResultBytes = 64 MiB). Direct mode pins syncMode=true so streaming never promotes. A huge SELECT would buffer every row in req.rowSlab. Now fails with ErrResultTooBig once accumulated bytes cross the cap; caller gets a typed error with actionable remediation (paginate with LIMIT/OFFSET or use non-async pool for streaming). 5. PG LISTEN/UNLISTEN/NOTIFY guarded in direct mode. Direct-mode conns have no background reader between queries, so NotificationResponse messages would be silently dropped. simpleQuery / simpleExec / simpleExecNoTag now detect these statements and return ErrDirectModeUnsupported with a clear workaround hint. Added isListenOrUnlisten() helper for prefix detection with whitespace + comment skip. 6. H2 + AsyncHandlers: documentation-level warning at engine start. When both flags are set, engines now log that async dispatch is HTTP/1.1-only; H2 conns still run inline on the worker. No behavior change, but surfaces the limitation instead of a silent per-conn-type inconsistency. 7. Backpressure end-to-end test skeleton (skipped by default). Loopback TCP auto-tuning on Linux makes deterministically exercising the maxPendingBytes path hard without sysctl tuning. Test body documents the shape and runs under GOTEST_BACKPRESSURE=1. The defensive paths are exercised in production via Autobahn 9.1.6 (WS 16 MiB frames through the detached 64 MiB cap). All 62 packages pass go test -race on Linux aarch64 (msr1).
startTestEngine waits for workers to be ready before returning. The 3s deadline was tight; on GitHub Actions' shared Azure VMs io_uring ring setup (NewRingCPU, SINGLE_ISSUER init, NUMA bind, SQPOLL thread creation) can legitimately take 3-5s when the host is loaded. The older flake showed 'engine did not start in time' as a single-run false-positive that passed on rerun. 15s covers the tail without hiding real failures — if a worker actually fails to initialize, it'll show up immediately via the error channel, not by timing out on the startup check.
1. PG Describe-skip: hasDescribe gate was `len(cachedCols) == 0`, which treated zero-column prepared statements as "no cache" and defeated the optimization. Change to `cachedCols == nil`. 2. iouring drainDetachQueue now checks asyncClosed and calls closeConn. The error/panic path in runAsyncHandler already enqueues cs + sets asyncClosed, but drainDetachQueue only called markDirty — the FD/connState stayed zombie until the next handleRecv. Matching fix enqueued in epoll's drainDetachQueue. 3. epoll async error + panic paths no longer call unix.Close from the dispatch goroutine. That raced with the worker's drainRead holding l.conns[fd]. Replaced with asyncClosed + detachQueue enqueue + eventfd signal; worker's drainDetachQueue picks up the teardown on its own goroutine. 4. Graceful shutdown now joins dispatch goroutines. Added asyncWG sync.WaitGroup on both Loop (epoll) and Worker (iouring); Add on spawn, Done via defer in runAsyncHandler, Wait at the tail of the engine shutdown. Prevents dispatch Gs from touching connState after the engine claims to have stopped. 5. CVE-2023-44487 Rapid Reset mitigated in H2. Processor now tracks RST_STREAM count in a sliding one-second window; a sustained burst > rstBurstMax (200) triggers GOAWAY with ENHANCE_YOUR_CALM and closes the connection. Honest clients reset a handful of streams per second; 200/s is well above legitimate patterns and well below the thousands/s needed to amplify the attack. 6. H1 MaxHeaderSize reduced 16 MiB -> 64 KiB (nginx-class default), and new MaxHeaderCount = 200 rejects thousands-of-tiny-headers DoS that would stay under the byte cap. 64 KiB covers verbose proxy chains; 16 MiB was a slow-loris amplifier. New sentinel ErrTooManyHeaders; existing ErrHeadersTooLarge unchanged.
H2 hardening: - HPACK decoder now enforces SetMaxStringLength(64 KiB), matching H1 MaxHeaderSize. Prevents a single HEADERS frame from growing the decode target unboundedly. - Framer initial max read size 16 KiB (RFC 9113 default) instead of hard-coded 1 MiB; new Parser.SetMaxReadFrameSize method so the processor can apply negotiated SETTINGS_MAX_FRAME_SIZE. - PRIORITY frames rejected when stream ID is >2048 past last client stream — prevents unbounded priority-tree growth via phantom- stream flood. Redis driver: - Cluster refreshTopology now fully resets slots/replicas maps at the start of every refresh. Previously per-range reset left stale replicas for slots that dropped out, and replica append accumulated across overlapping range entries (resharding window). - MOVED redirect on attempt > 0 now also refreshes topology (was skipped), eliminating the MOVED-loop-until-background-tick bug. - Cluster redirect loop bounds from attempt-based range to explicit maxAttempts = MaxRedirects+1 so the documented redirect count is honored (was off by one). - Sentinel reconnect no longer appends to sentinelConns unboundedly; stale entries are closed and the slice is capped at one entry. Postgres driver: - ParseRowDescriptionInto rejects column count > 1600 (PG's own MaxHeapAttributeNumber). Previously a server-supplied int16 up to 32767 forced multi-MB allocations per RowDescription. - dropPreparedAsync tracked via pgConn.closeWG; Close() joins the WaitGroup so the background DEALLOCATE G cannot outlive the conn. Early-exit if c.closed is already set. Redis RESP: - readBulk's dead rewind removed — Next() already rewinds to pre-tag on ErrIncomplete; readBulk's post-tag rewind was redundant and its comment was misleading. Server lifecycle: - StartWithContext / StartWithListenerAndContext no longer leak the shutdown-watcher goroutine on Listen error. Added listenDone chan the main flow closes; watcher selects on ctx.Done || listenDone. - Engine.Shutdown docs for epoll/iouring now correctly state that shutdown is ctx-driven (context cancel → Listen returns → worker shutdown runs asyncWG.Wait), not something Shutdown() does itself.
Summary of defensive hardening and polish swept in the final v1.4.0 pre-tag pass. Postgres driver - Defer dropPreparedAsync goroutines through closeWG so Close() waits for background DEALLOCATEs rather than racing with pgConn teardown. - Replace `len(cachedCols) == 0` with nil-check in doExtendedQuery so cached-but-empty RowDescriptions correctly skip the Describe step. - Bound all bare <-req.doneCh waits in COPY error paths with a 30s awaitDoneBounded closure to prevent hung CopyIn/CopyOut unwinds. - Case-insensitive, word-boundary SQL keyword detection for isListenOrUnlisten + isCacheableQuery (hasKeywordPrefix helper) to avoid false positives on column names like `selected`. - Move ErrDirectModeUnsupported from pool.go to errors.go alongside the other exported sentinels; prefix all scan convertTo errors with "celeris-postgres: scan: " for consistency. - dsn.go now warns to stderr on sslmode=prefer/allow (previously silent downgrade to plaintext) so operators see the change. - protocol/scram.go: enforce RFC 7677 minimum iteration count (4096) and zero saltedPassword/authMessage/clientFirst/serverFirst/ serverKey/serverSig/password after handleServerFinal. - protocol/query.go: reject RowDescription with >1600 columns (PG's MAX_TUPLE_ATTR) to guard against malformed server input. Redis driver - cluster.refreshTopology now fully resets slots/replicas at start; MOVED always refreshes topology (not only attempt==0) so stale routes don't persist through redirect storms. - Explicit maxAttempts = MaxRedirects+1 loop replaces range form after removing the attempt-gated refresh branch. - sentinel.subscribeLoop closes stale sentinelConns and caps the slice to a single live entry to prevent conn leaks on reconnect. - commands.asStringSlice/asStringMap return ErrNil on TyNull (previously nil, which callers couldn't distinguish from "empty"). - protocol/resp.go: drop dead rewind code in readBulk. Memcached driver - protocol/text.parseUint overflow check uses (maxU64-digit)/10 to detect the last-digit overflow case without false negatives. Error-prefix consistency - Normalize all user-facing error prefixes in driver/redis and driver/memcached from "celeris/redis:" / "celeris/memcached:" slash form to "celeris-redis:" / "celeris-memcached:" hyphen form matching driver/postgres. No test strings assert on the old prefix. Internal packages (async/pool, eventloop) keep slash form since they're not user-facing. Engine async dispatch - engine/epoll/loop.go: runAsyncHandler now panic-recovers and signals the worker via detachQueue + eventfd instead of calling unix.Close(cs.fd) from the handler goroutine (cross-thread FD close was racy against io_uring SQE submission). - Graceful shutdown awaits asyncWG so detached handler goroutines finish before Shutdown() returns. Config / server - ReadTimeout and WriteTimeout defaults: 300s → 60s (slow-loris hardening; matches nginx client_header_timeout / client_body_timeout). - Validate() flags Listener + explicit Addr conflict, but only when Addr has a concrete non-zero port — `:0` stays valid since callers intentionally delegate port selection to the pre-bound listener. - server.go: Version constant bumped to "1.4.0" (was stuck at "1.3.4"). celeristest - WithCookie godoc now explicitly notes no escaping of semicolons or CR/LF in value; tests needing malformed cookie headers should use WithHeader directly. Test fixes - resource/config_test.go TestWithDefaults now expects 60s ReadTimeout/WriteTimeout matching the new defaults.
Summary
v1.4.0 ships five workstreams as a single coherent release:
database/sqldriver, direct Pool, streaming rows, SCRAM-SHA-256, COPY FROM/TO (feat: PostgreSQL wire protocol message framing #123-test: PostgreSQL driver conformance + benchmark vs pgx #131).Plus:
pgspec/redisspec/mcspec— protocol-compliance suites against real servers.Config.AsyncHandlers— opt-in async dispatch so third-party drivers (goredis, pgx, gomemcache) don't futex-storm the LockOSThread'd workers.MS-R1 matrix — headline numbers
Captured 2026-04-19 on MS-R1 (CIX CP8180, 12-core aarch64, Linux 6.6.10-cix, Postgres 15, Redis 7.2, Memcached 1.6.16).
loadgen -connections 256 -duration 8s.HTTP-layer (pure-CPU handlers, no driver)
celeris-iouring sync wins plaintext (+37% over fiber, +131% over the other four competitors). fiber wins the /chain cell (314k vs 269k) because its zero-middleware routing path is tight; celeris's chain includes recovery + logger + requestid + cors + timeout which is closer to a "realistic production" stack.
Driver-layer (celeris HTTP + DB round-trip)
Best celeris cell vs best competitor for the same DB:
Matrix leader:
celeris-iouring + celerismc + AsyncHandlers=true = 101,902 rps— +33.5% over the best nethttp combination.AsyncHandlers — when to use which
Config.AsyncHandlers: true)Guidance: if your handler touches a DB or cache via any Go driver (third-party or celeris-native), set
AsyncHandlers: true. If your handler is pure-CPU (plaintext, JSON from preallocated data, pure computation), leave the default. Per-route control is tracked in #239 (v1.5.0 spike).One caveat:
celeris-adaptivewithAsyncHandlers=truecurrently regresses 3 cells (celerisredis -7.2%, celerispg -4.5%, celerismc -11.0%) — the engine-side async flag + driver-side sync-path mismatch on Adaptive. Actively being debugged; fix coming before v1.4.0 merge or punted to v1.4.1.What's included
driver/,engine/,internal/conn/,test/, and.github/workflows/.doc.gofor every new package + runnableExamplefunctions.test/integration/pipeline_test.go).engine/epoll/async_reuse_test.go).Scope decisions
sslmode=require/rediss://are rejected with actionable error messages. For managed cloud DB services (RDS, CloudSQL, ElastiCache) TLS is required — this release is for VPC/loopback deployments until TLS lands.ClusterTxis shipped.Test plan
go test ./... -raceon darwin (62/62 packages)go test ./... -raceon Linux aarch64 (MS-R1) — 62/62 packagesAsyncHandlers=falseandtrue