Release v1.5.3 by libraz · Pull Request #6 · libraz/mygram-db

libraz · 2026-04-12T00:48:26Z

Summary

Release v1.5.3 — reactor I/O model becomes the default (and only) TCP path, plus hardening fixes.

Added

Reactor I/O model (epoll/kqueue) — event-driven TCP path replacing one-thread-per-connection
Per-connection slow-reader backpressure — api.tcp.max_write_queue_bytes (default 16 MiB)
Reactor error codes 6016–6023

Fixed

TCP half-close drain regression — shutdown(SHUT_WR) + recv() clients now receive their response
Rate limiting silently disabled under reactor — api.rate_limiting.enable = true is now honored
Unix domain socket acceptor could not start — dead secondary unix_acceptor_ removed
Grafana memory usage PromQL — ignoring(type) reconciles label sets

Changed

Blocking I/O path removed entirely — ConnectionIOHandler, api.tcp.io_model, BlockingMode ctests all deleted
Thread-pool auto-size floor reverted to max(hw*2, 4)
Reactor hot-path polish (grow-on-demand poll buffers, Register/Stop race fix)

Testing

New unit tests: event_multiplexer_test, io_reactor_test, reactor_connection_test
New integration tests: reactor integration (half-close, rate limit, UDS, backpressure, many idle), starvation regression
e2e test_half_close_write now passes

Test plan

CI green on develop → main PR
Release workflow (release.yml) builds RPM/DEB packages on tag push
Docker release workflow (docker-release.yml) publishes ghcr.io/libraz/mygram-db:1.5.3
GitHub Release page lists all package assets

Detailed release notes: docs/releases/v1.5.3.md

The memory usage percentage panel divided mygramdb_memory_used_bytes by mygramdb_memory_system_total_bytes without reconciling the differing label sets: the numerator has a type="total" label that the denominator lacks, causing the Prometheus query engine to return no samples. Use `ignoring(type)` on the division so the two vectors match on the remaining labels.

Replace the blocking one-thread-per-connection model with an event-driven reactor that uses epoll (Linux) or kqueue (macOS) for readiness notification and a bounded per-connection write queue with inline fast-path plus EPOLLOUT fallback for output. The blocking path stays available behind api.tcp.io_model="blocking" so the feature flag can gate rollout, but the default is now "reactor". Core components - src/server/reactor/event_multiplexer.{h,cpp}: platform-agnostic interface for readiness events - src/server/reactor/epoll_multiplexer.{h,cpp}: level-triggered epoll implementation (Linux) - src/server/reactor/kqueue_multiplexer.{h,cpp}: kqueue implementation (macOS/BSD) - src/server/io_reactor.{h,cpp}: single-threaded event loop owning the multiplexer, connection registration, arm/disarm-write, graceful shutdown, and per-fd close callbacks - src/server/reactor_connection.{h,cpp}: per-connection state with a drain-task pattern, non-blocking write queue (std::deque plus front_offset tracking for partial head sends), the "clear-then-recheck" reschedule idiom, and per-connection max_write_queue_bytes cap for slow-reader backpressure Integration - ConnectionAcceptor now dispatches accepted fds to a reactor handler when io_model="reactor"; the existing connection_handler_ callback stays in place for the blocking fallback (to be removed later after an observation window) - TcpServer wires IoReactor lifecycle into Start/Stop, propagates max_write_queue_bytes, and increments/decrements active_connections from the reactor accept and close callbacks so ServerStats remains accurate under the new default - ServerLifecycleManager drains the reactor ahead of the thread pool on shutdown - config.h/config.cpp/config-schema.json/config_help.cpp/ runtime_variable_manager.cpp add api.tcp.max_write_queue_bytes (default 16 MiB, immutable at runtime) and flip io_model default to "reactor" - src/utils/error.h adds 6016-6023 for reactor init/poll/register/modify/ remove/queue-full/already-open failure modes Thread pool auto-size floor revert - src/server/thread_pool.cpp: drop the emergency hw*4/64-worker floor that was added to mitigate blocking-mode starvation. With the reactor default a single event-loop thread plus a small worker pool can serve thousands of persistent connections, so the original max(hw*2, 4) formula is restored.

Unit tests - tests/server/reactor/event_multiplexer_test.cpp with MockEventMultiplexer: register/modify/remove/poll semantics against a deterministic fake so the reactor's event-loop logic is covered without touching real fds. - tests/server/io_reactor_test.cpp: start/stop lifecycle, Register after Stop is rejected, ArmWrite before Register returns the right error, close-callback fires exactly once per fd. - tests/server/reactor_connection_test.cpp: drain-task "clear-then-recheck" reschedule, EnqueueResponse cap enforcement, partial-send front_offset tracking, write_armed transitions, close during in-flight drain. Integration tests - tests/integration/server/reactor_integration_test.cpp: end-to-end happy-path (SEARCH/INFO), many concurrent clients, and two new tests: * WriteBackpressureHandledGracefully: small cap (64 KiB) plus a slow reader (SO_RCVBUF=4096) must log reactor_write_queue_overflow and force-close the slow client without stalling four parallel fast clients. * ManyIdleConnectionsDoNotBlockActiveClient: thread pool kWorkers=8 plus thousands of idle persistent clients must still serve a new active client in under 500 ms, proving the reactor avoids the one-thread-per-connection starvation failure mode. The Connect() helper uses poll() instead of select() so the test works past the FD_SETSIZE=1024 cliff on macOS when opening many fds. - tests/integration/server/reactor_starvation_regression_test.cpp: three tests that pin the reactor's invariant (LateClientServedUnderLoad, NoBusyErrorUnderSustainedLoad) plus a blocking-mode negative control that still exhibits starvation, so we can tell if anything regresses. - tests/integration/server/thread_pool_saturation_test.cpp: existing saturation test migrated into its own binary with the OneWorkerPerConnectionStarvesQueuedWork assertion inverted (LateClientServedDespitePinnedIdleClientsInDefaultMode) now that reactor is the default; the blocking-mode legacy behaviour is preserved by passing io_model="blocking" in two dedicated cases. Dual-mode CI - tests/integration/server/test_io_model_override.h: reads the MYGRAMDB_TEST_IO_MODEL env var and flips ServerConfig::io_model in the test fixture when set. - end_to_end_test.cpp / multi_table_test.cpp / verify_text_test.cpp call ApplyIoModelOverride() in SetUp. - tests/integration/server/CMakeLists.txt registers a second ctest entry for each of the three binaries with ENVIRONMENT MYGRAMDB_TEST_IO_MODEL=blocking so blocking-mode coverage keeps running alongside the reactor default.

Two latent regressions from the reactor-default flip, both uncovered by new regression tests in reactor_integration_test.cpp that fail on the pre-fix code. Gap 1 — rate limiting was silently disabled - The blocking ConnectionIOHandler path enforced api.rate_limiting via TcpServer::HandleConnection (getpeername + AllowRequest + close on rejection). The reactor path had no equivalent: reactor_connection.cpp and io_reactor.cpp contained zero references to `rate_limit`, so any user who had `api.rate_limiting.enable = true` got unmetered traffic the moment the default io_model flipped to "reactor". - Fix: the reactor_handler lambda in TcpServer::Start now captures a RateLimiter* (or null if rate limiting is disabled / the acceptor is a UDS acceptor), does getpeername() on each accepted fd, extracts the peer IPv4 address, and calls AllowRequest() before Register(). On rejection it returns false so ConnectionAcceptor::AcceptLoop emits SERVER_BUSY and closes the fd, matching the existing accept-side backpressure path. - UDS bypass: the blocking path used the "unix-local" sentinel to skip rate limiting for Unix-domain clients. Here we compute the bypass once at Start() time (rate_limiter_ptr = null whenever the acceptor is UDS) rather than on every accept. Gap 2 — UDS acceptor could not start at all - TcpServer::Start unconditionally created a *secondary* `unix_acceptor_` whenever config_.unix_socket_path was non-empty. But the primary acceptor_ constructed in ServerLifecycleManager::InitAcceptor receives the same ServerConfig, and ConnectionAcceptor::Start checks `unix_socket_path.empty()` before the TCP branch — so the primary was *already* in UDS mode and had already bound the path. The secondary then hit its own stale-socket probe on the same path and failed TcpServer::Start() with "Another server is already listening on: …". - No integration test exercised unix_socket_path end to end, so this never surfaced. Once a test does, every TcpServer start with UDS enabled fails deterministically. - Fix: delete the dead unix_acceptor_ code path entirely (tcp_server.h field, Start() setup block, Stop() teardown, IsRunning() check). The primary acceptor already routes UDS client fds through the reactor_handler installed on it (connection_acceptor.cpp reactor_handler_ priority at AcceptLoop), so UDS flows through the reactor for free. Tests - tests/integration/server/reactor_integration_test.cpp adds three regression guards that all failed on the pre-fix code: RateLimitEnforcedInReactorMode: capacity=2, refill=0; third connection must be closed without a response. MaxQueryLengthEnforcedInReactorMode: a 4 KiB SEARCH query must not return OK RESULTS. (This test already passed on pre-fix code because the downstream query parser rejects oversized terms; kept as a regression guard for the enforcement pipeline.) UnixSocketServedUnderReactorDefault: AF_UNIX client connects, sends INFO, expects "OK INFO". Proves UDS end-to-end works under the reactor default and guards against re-breaking it. - Full fast suite: 2175 / 2175 pass.

The reactor is now the default TCP path and already handles rate limiting plus UDS routing. With nothing of value left in the legacy code, delete it wholesale. Behaviour is unchanged: every connection now flows through IoReactor + ReactorConnection on every supported platform. Production - src/config/config.{h,cpp}, config-schema.json, config_help.cpp, runtime_variable_manager.cpp: drop the api.tcp.io_model field, schema enum, help entry, and runtime variable getter/registry. - src/server/server_types.h: drop ServerConfig::io_model. - src/app/server_orchestrator.cpp: drop the io_model assignment. - src/server/connection_io_handler.{h,cpp}: deleted; the blocking per-connection recv/process/send loop has no remaining caller. - src/server/tcp_server.{h,cpp}: * delete TcpServer::HandleConnection, the connection_contexts_ map and its mutex (per-connection state lives in ReactorConnection); * Start() always creates and starts the IoReactor — failure (including kNetworkReactorUnsupported) propagates instead of falling back to blocking; * remove the reactor_active branch and the SetConnectionHandler else-arm. - src/server/connection_acceptor.{h,cpp}: drop ConnectionHandler typedef, SetConnectionHandler, the connection_handler_ field, and the thread-pool-submit branch in AcceptLoop. A missing reactor_handler_ is now a fatal misconfiguration logged from AcceptLoop rather than a silent fallback. - src/server/thread_pool.cpp: drop a stale comment referencing the blocking model. - src/server/CMakeLists.txt: drop connection_io_handler.cpp. - src/utils/error.h: kNetworkReactorUnsupported (6016) is preserved — it now propagates from TcpServer::Start instead of being caught. Tests - tests/server/connection_io_handler_test.cpp: deleted (8 tests for the removed class). - tests/server/connection_acceptor_unix_test.cpp: rewritten to install a ReactorHandler stub (returns true after counting + closing the fd) instead of the removed SetConnectionHandler. - tests/integration/server/test_io_model_override.h: deleted along with every #include and ApplyIoModelOverride() call site in end_to_end_test.cpp, multi_table_test.cpp (5 calls), and verify_text_test.cpp (4 calls). - tests/integration/server/CMakeLists.txt: drop the three .BlockingMode add_test entries (multi_table, end_to_end, verify_text) and the target_include_directories that only existed for the override helper. - tests/integration/server/thread_pool_saturation_test.cpp: drop the io_model parameter from StartServer; delete the two blocking-mode cases (LargerThreadPoolRemovesStarvationInBlockingMode and QueueOverflowTriggersBusyResponseInBlockingMode) — their failure mode no longer exists. - tests/integration/server/reactor_starvation_regression_test.cpp: drop the io_model parameter and delete the T3 negative-control case (BlockingModeStarvesUnderSameLoad). - tests/integration/server/reactor_integration_test.cpp: drop the io_model="reactor" assignments, RequireReactor() helper, and IsReactorActiveForTest() call sites. - tests/server/CMakeLists.txt: drop the connection_io_handler_test target. Validation - cmake --build build -j8 → exit 0 - ctest -LE "LOAD|SLOW" -j4 → 100% tests passed, 0 failed out of 2164. Baseline 2175 → 2164 = 11 tests removed (8 ConnectionIOHandler unit tests + 2 blocking thread-pool saturation tests + 1 T3 blocking starvation control), no new failures. - Reactor starvation tests: 4/4 pass.

With the thread-pool-submit branch in ConnectionAcceptor::AcceptLoop gone, the thread_pool_ member is no longer referenced anywhere — accepted fds are handed off inline to the ReactorHandler on the accept thread. Remove the dead member, the constructor parameter, the forward declaration, and the `server/thread_pool.h` include. - src/server/connection_acceptor.{h,cpp}: drop ThreadPool forward decl, drop the ThreadPool* constructor parameter (now `explicit ConnectionAcceptor(ServerConfig)`), drop the thread_pool_ field, drop the null check in the constructor, drop the include. - src/server/server_lifecycle_manager.{h,cpp}: ServerLifecycleManager:: InitAcceptor no longer needs a thread_pool argument; the matching call in InitializeComponents is updated to pass nothing. - tests/server/connection_acceptor_unix_test.cpp: drop the local `ThreadPool pool(2, 100)` declarations in all 6 fixtures, drop the matching `pool.Shutdown()` calls, drop the include. Each test now constructs `ConnectionAcceptor acceptor(config)` directly. Fast suite: 2164 / 2164 pass. Also gitignore: add `.backup/` so a local-only scratch directory is never accidentally tracked.

The blocking ConnectionIOHandler path is gone; all connections flow through IoReactor + ReactorConnection on epoll/kqueue. Update the architecture docs and the load test docstring to match. docs/ja/architecture.md and docs/en/architecture.md - Replace the ConnectionIOHandler subsection with new IoReactor and ReactorConnection subsections describing the event loop, the EventMultiplexer abstraction, the per-fd drain task pattern, the non-blocking write queue, and per-connection `max_write_queue_bytes` slow-reader backpressure. - Rewrite ConnectionAcceptor's "Features" bullets: it no longer dispatches into the thread pool; it hands accepted fds inline to IoReactor::Register on the accept thread. - Add an "IoReactor Event Loop Thread (single)" node to the thread model diagram and link it from the main process node. Worker pool description updated to "runs per-connection drain tasks". - "ServerLifecycleManager initialization order": ConnectionAcceptor no longer depends on ThreadPool; add IoReactor as step 8, created in TcpServer::Start() and depending on ThreadPool + RequestDispatcher. - "Between Acceptor and Handlers" section: rewrite the request flow as ConnectionAcceptor → IoReactor::Register → ReactorConnection (event loop) → drain task on ThreadPool → RequestDispatcher → handler → EnqueueResponse() with EPOLLOUT fallback. e2e/tests/load/test_load.py - Module docstring previously claimed pipelined commands were processed by ConnectionIOHandler "in a while loop". Update to describe ReactorConnection's drain task framing the `\r\n`-delimited requests. Historical release notes (docs/releases/v1.1.0.md, v1.3.5.md) still reference the old class names — left intact since they document past releases.

A client that does `send(...); shutdown(SHUT_WR)` followed by `recv()` on the still-open read side was losing its response under the reactor I/O path. The blocking ConnectionIOHandler historically handled this case fine (it kept reading from the socket until EOF and flushed the response), so this is a regression from the switch to the reactor I/O path, uncovered by e2e/tests/load/test_connection_stress.py::test_half_close_write. Root cause — there were two bugs compounding each other: 1. src/server/io_reactor.cpp DispatchEvent treated `reactor::event::kHangup` (EV_EOF on kqueue, EPOLLRDHUP on epoll) as a *fatal* error alongside `kError`, short-circuiting straight to OnError() without ever calling OnReadable(). The half-close raises the kHangup flag on the *same* readable event as the payload bytes, so the reactor tore the connection down before reading the request. Fix: kError stays fatal, but kHangup now falls through to OnReadable(), which already knows how to drain to EOF. 2. src/server/reactor_connection.cpp OnReadable, on recv()==0, set the `closing_` flag. EnqueueResponse refuses to enqueue writes when `closing_` is set, so even if the drain task had run, the response would have been dropped on the floor. Fix: introduce a distinct `read_eof_` atomic for "peer has stopped writing" semantics. OnReadable sets `read_eof_` (not `closing_`) on orderly EOF and still extracts buffered frames into the drain queue. The drain task runs to completion, EnqueueResponse accepts the response (since `closing_` is false), the write queue drains via the inline fast path, and only then does the drain task set `closing_` and Unregister. OnWritable has matching handling for the case where the last response went out via the EPOLLOUT fallback path. Also: subsequent OnReadable calls while `read_eof_` is set no longer issue further recv() syscalls — the data side is definitely empty and we're waiting for the drain task to close us. Regression test: tests/integration/server/reactor_integration_test.cpp adds `HalfCloseStillReceivesResponse`, which mirrors the e2e test case: send INFO, shutdown(SHUT_WR), read back response bytes with raw recv(). Fails on the pre-fix code, passes with the fix. Validation: - Fast suite: 2165 / 2165 pass (baseline 2164 + the new regression guard). - e2e suite (e2e/tests/): 195 passed, 4 skipped, 0 failed (previously 1 failed on test_half_close_write).

The reactor I/O model commits landed without a final clang-format pass, leaving the tree drifted from the project style config (ColumnLimit: 120, AllowShortIfStatementsOnASingleLine: Never). This commit is purely `clang-format -i` output over the affected files; no semantic changes.

Four small, independent improvements surfaced during code review of the reactor I/O path: - epoll/kqueue Poll() buffer grows on demand. Both backends now double their scratch buffer up to a fixed 4 KiB-entry cap whenever a Poll() fills the current capacity, so high-concurrency bursts are not fragmented across multiple Poll() rounds waiting for the next tick. Growth is monotonic; the buffer never shrinks. - ReactorConnection::OnWritable is flattened. The empty-queue branch previously wrapped the disarm + half-close teardown in a nested block with a fall-through `return true` that doubled as both the partial- drain and the fully-drained-but-not-closing return. Splitting the partial-drain early-return out removes the head-scratcher without changing semantics. - IoReactor::Register closes a narrow race with Stop(). Stop() sets running_=false, clears connections_, and then blocks on mux_lifecycle_ exclusive. If a concurrent Register interleaved between its initial running_ check and the mux_->Add call, the emplace had already been cleared by Stop() and the caller would receive a spurious success whose fd was never actually tracked. Register now re-checks running_ while still holding mux_lifecycle_ shared and rolls back the Add on the losing side. - kMaxReadBufferBytes contract documented. Clarify in the header that this constant is an OOM safety rail, not per-query size enforcement — that responsibility belongs to the downstream query parser (api.max_query_length). Deliberately decoupled from config so that lowering max_query_length at runtime cannot drop well-formed but large requests that are still in flight on an existing connection.

The Register race-guard return statement was manually wrapped at column 80 but fits on a single line under the project's 120-column limit. clang-format --dry-run --Werror (= make format-check, which CI runs) was flagging the mismatch. No semantic change.

Release-prep documentation update for v1.5.3 (reactor I/O model rollout plus three hardening fixes): - CHANGELOG.md: add [1.5.3] - 2026-04-12 section covering the reactor I/O model (epoll/kqueue), slow-reader backpressure, four bug fixes (half-close drain regression, rate limiting silently disabled, UDS acceptor broken, Grafana PromQL), and the removal of the blocking I/O path; fix previously missing/stale [Unreleased] and [1.5.2] compare links at the bottom of the file - docs/releases/v1.5.3.md: new detailed release notes (new file) - docs/releases/README.md: promote v1.5.3 to Latest release entry

The WriteBackpressureHandledGracefully test uses a deadline to detect FIN after the reactor force-closes the slow-reader fd on write-queue overflow (64 KiB cap). The original 5s deadline was too tight under coverage instrumentation, which slows request dispatch by 3-10x and in turn delays both tripping the cap and the close-propagation path after Unregister. The Linux Build, Test and Coverage job was the only CI job to fail on PR #6 (develop->main, v1.5.3); the non-coverage Linux Build and Test job passed. Logs confirmed reactor_write_queue_overflow fired correctly but the test gave up at exactly the 5s deadline before FIN arrived. - Bump deadline from 5s to 15s - Update EXPECT_TRUE assertion message to reflect the new deadline - Add comment explaining the generous deadline and coverage context No change to pass-path behavior; a passing run finishes in well under a second on unninstrumented builds.

codecov-commenter · 2026-04-12T01:08:03Z

Codecov Report

❌ Patch coverage is 77.21519% with 144 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/server/reactor_connection.cpp	78.99%	46 Missing ⚠️
src/server/connection_acceptor.cpp	46.00%	27 Missing ⚠️
src/server/io_reactor.cpp	84.44%	21 Missing ⚠️
src/utils/error.h	0.00%	16 Missing ⚠️
src/config/config.cpp	30.00%	14 Missing ⚠️
src/server/reactor/epoll_multiplexer.cpp	89.15%	9 Missing ⚠️
src/app/server_orchestrator.cpp	0.00%	8 Missing ⚠️
src/server/tcp_server.cpp	96.36%	2 Missing ⚠️
src/server/thread_pool.cpp	80.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

libraz added 13 commits April 12, 2026 06:56

libraz merged commit 3f9636e into main Apr 12, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release v1.5.3#6

Release v1.5.3#6
libraz merged 13 commits intomainfrom
develop

libraz commented Apr 12, 2026

Uh oh!

codecov-commenter commented Apr 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

libraz commented Apr 12, 2026

Summary

Added

Fixed

Changed

Testing

Test plan

Uh oh!

codecov-commenter commented Apr 12, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants