From bef75d064c4541cc96285711f0ca97de06cb5d2a Mon Sep 17 00:00:00 2001 From: "Sean T. Allen" Date: Tue, 14 Apr 2026 20:46:36 -0400 Subject: [PATCH] Fix statement timeout dropped when timer event subscription fails MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit lori 0.14.1 replaces the silent-cancel behavior for user-timer ASIO event subscription failures with a new `_on_timer_failure` callback. Without an override, the driver would keep running an in-flight query unbounded by its configured statement timeout. Rearm the timer with the in-flight operation's original duration so a transient subscription failure (e.g. `ENOMEM` from `epoll_ctl`/`kevent`) no longer drops the timeout. Rearm is best-effort; if the rearm's own subscription fails, the timeout is still lost for that operation. Also override `_on_idle_timer_failure` with `_IllegalState()` — postgres never arms lori's idle timer, so the callback firing would indicate a contract violation. Closes #222 --- ...rearm-statement-timeout-on-asio-failure.md | 3 + CLAUDE.md | 8 +- corral.json | 2 +- postgres/_test.pony | 1 + postgres/_test_statement_timeout.pony | 101 ++++++++++++++++++ postgres/session.pony | 81 +++++++++++++- 6 files changed, 190 insertions(+), 6 deletions(-) create mode 100644 .release-notes/rearm-statement-timeout-on-asio-failure.md diff --git a/.release-notes/rearm-statement-timeout-on-asio-failure.md b/.release-notes/rearm-statement-timeout-on-asio-failure.md new file mode 100644 index 0000000..a4b9253 --- /dev/null +++ b/.release-notes/rearm-statement-timeout-on-asio-failure.md @@ -0,0 +1,3 @@ +## Fix statement timeout dropped when timer event subscription fails + +A statement timeout could silently disappear if the kernel returned an error when the driver's internal timer tried to register itself with the I/O event loop (for example, `ENOMEM` under sustained resource pressure). The operation would then run without a timeout even though one had been requested. The driver now detects this failure and rearms the timer with the original duration, so a transient registration failure no longer drops the timeout. Recovery is best-effort: if the rearm itself fails to register, the timeout is still lost for that operation. diff --git a/CLAUDE.md b/CLAUDE.md index 2d305f6..baac369 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -19,7 +19,7 @@ SSL version is mandatory. Tests run with `--sequential`. Integration tests requi ## Dependencies - `ponylang/ssl` 2.0.1 (MD5 password hashing, SCRAM-SHA-256 crypto primitives via `ssl/crypto`, SSL/TLS via `ssl/net`) -- `ponylang/lori` 0.13.1 (TCP networking, STARTTLS support) +- `ponylang/lori` 0.14.1 (TCP networking, STARTTLS support) Managed via `corral`. @@ -91,7 +91,7 @@ This design makes illegal state transitions call `_IllegalState()` (panic) by de Only one operation is in-flight at a time. The queue serializes execution. `query_queue`, `query_state`, `backend_pid`, `backend_secret_key`, `codec_registry`, and `statement_timer: (lori.TimerToken | None)` are non-underscore-prefixed fields on `_SessionLoggedIn` because other types in this package need cross-type access (Pony private fields are type-private). `codec_registry: CodecRegistry` is received from the `Session` constructor and threaded through the state machine (`_SessionUnopened` → `_ConnectableState.on_connected` → `_SessionSSLNegotiating`/`_SessionConnected` → `_AuthenticableState.on_authentication_ok`/`on_authentication_sasl` → `_SessionSCRAMAuthenticating`/`_SessionLoggedIn`), then passed to all `_RowsBuilder` call sites. On shutdown, `_SessionLoggedIn.on_shutdown` cancels any active statement timer, then calls `query_state.drain_in_flight()` to let the in-flight state handle its own queue item (skipping notification if `on_error_response` already notified the receiver), then drains remaining queued items with `SessionClosed`. This prevents double-notification when `close()` arrives between ErrorResponse and ReadyForQuery delivery. -**Statement timeout:** `execute`, `prepare`, `copy_in`, `copy_out`, `stream`, and `pipeline` accept an optional `statement_timeout: (lori.TimerDuration | None)` parameter. When provided, `_QueryReady.try_run_query()` starts a one-shot timer (via `lori.set_timer()`) after sending wire messages. Each in-flight state's `on_ready_for_query()` cancels the timer (via `lori.cancel_timer()`) before transitioning to `_QueryReady`. For pipelines, the timer is cancelled only on the final `ReadyForQuery`. On expiry, `_SessionLoggedIn.on_timer()` sends a `CancelRequest` via `_CancelSender`, producing the same cancellation flow as `session.cancel()`. +**Statement timeout:** `execute`, `prepare`, `copy_in`, `copy_out`, `stream`, and `pipeline` accept an optional `statement_timeout: (lori.TimerDuration | None)` parameter. When provided, `_QueryReady.try_run_query()` starts a one-shot timer (via `lori.set_timer()`) after sending wire messages. Each in-flight state's `on_ready_for_query()` cancels the timer (via `lori.cancel_timer()`) before transitioning to `_QueryReady`. For pipelines, the timer is cancelled only on the final `ReadyForQuery`. On expiry, `_SessionLoggedIn.on_timer()` sends a `CancelRequest` via `_CancelSender`, producing the same cancellation flow as `session.cancel()`. If the timer's ASIO event subscription fails after `set_timer` succeeded (e.g. `ENOMEM` from `kevent`/`epoll_ctl`), lori fires `_on_timer_failure`, which `Session` routes to `_SessionState.on_timer_failure`. `_SessionLoggedIn.on_timer_failure` reads the in-flight queue item's `statement_timeout` and rearms the timer with the original duration so transient subscription failures do not silently drop the timeout. All other states no-op; `Session._on_idle_timer_failure` panics via `_IllegalState()` because postgres never arms lori's idle timer. **Query cancellation:** `session.cancel()` requests cancellation of the currently executing query by opening a separate TCP connection via `_CancelSender` and sending a `CancelRequest`. The `cancel` method on `_SessionState` follows the same "never illegal" contract as `close` — it is a no-op in all states except `_SessionLoggedIn`, where it fires only when a query is in flight (not in `_QueryReady` or `_QueryNotReady`). Cancellation is best-effort; the server may or may not honor it. If cancelled, the query's `ResultReceiver` receives `pg_query_failed` with an `ErrorResponseMessage` (SQLSTATE 57014). @@ -198,9 +198,9 @@ Codec-based decoding via `CodecRegistry.decode(oid, format_code, data)`. Extende Tests live in the main `postgres/` package (private test classes), organized across multiple files by concern (`_test_*.pony`). The `Main` test actor in `_test.pony` is the single test registry that lists all tests. Read the individual test files for per-test details. -**Conventions**: `_test.pony` contains shared helpers (`_ConnectionTestConfiguration` for env vars, `_ConnectTestNotify`/`_AuthenticateTestNotify` reused by other files). `_test_response_parser.pony` contains `_Incoming*TestMessage` builder classes that construct raw protocol bytes for mock servers across all test files. `_test_mock_message_reader.pony` contains `_MockMessageReader` for extracting complete PostgreSQL frontend messages from TCP data in mock servers. `_test_codecs.pony` contains unit tests for binary/text codecs, `CodecRegistry`, `_ParamEncoder`, and binary-format bind wire format. `_test_array.pony` contains unit tests for `_ArrayOidMap`, binary/text array decode, `PgArray` equality/string, `_ArrayEncoder` roundtrips, `_ParamEncoder` with `PgArray`, `_FrontendMessage.bind()` with `PgArray`, `_FieldDataEq` extraction, `_NumericBinaryCodec` encode roundtrip, `CodecRegistry` array methods, and integration tests for array SELECT/roundtrip. `_test_composite.pony` contains unit tests for `PgComposite` construction/access/equality/string, `CodecRegistry.with_composite_type` registration validation, binary/text composite decode, `_CompositeEncoder` roundtrips, `_ParamEncoder` with `PgComposite`, `_FrontendMessage.bind()` with `PgComposite`, composite-in-array and array-in-composite roundtrips, `_FieldDataEq` with composites, and integration tests for composite SELECT (binary/text) and INSERT/SELECT roundtrip. `_test_statement_timeout.pony` contains unit tests for statement timeout timer lifecycle (fires on slow query, cancelled on normal completion) and integration test for pg_sleep with timeout. `_test_connection_timeout.pony` contains unit test for connection timeout (connects to non-routable address, verifies `ConnectionFailedTimeout` is reported). `_test_protocol_violation.pony` contains unit tests for `ProtocolViolation` routing: parse failures during SCRAM and logged-in (with and without an in-flight query), wrong-state server messages pre-auth, mid-SCRAM, and logged-in (with and without an in-flight query), and per-receiver coverage of the in-flight states that have their own receiver interfaces — `copy_in`, `copy_out`, `prepare`, `stream`, and `pipeline` (each test triggers the violation immediately after the client-side wire message, so the in-flight state is entered but the server hasn't responded with a CopyInResponse/CopyOutResponse/ParseComplete/etc. yet; the pipeline case also exercises the current-query-gets-`ProtocolViolation`/remaining-queries-get-`SessionClosed` contract). The pre-auth parse-failure case (`_TestHandlingJunkMessages`) and the SSL-negotiation junk-byte case (`_TestSSLNegotiationJunkResponse`) live inline in `_test_session.pony` and `_test_ssl.pony` respectively. `_test_connection_closed.pony` contains unit tests for peer-initiated TCP close at every state boundary: `_SessionSSLNegotiating` (in both `SSLRequired` and `SSLPreferred` modes), `_SessionConnected` pre-auth, `_SessionSCRAMAuthenticating`, `_SessionLoggedIn` idle, `_SessionLoggedIn` with an in-flight simple query, `_SessionLoggedIn` with an in-flight pipeline, `_SessionLoggedIn` after `AuthenticationOk` but before the first `ReadyForQuery` (state `_QueryNotReady`), plus per-receiver coverage for the in-flight states — `execute` (extended query), `prepare`, `copy_in`, `copy_out`, `stream`, and `close_statement`. Two tests use exactly-once counters: the in-flight simple-query test catches regressions that double-deliver the query failure, and the `ErrorResponse`-followed-by-close test verifies that `_error` set by `on_error_response` prevents `on_closed` from double-delivering. +**Conventions**: `_test.pony` contains shared helpers (`_ConnectionTestConfiguration` for env vars, `_ConnectTestNotify`/`_AuthenticateTestNotify` reused by other files). `_test_response_parser.pony` contains `_Incoming*TestMessage` builder classes that construct raw protocol bytes for mock servers across all test files. `_test_mock_message_reader.pony` contains `_MockMessageReader` for extracting complete PostgreSQL frontend messages from TCP data in mock servers. `_test_codecs.pony` contains unit tests for binary/text codecs, `CodecRegistry`, `_ParamEncoder`, and binary-format bind wire format. `_test_array.pony` contains unit tests for `_ArrayOidMap`, binary/text array decode, `PgArray` equality/string, `_ArrayEncoder` roundtrips, `_ParamEncoder` with `PgArray`, `_FrontendMessage.bind()` with `PgArray`, `_FieldDataEq` extraction, `_NumericBinaryCodec` encode roundtrip, `CodecRegistry` array methods, and integration tests for array SELECT/roundtrip. `_test_composite.pony` contains unit tests for `PgComposite` construction/access/equality/string, `CodecRegistry.with_composite_type` registration validation, binary/text composite decode, `_CompositeEncoder` roundtrips, `_ParamEncoder` with `PgComposite`, `_FrontendMessage.bind()` with `PgComposite`, composite-in-array and array-in-composite roundtrips, `_FieldDataEq` with composites, and integration tests for composite SELECT (binary/text) and INSERT/SELECT roundtrip. `_test_statement_timeout.pony` contains unit tests for statement timeout timer lifecycle (fires on slow query, cancelled on normal completion, rearmed on simulated ASIO subscription failure via `Session._test_trigger_on_timer_failure`) and integration test for pg_sleep with timeout. `_test_connection_timeout.pony` contains unit test for connection timeout (connects to non-routable address, verifies `ConnectionFailedTimeout` is reported). `_test_protocol_violation.pony` contains unit tests for `ProtocolViolation` routing: parse failures during SCRAM and logged-in (with and without an in-flight query), wrong-state server messages pre-auth, mid-SCRAM, and logged-in (with and without an in-flight query), and per-receiver coverage of the in-flight states that have their own receiver interfaces — `copy_in`, `copy_out`, `prepare`, `stream`, and `pipeline` (each test triggers the violation immediately after the client-side wire message, so the in-flight state is entered but the server hasn't responded with a CopyInResponse/CopyOutResponse/ParseComplete/etc. yet; the pipeline case also exercises the current-query-gets-`ProtocolViolation`/remaining-queries-get-`SessionClosed` contract). The pre-auth parse-failure case (`_TestHandlingJunkMessages`) and the SSL-negotiation junk-byte case (`_TestSSLNegotiationJunkResponse`) live inline in `_test_session.pony` and `_test_ssl.pony` respectively. `_test_connection_closed.pony` contains unit tests for peer-initiated TCP close at every state boundary: `_SessionSSLNegotiating` (in both `SSLRequired` and `SSLPreferred` modes), `_SessionConnected` pre-auth, `_SessionSCRAMAuthenticating`, `_SessionLoggedIn` idle, `_SessionLoggedIn` with an in-flight simple query, `_SessionLoggedIn` with an in-flight pipeline, `_SessionLoggedIn` after `AuthenticationOk` but before the first `ReadyForQuery` (state `_QueryNotReady`), plus per-receiver coverage for the in-flight states — `execute` (extended query), `prepare`, `copy_in`, `copy_out`, `stream`, and `close_statement`. Two tests use exactly-once counters: the in-flight simple-query test catches regressions that double-deliver the query failure, and the `ErrorResponse`-followed-by-close test verifies that `_error` set by `on_error_response` prevents `on_closed` from double-delivering. -**Ports**: Mock server tests use ports in the 7669–7759 range and 9667–9668. **Port 7680 is reserved by Windows** (Update Delivery Optimization) and will fail to bind on WSL2 — do not use it. Composite integration tests don't use mock servers (they use the real PostgreSQL containers). +**Ports**: Mock server tests use ports in the 7669–7760 range and 9667–9668. **Port 7680 is reserved by Windows** (Update Delivery Optimization) and will fail to bind on WSL2 — do not use it. Composite integration tests don't use mock servers (they use the real PostgreSQL containers). ## Supported PostgreSQL Features diff --git a/corral.json b/corral.json index a8f912f..18044cd 100644 --- a/corral.json +++ b/corral.json @@ -5,7 +5,7 @@ "deps": [ { "locator": "github.com/ponylang/lori.git", - "version": "0.14.0" + "version": "0.14.1" }, { "locator": "github.com/ponylang/ssl.git", diff --git a/postgres/_test.pony b/postgres/_test.pony index c083d1f..eae93e5 100644 --- a/postgres/_test.pony +++ b/postgres/_test.pony @@ -156,6 +156,7 @@ actor \nodoc\ Main is TestList test(_TestCancelSSLPgSleep) test(_TestConnectionTimeoutFires) test(_TestStatementTimeoutFires) + test(_TestStatementTimeoutRearmOnTimerFailure) test(_TestStatementTimeoutCancelledOnCompletion) test(_TestStatementTimeoutPgSleep) test(_TestResponseParserAuthenticationSASLMessage) diff --git a/postgres/_test_statement_timeout.pony b/postgres/_test_statement_timeout.pony index 60a1f38..75bdb64 100644 --- a/postgres/_test_statement_timeout.pony +++ b/postgres/_test_statement_timeout.pony @@ -194,6 +194,107 @@ actor \nodoc\ _TimeoutTestServer // the statement timeout timer will fire. end +class \nodoc\ iso _TestStatementTimeoutRearmOnTimerFailure is UnitTest + """ + Verifies that when the statement timer's ASIO event subscription fails, + the driver rearms the timer with the in-flight operation's original + duration instead of silently dropping the timeout. The client simulates + the failure via `_test_trigger_on_timer_failure` immediately after + dispatching a query; the rearmed timer fires and a CancelRequest arrives + on the second connection. + """ + fun name(): String => + "StatementTimeout/RearmOnTimerFailure" + + fun apply(h: TestHelper) => + let host = "127.0.0.1" + let port = "7760" + + let listener = _TimeoutRearmTestListener( + lori.TCPListenAuth(h.env.root), + host, + port, + h) + + h.dispose_when_done(listener) + h.long_test(5_000_000_000) + +actor \nodoc\ _TimeoutRearmTestClient is (SessionStatusNotify & ResultReceiver) + let _h: TestHelper + + new create(h: TestHelper) => + _h = h + + be pg_session_connection_failed(s: Session, + reason: ConnectionFailureReason) + => + _h.fail("Unable to establish connection.") + _h.complete(false) + + be pg_session_authenticated(session: Session) => + // Execute a query with a 200ms timeout, then immediately simulate an + // ASIO subscription failure on the statement timer. The session + // processes `execute` first (arms the original timer), then the + // simulation (cancels the original token and rearms with 200ms). The + // rearmed timer fires and sends a CancelRequest on a second connection. + match lori.MakeTimerDuration(200) + | let d: lori.TimerDuration => + session.execute(SimpleQuery("SELECT pg_sleep(100)"), this + where statement_timeout = d) + session._test_trigger_on_timer_failure() + | let _: ValidationFailure => + _h.fail("Failed to create TimerDuration.") + _h.complete(false) + end + + be pg_query_result(session: Session, result: Result) => + None + + be pg_query_failed(session: Session, query: Query, + failure: (ErrorResponseMessage | ClientQueryError)) + => + None + +actor \nodoc\ _TimeoutRearmTestListener is lori.TCPListenerActor + var _tcp_listener: lori.TCPListener = lori.TCPListener.none() + let _server_auth: lori.TCPServerAuth + let _h: TestHelper + let _host: String + let _port: String + var _connection_count: USize = 0 + + new create(listen_auth: lori.TCPListenAuth, + host: String, + port: String, + h: TestHelper) + => + _host = host + _port = port + _h = h + _server_auth = lori.TCPServerAuth(listen_auth) + _tcp_listener = lori.TCPListener(listen_auth, host, port, this) + + fun ref _listener(): lori.TCPListener => + _tcp_listener + + fun ref _on_accept(fd: U32): _TimeoutTestServer => + _connection_count = _connection_count + 1 + let server = _TimeoutTestServer(_server_auth, fd, _h, + _connection_count > 1) + _h.dispose_when_done(server) + server + + fun ref _on_listening() => + let session = Session( + ServerConnectInfo(lori.TCPConnectAuth(_h.env.root), _host, _port), + DatabaseConnectInfo("postgres", "postgres", "postgres"), + _TimeoutRearmTestClient(_h)) + _h.dispose_when_done(session) + + fun ref _on_listen_failure() => + _h.fail("Unable to listen") + _h.complete(false) + class \nodoc\ iso _TestStatementTimeoutCancelledOnCompletion is UnitTest """ Verifies that when a query completes before its statement_timeout, the diff --git a/postgres/session.pony b/postgres/session.pony index 748bb5f..9159409 100644 --- a/postgres/session.pony +++ b/postgres/session.pony @@ -214,6 +214,33 @@ actor Session is (lori.TCPConnectionActor & lori.ClientLifecycleEventReceiver) fun ref _on_timer(token: lori.TimerToken) => state.on_timer(this, token) + fun ref _on_timer_failure() => + state.on_timer_failure(this) + + be _test_trigger_on_timer_failure() => + """ + Test-only entry point. Lori fires `_on_timer_failure` when the statement + timer's ASIO event subscription fails, a condition that requires kernel + resource pressure (e.g. ENOMEM from `kevent`/`epoll_ctl`) to trigger + organically. This behavior simulates the callback so unit tests can cover + the rearm path. + + Lori's real dispatch path cancels the user timer before invoking + `_on_timer_failure`, so `set_timer` inside the callback succeeds. To + match that invariant, cancel the active statement timer (if any) before + invoking the callback. + """ + match state + | let li: _SessionLoggedIn => + li.cancel_statement_timer(this) + end + _on_timer_failure() + + fun ref _on_idle_timer_failure() => + // postgres never arms lori's idle timer, so this callback firing is a + // contract violation. + _IllegalState() + fun ref _on_connected() => state.on_connected(this) @@ -585,6 +612,9 @@ class ref _SessionSSLNegotiating fun ref on_timer(s: Session ref, token: lori.TimerToken) => None + fun ref on_timer_failure(s: Session ref) => + None + fun ref close(s: Session ref) => _shutdown(s) @@ -1106,6 +1136,37 @@ class _SessionLoggedIn is _AuthenticatedState _CancelSender(s.server_connect_info(), backend_pid, backend_secret_key) end + fun ref on_timer_failure(s: Session ref) => + // Lori cancelled the timer before firing this callback, so the token we + // held is stale. Rearm using the in-flight operation's original duration + // so a transient ASIO subscription failure doesn't silently drop the + // statement timeout. COUPLING: the per-variant timeout extraction must + // stay in sync with `_QueryReady.try_run_query` — adding a new + // `_Queued*` variant requires updating both sites. + statement_timer = None + try + let timeout = match \exhaustive\ query_queue(0)? + | let qry: _QueuedQuery => qry.statement_timeout + | let prep: _QueuedPrepare => prep.statement_timeout + | let _: _QueuedCloseStatement => None + | let ci: _QueuedCopyIn => ci.statement_timeout + | let co: _QueuedCopyOut => co.statement_timeout + | let sq: _QueuedStreamingQuery => sq.statement_timeout + | let pl: _QueuedPipeline => pl.statement_timeout + end + match timeout + | let d: lori.TimerDuration => + match s._connection().set_timer(d) + | let t: lori.TimerToken => statement_timer = t + | let _: lori.SetTimerError => None + end + end + end + // Empty queue is an invariant violation for a real lori dispatch (the + // timer is only armed while a queue item is at the head), but the + // interface's "never illegal, silently ignore" contract still applies. + // Silently no-op rather than panic. + fun ref cancel_statement_timer(s: Session ref) => match statement_timer | let t: lori.TimerToken => @@ -1604,7 +1665,10 @@ class _QueryReady is _QueryNoQueryInFlight // Set statement timeout timer if configured on the dispatched item. // The queue item is still at index 0 — dequeuing happens in each - // in-flight state's on_ready_for_query. + // in-flight state's on_ready_for_query. COUPLING: the per-variant + // timeout extraction must stay in sync with + // `_SessionLoggedIn.on_timer_failure`, which rearms this timer on + // ASIO subscription failure. let timeout = match \exhaustive\ li.query_queue(0)? | let qry: _QueuedQuery => qry.statement_timeout | let prep: _QueuedPrepare => prep.statement_timeout @@ -2918,6 +2982,15 @@ interface _SessionState illegal state — it should be silently ignored when not applicable. """ + fun ref on_timer_failure(s: Session ref) + """ + Lori reported that the statement timeout timer's ASIO event subscription + failed. Like `on_timer`, this should never be an illegal state — it + should be silently ignored when not applicable. `_SessionLoggedIn` + (the only state that arms the statement timer) rearms using the + in-flight operation's original duration; every other state is a no-op. + """ + fun ref cancel(s: Session ref) """ The client requested query cancellation. Like `close`, this should never @@ -3243,6 +3316,9 @@ trait _ConnectedState is _NotConnectableState fun ref on_timer(s: Session ref, token: lori.TimerToken) => None + fun ref on_timer_failure(s: Session ref) => + None + fun ref cancel(s: Session ref) => None @@ -3311,6 +3387,9 @@ trait _UnconnectedState is (_NotAuthenticableState & _NotAuthenticated) fun ref on_timer(s: Session ref, token: lori.TimerToken) => None + fun ref on_timer_failure(s: Session ref) => + None + fun ref cancel(s: Session ref) => None