Skip to content

Add keepalives and safe recreation to session management#7

Merged
merlimat merged 2 commits intooxia-db:mainfrom
merlimat:fix/session-keepalives
Apr 15, 2026
Merged

Add keepalives and safe recreation to session management#7
merlimat merged 2 commits intooxia-db:mainfrom
merlimat:fix/session-keepalives

Conversation

@merlimat
Copy link
Copy Markdown
Contributor

Summary

  • Sends periodic KeepAlive RPCs from a background thread so ephemeral state tied to a client session stays alive on the server.
  • Teaches SessionManager to transparently recreate sessions that have been closed, while making on_session_closed race-safe so a late callback from an already-replaced session cannot drop a live replacement.
  • Adds tests/sessions_test.py with deterministic fake stubs; the whole new suite runs in well under a second.

This is an alternative to #6 that addresses the review feedback from both Copilot and Codex.

Heartbeat cadence

_resolve_heartbeat_interval_ms() picks a cadence of roughly session_timeout_ms / 10, floored at 2s, and — crucially — capped strictly below session_timeout_ms. The cap is the fix for the small-timeout foot-gun flagged on #6: with max(session_timeout_ms // 10, 2_000) alone, any session_timeout_ms < 2000 makes the loop sleep past the timeout, so the local-expiry check fires before a single heartbeat is ever sent. The cadence is also injectable for tests, which is how sessions_test.py stays fast and deterministic without sleeping for wall-clock seconds.

Race fix in on_session_closed

Session.close() now notifies the manager before running the potentially slow close_session RPC, so replacement sessions created during the RPC don't race with the callback. And on_session_closed only evicts when the tracked entry is still the same object:

if self.sessions_by_shard.get(session.shard_id()) is session:
    del self.sessions_by_shard[session.shard_id()]

Without this, a late callback from an old session could evict a newer replacement that get_session() had installed during the close-RPC window, leaving a live session with a running heartbeat thread completely unmanaged.

Test plan

  • pytest tests/sessions_test.py — 12 passed in ~0.35s, run 5x with no flakes
  • pytest tests/compare_test.py tests/sessions_test.py — 13 passed
  • python -c "import oxia" — no import regressions
  • Integration run (client_test.py) against a real Oxia server — requires Docker locally, please run on CI

Send periodic KeepAlive RPCs from a background thread so that ephemeral
state tied to a client session does not get garbage-collected by the
server, and teach SessionManager to transparently recreate sessions that
have been closed (locally expired or explicitly torn down).

The heartbeat cadence defaults to roughly one tenth of the session
timeout, floored at 2s and capped strictly below the session timeout so
that short-timeout configurations still get at least one heartbeat
before the local expiry check fires. The cadence is also injectable for
tests.

SessionManager.on_session_closed now verifies the tracked session is
still the one that closed before evicting, so that a late callback from
an already-replaced session cannot drop a live replacement and leak its
heartbeat thread.

A new tests/sessions_test.py drives the heartbeat loop deterministically
via fake stubs and a short injected interval, so the whole suite runs in
well under a second.

Signed-off-by: Matteo Merli <mmerli@apache.org>
@merlimat merlimat force-pushed the fix/session-keepalives branch from 00b7ac0 to 99c88c6 Compare April 15, 2026 22:45
The previous `session_timeout_ms < 2` guard rejected only the degenerate
case. The real constraint is that the default heartbeat cadence has a
2000ms floor, so a timeout below that cannot fit a single default
heartbeat before the local expiry check fires.

Now `_resolve_heartbeat_interval_ms` enforces
`session_timeout_ms >= _DEFAULT_HEARTBEAT_FLOOR_MS` only when the caller
did not supply an explicit `heartbeat_interval_ms`. Callers that pass a
custom interval (including the test suite) can still configure tight
timing deterministically.

Signed-off-by: Matteo Merli <mmerli@apache.org>
@merlimat merlimat mentioned this pull request Apr 15, 2026
@merlimat merlimat merged commit 348fdb0 into oxia-db:main Apr 15, 2026
2 checks passed
merlimat added a commit to merlimat/oxia-client-python that referenced this pull request Apr 16, 2026
…db#7, oxia-db#9)

Bug oxia-db#7: put() used `type(value) is str` (rejects str subclasses) and
`bytes(str(value), ...)` (redundant str()). Non-str/non-bytes values
were forwarded to protobuf, erroring deep in serialization.

Extracted _coerce_value(): isinstance checks, clear TypeError for
invalid types, str.encode() instead of redundant bytes(str(...)).

Bug oxia-db#9: `raise oxia.ex.KeyNotFound` (class) was inconsistent with
the rest of the file which uses `raise KeyNotFound()` (instance).
Both are legal Python but the instance form is conventional.

Bug oxia-db#10 (non-atomic assignment swap) was a false positive — the
existing _parse_assignments already holds self._lock during both the
clear and repopulate. Dropped from the fix list.

Signed-off-by: Matteo Merli <mmerli@apache.org>
merlimat added a commit to merlimat/oxia-client-python that referenced this pull request Apr 16, 2026
…db#7, oxia-db#9)

Bug oxia-db#7: put() used `type(value) is str` (rejects str subclasses) and
`bytes(str(value), ...)` (redundant str()). Non-str/non-bytes values
were forwarded to protobuf, erroring deep in serialization.

Extracted _coerce_value(): isinstance checks, clear TypeError for
invalid types, str.encode() instead of redundant bytes(str(...)).

Bug oxia-db#9: `raise oxia.ex.KeyNotFound` (class) was inconsistent with
the rest of the file which uses `raise KeyNotFound()` (instance).
Both are legal Python but the instance form is conventional.

Bug oxia-db#10 (non-atomic assignment swap) was a false positive — the
existing _parse_assignments already holds self._lock during both the
clear and repopulate. Dropped from the fix list.

Signed-off-by: Matteo Merli <mmerli@apache.org>
merlimat added a commit that referenced this pull request Apr 16, 2026
… (#19)

Bug #7: put() used `type(value) is str` (rejects str subclasses) and
`bytes(str(value), ...)` (redundant str()). Non-str/non-bytes values
were forwarded to protobuf, erroring deep in serialization.

Extracted _coerce_value(): isinstance checks, clear TypeError for
invalid types, str.encode() instead of redundant bytes(str(...)).

Bug #9: `raise oxia.ex.KeyNotFound` (class) was inconsistent with
the rest of the file which uses `raise KeyNotFound()` (instance).
Both are legal Python but the instance form is conventional.

Bug #10 (non-atomic assignment swap) was a false positive — the
existing _parse_assignments already holds self._lock during both the
clear and repopulate. Dropped from the fix list.

Signed-off-by: Matteo Merli <mmerli@apache.org>
@merlimat merlimat deleted the fix/session-keepalives branch April 22, 2026 15:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant