-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pageserver: refine tenant_id->shard lookup #7762
Conversation
3060 tests run: 2933 passed, 0 failed, 127 skipped (full report)Flaky tests (3)Postgres 16Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
146df8f at 2024-05-16T08:39:00.354Z :recycle: |
Interesting failure: while we're waiting for InProgress, we are holding a gate which is blocking location_conf call on a tenant: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-7762/9099973490/index.html#suites/140824de6e814b5b1ae2b622c3f67840/7bae027ac87aa2ef Need to think about what happens if a page request handler is holding a gate on an ancestor shard during a split. |
4ada3de
to
146df8f
Compare
## Problem This is tech debt from when shard splitting was implemented, to handle more nicely the edge case of a client reconnect at the moment of the split. During shard splits, there were edge cases where we could incorrectly return NotFound to a getpage@lsn request, prompting an unwanted reconnect/backoff from the client. It is already the case that parent shards during splits are marked InProgress before child shards are created, so `resolve_attached_shard` will not match on them, thereby implicitly preferring child shards (good). However, we were not doing any elegant handling of InProgress in general: `get_active_tenant_with_timeout` was previously mostly dead code: it was inspecting the slot found by `resolve_attached_shard` and maybe waiting for InProgress, but that path is never taken because since ef7c9c2 the resolve function only ever returns attached slots. Closes: #7044 ## Summary of changes - Change return value of `resolve_attached_shard` to distinguish between true NotFound case, and the case where we skipped slots that were InProgress. - Rework `get_active_tenant_with_timeout` to loop over calling resolve_attached_shard, waiting if it sees an InProgress result. The resulting behavior during a shard split is: - If we look up a shard early in split when parent is InProgress but children aren't created yet, we'll wait for the parent to be shut down. This corresponds to the part of the split where we wait for LSNs to catch up: so a small delay to the request, but a clean enough handling. - If we look up a shard while child shards are already present, we will match on those shards rather than the parent, as intended.
Problem
This is tech debt from when shard splitting was implemented, to handle more nicely the edge case of a client reconnect at the moment of the split.
During shard splits, there were edge cases where we could incorrectly return NotFound to a getpage@lsn request, prompting an unwanted reconnect/backoff from the client.
It is already the case that parent shards during splits are marked InProgress before child shards are created, so
resolve_attached_shard
will not match on them, thereby implicitly preferring child shards (good).However, we were not doing any elegant handling of InProgress in general:
get_active_tenant_with_timeout
was previously mostly dead code: it was inspecting the slot found byresolve_attached_shard
and maybe waiting for InProgress, but that path is never taken because since ef7c9c2 the resolve function only ever returns attached slots.Closes: #7044
Summary of changes
resolve_attached_shard
to distinguish between true NotFound case, and the case where we skipped slots that were InProgress.get_active_tenant_with_timeout
to loop over calling resolve_attached_shard, waiting if it sees an InProgress result.The resulting behavior during a shard split is:
Checklist before requesting a review
Checklist before merging