-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs/rfcs: add RFC for fast tenant migration/failover #5029
Conversation
2520 tests run: 2403 passed, 0 failed, 117 skipped (full report)Code coverage (full report)
The comment gets automatically updated with the latest test results
1569446 at 2023-09-27T10:12:51.854Z :recycle: |
5f608d1
to
5103068
Compare
5103068
to
cfb2851
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Read up to but not including ### Database schema for locations
.
I'm good with the high-level procedure.
My main worry design-wise is the AttachedStale state and how it will interact with eviction.
I'd love to see it removed from this RFC and added in a later RFC.
I'll create a stacked PR with some editorial fixes.
Please pull in my editorial fixes from here: #5185 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some quick comments
Updates:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the slew of comments. By and large, I'm good with this design though.
Co-authored-by: Christian Schwarz <christian@neon.tech>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to go!
…ocations (#5299) ## Problem These changes are part of building seamless tenant migration, as described in the RFC: - #5029 ## Summary of changes - A new configuration type `LocationConf` supersedes `TenantConfOpt` for storing a tenant's configuration in the pageserver repo dir. It contains `TenantConfOpt`, as well as a new `mode` attribute that describes what kind of location this is (secondary, attached, attachment mode etc). It is written to a file called `config-v1` instead of `config` -- this prepares us for neatly making any other profound changes to the format of the file in future. Forward compat for existing pageserver code is achieved by writing out both old and new style files. Backward compat is achieved by checking for the old-style file if the new one isn't found. - The `TenantMap` type changes, to hold `TenantSlot` instead of just `Tenant`. The `Tenant` type continues to be used for attached tenants only. Tenants in other states (such as secondaries) are represented by a different variant of `TenantSlot`. - Where `Tenant` & `Timeline` used to hold an Arc<Mutex<TenantConfOpt>>, they now hold a reference to a AttachedTenantConf, which includes the extra information from LocationConf. This enables them to know the current attachment mode. - The attachment mode is used as an advisory input to decide whether to do compaction and GC (AttachedStale is meant to avoid doing uploads, AttachedMulti is meant to avoid doing deletions). - A new HTTP API is added at `PUT /tenants/<tenant_id>/location_config` to drive new location configuration. This provides a superset of the functionality of attach/detach/load/ignore: - Attaching a tenant is just configuring it in an attached state - Detaching a tenant is configuring it to a detached state - Loading a tenant is just the same as attaching it - Ignoring a tenant is the same as configuring it into Secondary with warm=false (i.e. retain the files on disk but do nothing else). Caveats: - AttachedMulti tenants don't do compaction in this PR, but they do in the follow on #5397 - Concurrent updates to the `location_config` API are not handled elegantly in this PR, a better mechanism is added in the follow on #5367 - Secondary mode is just a placeholder in this PR: the code to upload heatmaps and do downloads on secondary locations will be added in a later PR (but that shouldn't change any external interfaces) Closes: #5379 --------- Co-authored-by: Christian Schwarz <christian@neon.tech>
Problem
Currently we don't have a way to migrate tenants from one pageserver to another without a risk of gap in availability.
Summary of changes
This follows on from #4919
Migrating tenants between pageservers is essential to operating a service
at scale, in several contexts:
database and they need to migrate to a pageserver with more capacity.
Currently, a tenant may migrated by attaching to a new node,
re-configuring endpoints to use the new node, and then later detaching from the old node. This is safe once generation numbers are implemented, but does meet
our seamless/fast/efficient goals:
Checklist before requesting a review
Checklist before merging