diff --git a/docs/rfcs/026-warm-secondary.md b/docs/rfcs/026-warm-secondary.md new file mode 100644 index 000000000000..46fc521b6ad2 --- /dev/null +++ b/docs/rfcs/026-warm-secondary.md @@ -0,0 +1,496 @@ +# Fast tenant transfers for high availability + +- Author: john@neon.tech +- Created on 2023-08-11 +- Implemented on .. + +## Summary + +The preceding generation numbers RFC may be thought of as "making tenant +transfers safe". Following that, +this RFC is about how those transfers are to be done _quickly_ and _efficiently_. + +In this context, doing a migration "quickly" means doing it with least possible +window of disruption to workloads in the event that we are migrating because +of a failure. + +This is accomplished by introducing two high level changes: + +- A dual-attached state for tenants, used in a control-plane-orchestrated + migration procedure that preserves availability during a migration. +- Warm secondary locations for tenants, where on-disk content is primed + for a fast migration of the tenant from its current attachment to this + secondary location. + +## Motivation + +Migrating tenants between pageservers is essential to operating a service +at scale, in several contexts: + +1. Responding to a pageserver node failure by migrating tenants to other pageservers +2. Balancing load and capacity across pageservers, for example when a user expands their + database and they need to migrate to a pageserver with more capacity. +3. Restarting pageservers for upgrades and maintenance + +In all of these cases, users expect the service to minimize any loss of availability. However, +in the case of proactive movement initiated intentionally, users reasonably expect absolutely +no noticeable loss of availability. If transfers have a user impact (as they currently do), we are de-motivated +from otherwise useful proactive tenant movement, e.g. for balancing load. Therefore making +tenant movement low-impact for users doesn't just make failovers fasterr: it unlocks +the ability to do more continuous management of workload on pageservers using +proactive migrations from one healthy node to another. + +Currently, a tenant may be re-attached by detaching it from an old node and +attaching it to a new one. Once the generation numbers RFC is implemented, +it is also safe to attach a tenant to a new pageserver without detaching +it from the old pageserver (for example if the old pageserver is unresponsive, +stuck in terminating state, or network partitioned). + +Let us consider the pageserver client's view (i.e. postgres's view) when we do such a detach/attach +cycle: + +- Between the detach and attach, no reads may be serviced +- After attach, LSNs above the remote consistent lsn will will not + be readable until the new pageserver has replayed the WAL up to the + present point. +- After attach, reads will have much higher latency until the new pageserver + has populated its local disk with any layers that the client is reading from. + For a client doing random reads, this will require loading at least the total size + of the database in image layers, plus whatever delta layers are required. + +Those availability windows are substantial, and the purpose of this RFC is +to define how to close availability gaps in two ways: + +- Remove the inter-attachment availability gap by using multi-attachment of + tenants during a migration. +- Remove the cold-cache availability/latency gap by adding secondary + locations for tenants that will have pre-warmed caches. + +## Non Goals (if relevant) + +- We do not aim to have the pageservers fail over if the + control plane is unavailable. +- On unplanned migrations (node failures), we do not aim to prevent a small, bounded window of + read unavailability of very recent LSNs (because postgres + page cache usually contains such pages, we do not expect + them to be read frequently from the pageserver). + +## Impacted components (e.g. pageserver, safekeeper, console, etc) + +Pageserver, control plane + +## Proposed implementation + +Since the Generation Numbers RFC enables safe multi-attachment of tenants, we may exploit +this for seamless migration. + +The high level flow is: + +- Attach the tenant to both old and new pageservers +- Wait for the new pageserver to catch up to an LSN as least as recent as the old one +- Update Endpoints to use the new pageserver +- Detach the old pageserver + +In addition to that migration flow, the new pageserver will have a warm cache, by being +identified as a "secondary warm location" by the control plane, and continuously +downloading layers from S3 in the background in case it needs to receive an incoming +migration quickly. + +### Cutover procedure + +The exact steps to do the neatest possible cutover depend on whether we are migrating from a healthy node, +or from a degraded/unresponsive node. However, to avoid defining distinct procedures +for these cases, we will define one procedure with certain optional/advisory steps. The +correctness parts of the procedure are the same for any tenant migration. + +To accomodate the need for some attachments to behave differently while multi-attached, we +will break up the current "attached" state into several states: + +- **AttachedSingle**: our current definition of "attached": the attached node has exclusive + control over the tenant. +- **AttachedMulti**: Uploads are permitted but deletions are blocked. + +If a node is in AttachedSingle and witnesses a higher generation number for the tenant, +it will automatically switch into AttachedMulti: for a node with a stale generation number, +AttachedMulti is equivalent to failing the pre-deletion checks introduced in the +Generation Numbers RFC, but at a higher level where we will avoid even trying to +delete anything, or to do compaction of remote data. + +A node in AttachedMulti with +a stale generation number will also avoid doing any S3 uploads, as it knows that +since its generation is stale, future generations will not see uploads in that +generation. This last part is not necessary for correctness but avoids writing +useless objects to S3. + +The AttachedMulti state is needed to preserve availability of reads on node A during +the migration, otherwise node B might execute deletions that affect objects node A's generation's index still refers to. It is not necessary for data integrity: if the +control plane put two nodes concurrently into AttachSingle, the generation number safety +rules would avoid any corruption, but the older-generationed attachment could see +its remote objects getting deleted by the newer generationed attachment, and thereby +become unavailable for reads. + +For a migration from old node A to new node B: + +1. (Optional, may fail) RPC to node A requesting that it flush to S3. +2. Increment attachment generation number and attach node B in state AttachedMulti +3. (Optional, may fail) RPC to node A notifying it of the new generation. If this + step fails, then node A will still eventually fail to do deletions when it fails to validate + its attachment generation number (see Generation Numbers RFC). If this step succeeds, + node A will stop all S3 writes since its generation is stale, but remain available for reads. +4. (Optional) If node A was available in previous step, then RPC to node B requesting that it download + the full latest image layer, and wait for this to complete before proceeding. +5. Enter catchup polling loop, where we read the latest visible LSN from both nodes, and proceed + when node B's LSN catches up to node A's LSN, or when node A does not respond. +6. Update Endpoints to use node B instead of node A. +7. RPC to node B to set state AttachedSingle +8. (Optional, may fail) RPC to node A to detach. + +We have two different user experiences depending on whether node A was responsive: + +- The guaranteed end state, if node B is behaving, is that the endpoint ends up talking to a pageserver + that will eventually serve all its reads when it catches up with the WAL. +- The happy path end state is that we waited until node B caught up with node A before cutting over, so + there was no degredation to the endpoint's experience associated with the cutover: any LSN gap it sees + is no worse on node B than it was on node A. + +The "optional" steps can be proactively skipped if the control plane believes node A is offline, so +that we avoid N different tenant migrations all trying+failing to execute the optional RPCs. + +#### Recovering from failures during cutover + +We may be unlucky, and node B fails while we are cutting over. The mostly likely reason for this would +be a cascading failure, if we have overloaded a node that we are trying to migrate work to: we should endeavor +to avoid this, but also design to handle it safely. + +If we save the generation numbers of the nodes when we started a migration in the control plane +database, then the control plane's reconcilation loop may reliably "notice" if a node that was +participating in a migration is no longer alive. + +We may _not_ "fail back" to make node A an AttachedSingle under its original generation number, +because node B may have already written their own metadata to remote storage, and any subsequent +generation would end up reading node B's metadata instead of node A's. + +We may ask node A to do a quick "detach/attach" cycle to transition from its old generation to a +new generation, but that is only useful if node A is available and healthy (there was probably some +reason we were migrating _away_ from it to begin with). If Nodes A and B are both fenced, then there +is nothing to recover from, and we can simply put a new node directly into AttachedSingleWriter +with a new generation number. In this context "fenced" means that their node generation number is higher +than the generation number we recorded at the start of the migration. + +Given that destination node failures during a migration will be rare events, rather than providing +a fine-grained set of recovery plans for all possible failure states, we may adopt a simple model +with three procedures depending on the state of node A and the purpose of the migration: + +- If the node A was offline (failover case), then immediately select a new pageserver for the tenant, and + attach it in AttachedSingle. It will download from S3 and try to catch up with the WAL: there will + be some unavoidable availability gap to clients, since we suffered a double node failure (node A failed, + then node B failed while we were migrating node A's work) +- If we were migrating for a node evacuation, and node A is still online, then ensure endpoints are still directed at + node A, and then identify a new migration destination based on the same pageserver selection logic + used for new tenants, attach to that new destination in AttachedMulti, and resume the migration + procedure. +- If we were migrating for a specific API request asking to move a tenant to a particular node, then + detach from node A and re-attach in state AttachedSingle. This will have a brief availability gap, + but that's tolerable because this scenario is very rare (a node failure while we were manually + migrating a tenant). We could in principle avoid this gap with a special case in the pageserver code + to fast-forward an attachment's generation without detaching, but it is not good value-for-money in complexity. + +At the end of the migration operation: + +- the tenant will be available, assuming there was at least one healthy pageserver with enough capacity available. +- the actual location where it ends up depends on whether there was a failure during migration, and whether the operation was a node evacuation or an explicit point-to-point movement of a tenant. +- the attached pageserver with the highest generation number will be in state AttachedSingle. + +## Warm secondary locations + +**Terminology note**: the term "warm standby" is overloaded (also has a meaning in postgres), so +we avoid it here and refer to "warm secondary locations". + +We introduce the concept of a warm secondary location, meaning pageservers to which +a tenant may be attached very quickly due to having pre-warmed content on local disk. + +A tenant may have multiple warm secondary locations. If this set includes the currently +attached pageserver, then the attachment "wins" and the node will only act as a warm secondary +location when no longer attached. + +The secondary location's job is to serve reads **with the same quality of service as the original location +was serving them around the time of a migration**. This does not mean the secondary +location needs the whole database downloaded: + +- If the old location has not served any read requests recently, then the warm secondary + may not need any local content at all: it can "be idle" just as well as the primary was + being idle, without any local data. +- If the old location's pattern of client reads was concentrated on certain layers, then + the secondary only needs to preload those layers, not all content for the tenant. + +**Terminology note**: an "attachment" still means a pageserver that is attached for reads +and writes: a warm secondary location is not considered attached, and does not have +an attachment generation number. The tenant is only attached to one pageserver (unless +it is double-attached during a migration), + +## Layer heat map + +The attached page server is responsible for notifying all secondary locations of +enough information to decide which layers to download to their local storage. + +This could be done by simply sending a list of layers that the attached node deems +hot enough, but this prevents the secondary from applying its own policy about +how much to download from each tenant. For example, a secondary may have +a storage limitation that motivates only downloading the hottest layers from +all tenants. + +The heat map may initially be quite simple: a collection of layer names, with +a heat for each one. The heat may be a time-decaying counter of page reads to +the layer. The heat map only includes layers which have a nonzero counter, where +some threshold is set for clamping counters to zero when they decay past it. This +threshold may be set based on some time policy, such as keeping something warm +for 10 minutes, i.e. ensuring that a layer which had 1 read 11 minutes ago would +have a zero heat. + +Secondary locations may combine the heat map with knowledge of layer sizes to +determine a "cost/benefit ratio" to pre-downloading a layer, and prioritize +downloads based on that. + +Over time, the heat map may be evolved to account for more complex understanding +of access patterns and costs. + +## Local attachment metadata + +To facilitate the finer-grained states that a tenant may have within a pageserver (beyond +just being attached or not), without requiring synchronization with control plane on +startup, a metadata file will be written to the tenant's directory in local storage, +updated when events such as attach/detach happen. + +- To distinguish between AttachedSingle, and AttachedMulti +- to retain a memory of the latest generation seen as well as the attached generation, + we should write a file to local storage within the tenant directory. +- To distinguish between warm secondary locations and attachments: these will + be the same layer files in the same directory hierarchy, so that when we switch + between secondary status & attached status, we don't move anything around. + +This might later evolve into more of a manifest of the contents of local storage, +but for now it is just an O(1) record of the details of the attachment, now +that an attachment + +## Secondary download/housekeeping logic + +Secondary warm locations run a simple loop, implemented separately from +the main `Tenant` type, which represents attached tenants: + +- Periodically RPC to the attached pageserver (learn this via storage broker) to retrieve layer heatmap +- Select any "hot enough" layers to download, if there is sufficient + free disk space. +- Download layers +- Download the latest index_part.jsons +- Check if any layers currently on disk are no longer referenced by + the tenant's metadata & delete them + +## Configuring secondary locations + +A new API `/v1/tenants/:tenant_id/configure` will be added, to enable configuring +a particular tenant's behavior on a node in a more general way than just attaching +or detaching it. + +Sending a DELETE to this request will cause the pageserver to stop holding +a warm cache for the tenant. A DELETE will get a `400` response if the tenant +is currently attached to this node: the control plane must detach it first. + +## Promotion/demotion from warm secondary to/from attachment + +No special behavior is required from the control plane. The pageserver will remember +what was previously done with the `/configure` API, such that if a tenant was configured +as a secondary location, then attached, then the following detach will revert it to +a secondary location and the cache keep-warm loop will reactivate. + +If a node is in use as a secondary location, then a normal `/v1/tenants/:tenant_id/attach` +API request may be made, and the pageserver will internally manage the transition +from running the keep-warm loop to running a fully attached tenant. + +## Correctness + +### Keep-warm downloads are advisory only + +We avoid relying on secondaries for any correctness guarantees other than +that the layers it has downloaded will match the layers written to remote storage. + +More specifically: + +- The secondary is not required to download everything in the heat map: it may + autonomously deprioritize this work and/or reclaim disk space +- The secondary is not required to meet any freshness requirement for data + or metadata: when a tenant is attached, it is the attachment that is required + to synchronize with remote metadata before proceeding. + +### Preserving read availability on old pageserver, during migration. + +Generation numbers provide storage safety for attaching to a new pageserver without +detaching the old one, but one extra change is needed to preserve availability on +the previous attachment while this is going on: the new pageserver must not do +any deletions until we no longer want the old pageserver to be able to serve reads. + +This corresponds to the period where we have +two attachments that may both serve reads, but the old attachment has more recent +LSNs readable than the new attachment. + +The old attachment is prevented from doing deletions implicitly because the new attachment +has gained a higher generation number. + +## Alternatives considered + +### Pageserver-granularity failover + +Instead of migrating tenants individually, we could have entire spare nodes, +and on a node death, move all its work to one of these spares. + +This approach is avoided for several reasons: + +- we would still need fine-grained tenant migration for other + purposes such as balancing load +- by sharing the spare capacity over many peers rather than one spare node, + these peers may use the capacity for other purposes, until it is needed + to handle migrated tenants. e.g. for keeping a deeper cache of their + attached tenants. + +### Cold secondary locations + +We could implement all migration logic without implementing warm secondaries: +this would not hurt availability during planned migrations, but would make +unplanned migrations (i.e. failover on node death) unacceptably slow, as +peers would have to download hundreds of GiB from S3 before resuming +service. + +### Hot secondary locations + +Instead of our "warm" secondary locations that only download remote data, we +could instead use "hot" secondary locations that continuously replay the +WAL to local storage. This would provide even faster transfers, at the cost +of imposing double load on the safekeepers all the time (not just during +a migration), and requiring more local disk bandwidth to stream all +WAL writes into delta L0 layers and compact them. + +A hot secondary location would have faster cutover, but this only matters in +the event of a total pageserver failure: for planned migrations where the original +pageserver is online, warm migrations present the same level of availability +and performance to clients, they just take slightly longer to happen. + +### Readonly during migration + +We could simplify migrations by making both previous and new nodes go into a +readonly state, then flush remote content from the previous node, then activate +attachment on the secondary node. + +The downside to this approach is a potentially large gap in readability of +recent LSNs while loading data onto the new node. To avoid this, it is worthwhile +to incur the extra cost of double-replaying the WAL onto old and new nodes' local +storage during a transfer. + +### Warm Secondary Locations + +The cutover process defined above is correct, and preserves availability, but may not be fast: if +the new pageserver doesn't have anything in its local cache, then completing a migration for a large +database will take a time proportional to the database's size. + +Fast cutovers are important when responding to an unexpected node failure, or when migrating load +to respond to a spike. + +Our architecture relies on having a good cache hit rate on local disk when serving reads and doing compaction, +so providing a fast failover may only be accomplished by having a similarly good cache hit rate on +a new location, by pre-populating it with data. + +### Control Plane Changes + +This section is speculative, the exact implementation may vary. + +#### Database schema + +Currently the `Project` object in the control plane database schema only +carries single pageserver ID. This will be extended to represent multiple attachment, +and to represent a list of locations. + +Each location should record: + +- u32: Attachment generation number (may be null if never attached) +- enum(attached/attaching/detached): whether currently attached. If in attaching state, + then we are pending sending an `/attach` call to the pageserver. +- bool: whether in use by endpoints. + +#### Driving migrations + +Currently, there is a `tenant_migrate` endpoint in the v2 mgmt API, which +is implemented as a series of individual `Operation` records that suspend +the endpoint, disable ("ignore") the tenant on the original server, +attach it to the new one, resume the compute, and finally detach from +the old server. + +We may adapt that existing structure to the new migration procedure, with a new +migration `Operation` that encloses multiple steps -- the procedure described in +this RFC would be quite awkward to implement as separate steps, as if there are failures +then we may change our mind about the ultimate destination of the migration. + +It may make sense to have a separate `Operation` type for bulk movements, so that +when draining a node prior to maintenance, we can dynamically choose where to send +tenants based on load over a series of batches, rather than having to choose all +destinations up front. + +Migration operations may have two variants: + +- Transient: move attachments, but leave secondary warm locations behind: we + expect that this original server will be used again. +- Permanent: move attachments away, and also try to find new secondary warm locations + for tenants that were using this pageserver. The end state is that this pageserver + has no attachments and is also not a location for any tenant. + +#### Handling unresponsive pageservers + +The `Project` object should have a column to track whether a migration is in process: +during a migration the `Operation` for the migration is responsible for all attach/detach +calls out to pageservers. Outside of that operation, some reconcilation process should identify +pageservers that have stale attachments: e.g. if we migrated from an unavailable pageserver and it +later becomes available, we should detach from it. + +Migrations treat detaching from the old pageserver as optional, so after a migration +triggered by node failure, we may leave the Project in a state where "previous pageserver +still attached" column is true. + +Some background process will be needed to reconcile this when the pageserver eventually +comes back online. Also, if we add a way to administratively delete a pageserver in +the + +#### Interaction with Endpoints + +Currently, endpoints are suspended and then started during a migration: the endpoints +must be adapted to accept a configuration change at runtime. + +This RFC just deals with pageservers: runtime modification of endpoint configuration +is out of scope, but believed to be a realistic change to make in the near future. + +#### Managing warm secondary locations + +Warm secondary locations operate similarly to attachments: to set one, it must be +written to the database and then an RPC sent to the pageserver in question to synchronize +it. + +When creating tenants, a warm secondary location may be selected using the same +logic as picking the pageserver to attach to, and the attached pageserver should +also be set as a warm secondary. + +When detaching tenants the control plane may indicate inline to the +pageserver whether the tenant should be retained as a warm location, or +completely removed, depending on whether the detaching pageserver still appears +in the list of locations in the control plane database. + +### Reliability, failure modes and corner cases (if relevant) + +### Interaction/Sequence diagram (if relevant) + +### Scalability (if relevant) + +### Security implications (if relevant) + +### Unresolved questions (if relevant) + +## Alternative implementation (if relevant) + +## Pros/cons of proposed approaches (if relevant) + +## Definition of Done (if relevant)