docs: add RFC for remote storage generation numbers #4919

jcsp · 2023-08-07T15:46:11Z

Summary

A scheme of logical "generation numbers" for pageservers and their attachments is proposed, along with
changes to the remote storage format to include these generation numbers in S3 keys.

Using the control plane as the issuer of these generation numbers enables strong anti-split-brain
properties in the pageserver cluster without implementing a consensus mechanism directly
in the pageservers.

Motivation

Currently, the pageserver's remote storage format does not provide a mechanism for addressing
split brain conditions that may happen when replacing a node during failover or when migrating
a tenant from one pageserver to another. From a remote storage perspective, a split brain condition
occurs whenever two nodes both think they have the same tenant attached, and both can write to S3. This
can happen in the case of a network partition, pathologically long delays (e.g. suspended VM), or software
bugs.

This blocks robust implementation of failover from unresponsive pageservers, due to the risk that
the unresponsive pageserver is still writing to S3.

github-actions · 2023-08-07T16:12:35Z

1624 tests run: 1548 passed, 0 failed, 76 skipped (full report)

_{The comment gets automatically updated with the latest test results
cf564f3 at 2023-08-30T08:18:32.984Z :recycle:}

LizardWizzard

Thanks for writing this up!

I think with some tweaking this can fly.

IMO for easier reading its a good idea to add more references and mermaid diagrams outlining key operations involved.

Additionally I think this lacks comparison with previously discussed approaches. What are the downsides of this one compared to others.

Also discussion started in slack and @knizhnik made some points, so I'd reference them here: https://neondb.slack.com/archives/C05EPKK8K6G/p1691478137757059?thread_ts=1691423240.177529&cid=C05EPKK8K6G

docs/rfcs/025-generation-numbers.md

knizhnik · 2023-08-08T11:24:16Z

IMHO this proposal requires quite a lot changes in all Neon components: PS, SK, control plane,...
It introduce new format of file names. There are a lot of questions and concerns (see my and @LizardWizzard comments).

At the same time data corruption is not so frequent event (unlike crash, split brain, ... and a lot of other reasons of PS temporary unavailability. During last two years it happens only once. Certainly one we will have more server, there will be more data corruption incidents. But still them be quite rare.

Also please take in account that spawning new PS is very expensive operation because we loose all storage contents old PS and have to reload it from S3.

So may be try to find some simpler solution of the problem?
The classical approach is leasing: PS is granted role of leader from control plane for some time. After expiration of this period it has to request new lease from control plane. This timeout can be quite big - for example 10 seconds or may be even minute. The idea is that we should not spawn new PS automatically without human approval. And it should be done only if we are completely sure that there are no chances to recover original PS. So such decision in any case can not (and should not) taken immediately.

So if PS seems to be not alive, we send alert but do nothing before leasing timeout expiration. After timeout is expired control panel can either automatically spawn new PS, either wait until it is done explicitly by DBA.In any case at this moment leasing period of old PS is already expired and it should send new leaving request to control panel (and will get negative response).

IMHO such approach requires less changes in Neon and is more safe because exlude situation with two or more concurrently running PS. Certainly we should take in account time measuring error and the fact that PS can not be stopped immediately. For example control panel should wait not for leasing timeout but timeout*2. But it is technical detail.

LizardWizzard · 2023-08-08T11:57:55Z

Agree with @knizhnik that required changes are quite invasive. There also needs to be a way to handle existing layers/index parts that lack new suffixes.

After timeout is expired control panel can either automatically spawn new PS, either wait until it is done explicitly by DBA

I think it can be a bit simpler, IMO we should always have some spare capacity to distribute tenants from single failing pageserver because current process of spawning a pageserver takes a lot of time (tens of minutes at best)

Also see this thread in slack where I've tried to see what is the impact of reattach operation on running pgbench

jcsp · 2023-08-10T09:50:07Z

There also needs to be a way to handle existing layers/index parts that lack new suffixes.

Added a section "On-disk backward compatibility", along with various other aspects in the "Operational impact" section.

jcsp · 2023-08-10T09:59:08Z

Inlining some of the questions from the slack thread here:

Locating right version of index_part.json Looks like PS need to pool S3 for all generations of this file starting from one assigned to this PS and decrementing until reaching 0 or some previous generation is found. Certainly there are not expected to be a lot of generations but still I do not like such polling solution.

I've updated the "pageserver::Tenant startup changes" section to make clear that the listing is just a fallback, and in normal operation the
pageserver will be able to directly GET the most recent index based on knowledge of what the last attachment's generation numbers were. We
can either pass that around in the control plane, or fetch it from the storage broker (in latest doc, the storage broker has generation
information propagated from safekeeper Pageserverfeedback messages.

Assume that there was long split-brain, i.e. datacenter in which old PS is located is unavailable. We start new PS with new generation but after some time it crashes or become unavailable. At the same time connection to the first datacenter is recovered and first PS is now accessible. We it can't continue work - we in any case need to shutdown it and start new PS with new generation. May be such protection from "resurrection" is really good but ...
It is not clear from RFS how SK will be notified about new generation of PS. Should it poll control plane? Or it will receive notifications from control plane? Should generation be included in any message sent from PS to SK? At least it requires protocol changes.

Yeah, this was something I had left kind of vague. I've change this so that the safekeeper's behaviors are only optimizations, and the pagekeeper itself
is responsible for not publishing updated remote_consistent_lsn until it validates its generation number, similar to what was already proposed
for object deletions.

Then, since the safekeeper stuff is just an optional optimization, we can lazily propagate generation numbers as part of PageserverFeedback and
stash it in SafekeeperTimelineInfo on the broker.

How old PS will be forced to terminate? One answer proposed in RFC is that it will get error from SK when SK detects that PS belongs to older generation. But what if there are no changes and SK is not receiving WAL. Will old PS continue to serve tenants forever?

The old PS will remain available until it encounters some operation it can't do because of its generation (like trying to delete objects, or trying to update its remote_persiste
nt_lsns). If it had no traffic at all it might never do these things, and stay running until it was terminated externally (e.g. when an EC2 termination completed).

Endpoints won't be talking to the old PS because the control plane will have informed them of the new PS. During the transition, it is still safe for endpoints to talk to the old PS, because its data doesn't diverge from the new PS: they're both consuming the same WAL.

The purpose of this RFC is to make such scenarios data-safe: actually killing the node can take seconds, minutes or hours, but it won't do any harm. But usually it would just
be seconds.

Pending deletion assumes that there is some way to determine that old server is dead. If so, why this check can not be used before we launch new PS? Certainly in case of long network partitioning, it may take some time. But most likely all other components: computes, SK, ... also are hosted in the same datacenter, so will be not available as well.

The deletion changes don't assume that we can check the old node is dead: they just require the new node to verify that its generation ID is current.

kelvich

Thank you for writing this up!

After relatively careful reading of this RFC I haven't spotted any immediate problems or safety violations. Proposed changes aren't deeply invasive and looks like they could be done in a reasonable time frame. So +1 from my side on proceeding with this approach.

One item that I'm interested in discussing is #4919 (comment). Looks like we can reduce migration latency and simplify things by dropping reliance on "no stolen attachments without increasing node generation" and checking attachment generations only on demand when tenant want to delete some data.

docs/rfcs/025-generation-numbers.md

…n unclean attachment to a new node, and double attachments are explicitly supported.

docs/rfcs/025-generation-numbers.md

…mportant)

…l plane to track dirty bit on a pageserver

Co-authored-by: Christian Schwarz <christian@neon.tech>

problame

Re-read top-to-bottom up to but not including "Timeline/Branch creation".

I really like where this is moving, congrats!

This review contains a lot of suggestions to align on use of wording.

Hope you're aware of the "Add suggestion to batch" feature :)

Will read the remainder tomorrow.

docs/rfcs/025-generation-numbers.md

Co-authored-by: Christian Schwarz <christian@neon.tech>

docs/rfcs/025-generation-numbers.md

mis-clicked

problame · 2023-08-25T08:20:44Z

Damn, accidentally pushed to this branch, force-pushing to remove the commits.

(v2 since [I accidentally merged the first version](#4919 (comment)) of this PR) In a PR stacked onto [your RFC](#4919) so that there aren't as many conversations :) Please leave the commits separate (merge by merge commit / rebase & merge)

In a PR stacked onto [your RFC](#4919) so that there aren't as many conversations :) deletion queue: radical simplifcation 1. Change the flow to just assume a persistent queue. 2. Skip the detail of DeletionList objects, and re-writing them; instead, orient the whole flow around the entries. 3. Move the trick where we detect generation-1 to a "future optimization" section. 4. Remove the on-disk format spec. I know I asked for it, but I think it's best to have it in the deletion queue PR, not here. 5. Remove all the between-the-lines hints that this deletion queue is going to have low filesystem perf needs. Nobody cares, we could probably append & fsync for each and the NVMes would bear the load just fine. With the version of deletion queue that is written here, I am confident that it's a correct optimization. And I find it substantially easier to follow, since the unimportant details of how we persist it most efficientl are gone.

docs/rfcs/025-generation-numbers.md

Stacked atop your RFC PR #4919

problame

I think this has come out great!

I don't see any correctness issues.
I don't think we're painting ourselves into a corner.
I think the recent (last few days) additions make this RFC quite easily digestible:
- simplifications to a single generation number type
- concept of re-attach
- editor changes to how the deletion queue is explained in this rfc 🤓 )

Thanks for enduring my countless review iterations. Truly great and impressive work! :)

## Problem Currently we don't have a way to migrate tenants from one pageserver to another without a risk of gap in availability. ## Summary of changes This follows on from #4919 Migrating tenants between pageservers is essential to operating a service at scale, in several contexts: 1. Responding to a pageserver node failure by migrating tenants to other pageservers 2. Balancing load and capacity across pageservers, for example when a user expands their database and they need to migrate to a pageserver with more capacity. 3. Restarting pageservers for upgrades and maintenance Currently, a tenant may migrated by attaching to a new node, re-configuring endpoints to use the new node, and then later detaching from the old node. This is safe once [generation numbers](025-generation-numbers.md) are implemented, but does meet our seamless/fast/efficient goals: Co-authored-by: Christian Schwarz <christian@neon.tech>

docs: add RFC for remote storage generation numbers

b1d01e2

jcsp added c/storage/pageserver Component: storage: pageserver t/tech_design_rfc Issue type: tech design RFC labels Aug 7, 2023

jcsp requested review from kelvich and problame August 7, 2023 15:46

LizardWizzard reviewed Aug 8, 2023

View reviewed changes

Round 1 of revisions from August 8th comments

f146ef0

kelvich reviewed Aug 10, 2023

View reviewed changes

jcsp mentioned this pull request Aug 10, 2023

pageserver: Deletion queue for coalescing & deferring deletes #4960

Closed

5 tasks

Update for Aug 10 discussion: pageserver no longer has to be fenced o…

896bfdb

…n unclean attachment to a new node, and double attachments are explicitly supported.

problame reviewed Aug 16, 2023

View reviewed changes

docs/rfcs/025-generation-numbers.md Outdated Show resolved Hide resolved

problame reviewed Aug 16, 2023

View reviewed changes

jcsp and others added 13 commits August 16, 2023 12:35

Explain why attachment comes first in key suffix (and that it's not i…

13bd800

…mportant)

Split deletion queue into correctness + optimization parts

9ec9a5d

Revise pageserver startup sync optimization to avoid requiring contro…

3bc4110

…l plane to track dirty bit on a pageserver

Clarify generation numbers in summary

c036051

Clearer "Safety for deletions" bullet in summary

7d845bf

link to availability section from generation number section

111a405

Avoid implying that node restart changes attachment generation

f00731f

Update docs/rfcs/025-generation-numbers.md

ff84db9

Co-authored-by: Christian Schwarz <christian@neon.tech>

Update docs/rfcs/025-generation-numbers.md

46eebee

Co-authored-by: Christian Schwarz <christian@neon.tech>

Flesh out explanation of deletion queue persistence

304c2ed

Clarify safekeeper interaction

d3eb6cb

Update metadata reconciliation section

d5d6f27

Clarify cleaning up remote indices

14ceac6

jcsp added 8 commits August 24, 2023 16:02

fix a typo

4fdfa6e

Remove note about avoiding O(tenants) operations

ada9a97

separate "unsafe case" section

621d07e

clarify language

dd8216b

drop a bullet

748745a

use "re-attach" in summary

d9e4eb5

Update section on pageserver startup

63bed47

Note on load/ignore APIs

261d9ad

problame reviewed Aug 24, 2023

View reviewed changes

Apply suggestions from code review

baf8ab1

Co-authored-by: Christian Schwarz <christian@neon.tech>

problame previously approved these changes Aug 25, 2023

View reviewed changes

docs/rfcs/025-generation-numbers.md Outdated Show resolved Hide resolved

docs/rfcs/025-generation-numbers.md Outdated Show resolved Hide resolved

docs/rfcs/025-generation-numbers.md Show resolved Hide resolved

problame force-pushed the jcsp/rfc-generation-numbers branch from 362ed6a to baf8ab1 Compare August 25, 2023 08:21

This was referenced Aug 25, 2023

for #4919: small improvements [v2] #5104

Merged

for #4919: radical simplifcation of deletion queue #5105

Merged

problame added 2 commits August 25, 2023 09:31

for #4919: small improvements [v2] (#5104)

c6dc871

(v2 since [I accidentally merged the first version](#4919 (comment)) of this PR) In a PR stacked onto [your RFC](#4919) so that there aren't as many conversations :) Please leave the commits separate (merge by merge commit / rebase & merge)

problame reviewed Aug 25, 2023

View reviewed changes

docs/rfcs/025-generation-numbers.md Outdated Show resolved Hide resolved

problame reviewed Aug 25, 2023

View reviewed changes

docs/rfcs/025-generation-numbers.md Outdated Show resolved Hide resolved

problame reviewed Aug 25, 2023

View reviewed changes

docs/rfcs/025-generation-numbers.md Outdated Show resolved Hide resolved

problame reviewed Aug 25, 2023

View reviewed changes

docs/rfcs/025-generation-numbers.md Outdated Show resolved Hide resolved

problame mentioned this pull request Aug 25, 2023

for #4919: minor improvements for the remainder of the doc #5110

Merged

for #4919: minor improvements for the remainder of the doc (#5110)

b231a76

Stacked atop your RFC PR #4919

problame approved these changes Aug 25, 2023

View reviewed changes

jcsp mentioned this pull request Aug 29, 2023

pageserver: gate deletions on generation number validation #5137

Closed

corrections & editorial tweaks

cf564f3

jcsp merged commit 382473d into main Aug 30, 2023
32 checks passed

jcsp deleted the jcsp/rfc-generation-numbers branch August 30, 2023 08:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add RFC for remote storage generation numbers #4919

docs: add RFC for remote storage generation numbers #4919

jcsp commented Aug 7, 2023

github-actions bot commented Aug 7, 2023 •

edited

Loading

LizardWizzard left a comment •

edited

Loading

knizhnik commented Aug 8, 2023 •

edited

Loading

LizardWizzard commented Aug 8, 2023

jcsp commented Aug 10, 2023

jcsp commented Aug 10, 2023

kelvich left a comment •

edited

Loading

problame left a comment

problame commented Aug 25, 2023

problame left a comment •

edited

Loading

docs: add RFC for remote storage generation numbers #4919

docs: add RFC for remote storage generation numbers #4919

Conversation

jcsp commented Aug 7, 2023

Summary

Motivation

github-actions bot commented Aug 7, 2023 • edited Loading

1624 tests run: 1548 passed, 0 failed, 76 skipped (full report)

LizardWizzard left a comment • edited Loading

Choose a reason for hiding this comment

knizhnik commented Aug 8, 2023 • edited Loading

LizardWizzard commented Aug 8, 2023

jcsp commented Aug 10, 2023

jcsp commented Aug 10, 2023

kelvich left a comment • edited Loading

Choose a reason for hiding this comment

problame left a comment

Choose a reason for hiding this comment

problame commented Aug 25, 2023

problame left a comment • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Aug 7, 2023 •

edited

Loading

LizardWizzard left a comment •

edited

Loading

knizhnik commented Aug 8, 2023 •

edited

Loading

kelvich left a comment •

edited

Loading

problame left a comment •

edited

Loading