Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add RFC for remote storage generation numbers #4919

Merged
merged 75 commits into from
Aug 30, 2023

Conversation

jcsp
Copy link
Contributor

@jcsp jcsp commented Aug 7, 2023

Summary

A scheme of logical "generation numbers" for pageservers and their attachments is proposed, along with
changes to the remote storage format to include these generation numbers in S3 keys.

Using the control plane as the issuer of these generation numbers enables strong anti-split-brain
properties in the pageserver cluster without implementing a consensus mechanism directly
in the pageservers.

Motivation

Currently, the pageserver's remote storage format does not provide a mechanism for addressing
split brain conditions that may happen when replacing a node during failover or when migrating
a tenant from one pageserver to another. From a remote storage perspective, a split brain condition
occurs whenever two nodes both think they have the same tenant attached, and both can write to S3. This
can happen in the case of a network partition, pathologically long delays (e.g. suspended VM), or software
bugs.

This blocks robust implementation of failover from unresponsive pageservers, due to the risk that
the unresponsive pageserver is still writing to S3.

@jcsp jcsp added c/storage/pageserver Component: storage: pageserver t/tech_design_rfc Issue type: tech design RFC labels Aug 7, 2023
@jcsp jcsp requested review from kelvich and problame August 7, 2023 15:46
@github-actions
Copy link

github-actions bot commented Aug 7, 2023

1624 tests run: 1548 passed, 0 failed, 76 skipped (full report)


The comment gets automatically updated with the latest test results
cf564f3 at 2023-08-30T08:18:32.984Z :recycle:

Copy link
Contributor

@LizardWizzard LizardWizzard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing this up!

I think with some tweaking this can fly.

IMO for easier reading its a good idea to add more references and mermaid diagrams outlining key operations involved.

Additionally I think this lacks comparison with previously discussed approaches. What are the downsides of this one compared to others.

Also discussion started in slack and @knizhnik made some points, so I'd reference them here: https://neondb.slack.com/archives/C05EPKK8K6G/p1691478137757059?thread_ts=1691423240.177529&cid=C05EPKK8K6G

docs/rfcs/025-generation-numbers.md Outdated Show resolved Hide resolved
docs/rfcs/025-generation-numbers.md Outdated Show resolved Hide resolved
docs/rfcs/025-generation-numbers.md Outdated Show resolved Hide resolved
docs/rfcs/025-generation-numbers.md Outdated Show resolved Hide resolved
docs/rfcs/025-generation-numbers.md Outdated Show resolved Hide resolved
docs/rfcs/025-generation-numbers.md Show resolved Hide resolved
docs/rfcs/025-generation-numbers.md Outdated Show resolved Hide resolved
docs/rfcs/025-generation-numbers.md Show resolved Hide resolved
docs/rfcs/025-generation-numbers.md Show resolved Hide resolved
docs/rfcs/025-generation-numbers.md Outdated Show resolved Hide resolved
@knizhnik
Copy link
Contributor

knizhnik commented Aug 8, 2023

IMHO this proposal requires quite a lot changes in all Neon components: PS, SK, control plane,...
It introduce new format of file names. There are a lot of questions and concerns (see my and @LizardWizzard comments).

At the same time data corruption is not so frequent event (unlike crash, split brain, ... and a lot of other reasons of PS temporary unavailability. During last two years it happens only once. Certainly one we will have more server, there will be more data corruption incidents. But still them be quite rare.

Also please take in account that spawning new PS is very expensive operation because we loose all storage contents old PS and have to reload it from S3.

So may be try to find some simpler solution of the problem?
The classical approach is leasing: PS is granted role of leader from control plane for some time. After expiration of this period it has to request new lease from control plane. This timeout can be quite big - for example 10 seconds or may be even minute. The idea is that we should not spawn new PS automatically without human approval. And it should be done only if we are completely sure that there are no chances to recover original PS. So such decision in any case can not (and should not) taken immediately.

So if PS seems to be not alive, we send alert but do nothing before leasing timeout expiration. After timeout is expired control panel can either automatically spawn new PS, either wait until it is done explicitly by DBA.In any case at this moment leasing period of old PS is already expired and it should send new leaving request to control panel (and will get negative response).

IMHO such approach requires less changes in Neon and is more safe because exlude situation with two or more concurrently running PS. Certainly we should take in account time measuring error and the fact that PS can not be stopped immediately. For example control panel should wait not for leasing timeout but timeout*2. But it is technical detail.

@LizardWizzard
Copy link
Contributor

Agree with @knizhnik that required changes are quite invasive. There also needs to be a way to handle existing layers/index parts that lack new suffixes.

After timeout is expired control panel can either automatically spawn new PS, either wait until it is done explicitly by DBA

I think it can be a bit simpler, IMO we should always have some spare capacity to distribute tenants from single failing pageserver because current process of spawning a pageserver takes a lot of time (tens of minutes at best)

Also see this thread in slack where I've tried to see what is the impact of reattach operation on running pgbench

@jcsp
Copy link
Contributor Author

jcsp commented Aug 10, 2023

There also needs to be a way to handle existing layers/index parts that lack new suffixes.

Added a section "On-disk backward compatibility", along with various other aspects in the "Operational impact" section.

@jcsp
Copy link
Contributor Author

jcsp commented Aug 10, 2023

Inlining some of the questions from the slack thread here:

Locating right version of index_part.json Looks like PS need to pool S3 for all generations of this file starting from one assigned to this PS and decrementing until reaching 0 or some previous generation is found. Certainly there are not expected to be a lot of generations but still I do not like such polling solution.

I've updated the "pageserver::Tenant startup changes" section to make clear that the listing is just a fallback, and in normal operation the
pageserver will be able to directly GET the most recent index based on knowledge of what the last attachment's generation numbers were. We
can either pass that around in the control plane, or fetch it from the storage broker (in latest doc, the storage broker has generation
information propagated from safekeeper Pageserverfeedback messages.

Assume that there was long split-brain, i.e. datacenter in which old PS is located is unavailable. We start new PS with new generation but after some time it crashes or become unavailable. At the same time connection to the first datacenter is recovered and first PS is now accessible. We it can't continue work - we in any case need to shutdown it and start new PS with new generation. May be such protection from "resurrection" is really good but ...
It is not clear from RFS how SK will be notified about new generation of PS. Should it poll control plane? Or it will receive notifications from control plane? Should generation be included in any message sent from PS to SK? At least it requires protocol changes.

Yeah, this was something I had left kind of vague. I've change this so that the safekeeper's behaviors are only optimizations, and the pagekeeper itself
is responsible for not publishing updated remote_consistent_lsn until it validates its generation number, similar to what was already proposed
for object deletions.

Then, since the safekeeper stuff is just an optional optimization, we can lazily propagate generation numbers as part of PageserverFeedback and
stash it in SafekeeperTimelineInfo on the broker.

How old PS will be forced to terminate? One answer proposed in RFC is that it will get error from SK when SK detects that PS belongs to older generation. But what if there are no changes and SK is not receiving WAL. Will old PS continue to serve tenants forever?

The old PS will remain available until it encounters some operation it can't do because of its generation (like trying to delete objects, or trying to update its remote_persiste
nt_lsns). If it had no traffic at all it might never do these things, and stay running until it was terminated externally (e.g. when an EC2 termination completed).

Endpoints won't be talking to the old PS because the control plane will have informed them of the new PS. During the transition, it is still safe for endpoints to talk to the old PS, because its data doesn't diverge from the new PS: they're both consuming the same WAL.

The purpose of this RFC is to make such scenarios data-safe: actually killing the node can take seconds, minutes or hours, but it won't do any harm. But usually it would just
be seconds.

Pending deletion assumes that there is some way to determine that old server is dead. If so, why this check can not be used before we launch new PS? Certainly in case of long network partitioning, it may take some time. But most likely all other components: computes, SK, ... also are hosted in the same datacenter, so will be not available as well.

The deletion changes don't assume that we can check the old node is dead: they just require the new node to verify that its generation ID is current.

Copy link
Contributor

@kelvich kelvich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for writing this up!

After relatively careful reading of this RFC I haven't spotted any immediate problems or safety violations. Proposed changes aren't deeply invasive and looks like they could be done in a reasonable time frame. So +1 from my side on proceeding with this approach.

One item that I'm interested in discussing is #4919 (comment). Looks like we can reduce migration latency and simplify things by dropping reliance on "no stolen attachments without increasing node generation" and checking attachment generations only on demand when tenant want to delete some data.

docs/rfcs/025-generation-numbers.md Outdated Show resolved Hide resolved
docs/rfcs/025-generation-numbers.md Outdated Show resolved Hide resolved
docs/rfcs/025-generation-numbers.md Outdated Show resolved Hide resolved
docs/rfcs/025-generation-numbers.md Outdated Show resolved Hide resolved
docs/rfcs/025-generation-numbers.md Outdated Show resolved Hide resolved
docs/rfcs/025-generation-numbers.md Outdated Show resolved Hide resolved
…n unclean

attachment to a new node, and double attachments are explicitly supported.
docs/rfcs/025-generation-numbers.md Outdated Show resolved Hide resolved
docs/rfcs/025-generation-numbers.md Outdated Show resolved Hide resolved
docs/rfcs/025-generation-numbers.md Outdated Show resolved Hide resolved
docs/rfcs/025-generation-numbers.md Outdated Show resolved Hide resolved
docs/rfcs/025-generation-numbers.md Outdated Show resolved Hide resolved
docs/rfcs/025-generation-numbers.md Outdated Show resolved Hide resolved
docs/rfcs/025-generation-numbers.md Outdated Show resolved Hide resolved
docs/rfcs/025-generation-numbers.md Outdated Show resolved Hide resolved
docs/rfcs/025-generation-numbers.md Outdated Show resolved Hide resolved
docs/rfcs/025-generation-numbers.md Outdated Show resolved Hide resolved
Copy link
Contributor

@problame problame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-read top-to-bottom up to but not including "Timeline/Branch creation".

I really like where this is moving, congrats!

This review contains a lot of suggestions to align on use of wording.

Hope you're aware of the "Add suggestion to batch" feature :)

image

Will read the remainder tomorrow.

docs/rfcs/025-generation-numbers.md Outdated Show resolved Hide resolved
docs/rfcs/025-generation-numbers.md Outdated Show resolved Hide resolved
docs/rfcs/025-generation-numbers.md Outdated Show resolved Hide resolved
docs/rfcs/025-generation-numbers.md Outdated Show resolved Hide resolved
docs/rfcs/025-generation-numbers.md Outdated Show resolved Hide resolved
docs/rfcs/025-generation-numbers.md Outdated Show resolved Hide resolved
docs/rfcs/025-generation-numbers.md Outdated Show resolved Hide resolved
docs/rfcs/025-generation-numbers.md Outdated Show resolved Hide resolved
docs/rfcs/025-generation-numbers.md Outdated Show resolved Hide resolved
docs/rfcs/025-generation-numbers.md Outdated Show resolved Hide resolved
Co-authored-by: Christian Schwarz <christian@neon.tech>
problame
problame previously approved these changes Aug 25, 2023
docs/rfcs/025-generation-numbers.md Outdated Show resolved Hide resolved
docs/rfcs/025-generation-numbers.md Outdated Show resolved Hide resolved
docs/rfcs/025-generation-numbers.md Show resolved Hide resolved
@problame problame dismissed their stale review August 25, 2023 08:19

mis-clicked

@problame
Copy link
Contributor

Damn, accidentally pushed to this branch, force-pushing to remove the commits.

(v2 since [I accidentally merged the first
version](#4919 (comment))
of this PR)


In a PR stacked onto [your
RFC](#4919) so that there
aren't as many conversations :)

Please leave the commits separate (merge by merge commit / rebase &
merge)
In a PR stacked onto [your
RFC](#4919) so that there
aren't as many conversations :)

deletion queue: radical simplifcation

1. Change the flow to just assume a persistent queue.
2. Skip the detail of DeletionList objects, and re-writing
   them; instead, orient the whole flow around the entries.
3. Move the trick where we detect generation-1 to a "future
   optimization" section.
4. Remove the on-disk format spec. I know I asked for it,
   but I think it's best to have it in the deletion queue PR,
   not here.
5. Remove all the between-the-lines hints that this
   deletion queue is going to have low filesystem perf needs.
   Nobody cares, we could probably append & fsync for each
   and the NVMes would bear the load just fine.

With the version of deletion queue that is written here, I am
confident that it's a correct optimization.
And I find it substantially easier to follow, since the unimportant
details of how we persist it most efficientl are gone.
Copy link
Contributor

@problame problame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this has come out great!

  • I don't see any correctness issues.
  • I don't think we're painting ourselves into a corner.
  • I think the recent (last few days) additions make this RFC quite easily digestible:
    • simplifications to a single generation number type
    • concept of re-attach
    • editor changes to how the deletion queue is explained in this rfc 🤓 )

Thanks for enduring my countless review iterations. Truly great and impressive work! :)

@jcsp jcsp merged commit 382473d into main Aug 30, 2023
32 checks passed
@jcsp jcsp deleted the jcsp/rfc-generation-numbers branch August 30, 2023 08:49
jcsp added a commit that referenced this pull request Sep 28, 2023
## Problem

Currently we don't have a way to migrate tenants from one pageserver to
another without a risk of gap in availability.

## Summary of changes

This follows on from #4919

Migrating tenants between pageservers is essential to operating a
service
at scale, in several contexts:

1. Responding to a pageserver node failure by migrating tenants to other
pageservers
2. Balancing load and capacity across pageservers, for example when a
user expands their
   database and they need to migrate to a pageserver with more capacity.
3. Restarting pageservers for upgrades and maintenance

Currently, a tenant may migrated by attaching to a new node,
re-configuring endpoints to use the new node, and then later detaching
from the old node. This is safe once [generation
numbers](025-generation-numbers.md) are implemented, but does meet
our seamless/fast/efficient goals:

Co-authored-by: Christian Schwarz <christian@neon.tech>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver t/tech_design_rfc Issue type: tech design RFC
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants