Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propagate apply_lsn from SK to PS to prevent GC from collecting objects which may be still requested by replica #6357

Closed
wants to merge 33 commits into from

Conversation

knizhnik
Copy link
Contributor

Problem

Lagged replica can request objects which are beyond PiTR boundary ands so collected by GC

#6211

Summary of changes

Take in account standby_flush_lsn when calculation gc_cutoff

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

@knizhnik knizhnik requested review from a team as code owners January 15, 2024 13:24
@knizhnik knizhnik requested review from arssher, jcsp and a team and removed request for a team January 15, 2024 13:24
Copy link

github-actions bot commented Jan 15, 2024

2406 tests run: 2290 passed, 0 failed, 116 skipped (full report)


Flaky tests (3)

Postgres 16

  • test_secondary_mode_eviction: debug
  • test_compute_pageserver_connection_stress: release

Postgres 14

  • test_replication_lag: debug

Code coverage (full report)

  • functions: 54.2% (11279 of 20812 functions)
  • lines: 81.2% (63446 of 78157 lines)

The comment gets automatically updated with the latest test results
32f4a56 at 2024-02-07T07:45:12.191Z :recycle:

@jcsp
Copy link
Collaborator

jcsp commented Jan 16, 2024

High level comments:

  • It looks like hot standby instances can effectively prevent GC if they get stuck: should we be worried about that? Would it make sense to apply some LSN threshold such that if a hot standby is too far behind, we will GC anyway rather than letting it hold us back?
  • What happens if a timeline initially has a hot standby, and then later does not? Is there a mechanism to reset standby_flush_lsn to enable the pageserver to proceed with GC?
  • This PR is missing tests.

@@ -1358,6 +1360,7 @@ impl Timeline {

compaction_lock: tokio::sync::Mutex::default(),
gc_lock: tokio::sync::Mutex::default(),
standby_flush_lsn: AtomicLsn::new(0),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using Lsn::INVALID would make it more obvious that thise is the same as the default value in SkTimelineInfo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AtomicLsn constructor accepts u64 parameter. And AtomicLsn::new(LSN::INVALID.0) looks IMHO really strange

libs/safekeeper_api/src/models.rs Outdated Show resolved Hide resolved
pageserver/src/tenant/timeline.rs Outdated Show resolved Hide resolved
}
}
if reply.flush_lsn != Lsn::INVALID {
if reply_agg.flush_lsn != Lsn::INVALID {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since INVALID is defined as zero, wouldn't it be equivalent to just always do an Lsn::min here, rather than checking for INVALID?

Copy link
Contributor Author

@knizhnik knizhnik Jan 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately it doesn't work as 0 is is minimum u64 number, flush_lsb will never be assigned larger value without this check. Alternative is to use Lsn::MAX as default - in this case we can avoid this check. But it may cause problems in other places which expect Lsn::INVALID and not Lsn::MAX.

So I preserved the same logic as already used for PS feedback.

@knizhnik
Copy link
Contributor Author

High level comments:

  • It looks like hot standby instances can effectively prevent GC if they get stuck: should we be worried about that? Would it make sense to apply some LSN threshold such that if a hot standby is too far behind, we will GC anyway rather than letting it hold us back?

It was my original concern, why I didn't want replicas to hold GC at PS.
Current status is the following: SK collects LSN only about active replicas (which are attached to SK).
If some replica is detached, it can not hold GC at PS.

So the question can it be active replica which has huge lag with master?
It can happen if replica and master has different configuration (manually chosen by user or because of autoscaling)
and there is huge workload on master sod that replica can not caught up.

I do not see that it is realistic scenario.
But I think that adding threshold for max standby lag is good idea.

  • What happens if a timeline initially has a hot standby, and then later does not? Is there a mechanism to reset standby_flush_lsn to enable the pageserver to proceed with GC?

As I wrote above only active replicas are taken into account.

  • This PR is missing tests.

It is not so early to somehow slowdown replica to cause replication lag. Certainly we can stop replica. But in this case it will not taken in account. And once it is restarted, it should catch up quite soon, unless there are a lot of changes it has to apply. But it means that test should take long enough time to collect this WAL.

@jcsp
Copy link
Collaborator

jcsp commented Jan 17, 2024

What happens if a timeline initially has a hot standby, and then later does not? Is there a mechanism to reset standby_flush_lsn to enable the pageserver to proceed with GC?
As I wrote above only active replicas are taken into account.

Can you explain more? When a replica ceases to be active, how does the standby_flush_lsn in the pageserver's memory get reset?

This PR is missing tests.
It is not so early to somehow slowdown replica to cause replication lag. Certainly we can stop replica. But in this case it will not taken in account. And once it is restarted, it should catch up quite soon, unless there are a lot of changes it has to apply. But it means that test should take long enough time to collect this WAL.

You may need some test-specific way to pause a replica, like a SIGSTOP, so that it remains logically active but delays advancing the offset.

@knizhnik
Copy link
Contributor Author

What happens if a timeline initially has a hot standby, and then later does not? Is there a mechanism to reset standby_flush_lsn to enable the pageserver to proceed with GC?
As I wrote above only active replicas are taken into account.

Can you explain more? When a replica ceases to be active, how does the standby_flush_lsn in the pageserver's memory get reset?

So workflow is the following:

  1. Replica is sending standby reply to SK, containing write/flush/apply LSNs of replica
  2. SK aggregates min(standby_reply) LSNs
  3. SK includes min(standby_flush_lsn) in SK's timeline state which is sent to broker
  4. Broker push timeline info to SK
  5. SK stores received standby_flush_lsn in its timeline info
  6. GC compares new_gc_cutoff with stored standby_flush_lsn

So if there is no active replicas than Lsn::INVALID=0 is propagated to SK and SK doesn't take ins account when calculates new_gc_cutoff.

This PR is missing tests.
It is not so early to somehow slowdown replica to cause replication lag. Certainly we can stop replica. But in this case it will not taken in account. And once it is restarted, it should catch up quite soon, unless there are a lot of changes it has to apply. But it means that test should take long enough time to collect this WAL.

You may need some test-specific way to pause a replica, like a SIGSTOP, so that it remains logically active but delays advancing the offset.

I will think more about it. The problem is that replica is receiving and replaying WAL is Postgres core code. I do not want to change it just to make it possible to create test. In principle we can add such "throttling" in WAL page filtering (done in our extension). But adding extra GUC and throttling code just for testing is IMHO overkill.

@knizhnik knizhnik force-pushed the propagate_reply_flush_lsn_from_sk_to_ps branch from e7ea8bb to 4242f79 Compare January 19, 2024 08:05
@knizhnik
Copy link
Contributor Author

Test I added - without this PR it reproduces "tried to request a page version that was garbage collected" error

secondary_lsn = secondary.safe_psql_scalar(
"SELECT pg_last_wal_replay_lsn()", log_query=False
)
balance = secondary.safe_psql_scalar("select sum(abalance) from pgbench_accounts")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no assertion in this test: it looks like it will pass as long as pgbench runs. I guess this is where you need to add something that makes the replica slow, and then some check that GC doesn't advance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assertion is not needed - if requested page can not be reconstructed because requested LSN is beyand GC cutoff, then query is terminated with error and test is failed without any assertions.
It happens without this PR and still happen with 14/15 versions.
I am investigating it now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay - can you please add a comment to the test that explains what it is testing, and how it would fail if something is wrong.

@knizhnik knizhnik force-pushed the propagate_reply_flush_lsn_from_sk_to_ps branch from 64f23d5 to 8de53a6 Compare January 22, 2024 13:05
@knizhnik knizhnik requested a review from a team as a code owner January 22, 2024 13:05
@knizhnik knizhnik requested review from save-buffer and removed request for a team January 22, 2024 13:05
@knizhnik knizhnik force-pushed the propagate_reply_flush_lsn_from_sk_to_ps branch from 171793a to 9bee043 Compare January 23, 2024 14:27
@knizhnik knizhnik changed the title Propagate repply_flush_lsn from SK to PS to prevent GC from collecting objects which may be still requested by replica Propagate apply_lsn from SK to PS to prevent GC from collecting objects which may be still requested by replica Jan 23, 2024
@knizhnik knizhnik force-pushed the propagate_reply_flush_lsn_from_sk_to_ps branch from 1568c14 to 16e196f Compare January 23, 2024 17:31
Copy link
Contributor

@save-buffer save-buffer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave it a look-through, don't have any issues. Seems that we're just piping the horizon through from compute to pageserver, and that's the majority of the change.

@knizhnik knizhnik force-pushed the propagate_reply_flush_lsn_from_sk_to_ps branch from 6007fad to f6c6808 Compare January 25, 2024 06:57
@arssher
Copy link
Contributor

arssher commented Jan 29, 2024

High level notes:

  • <lsn, latest> -> <lsn, horizon> switch generally looks good and very important for standbys, but it is a breaking protocol change. We either need to hack around this or restart all computes during release (not appealing).
  • CombineHotStanbyFeedbacks fix looks good.
  • Aggregation of StandbyReply at safekeepers itself is also ok.
  • However, I don't like standby_horizon machinery. It will flip flop at ps because only safekeeper with connected standby will send to the broker not 0 value; I guess that to do that reasonably we need to augment lsn with timestamp, so that pageserver would respect only reasonably recent horizon hold-off request. But more importantly, on the second look I strongly doubt this mechanism is needed at all. While we indeed have seen tried to request a page version that was garbage collected errors in practice, likely these were due to several bugs we had in the past when standby got stuck, + improper handling of 'horizon' handling fixed here. Our default gc horizon is 1 week AFAIR; replica lagging more than that is clearly not something sane and needs investigation.

Mostly note to myself: haven't checked yet if hs feedback is processed at compute at all, last time I checked it wasn't.

@knizhnik
Copy link
Contributor Author

High level notes:

  • <lsn, latest> -> <lsn, horizon> switch generally looks good and very important for standbys, but it is a breaking protocol change. We either need to hack around this or restart all computes during release (not appealing).

I agree. I didn't think much about upgrade issues.
May be we should just extend protocol by new message (something like get_page_in_lsn_range) so that PS can accept both new and old requests.

  • However, I don't like standby_horizon machinery. It will flip flop at ps because only safekeeper with connected standby will send to the broker not 0 value;

It seems to be not a bug, but feature. We do not want offline replicas to somehow suspend GC. If replica is restarted, then it is restarted at most recent LSN.

I guess that to do that reasonably we need to augment lsn with timestamp, so that pageserver would respect only reasonably recent horizon hold-off request.

I do not understand why do we need timestamp here.
What is most critical for us is LSN delta. Assume that we have replica with lag=N. It means that we do not allow GC at PS to reclaim up to N bytes of data (may be with some multiplier). Is it acceptable or not? It depends on N. If N is small enough (i.e. < Gb), then doesn't matter what is the time lag: minute, hour or day. We can keep extra Gb in PS storage. But if N=1Tb, then it is definitely not acceptable and once again time lag is not important.

There is currently hardcoded constant:

MAX_STANDBY_LAG: u64 = 1024 * 1024 * 1024; // 1Gb

which limits reported replica lag:

                if reply.apply_lsn != Lsn::INVALID
                    && self.agg_ps_feedback.last_received_lsn < reply.apply_lsn + MAX_STANDBY_LAG
                {
                    if reply_agg.apply_lsn != Lsn::INVALID {
                        reply_agg.apply_lsn = Lsn::min(reply_agg.apply_lsn, reply.apply_lsn);
                    } else {
                        reply_agg.apply_lsn = reply.apply_lsn;
                    }
                }

May be it will be better to replace it with PS parameter, but I do not understand why do we need to do something more than such check.

But more importantly, on the second look I strongly doubt this mechanism is needed at all. While we indeed have seen tried to request a page version that was garbage collected errors in practice, likely these were due to several bugs we had in the past when standby got stuck, + improper handling of 'horizon' handling fixed here. Our default gc horizon is 1 week AFAIR; replica lagging more than that is clearly not something sane and needs investigation.

I also do not understand what can cause such large replication lag. Most likely it means some error. Hopefully it is already fixed.

But from the other side - this standby_horizon mechanism actually costs nothing: it doesn't introduce some extra overhead. The only drawback - it can suspend GC. But only in cards of large replication lag. But as you mentioned - it should normally happen. And if still it managed to happen, it is better to occupy more space on the disk than abort queries by reporting an error.

Mostly note to myself: haven't checked yet if hs feedback is processed at compute at all, last time I checked it wasn't.

I have checked it - it I proceeded. Please notice that hot_standby_feedback needs to be set.

@@ -62,7 +62,7 @@ typedef struct
typedef struct
{
NeonMessageTag tag;
bool latest; /* if true, request latest page version */
XLogRecPtr horizon; /* uppe boundary for page LSN */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Saying here about 0 special case would be good, or add reference to neon_get_horizon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed comment because I do not consider 0 as special case any more.
get_page now specifies interval of LSN. 0 is valid value of interval low boundary.
It is not somehow specially treated at PS.

}
let last_record_lsn = timeline.get_last_record_lsn();
let request_horizon = if horizon == Lsn::INVALID {
lsn
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This removes comment about special lsn == 0 case. While we don't connect pageserver directly to compute anymore, I believe this hack is still valid when we get initial basebackup, so worth leaving comment, adapting it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above

// anyway)
}
let last_record_lsn = timeline.get_last_record_lsn();
let request_horizon = if horizon == Lsn::INVALID {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite unobvious, can't we do that at compute side (set lsn == request_lsn is that latter is 0)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was specially done for conveniece in tests: there are a lot of tests which originally used latest=false.
Passing copy of lsn to horizon in all this tests seems to be very inconvenient.
So I prefer to add comment here rather than require to pass copy of LSN.
And it seems to be quite logical:

  • Lsn::MAX stands for latest version
  • any valid LSN specified upper boundary for LSN
  • Lsn::INVALID (0) move upper boundary down to the lower boundary=lsn

@knizhnik
Copy link
Contributor Author

knizhnik commented Jan 30, 2024

I have made thew following changes after discussion with @arssher :

  1. It is not possible to distinguish reply with zero horizion received from SK which lost connection with replica from reply from SK which was never connected to any replica. @arssher suggested to reject invalid (0) standby horizon at PS and use timeout to deteriorate its value. So if any didn't refresh standby_horizon with some non-zero value, then it will not affect GC any more. I have implemented this schema, but then decided that yet another hardcoded timeout is not so good. This is why I just reset standby_horizon at each GC iteration. So timeout is actually equal to gc_period.
  2. As far as PS should support work with old client sending latest flag instead of horizon in get_page_request (sorry for name collision but do not mix this horizon with standby horizon:) and we do not support protocol versions, I add new command - GetLatestPage with old tag (2) and GetPage is given new tag (4). So now PS can serve old and new clients.
  3. Trying to address problem with lack of active XIDs at replica startup, I slightly changes startup procedure for replica - now it is not assumed that mode was shutdowned. So, in theory, Postgres hot standby state machine should take care of it.

@hlinnaka
Copy link
Contributor

hlinnaka commented Feb 8, 2024

I don't understand why all the compute <-> pageserver protocol changes are needed. Can you explain? (I remember we talked about that in a 1:1 call last week, but I cannot remember it now, and it would be good to have written down in comments anyway)

@knizhnik
Copy link
Contributor Author

knizhnik commented Feb 9, 2024

I don't understand why all the compute <-> pageserver protocol changes are needed. Can you explain? (I remember we talked about that in a 1:1 call last week, but I cannot remember it now, and it would be good to have written down in comments anyway)

I briefly explain it in the comment in pagestore_smgr.c:

/*
 * There are three kinds of get_page :
 * 1. Master compute: get the latest page not older than specified LSN (horizon=Lsn::MAX)
 * 2. RO replica: get the latest page not newer than current WAL position replica already applied (horizon=GetXLogReplayRecPtr(NULL))
 * 3. Snapshot: get latest page not new than specified LSN (horizon=request_lsn)
 */

The problem itself is more detailed explained in Slack:
https://neondb.slack.com/archives/C036U0GRMRB/p1705932864991419

So I assume that two cases are clear:

  1. Normal RW node: it always request latest version of there page because nobody except it can update pages.
  2. Static RO node (snapshot): it always request version of the page not greater than snapshot LSN

In both cases we do no need range and just "latest" flag.
Most difficult is hot-standby replica. It accepts and apply WAL concurrently with PS. Replica can be ahead of PS or visa versa. This is why replica can not request latest version of the page, because if PS is ahead of replica, it will get "future" version of the page. But we also can not request version of the page with LSN not greater than specified (as in case of static replica). Because we are using "last written LSN" cache to estimate when page was last updated. If page was not updated for a long time, then requesting page worth this LSN may cause PS to retrieve version of the page which was already collected by GC.

Originally I thought that the problem is related only with GC. And if GC is disabled, then there is no problem with hot-standby replicas. But looks like it is not true:
#6674
In this case the problem is caused by accessing FSM when heap page is updated. If to fetch FSM page we are using last written access LSN of MAIN fork page, then we can get an error at PS.

So to handle case 3 we need top pas range of LSN: low boundary specifies estimated LSN of last page update (so that we do make PS to wait for applying most recent updates) and higher boundary specifies current apply position of replica, to prevent PS from sending too young pages.

With such range determining start LSN position for lookup in layer map becomes even simpler than now with "latest" flag - it is just max(low_bloundary, min(last_record_lsn, upper_boundary))

@hlinnaka
Copy link
Contributor

hlinnaka commented Feb 9, 2024

Ok, so the protocol changes are completely unrelated to the topic of this PR, propagating apply_lsn from SK to PS. Please split that off to a separate PR.

@knizhnik
Copy link
Contributor Author

Ok, so the protocol changes are completely unrelated to the topic of this PR, propagating apply_lsn from SK to PS. Please split that off to a separate PR.

#6357

@knizhnik knizhnik mentioned this pull request Feb 15, 2024
5 tasks
@brendan-stephens
Copy link

@knizhnik, et al, thanks for your work on this issue.
I can see this has spit off a bit into various subcomponents.
Do all of the items here need to be in place for the primary request a page version that was garbage collected issue? Or has that been addressed and the rest are improvements?
I have a few customers who have been persisting their replicas up to try and avoid this issue.

@kelvich
Copy link
Contributor

kelvich commented Mar 18, 2024

I see some changes in protocol. While deploying we will have older compute images talking to newer pageservers and safekeepers. Would that be a problem?

@petuhovskiy can you please review that one? (ideas on on how to split that up to a safer series of patches and how to increase test coverage are welcome)

@kelvich
Copy link
Contributor

kelvich commented Mar 18, 2024

also #6718 has some part of protocol changes as well.

@skyzh when talking about last compute image rollout you've mentioned explicit protocol versioning for compute <> pageserver protocol to avoid assuming that each release has some breaking (bat backward-compatible) change. Should we add protocol version right here? And in a follow up PR we can add some tests to check for api breakage and the relax compute image selection rules in a control plane.

@knizhnik
Copy link
Contributor Author

Protocol changes (sending LSN range) were extracted from this ticket to #6718
Propagation of LSN doesn't require any protocol changes.
My plan is the following: first merge #6718 and then I will rebase this ticket so that it contains only changes needed to correctly propagate LSN from replica to PS.

@kelvich
Copy link
Contributor

kelvich commented Mar 18, 2024

ok, got it. Then @petuhovskiy or @arssher can you please take a look on #6718 instead?

@petuhovskiy
Copy link
Member

Took a quick look at #6718, it's about compute<->ps protocol change and I don't have expertise on that. But I'll take a look at this PR after it will get rebased.

@petuhovskiy petuhovskiy self-requested a review March 18, 2024 12:12
@skyzh
Copy link
Member

skyzh commented Mar 18, 2024

Also note that the current prod runs a release of compute node from 3 weeks ago. Better to wait it to catch up with our latest release before adding more things to the compute so that it's easier to roll back in case something goes wrong.

@andreasscherbaum
Copy link
Contributor

current prod runs a release of compute node from 3 weeks ago

We definitely need to unblock/unpin Compute first, and see if it's stable. Only then we can merge this in, and release it.

@knizhnik After moving the protocol changes into #6718, are there any other breaking changes in this PR which require Compute restart?

This PR is touching 30+ code files. Is it possible to break it down into smaller patches, which can be rolled out in 2 or 3 steps? In steps which build on each other.

@knizhnik
Copy link
Contributor Author

Replaced with #7368

@knizhnik knizhnik closed this Apr 12, 2024
knizhnik pushed a commit that referenced this pull request May 14, 2024
…ts which may be still requested by replica

refer #6211 #6357
arssher pushed a commit that referenced this pull request May 20, 2024
…ts which may be still requested by replica

refer #6211 #6357
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants