Propagate apply_lsn from SK to PS to prevent GC from collecting objects which may be still requested by replica #6357

knizhnik · 2024-01-15T13:24:45Z

Problem

Lagged replica can request objects which are beyond PiTR boundary ands so collected by GC

#6211

Summary of changes

Take in account standby_flush_lsn when calculation gc_cutoff

Checklist before requesting a review

I have performed a self-review of my code.
If it is a core feature, I have added thorough tests.
Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

Do not forget to reformat commit message to not include the above checklist

github-actions · 2024-01-15T14:27:28Z

2406 tests run: 2290 passed, 0 failed, 116 skipped (full report)

Flaky tests (3)

Postgres 16

test_secondary_mode_eviction: debug
test_compute_pageserver_connection_stress: release

Postgres 14

test_replication_lag: debug

Code coverage (full report)

functions: 54.2% (11279 of 20812 functions)
lines: 81.2% (63446 of 78157 lines)

_{The comment gets automatically updated with the latest test results
32f4a56 at 2024-02-07T07:45:12.191Z :recycle:}

jcsp · 2024-01-16T09:19:14Z

High level comments:

It looks like hot standby instances can effectively prevent GC if they get stuck: should we be worried about that? Would it make sense to apply some LSN threshold such that if a hot standby is too far behind, we will GC anyway rather than letting it hold us back?
What happens if a timeline initially has a hot standby, and then later does not? Is there a mechanism to reset standby_flush_lsn to enable the pageserver to proceed with GC?
This PR is missing tests.

jcsp · 2024-01-16T09:07:43Z

pageserver/src/tenant/timeline.rs

@@ -1358,6 +1360,7 @@ impl Timeline {

                compaction_lock: tokio::sync::Mutex::default(),
                gc_lock: tokio::sync::Mutex::default(),
+                standby_flush_lsn: AtomicLsn::new(0),


using Lsn::INVALID would make it more obvious that thise is the same as the default value in SkTimelineInfo

AtomicLsn constructor accepts u64 parameter. And AtomicLsn::new(LSN::INVALID.0) looks IMHO really strange

libs/safekeeper_api/src/models.rs

pageserver/src/tenant/timeline.rs

jcsp · 2024-01-16T09:13:17Z

safekeeper/src/send_wal.rs

+                    }
+                }
+                if reply.flush_lsn != Lsn::INVALID {
+                    if reply_agg.flush_lsn != Lsn::INVALID {


Since INVALID is defined as zero, wouldn't it be equivalent to just always do an Lsn::min here, rather than checking for INVALID?

Unfortunately it doesn't work as 0 is is minimum u64 number, flush_lsb will never be assigned larger value without this check. Alternative is to use Lsn::MAX as default - in this case we can avoid this check. But it may cause problems in other places which expect Lsn::INVALID and not Lsn::MAX.

So I preserved the same logic as already used for PS feedback.

knizhnik · 2024-01-17T10:06:12Z

High level comments:

It looks like hot standby instances can effectively prevent GC if they get stuck: should we be worried about that? Would it make sense to apply some LSN threshold such that if a hot standby is too far behind, we will GC anyway rather than letting it hold us back?

It was my original concern, why I didn't want replicas to hold GC at PS.
Current status is the following: SK collects LSN only about active replicas (which are attached to SK).
If some replica is detached, it can not hold GC at PS.

So the question can it be active replica which has huge lag with master?
It can happen if replica and master has different configuration (manually chosen by user or because of autoscaling)
and there is huge workload on master sod that replica can not caught up.

I do not see that it is realistic scenario.
But I think that adding threshold for max standby lag is good idea.

What happens if a timeline initially has a hot standby, and then later does not? Is there a mechanism to reset standby_flush_lsn to enable the pageserver to proceed with GC?

As I wrote above only active replicas are taken into account.

This PR is missing tests.

It is not so early to somehow slowdown replica to cause replication lag. Certainly we can stop replica. But in this case it will not taken in account. And once it is restarted, it should catch up quite soon, unless there are a lot of changes it has to apply. But it means that test should take long enough time to collect this WAL.

jcsp · 2024-01-17T14:09:04Z

What happens if a timeline initially has a hot standby, and then later does not? Is there a mechanism to reset standby_flush_lsn to enable the pageserver to proceed with GC?
As I wrote above only active replicas are taken into account.

Can you explain more? When a replica ceases to be active, how does the standby_flush_lsn in the pageserver's memory get reset?

This PR is missing tests.
It is not so early to somehow slowdown replica to cause replication lag. Certainly we can stop replica. But in this case it will not taken in account. And once it is restarted, it should catch up quite soon, unless there are a lot of changes it has to apply. But it means that test should take long enough time to collect this WAL.

You may need some test-specific way to pause a replica, like a SIGSTOP, so that it remains logically active but delays advancing the offset.

knizhnik · 2024-01-17T16:48:23Z

What happens if a timeline initially has a hot standby, and then later does not? Is there a mechanism to reset standby_flush_lsn to enable the pageserver to proceed with GC?
As I wrote above only active replicas are taken into account.

Can you explain more? When a replica ceases to be active, how does the standby_flush_lsn in the pageserver's memory get reset?

So workflow is the following:

Replica is sending standby reply to SK, containing write/flush/apply LSNs of replica
SK aggregates min(standby_reply) LSNs
SK includes min(standby_flush_lsn) in SK's timeline state which is sent to broker
Broker push timeline info to SK
SK stores received standby_flush_lsn in its timeline info
GC compares new_gc_cutoff with stored standby_flush_lsn

So if there is no active replicas than Lsn::INVALID=0 is propagated to SK and SK doesn't take ins account when calculates new_gc_cutoff.

This PR is missing tests.
It is not so early to somehow slowdown replica to cause replication lag. Certainly we can stop replica. But in this case it will not taken in account. And once it is restarted, it should catch up quite soon, unless there are a lot of changes it has to apply. But it means that test should take long enough time to collect this WAL.

You may need some test-specific way to pause a replica, like a SIGSTOP, so that it remains logically active but delays advancing the offset.

I will think more about it. The problem is that replica is receiving and replaying WAL is Postgres core code. I do not want to change it just to make it possible to create test. In principle we can add such "throttling" in WAL page filtering (done in our extension). But adding extra GUC and throttling code just for testing is IMHO overkill.

knizhnik · 2024-01-19T08:07:08Z

Test I added - without this PR it reproduces "tried to request a page version that was garbage collected" error

jcsp · 2024-01-22T08:50:22Z

test_runner/regress/test_replication_lag.py

+                secondary_lsn = secondary.safe_psql_scalar(
+                    "SELECT pg_last_wal_replay_lsn()", log_query=False
+                )
+                balance = secondary.safe_psql_scalar("select sum(abalance) from pgbench_accounts")


There is no assertion in this test: it looks like it will pass as long as pgbench runs. I guess this is where you need to add something that makes the replica slow, and then some check that GC doesn't advance.

Assertion is not needed - if requested page can not be reconstructed because requested LSN is beyand GC cutoff, then query is terminated with error and test is failed without any assertions.
It happens without this PR and still happen with 14/15 versions.
I am investigating it now.

Okay - can you please add a comment to the test that explains what it is testing, and how it would fail if something is wrong.

save-buffer

I gave it a look-through, don't have any issues. Seems that we're just piping the horizon through from compute to pageserver, and that's the majority of the change.

arssher · 2024-01-29T15:08:52Z

High level notes:

<lsn, latest> -> <lsn, horizon> switch generally looks good and very important for standbys, but it is a breaking protocol change. We either need to hack around this or restart all computes during release (not appealing).
CombineHotStanbyFeedbacks fix looks good.
Aggregation of StandbyReply at safekeepers itself is also ok.
However, I don't like standby_horizon machinery. It will flip flop at ps because only safekeeper with connected standby will send to the broker not 0 value; I guess that to do that reasonably we need to augment lsn with timestamp, so that pageserver would respect only reasonably recent horizon hold-off request. But more importantly, on the second look I strongly doubt this mechanism is needed at all. While we indeed have seen tried to request a page version that was garbage collected errors in practice, likely these were due to several bugs we had in the past when standby got stuck, + improper handling of 'horizon' handling fixed here. Our default gc horizon is 1 week AFAIR; replica lagging more than that is clearly not something sane and needs investigation.

Mostly note to myself: haven't checked yet if hs feedback is processed at compute at all, last time I checked it wasn't.

knizhnik · 2024-01-29T15:46:11Z

High level notes:

<lsn, latest> -> <lsn, horizon> switch generally looks good and very important for standbys, but it is a breaking protocol change. We either need to hack around this or restart all computes during release (not appealing).

I agree. I didn't think much about upgrade issues.
May be we should just extend protocol by new message (something like get_page_in_lsn_range) so that PS can accept both new and old requests.

However, I don't like standby_horizon machinery. It will flip flop at ps because only safekeeper with connected standby will send to the broker not 0 value;

It seems to be not a bug, but feature. We do not want offline replicas to somehow suspend GC. If replica is restarted, then it is restarted at most recent LSN.

I guess that to do that reasonably we need to augment lsn with timestamp, so that pageserver would respect only reasonably recent horizon hold-off request.

I do not understand why do we need timestamp here.
What is most critical for us is LSN delta. Assume that we have replica with lag=N. It means that we do not allow GC at PS to reclaim up to N bytes of data (may be with some multiplier). Is it acceptable or not? It depends on N. If N is small enough (i.e. < Gb), then doesn't matter what is the time lag: minute, hour or day. We can keep extra Gb in PS storage. But if N=1Tb, then it is definitely not acceptable and once again time lag is not important.

There is currently hardcoded constant:

MAX_STANDBY_LAG: u64 = 1024 * 1024 * 1024; // 1Gb

which limits reported replica lag:

                if reply.apply_lsn != Lsn::INVALID
                    && self.agg_ps_feedback.last_received_lsn < reply.apply_lsn + MAX_STANDBY_LAG
                {
                    if reply_agg.apply_lsn != Lsn::INVALID {
                        reply_agg.apply_lsn = Lsn::min(reply_agg.apply_lsn, reply.apply_lsn);
                    } else {
                        reply_agg.apply_lsn = reply.apply_lsn;
                    }
                }

May be it will be better to replace it with PS parameter, but I do not understand why do we need to do something more than such check.

But more importantly, on the second look I strongly doubt this mechanism is needed at all. While we indeed have seen tried to request a page version that was garbage collected errors in practice, likely these were due to several bugs we had in the past when standby got stuck, + improper handling of 'horizon' handling fixed here. Our default gc horizon is 1 week AFAIR; replica lagging more than that is clearly not something sane and needs investigation.

I also do not understand what can cause such large replication lag. Most likely it means some error. Hopefully it is already fixed.

But from the other side - this standby_horizon mechanism actually costs nothing: it doesn't introduce some extra overhead. The only drawback - it can suspend GC. But only in cards of large replication lag. But as you mentioned - it should normally happen. And if still it managed to happen, it is better to occupy more space on the disk than abort queries by reporting an error.

Mostly note to myself: haven't checked yet if hs feedback is processed at compute at all, last time I checked it wasn't.

I have checked it - it I proceeded. Please notice that hot_standby_feedback needs to be set.

arssher · 2024-01-29T14:11:32Z

pgxn/neon/pagestore_client.h

@@ -62,7 +62,7 @@ typedef struct
 typedef struct
 {
 	NeonMessageTag tag;
-	bool		latest;			/* if true, request latest page version */
+	XLogRecPtr	horizon;		/* uppe boundary for page LSN */


Saying here about 0 special case would be good, or add reference to neon_get_horizon.

I have removed comment because I do not consider 0 as special case any more.
get_page now specifies interval of LSN. 0 is valid value of interval low boundary.
It is not somehow specially treated at PS.

arssher · 2024-01-29T14:57:16Z

pageserver/src/page_service.rs

-            }
+        let last_record_lsn = timeline.get_last_record_lsn();
+        let request_horizon = if horizon == Lsn::INVALID {
+            lsn


This removes comment about special lsn == 0 case. While we don't connect pageserver directly to compute anymore, I believe this hack is still valid when we get initial basebackup, so worth leaving comment, adapting it.

arssher · 2024-01-29T14:58:06Z

pageserver/src/page_service.rs

-                // anyway)
-            }
+        let last_record_lsn = timeline.get_last_record_lsn();
+        let request_horizon = if horizon == Lsn::INVALID {


This is quite unobvious, can't we do that at compute side (set lsn == request_lsn is that latter is 0)?

It was specially done for conveniece in tests: there are a lot of tests which originally used latest=false.
Passing copy of lsn to horizon in all this tests seems to be very inconvenient.
So I prefer to add comment here rather than require to pass copy of LSN.
And it seems to be quite logical:

Lsn::MAX stands for latest version

any valid LSN specified upper boundary for LSN

Lsn::INVALID (0) move upper boundary down to the lower boundary=lsn

knizhnik · 2024-01-30T09:33:25Z

I have made thew following changes after discussion with @arssher :

It is not possible to distinguish reply with zero horizion received from SK which lost connection with replica from reply from SK which was never connected to any replica. @arssher suggested to reject invalid (0) standby horizon at PS and use timeout to deteriorate its value. So if any didn't refresh standby_horizon with some non-zero value, then it will not affect GC any more. I have implemented this schema, but then decided that yet another hardcoded timeout is not so good. This is why I just reset standby_horizon at each GC iteration. So timeout is actually equal to gc_period.
As far as PS should support work with old client sending latest flag instead of horizon in get_page_request (sorry for name collision but do not mix this horizon with standby horizon:) and we do not support protocol versions, I add new command - GetLatestPage with old tag (2) and GetPage is given new tag (4). So now PS can serve old and new clients.
Trying to address problem with lack of active XIDs at replica startup, I slightly changes startup procedure for replica - now it is not assumed that mode was shutdowned. So, in theory, Postgres hot standby state machine should take care of it.

hlinnaka · 2024-02-08T22:32:29Z

I don't understand why all the compute <-> pageserver protocol changes are needed. Can you explain? (I remember we talked about that in a 1:1 call last week, but I cannot remember it now, and it would be good to have written down in comments anyway)

knizhnik · 2024-02-09T06:36:25Z

I don't understand why all the compute <-> pageserver protocol changes are needed. Can you explain? (I remember we talked about that in a 1:1 call last week, but I cannot remember it now, and it would be good to have written down in comments anyway)

I briefly explain it in the comment in pagestore_smgr.c:

/*
 * There are three kinds of get_page :
 * 1. Master compute: get the latest page not older than specified LSN (horizon=Lsn::MAX)
 * 2. RO replica: get the latest page not newer than current WAL position replica already applied (horizon=GetXLogReplayRecPtr(NULL))
 * 3. Snapshot: get latest page not new than specified LSN (horizon=request_lsn)
 */

The problem itself is more detailed explained in Slack:
https://neondb.slack.com/archives/C036U0GRMRB/p1705932864991419

So I assume that two cases are clear:

Normal RW node: it always request latest version of there page because nobody except it can update pages.
Static RO node (snapshot): it always request version of the page not greater than snapshot LSN

In both cases we do no need range and just "latest" flag.
Most difficult is hot-standby replica. It accepts and apply WAL concurrently with PS. Replica can be ahead of PS or visa versa. This is why replica can not request latest version of the page, because if PS is ahead of replica, it will get "future" version of the page. But we also can not request version of the page with LSN not greater than specified (as in case of static replica). Because we are using "last written LSN" cache to estimate when page was last updated. If page was not updated for a long time, then requesting page worth this LSN may cause PS to retrieve version of the page which was already collected by GC.

Originally I thought that the problem is related only with GC. And if GC is disabled, then there is no problem with hot-standby replicas. But looks like it is not true:
#6674
In this case the problem is caused by accessing FSM when heap page is updated. If to fetch FSM page we are using last written access LSN of MAIN fork page, then we can get an error at PS.

So to handle case 3 we need top pas range of LSN: low boundary specifies estimated LSN of last page update (so that we do make PS to wait for applying most recent updates) and higher boundary specifies current apply position of replica, to prevent PS from sending too young pages.

With such range determining start LSN position for lookup in layer map becomes even simpler than now with "latest" flag - it is just max(low_bloundary, min(last_record_lsn, upper_boundary))

hlinnaka · 2024-02-09T09:01:31Z

Ok, so the protocol changes are completely unrelated to the topic of this PR, propagating apply_lsn from SK to PS. Please split that off to a separate PR.

knizhnik · 2024-02-10T20:03:31Z

Ok, so the protocol changes are completely unrelated to the topic of this PR, propagating apply_lsn from SK to PS. Please split that off to a separate PR.

#6357

brendan-stephens · 2024-03-05T14:22:37Z

@knizhnik, et al, thanks for your work on this issue.
I can see this has spit off a bit into various subcomponents.
Do all of the items here need to be in place for the primary request a page version that was garbage collected issue? Or has that been addressed and the rest are improvements?
I have a few customers who have been persisting their replicas up to try and avoid this issue.

kelvich · 2024-03-18T09:12:43Z

I see some changes in protocol. While deploying we will have older compute images talking to newer pageservers and safekeepers. Would that be a problem?

@petuhovskiy can you please review that one? (ideas on on how to split that up to a safer series of patches and how to increase test coverage are welcome)

kelvich · 2024-03-18T09:17:04Z

also #6718 has some part of protocol changes as well.

@skyzh when talking about last compute image rollout you've mentioned explicit protocol versioning for compute <> pageserver protocol to avoid assuming that each release has some breaking (bat backward-compatible) change. Should we add protocol version right here? And in a follow up PR we can add some tests to check for api breakage and the relax compute image selection rules in a control plane.

knizhnik · 2024-03-18T11:02:00Z

Protocol changes (sending LSN range) were extracted from this ticket to #6718
Propagation of LSN doesn't require any protocol changes.
My plan is the following: first merge #6718 and then I will rebase this ticket so that it contains only changes needed to correctly propagate LSN from replica to PS.

kelvich · 2024-03-18T11:33:37Z

ok, got it. Then @petuhovskiy or @arssher can you please take a look on #6718 instead?

petuhovskiy · 2024-03-18T12:12:03Z

Took a quick look at #6718, it's about compute<->ps protocol change and I don't have expertise on that. But I'll take a look at this PR after it will get rebased.

skyzh · 2024-03-18T13:14:27Z

Also note that the current prod runs a release of compute node from 3 weeks ago. Better to wait it to catch up with our latest release before adding more things to the compute so that it's easier to roll back in case something goes wrong.

andreasscherbaum · 2024-03-19T10:23:33Z

current prod runs a release of compute node from 3 weeks ago

We definitely need to unblock/unpin Compute first, and see if it's stable. Only then we can merge this in, and release it.

@knizhnik After moving the protocol changes into #6718, are there any other breaking changes in this PR which require Compute restart?

This PR is touching 30+ code files. Is it possible to break it down into smaller patches, which can be rolled out in 2 or 3 steps? In steps which build on each other.

…ts which may be still requested by replica refer #6211 #6357

knizhnik · 2024-04-12T06:30:13Z

Replaced with #7368

…ts which may be still requested by replica refer #6211 #6357

knizhnik requested review from a team as code owners January 15, 2024 13:24

knizhnik requested review from arssher, jcsp and a team and removed request for a team January 15, 2024 13:24

jcsp reviewed Jan 16, 2024

View reviewed changes

knizhnik force-pushed the propagate_reply_flush_lsn_from_sk_to_ps branch from e7ea8bb to 4242f79 Compare January 19, 2024 08:05

jcsp reviewed Jan 22, 2024

View reviewed changes

knizhnik force-pushed the propagate_reply_flush_lsn_from_sk_to_ps branch from 64f23d5 to 8de53a6 Compare January 22, 2024 13:05

knizhnik requested a review from a team as a code owner January 22, 2024 13:05

knizhnik requested review from save-buffer and removed request for a team January 22, 2024 13:05

knizhnik force-pushed the propagate_reply_flush_lsn_from_sk_to_ps branch from 171793a to 9bee043 Compare January 23, 2024 14:27

knizhnik changed the title ~~Propagate repply_flush_lsn from SK to PS to prevent GC from collecting objects which may be still requested by replica~~ Propagate apply_lsn from SK to PS to prevent GC from collecting objects which may be still requested by replica Jan 23, 2024

knizhnik force-pushed the propagate_reply_flush_lsn_from_sk_to_ps branch from 1568c14 to 16e196f Compare January 23, 2024 17:31

save-buffer reviewed Jan 23, 2024

View reviewed changes

knizhnik force-pushed the propagate_reply_flush_lsn_from_sk_to_ps branch from 6007fad to f6c6808 Compare January 25, 2024 06:57

arssher reviewed Jan 29, 2024

View reviewed changes

Konstantin Knizhnik added 4 commits February 7, 2024 08:57

Fix merge problems

772a579

Bump Postgres vesion

69a097e

Trace received standby_horizon updates

28475c6

Fix catching error in test_replication_lag

32f4a56

knizhnik force-pushed the propagate_reply_flush_lsn_from_sk_to_ps branch from 28d4020 to 32f4a56 Compare February 7, 2024 06:57

This was referenced Feb 7, 2024

Epic: stabilize physical replication #6211

Open

Read replica crashes during insert or extension load #6674

Closed

This was referenced Feb 10, 2024

Fix SLRU download bug, with test #6693

Closed

Add neon.primary_is_running GUC. #6705

Merged

Use neon.primary_is_running GUC for replica startup

0ae36c5

knizhnik mentioned this pull request Feb 15, 2024

Send LSN range in getpage request #6718

Closed

5 tasks

petuhovskiy self-requested a review March 18, 2024 12:12

knizhnik pushed a commit that referenced this pull request Apr 12, 2024

Propagate apply_lsn from SK to PS to prevent GC from collecting objec…

f3b3677

…ts which may be still requested by replica refer #6211 #6357

knizhnik mentioned this pull request Apr 12, 2024

Propagate apply_lsn from SK to PS to prevent GC from collecting objects which may be still requested by replica #7368

Merged

5 tasks

knizhnik closed this Apr 12, 2024

knizhnik pushed a commit that referenced this pull request May 14, 2024

Propagate apply_lsn from SK to PS to prevent GC from collecting objec…

4c2927c

…ts which may be still requested by replica refer #6211 #6357

arssher pushed a commit that referenced this pull request May 20, 2024

Propagate apply_lsn from SK to PS to prevent GC from collecting objec…

c1a2c16

…ts which may be still requested by replica refer #6211 #6357

Propagate apply_lsn from SK to PS to prevent GC from collecting objects which may be still requested by replica #6357

Propagate apply_lsn from SK to PS to prevent GC from collecting objects which may be still requested by replica #6357

Conversation

knizhnik commented Jan 15, 2024

Problem

Summary of changes

Checklist before requesting a review

Checklist before merging

github-actions bot commented Jan 15, 2024 • edited Loading

2406 tests run: 2290 passed, 0 failed, 116 skipped (full report)

Postgres 16

Postgres 14

Code coverage (full report)

jcsp commented Jan 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

knizhnik Jan 17, 2024 • edited Loading

Choose a reason for hiding this comment

knizhnik commented Jan 17, 2024

jcsp commented Jan 17, 2024

knizhnik commented Jan 17, 2024

knizhnik commented Jan 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

save-buffer left a comment

Choose a reason for hiding this comment

arssher commented Jan 29, 2024 • edited Loading

knizhnik commented Jan 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

knizhnik commented Jan 30, 2024 • edited Loading

hlinnaka commented Feb 8, 2024 • edited Loading

knizhnik commented Feb 9, 2024

hlinnaka commented Feb 9, 2024

knizhnik commented Feb 10, 2024

brendan-stephens commented Mar 5, 2024

kelvich commented Mar 18, 2024

kelvich commented Mar 18, 2024

knizhnik commented Mar 18, 2024

kelvich commented Mar 18, 2024

petuhovskiy commented Mar 18, 2024

skyzh commented Mar 18, 2024

andreasscherbaum commented Mar 19, 2024

knizhnik commented Apr 12, 2024

github-actions bot commented Jan 15, 2024 •

edited

Loading

knizhnik Jan 17, 2024 •

edited

Loading

arssher commented Jan 29, 2024 •

edited

Loading

knizhnik commented Jan 30, 2024 •

edited

Loading

hlinnaka commented Feb 8, 2024 •

edited

Loading