Epic: physical replication #6211

vadim2404 · 2023-12-21T10:54:21Z

Summary

Original issue we hit was

page server returned error: tried to request a page version that was garbage collected. requested at C/1E923DE0 gc cutoff C/23B3DF00

but then the scope grew up quickly. This is the Epic to track main physical replication work

Tasks

Give feedback

deal with running_xacts to hot standby replica #7236

c/compute t/bug
Remove primary_is_running #8162
Bug: walreceiver did not restart after erroring out #8172

c/compute t/bug
https://github.com/neondatabase/cloud/issues/14903
Epic: internal replica health monitoring #8249

c/compute
External replica observability #8250
RFC / document about how replication works
Shutdown records, or do we need to do a checkpoint at compute shutdown?
Hot standby feedback -- no consensus, but can be a reason of replica lag
Options

Related Epics:

https://github.com/neondatabase/cloud/issues/13390

The text was updated successfully, but these errors were encountered:

knizhnik · 2023-12-22T14:28:48Z

I am thinking now how it can be done.

Replica receives WAL from safekeeper.
Master compute knows nothing about presence of replica - there is no replication slot at master.
Replica can be arbitrary lagged, suspended, ... It may not access nether SK, neither SK for arbitrary long time
There are also no replication slots at SK, so SK has no knowladge about all existed replicas and their WAL poisition.

So what can we do?

We can create replication slot at master. This slot will be persisteted using AUX_KEY mechanism (right now it works only for logical slots, but it an be changed) and applying this wal record PS will know about position of replica. It is not clear who will advance this slot, if replication is performed from SK. In principle. SK can send this position in some feedback message to PS. But looks pretty ugly.
SK should explicitly notify PS about current position of all replicas. Not so obvious how to report this position to PS which now just receives WAL stream from SK. Should it be some special message in SK<->PS protocol? Or should SK generate WAL record with replica position (not clear which LSN this record should be assigned to be included in stream of existed WAL records). As it was mentioned above SK has no information about all replicas, so lack of such message doesn;t mean that there is no replica wityh eom old LSN.
Replica should notify PS itself (by means of some special message). The problem is that replica can be offline and do not send any requests to PS.
In addition to PITR we can also have max_replica_lag parameter. If replica exceeds this value, then it is disabled.

kelvich · 2024-01-02T06:43:23Z

https://neondb.slack.com/archives/C04DGM6SMTM/p1703154820552359

kelvich · 2024-01-02T06:59:43Z

So basically we need to delay PITR for some amount of time for lagging replicas when they are enabled.

Replica should notify PS itself (by means of some special message). The problem is that replica can be offline and do not send any requests to PS.

That could be done with time lease. Replica sends message each 10 minutes, when pageserver don't receive 3 messages in a row it considers replica to be disabled.

SK should explicitly notify PS about current position of all replicas. Not so obvious how to report this position to PS which now just receives WAL stream from SK. Should it be some special message in SK<->PS protocol? Or should SK generate WAL record with replica position (not clear which LSN this record should be assigned to be included in stream of existed WAL records). As it was mentioned above SK has no information about all replicas, so lack of such message doesn;t mean that there is no replica wityh eom old LSN.

Won't usual feedback message help? IIRC we already have it for backpressure, also pageserver knows that LSN's via storage broker.

knizhnik · 2024-01-02T07:01:29Z

PiTR is enforced at PS and information about replica flush/apply position is avaiable only at SK. The problem is that PS can be connected to one SK1, and replica - some some other SK2. The only components which knows about all SKs are compute and broker. But compute may we inactive (suspended) at the moment when GC is performed by PS. And involving broker in the process of garbage collection on PS seems to be overkill. Certainly SK can somehow interact with each other or through wal proposer. But it also seems to be too complicated and fragile.

kelvich · 2024-01-02T11:51:09Z

PiTR is enforced at PS and information about replica flush/apply position is avaiable only at SK. The problem is that PS can be connected to one SK1, and replica - some some other SK2. The only components which knows about all SKs are compute and broker. But compute may we inactive (suspended) at the moment when GC is performed by PS. And involving broker in the process of garbage collection on PS seems to be overkill. Certainly SK can somehow interact with each other or through wal proposer. But it also seems to be too complicated and fragile.

Through broker pageserver has information about LSNs on all safekeepers. That is how pageserver decides which one to connect to. So safekeeper can advertise min feedback lsn out of all replicas connected to it (if any).

Also, most likely, we should use information from broker when deciding which safekeeper to connect to on replica. @arssher what do you think?

arssher · 2024-01-02T12:44:02Z

Through broker pageserver has information about LSNs on all safekeepers. That is how pageserver decides which one to connect to. So safekeeper can advertise min feedback lsn out of all replicas connected to it (if any).'

Yes, this seems to be the easiest way.

Also, most likely, we should use information from broker when deciding which safekeeper to connect to on replica. @arssher what do you think?

Not necessarily. Replica here is different from pageserver because it costs something, so we're ok to keep the standby -> safekeeper connection all the time as long as standby as alive, which means standby can be initiator of the connection. So what we do currently is just wire all safekeepers into primary_conninfo; if some is down, libpq will try another etc. If set of safekeepers changes we need to update the setting, but this is not hard (though this is not automated yet).

With pageserver we can't do similar because we don't want to keep live connections from all existing attached timelines, and safekeeper learns about new data first, so it should be initiator of the connection. Usage of broker gives another advantage: pageserver concurrently can have active connection and at the same time up to date info about other safekeeper positions, so can choose better where to connect in complicated scenarios like when connection to current sk is good, but it is very slow for whatever reason. But similar heuristics though less powerful can be implemented without broker data (e.g. restart connection if no new data arrives within some period).

Also using broker on standby likely would be quite untrivial because it is grpc, I'm not even sure C grpc library exists. So looks like a significant work without much gain.

arssher · 2024-01-02T12:47:27Z

On a related note, I'm also very suspicious that original issue is caused by this -- "doubt that.replica lags for 7 days" -- me too. Looking at metrics to understand standby position would be very useful, but likely pg_last_wal_replay_lsn is not collected :(

knizhnik · 2024-01-02T13:04:35Z

Ok, so to summarise all above:

Information about replica apply position can be obtained by PS from broker (still not quite clear to me how frequent this information is updated)
The problem most likely is not caused by replication lag, but by some bug in tracking VM updates either at compute, either at PS side. As far as the problem is reproduced only on replica, then most likely it is bug in compute, particularly in performing redo in compute. PS knows nothing if get_page request comes from master or replica, so unlikely the problem is here. But there is one important difference: master does get_page request with latest option (takes latest LSN), while replica uses latest=false.

knizhnik · 2024-01-02T13:54:28Z

One of the problems with requesting information about replica position from broker is that it is available only as far as replica is connected to one of SK. But if it is suspended, then this information is not available. As far as I understand only control plane has information. about all replicas. But it is not desirable to:

to involve control plane in GC process
block GC until all replicas are online
remember current state of all replicas in some shared storage

vadim2404 · 2024-01-02T17:00:14Z

under investigation (most probably slip to the next week)

arssher · 2024-01-02T18:25:23Z

One of the problems with requesting information about replica position from broker is that it is available only as far as replica is connected to one of SK.

Yes, but as Stas wrote somewhere it's mostly ok to keep data only as long as replica is around. Newly seeded replica shouldn't lag significantly. Well, there is probably also standby pinned to LSN, but it can be addresses separately.

knizhnik · 2024-01-02T19:29:10Z

Newly seeded replica shouldn't lag significantly.

My concern is that replica can be suspended because of inactivity.
I wonder how we are protecting replica fro scale to zero now (if there are no active requests to replica).

vadim2404 · 2024-01-09T10:21:46Z

Recently, @arssher turned off the suspension for computes, which has logical replication subscribers.
a41c412

@knizhnik, you can adjust this part for RO endpoints. In compute_ctl the compute type (R/W or R/O) is known

vadim2404 · 2024-01-09T16:51:35Z

@knizhnik to check why replica requires to download the WAL.

kelvich · 2024-01-09T20:39:42Z

My concern is that replica can be suspended because of inactivity.
Do not suspend read-only replica if it applies some WAL within some time interval (i.e. 5 minutes). It can be checked using last_flush_lsn.
Periodically wakeup read-only node to make it possible to connect to master and get updates. Wakeup period should be several times larger than suspend interval (otherwise it has not sense to suspend replica at all). It may be also useful periodically wakeup not only read-only replicas, but any other suspended nodes. Such computes will have a chance to perform some bookkeeping work, i.e. autovacuum. I do not think that if node will be awaken once per hour for 5 minutes, then it can some significantly affect cost (for users).

Hm, how we did end up here? Replica should be suspended due to inactivity. New start will start with latest LSN, so not sure why replica suspend is relevant.

There are two open questions now:

why replica lags a lot, that shouldn't happen and that is the most pressing issue
how we delay GC in case of legitimately lagging replica. Approach with broker sounds reasonable (no replica == no need to hold GC). Control plane doesn't know about replica LSN and shouldn't know about them.

knizhnik · 2024-01-10T07:34:56Z

Sorry. my concerns about read-only replica suspension (when there are not active queries) seems to be irrelevant.
Unlike "standard" read-only replica in Vanilla Postgres, we do not need to replay all WAL when activating suspended replica. Page server should just create basebackup with most recent LSN for launching this replica. And I have tested that it is really done now in this way.

So lagged replica can not be caused by replica suspension. Quite opposite: suspend and restart of replica should cause replica to "catch up" with master. Large replication lag between master and replica should be caused by some other reasons. Actually I see only two reasons:

Replica apply WAL slowly than master is producing it. For example replica use less powerful VM than master.
There was some error with processing WAL at replica which stuck replication. I can be related with the problem recently fixed by @arssher (alignment of segments sent to replica on page boundary).

Are there links to the projects suffering for this problem? Can we include them in this ticket?

Concerning approach described above: take information about replica LSN from broker and use it to restrict PiTR boundary to prevent GC from removing layers which may be accessed by replica. There are two kind of LSNs maintained by SK: last committed LSN returned in the response to happen requests and triple of LSNs (write/flush/apply) included in hot-stanndby feedback and collected by SK as min from all subscribers (PS and replicas). I wonder of broker can provide now access to both of this LSNs. @arssher ?

arssher · 2024-01-10T09:50:43Z

I wonder of broker can provide now access to both of this

Not everything is published right now, but this is trivial to add, see LSNsSafekeeperTimelineInfo

vadim2404 · 2024-01-17T09:13:37Z

status update: in review

vadim2404 · 2024-01-23T17:00:22Z

to review it with @MMeent

vadim2404 · 2024-01-30T16:48:24Z

@arssher to review the PR

ItsWadams · 2024-01-30T21:54:40Z

Hey All - a customer just asked about this in an email thread with me about pricing. Are there any updates we can provide them?

vadim2404 · 2024-01-31T08:57:47Z

The problem was identified, and @knizhnik is working on fixing it.

But the fix requires time because it affects compute, safekeeper, and pageserver. I suppose in February, we will merge it and ship it.

YanicNeon · 2024-01-31T17:25:45Z

We got a support case about this problem today (ZD #2219)

Keeping an eye on this thread

acervantes23 · 2024-02-07T02:03:13Z

@knizhnik what's the latest status on this issue?

knizhnik · 2024-02-07T07:06:48Z

@knizhnik what's the latest status on this issue?

There is PR #6357 waiting for one more round of review.
There was also some problems with e2e tests: https://neondb.slack.com/archives/C03438W3FLZ/p1706868624273839
which are not yet resolved and where I need some help for somebody familiar with e2e tests.

Once all the computes in production have restarted, we can remove protocol version 1 altogether. See issue #6211.

…ts which may be still requested by replica refer #6211 #6357

Once all the computes in production have restarted, we can remove protocol version 1 altogether. See issue #6211. This was done earlier already in commit 0115fe6, but reverted before it was released to production in commit bbe730d because of issue #7692. That issue was fixed in commit 22afaea, so we are ready to change the default again.

lubennikovaav · 2024-06-10T15:10:55Z

Not sure, if this is helpful, but I'll leave it here for the record.

Caught error that was mentioned in the discussion above.
compute version 5623

PG:2024-06-09 18:03:42.771 GMT ttid=4fa0435ad413cf44efb43eaad51c7814/245c619f9df3c8210152b7e254f26583 
sqlstate=XX000 [7104] FATAL:  too many KnownAssignedXids

PG:2024-06-09 18:03:42.771 GMT ttid=4fa0435ad413cf44efb43eaad51c7814/245c619f9df3c8210152b7e254f26583 
sqlstate=XX000 [7104] CONTEXT:  WAL redo at 0/DF4B5F70 for neon/HOT_UPDATE: old_xmax: 1775616, old_off: 37, 
old_infobits: [], flags: 0x20, new_xmax: 0, new_off: 39; blkref #0: rel 1663/40960/826, blk 0 FPW

https://neonprod.grafana.net/d/IJSLHBOnk/neon-logs-compute-nodes-by-compute-id?orgId=1&from=1717956222000&to=1717956223000&var-datasource=grafanacloud-logs&var-region=All&var-search=&var-exclude=string_to_exclude_from_search&var-vm_metrics_datasource=ZNX49CDVz&var-compute=compute-morning-breeze-a7plxe9n

kelvich · 2024-06-26T14:51:25Z

Status update:

Critical issues are fixed, compute<>storage protocol v2 is rolled out everywhere, we monitor errors that customers hit.

Restore running xacts from CLOG on replica startup #7288 is being merged today, and supposed to be rolled out next week
Long-running replication tests #7896 are passing now, to be merged soon
Today we found new bug with Bug: walreceiver did not restart after erroring out #8172

Overall none of that is critical and I'm ok to proceed with re-launch.

ololobus · 2024-07-02T17:13:49Z

This week plan and update:

@knizhnik will look on Bug: walreceiver did not restart after erroring out #8172, although he is not sure yet, how to reproduce that
regarding RO configuration https://github.com/neondatabase/cloud/issues/14903
- @tristan957 will run analytic SQL to figure out how many potentially affected ROs we have
- @ololobus figure out some easy fix for avoiding the RO misconfiguration (see Stas comment in the issue), likely we will try to sync the RO GUCs with RW
Few more items we briefly touched, maybe @hlinnaka can comment on that too
- Hot standby feedback, @knizhnik had a point that it's absence can lead to replica lag as well. I recall that Heikki was against enabling it. My gut feeling is that we need to fix RO configuration first, then check again if we will have lagging replicas again
- Shutdown records / making a checkpoint at shutdown. @kelvich said that it could be important, but I see that we still use immediate shutdown. Can it be a problem?

tristan957 · 2024-07-02T17:32:19Z

Shutdown records / making a checkpoint at shutdown. @kelvich said that it could be important, but I see that we still use immediate shutdown. Can it be a problem?

Does anyone have context on why we shutdown the way we currently do?

ololobus · 2024-07-02T18:33:08Z

Does anyone have context on why we shutdown the way we currently do?

I originally used fast mode in the suspend path, but it creates a checkpoint that is longer and unnecessary for durability. Then, we changed it to immediate at some point. We also cannot fully rely on it anyway, as compute can crash

knizhnik · 2024-07-02T18:53:29Z

Shutdown checkpoint is saving pgstat information about replication slots and other AUX files.
I am not sure about all bad consequences if this information is not persisted.
At least we will have to do a lot of redundant work on restart.

Making checkpoint in case of Neon should not be expensive, because we do not actually flush pages.
So looks like there are good arguments to restore "fast shutdown" mode.

tristan957 · 2024-07-02T19:03:00Z

Here is the SQL query that I am going to run across all production databases:

SELECT
  (
    SELECT
      COUNT(1)
    FROM
      endpoints
    WHERE
      type = 'read_only'
  ) AS total,
  COUNT(*) FILTER(
    WHERE
      ro.autoscaling_limit_min_cu < rw.autoscaling_limit_min_cu
      OR ro.autoscaling_limit_max_cu < rw.autoscaling_limit_max_cu
  ) AS all_laggards,
  COUNT(*) FILTER(
    WHERE
      ro.autoscaling_limit_min_cu < rw.autoscaling_limit_min_cu
  ) AS laggards_that_fail_precondition,
  COUNT(*) FILTER(
    WHERE
      ro.autoscaling_limit_max_cu < rw.autoscaling_limit_max_cu
  ) AS laggards_that_cant_scale,
  COUNT(*) FILTER(
    WHERE
      ro.autoscaling_limit_min_cu = rw.autoscaling_limit_min_cu
      AND ro.autoscaling_limit_max_cu = rw.autoscaling_limit_max_cu
  ) AS same_size,
  COUNT(*) FILTER(
    WHERE
      ro.autoscaling_limit_min_cu >= rw.autoscaling_limit_min_cu
      AND ro.autoscaling_limit_max_cu > rw.autoscaling_limit_max_cu
  ) AS higher_max_cu
FROM
  endpoints AS ro
JOIN
    endpoints AS rw
    ON ro.branch_id = rw.branch_id
WHERE
  ro.type = 'read_only'
  AND rw.type = 'read_write';

Stay tuned. It should tell us how our users currently user read-only endpoints.

MMeent · 2024-07-02T19:09:28Z

COUNT(CASE WHEN ... THEN 1 END)
Have you heard about window functions and aggregate filters?

COUNT(*) FILTER (WHERE ...) is much more concise and clean, IMO, and last time I checked it's faster too.

tristan957 · 2024-07-02T19:10:43Z

I'll take a look

It also creates a shutdown checkpoint, which is important for ROs to get a list of running xacts faster instead of going through the CLOG. See https://www.postgresql.org/docs/current/server-shutdown.html for the list of modes and signals. Related to #6211

vadim2404 added t/bug Issue Type: Bug c/compute Component: compute, excluding postgres itself labels Dec 21, 2023

vadim2404 assigned knizhnik Dec 21, 2023

knizhnik mentioned this issue Jan 15, 2024

Propagate apply_lsn from SK to PS to prevent GC from collecting objects which may be still requested by replica #6357

Closed

5 tasks

knizhnik mentioned this issue Feb 16, 2024

Send LSN range in getpage request #6718

Closed

5 tasks

hlinnaka added a commit that referenced this issue May 6, 2024

Make 'neon_protocol_version = 2' the default

965e0c2

Once all the computes in production have restarted, we can remove protocol version 1 altogether. See issue #6211.

hlinnaka added a commit that referenced this issue May 6, 2024

Make 'neon.protocol_version = 2' the default (#7616)

0115fe6

Once all the computes in production have restarted, we can remove protocol version 1 altogether. See issue #6211.

conradludgate pushed a commit that referenced this issue May 8, 2024

Make 'neon.protocol_version = 2' the default (#7616)

93fd7d3

Once all the computes in production have restarted, we can remove protocol version 1 altogether. See issue #6211.

knizhnik pushed a commit that referenced this issue May 14, 2024

Propagate apply_lsn from SK to PS to prevent GC from collecting objec…

4c2927c

…ts which may be still requested by replica refer #6211 #6357

arssher pushed a commit that referenced this issue May 20, 2024

Propagate apply_lsn from SK to PS to prevent GC from collecting objec…

c1a2c16

…ts which may be still requested by replica refer #6211 #6357

hlinnaka mentioned this issue May 21, 2024

Make 'neon.protocol_version = 2' the default, take two #7819

Merged

This was referenced May 23, 2024

Restore running xacts from CLOG on replica startup neondatabase/postgres#422

Closed

Restore running xacts from CLOG on replica startup neondatabase/postgres#423

Closed

Restore running xacts from CLOG on replica startup neondatabase/postgres#424

Closed

ololobus changed the title ~~[Compute]: Issue with read replicas~~ Epic: [Compute] Issue with read replicas Jun 20, 2024

ololobus added the t/Epic Issue type: Epic label Jun 20, 2024

ololobus changed the title ~~Epic: [Compute] Issue with read replicas~~ Epic: [Compute] Issues with read replicas Jun 20, 2024

ololobus changed the title ~~Epic: [Compute] Issues with read replicas~~ [Compute] Issues with read replicas Jul 1, 2024

ololobus added t/Epic Issue type: Epic and removed t/Epic Issue type: Epic labels Jul 1, 2024

ololobus changed the title ~~[Compute] Issues with read replicas~~ Epic: Issues with read replicas (physical replication) Jul 2, 2024

kelvich changed the title ~~Epic: Issues with read replicas (physical replication)~~ Epic: physical replication Jul 3, 2024

ololobus mentioned this issue Jul 5, 2024

compute_ctl: Use 'fast' shutdown for Postgres termination #8289

Open

5 tasks

Epic: physical replication #6211

Epic: physical replication #6211

Comments

vadim2404 commented Dec 21, 2023 • edited by kelvich Loading

Summary

Tasks

knizhnik commented Dec 22, 2023

kelvich commented Jan 2, 2024

kelvich commented Jan 2, 2024

knizhnik commented Jan 2, 2024

kelvich commented Jan 2, 2024 • edited Loading

arssher commented Jan 2, 2024

arssher commented Jan 2, 2024

knizhnik commented Jan 2, 2024

knizhnik commented Jan 2, 2024

vadim2404 commented Jan 2, 2024

arssher commented Jan 2, 2024

knizhnik commented Jan 2, 2024

vadim2404 commented Jan 9, 2024

vadim2404 commented Jan 9, 2024

kelvich commented Jan 9, 2024

knizhnik commented Jan 10, 2024

arssher commented Jan 10, 2024

vadim2404 commented Jan 17, 2024

vadim2404 commented Jan 23, 2024

vadim2404 commented Jan 30, 2024

ItsWadams commented Jan 30, 2024

vadim2404 commented Jan 31, 2024

YanicNeon commented Jan 31, 2024

acervantes23 commented Feb 7, 2024

knizhnik commented Feb 7, 2024

lubennikovaav commented Jun 10, 2024

kelvich commented Jun 26, 2024

ololobus commented Jul 2, 2024

tristan957 commented Jul 2, 2024

ololobus commented Jul 2, 2024

knizhnik commented Jul 2, 2024

tristan957 commented Jul 2, 2024 • edited Loading

MMeent commented Jul 2, 2024 • edited Loading

tristan957 commented Jul 2, 2024

vadim2404 commented Dec 21, 2023 •

edited by kelvich

Loading

kelvich commented Jan 2, 2024 •

edited

Loading

tristan957 commented Jul 2, 2024 •

edited

Loading

MMeent commented Jul 2, 2024 •

edited

Loading