Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epic: physical replication #6211

Open
2 of 9 tasks
vadim2404 opened this issue Dec 21, 2023 · 74 comments
Open
2 of 9 tasks

Epic: physical replication #6211

vadim2404 opened this issue Dec 21, 2023 · 74 comments
Assignees
Labels
c/compute Component: compute, excluding postgres itself t/bug Issue Type: Bug t/Epic Issue type: Epic

Comments

@vadim2404
Copy link
Contributor

vadim2404 commented Dec 21, 2023

Summary

Original issue we hit was

page server returned error: tried to request a page version that was garbage collected. requested at C/1E923DE0 gc cutoff C/23B3DF00

but then the scope grew up quickly. This is the Epic to track main physical replication work

Tasks

  1. c/compute t/bug
    knizhnik
  2. c/compute t/bug
    knizhnik
  3. c/compute

Related Epics:

@vadim2404 vadim2404 added t/bug Issue Type: Bug c/compute Component: compute, excluding postgres itself labels Dec 21, 2023
@knizhnik
Copy link
Contributor

I am thinking now how it can be done.

  • Replica receives WAL from safekeeper.
  • Master compute knows nothing about presence of replica - there is no replication slot at master.
  • Replica can be arbitrary lagged, suspended, ... It may not access nether SK, neither SK for arbitrary long time
  • There are also no replication slots at SK, so SK has no knowladge about all existed replicas and their WAL poisition.

So what can we do?

  1. We can create replication slot at master. This slot will be persisteted using AUX_KEY mechanism (right now it works only for logical slots, but it an be changed) and applying this wal record PS will know about position of replica. It is not clear who will advance this slot, if replication is performed from SK. In principle. SK can send this position in some feedback message to PS. But looks pretty ugly.
  2. SK should explicitly notify PS about current position of all replicas. Not so obvious how to report this position to PS which now just receives WAL stream from SK. Should it be some special message in SK<->PS protocol? Or should SK generate WAL record with replica position (not clear which LSN this record should be assigned to be included in stream of existed WAL records). As it was mentioned above SK has no information about all replicas, so lack of such message doesn;t mean that there is no replica wityh eom old LSN.
  3. Replica should notify PS itself (by means of some special message). The problem is that replica can be offline and do not send any requests to PS.
  4. In addition to PITR we can also have max_replica_lag parameter. If replica exceeds this value, then it is disabled.

@kelvich
Copy link
Contributor

kelvich commented Jan 2, 2024

@kelvich
Copy link
Contributor

kelvich commented Jan 2, 2024

So basically we need to delay PITR for some amount of time for lagging replicas when they are enabled.

Replica should notify PS itself (by means of some special message). The problem is that replica can be offline and do not send any requests to PS.

That could be done with time lease. Replica sends message each 10 minutes, when pageserver don't receive 3 messages in a row it considers replica to be disabled.

SK should explicitly notify PS about current position of all replicas. Not so obvious how to report this position to PS which now just receives WAL stream from SK. Should it be some special message in SK<->PS protocol? Or should SK generate WAL record with replica position (not clear which LSN this record should be assigned to be included in stream of existed WAL records). As it was mentioned above SK has no information about all replicas, so lack of such message doesn;t mean that there is no replica wityh eom old LSN.

Won't usual feedback message help? IIRC we already have it for backpressure, also pageserver knows that LSN's via storage broker.

@knizhnik
Copy link
Contributor

knizhnik commented Jan 2, 2024

PiTR is enforced at PS and information about replica flush/apply position is avaiable only at SK. The problem is that PS can be connected to one SK1, and replica - some some other SK2. The only components which knows about all SKs are compute and broker. But compute may we inactive (suspended) at the moment when GC is performed by PS. And involving broker in the process of garbage collection on PS seems to be overkill. Certainly SK can somehow interact with each other or through wal proposer. But it also seems to be too complicated and fragile.

@kelvich
Copy link
Contributor

kelvich commented Jan 2, 2024

PiTR is enforced at PS and information about replica flush/apply position is avaiable only at SK. The problem is that PS can be connected to one SK1, and replica - some some other SK2. The only components which knows about all SKs are compute and broker. But compute may we inactive (suspended) at the moment when GC is performed by PS. And involving broker in the process of garbage collection on PS seems to be overkill. Certainly SK can somehow interact with each other or through wal proposer. But it also seems to be too complicated and fragile.

Through broker pageserver has information about LSNs on all safekeepers. That is how pageserver decides which one to connect to. So safekeeper can advertise min feedback lsn out of all replicas connected to it (if any).

Also, most likely, we should use information from broker when deciding which safekeeper to connect to on replica. @arssher what do you think?

@arssher
Copy link
Contributor

arssher commented Jan 2, 2024

Through broker pageserver has information about LSNs on all safekeepers. That is how pageserver decides which one to connect to. So safekeeper can advertise min feedback lsn out of all replicas connected to it (if any).'

Yes, this seems to be the easiest way.

Also, most likely, we should use information from broker when deciding which safekeeper to connect to on replica. @arssher what do you think?

Not necessarily. Replica here is different from pageserver because it costs something, so we're ok to keep the standby -> safekeeper connection all the time as long as standby as alive, which means standby can be initiator of the connection. So what we do currently is just wire all safekeepers into primary_conninfo; if some is down, libpq will try another etc. If set of safekeepers changes we need to update the setting, but this is not hard (though this is not automated yet).

With pageserver we can't do similar because we don't want to keep live connections from all existing attached timelines, and safekeeper learns about new data first, so it should be initiator of the connection. Usage of broker gives another advantage: pageserver concurrently can have active connection and at the same time up to date info about other safekeeper positions, so can choose better where to connect in complicated scenarios like when connection to current sk is good, but it is very slow for whatever reason. But similar heuristics though less powerful can be implemented without broker data (e.g. restart connection if no new data arrives within some period).

Also using broker on standby likely would be quite untrivial because it is grpc, I'm not even sure C grpc library exists. So looks like a significant work without much gain.

@arssher
Copy link
Contributor

arssher commented Jan 2, 2024

On a related note, I'm also very suspicious that original issue is caused by this -- "doubt that.replica lags for 7 days" -- me too. Looking at metrics to understand standby position would be very useful, but likely pg_last_wal_replay_lsn is not collected :(

@knizhnik
Copy link
Contributor

knizhnik commented Jan 2, 2024

Ok, so to summarise all above:

  1. Information about replica apply position can be obtained by PS from broker (still not quite clear to me how frequent this information is updated)
  2. The problem most likely is not caused by replication lag, but by some bug in tracking VM updates either at compute, either at PS side. As far as the problem is reproduced only on replica, then most likely it is bug in compute, particularly in performing redo in compute. PS knows nothing if get_page request comes from master or replica, so unlikely the problem is here. But there is one important difference: master does get_page request with latest option (takes latest LSN), while replica uses latest=false.

@knizhnik
Copy link
Contributor

knizhnik commented Jan 2, 2024

One of the problems with requesting information about replica position from broker is that it is available only as far as replica is connected to one of SK. But if it is suspended, then this information is not available. As far as I understand only control plane has information. about all replicas. But it is not desirable to:

  • to involve control plane in GC process
  • block GC until all replicas are online
  • remember current state of all replicas in some shared storage

@vadim2404
Copy link
Contributor Author

under investigation (most probably slip to the next week)

@arssher
Copy link
Contributor

arssher commented Jan 2, 2024

One of the problems with requesting information about replica position from broker is that it is available only as far as replica is connected to one of SK.

Yes, but as Stas wrote somewhere it's mostly ok to keep data only as long as replica is around. Newly seeded replica shouldn't lag significantly. Well, there is probably also standby pinned to LSN, but it can be addresses separately.

@knizhnik
Copy link
Contributor

knizhnik commented Jan 2, 2024

Newly seeded replica shouldn't lag significantly.

My concern is that replica can be suspended because of inactivity.
I wonder how we are protecting replica fro scale to zero now (if there are no active requests to replica).

@vadim2404
Copy link
Contributor Author

Recently, @arssher turned off the suspension for computes, which has logical replication subscribers.
a41c412

@knizhnik, you can adjust this part for RO endpoints. In compute_ctl the compute type (R/W or R/O) is known

@vadim2404
Copy link
Contributor Author

@knizhnik to check why replica requires to download the WAL.

@kelvich
Copy link
Contributor

kelvich commented Jan 9, 2024

My concern is that replica can be suspended because of inactivity.
Do not suspend read-only replica if it applies some WAL within some time interval (i.e. 5 minutes). It can be checked using last_flush_lsn.
Periodically wakeup read-only node to make it possible to connect to master and get updates. Wakeup period should be several times larger than suspend interval (otherwise it has not sense to suspend replica at all). It may be also useful periodically wakeup not only read-only replicas, but any other suspended nodes. Such computes will have a chance to perform some bookkeeping work, i.e. autovacuum. I do not think that if node will be awaken once per hour for 5 minutes, then it can some significantly affect cost (for users).

Hm, how we did end up here? Replica should be suspended due to inactivity. New start will start with latest LSN, so not sure why replica suspend is relevant.

There are two open questions now:

  • why replica lags a lot, that shouldn't happen and that is the most pressing issue
  • how we delay GC in case of legitimately lagging replica. Approach with broker sounds reasonable (no replica == no need to hold GC). Control plane doesn't know about replica LSN and shouldn't know about them.

@knizhnik
Copy link
Contributor

Sorry. my concerns about read-only replica suspension (when there are not active queries) seems to be irrelevant.
Unlike "standard" read-only replica in Vanilla Postgres, we do not need to replay all WAL when activating suspended replica. Page server should just create basebackup with most recent LSN for launching this replica. And I have tested that it is really done now in this way.

So lagged replica can not be caused by replica suspension. Quite opposite: suspend and restart of replica should cause replica to "catch up" with master. Large replication lag between master and replica should be caused by some other reasons. Actually I see only two reasons:

  1. Replica apply WAL slowly than master is producing it. For example replica use less powerful VM than master.
  2. There was some error with processing WAL at replica which stuck replication. I can be related with the problem recently fixed by @arssher (alignment of segments sent to replica on page boundary).

Are there links to the projects suffering for this problem? Can we include them in this ticket?

Concerning approach described above: take information about replica LSN from broker and use it to restrict PiTR boundary to prevent GC from removing layers which may be accessed by replica. There are two kind of LSNs maintained by SK: last committed LSN returned in the response to happen requests and triple of LSNs (write/flush/apply) included in hot-stanndby feedback and collected by SK as min from all subscribers (PS and replicas). I wonder of broker can provide now access to both of this LSNs. @arssher ?

@arssher
Copy link
Contributor

arssher commented Jan 10, 2024

I wonder of broker can provide now access to both of this

Not everything is published right now, but this is trivial to add, see LSNsSafekeeperTimelineInfo

@vadim2404
Copy link
Contributor Author

status update: in review

@vadim2404
Copy link
Contributor Author

to review it with @MMeent

@vadim2404
Copy link
Contributor Author

@arssher to review the PR

@ItsWadams
Copy link

Hey All - a customer just asked about this in an email thread with me about pricing. Are there any updates we can provide them?

@vadim2404
Copy link
Contributor Author

The problem was identified, and @knizhnik is working on fixing it.

But the fix requires time because it affects compute, safekeeper, and pageserver. I suppose in February, we will merge it and ship it.

@YanicNeon
Copy link

We got a support case about this problem today (ZD #2219)

Keeping an eye on this thread

@acervantes23
Copy link

@knizhnik what's the latest status on this issue?

@knizhnik
Copy link
Contributor

knizhnik commented Feb 7, 2024

@knizhnik what's the latest status on this issue?

There is PR #6357 waiting for one more round of review.
There was also some problems with e2e tests: https://neondb.slack.com/archives/C03438W3FLZ/p1706868624273839
which are not yet resolved and where I need some help for somebody familiar with e2e tests.

hlinnaka added a commit that referenced this issue May 6, 2024
Once all the computes in production have restarted, we can remove
protocol version 1 altogether.

See issue #6211.
hlinnaka added a commit that referenced this issue May 6, 2024
Once all the computes in production have restarted, we can remove
protocol version 1 altogether.

See issue #6211.
conradludgate pushed a commit that referenced this issue May 8, 2024
Once all the computes in production have restarted, we can remove
protocol version 1 altogether.

See issue #6211.
knizhnik pushed a commit that referenced this issue May 14, 2024
…ts which may be still requested by replica

refer #6211 #6357
arssher pushed a commit that referenced this issue May 20, 2024
…ts which may be still requested by replica

refer #6211 #6357
hlinnaka added a commit that referenced this issue May 21, 2024
Once all the computes in production have restarted, we can remove
protocol version 1 altogether.

See issue #6211.

This was done earlier already in commit 0115fe6, but reverted
before it was released to production in commit bbe730d because of
issue #7692. That issue was
fixed in commit 22afaea, so we are ready to change the default
again.
hlinnaka added a commit that referenced this issue May 22, 2024
Once all the computes in production have restarted, we can remove
protocol version 1 altogether.

See issue #6211.

This was done earlier already in commit 0115fe6, but reverted before
it was released to production in commit bbe730d because of issue
#7692. That issue was fixed
in commit 22afaea, so we are ready to change the default again.
@lubennikovaav
Copy link
Contributor

Not sure, if this is helpful, but I'll leave it here for the record.

Caught error that was mentioned in the discussion above.
compute version 5623

PG:2024-06-09 18:03:42.771 GMT ttid=4fa0435ad413cf44efb43eaad51c7814/245c619f9df3c8210152b7e254f26583 
sqlstate=XX000 [7104] FATAL:  too many KnownAssignedXids

PG:2024-06-09 18:03:42.771 GMT ttid=4fa0435ad413cf44efb43eaad51c7814/245c619f9df3c8210152b7e254f26583 
sqlstate=XX000 [7104] CONTEXT:  WAL redo at 0/DF4B5F70 for neon/HOT_UPDATE: old_xmax: 1775616, old_off: 37, 
old_infobits: [], flags: 0x20, new_xmax: 0, new_off: 39; blkref #0: rel 1663/40960/826, blk 0 FPW

https://neonprod.grafana.net/d/IJSLHBOnk/neon-logs-compute-nodes-by-compute-id?orgId=1&from=1717956222000&to=1717956223000&var-datasource=grafanacloud-logs&var-region=All&var-search=&var-exclude=string_to_exclude_from_search&var-vm_metrics_datasource=ZNX49CDVz&var-compute=compute-morning-breeze-a7plxe9n

@ololobus ololobus changed the title [Compute]: Issue with read replicas Epic: [Compute] Issue with read replicas Jun 20, 2024
@ololobus ololobus added the t/Epic Issue type: Epic label Jun 20, 2024
@ololobus ololobus changed the title Epic: [Compute] Issue with read replicas Epic: [Compute] Issues with read replicas Jun 20, 2024
@kelvich
Copy link
Contributor

kelvich commented Jun 26, 2024

Status update:

Critical issues are fixed, compute<>storage protocol v2 is rolled out everywhere, we monitor errors that customers hit.

Overall none of that is critical and I'm ok to proceed with re-launch.

@ololobus ololobus changed the title Epic: [Compute] Issues with read replicas [Compute] Issues with read replicas Jul 1, 2024
@ololobus ololobus added t/Epic Issue type: Epic and removed t/Epic Issue type: Epic labels Jul 1, 2024
@ololobus ololobus changed the title [Compute] Issues with read replicas Epic: Issues with read replicas (physical replication) Jul 2, 2024
@ololobus
Copy link
Member

ololobus commented Jul 2, 2024

This week plan and update:

  • @knizhnik will look on Bug: walreceiver did not restart after erroring out #8172, although he is not sure yet, how to reproduce that
  • regarding RO configuration https://github.com/neondatabase/cloud/issues/14903
    • @tristan957 will run analytic SQL to figure out how many potentially affected ROs we have
    • @ololobus figure out some easy fix for avoiding the RO misconfiguration (see Stas comment in the issue), likely we will try to sync the RO GUCs with RW
  • Few more items we briefly touched, maybe @hlinnaka can comment on that too
    • Hot standby feedback, @knizhnik had a point that it's absence can lead to replica lag as well. I recall that Heikki was against enabling it. My gut feeling is that we need to fix RO configuration first, then check again if we will have lagging replicas again
    • Shutdown records / making a checkpoint at shutdown. @kelvich said that it could be important, but I see that we still use immediate shutdown. Can it be a problem?

@tristan957
Copy link
Member

Shutdown records / making a checkpoint at shutdown. @kelvich said that it could be important, but I see that we still use immediate shutdown. Can it be a problem?

Does anyone have context on why we shutdown the way we currently do?

@ololobus
Copy link
Member

ololobus commented Jul 2, 2024

Does anyone have context on why we shutdown the way we currently do?

I originally used fast mode in the suspend path, but it creates a checkpoint that is longer and unnecessary for durability. Then, we changed it to immediate at some point. We also cannot fully rely on it anyway, as compute can crash

@knizhnik
Copy link
Contributor

knizhnik commented Jul 2, 2024

Shutdown checkpoint is saving pgstat information about replication slots and other AUX files.
I am not sure about all bad consequences if this information is not persisted.
At least we will have to do a lot of redundant work on restart.

Making checkpoint in case of Neon should not be expensive, because we do not actually flush pages.
So looks like there are good arguments to restore "fast shutdown" mode.

@tristan957
Copy link
Member

tristan957 commented Jul 2, 2024

Here is the SQL query that I am going to run across all production databases:

SELECT
  (
    SELECT
      COUNT(1)
    FROM
      endpoints
    WHERE
      type = 'read_only'
  ) AS total,
  COUNT(*) FILTER(
    WHERE
      ro.autoscaling_limit_min_cu < rw.autoscaling_limit_min_cu
      OR ro.autoscaling_limit_max_cu < rw.autoscaling_limit_max_cu
  ) AS all_laggards,
  COUNT(*) FILTER(
    WHERE
      ro.autoscaling_limit_min_cu < rw.autoscaling_limit_min_cu
  ) AS laggards_that_fail_precondition,
  COUNT(*) FILTER(
    WHERE
      ro.autoscaling_limit_max_cu < rw.autoscaling_limit_max_cu
  ) AS laggards_that_cant_scale,
  COUNT(*) FILTER(
    WHERE
      ro.autoscaling_limit_min_cu = rw.autoscaling_limit_min_cu
      AND ro.autoscaling_limit_max_cu = rw.autoscaling_limit_max_cu
  ) AS same_size,
  COUNT(*) FILTER(
    WHERE
      ro.autoscaling_limit_min_cu >= rw.autoscaling_limit_min_cu
      AND ro.autoscaling_limit_max_cu > rw.autoscaling_limit_max_cu
  ) AS higher_max_cu
FROM
  endpoints AS ro
JOIN
    endpoints AS rw
    ON ro.branch_id = rw.branch_id
WHERE
  ro.type = 'read_only'
  AND rw.type = 'read_write';

Stay tuned. It should tell us how our users currently user read-only endpoints.

@MMeent
Copy link
Contributor

MMeent commented Jul 2, 2024

COUNT(CASE WHEN ... THEN 1 END)
Have you heard about window functions and aggregate filters?

COUNT(*) FILTER (WHERE ...) is much more concise and clean, IMO, and last time I checked it's faster too.

@tristan957
Copy link
Member

I'll take a look

@kelvich kelvich changed the title Epic: Issues with read replicas (physical replication) Epic: physical replication Jul 3, 2024
ololobus added a commit that referenced this issue Jul 5, 2024
It also creates a shutdown checkpoint, which is important for
ROs to get a list of running xacts faster instead of going through
the CLOG.

See https://www.postgresql.org/docs/current/server-shutdown.html
for the list of modes and signals.

Related to #6211
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/compute Component: compute, excluding postgres itself t/bug Issue Type: Bug t/Epic Issue type: Epic
Projects
None yet
Development

No branches or pull requests