-
Notifications
You must be signed in to change notification settings - Fork 440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restore running xacts from CLOG on replica startup #7288
Conversation
That's a pleasantly short patch. Still highly work-in-progress obviously, but I think this is promising. Issues / TODOs:
Unless we come up with some creative solution to the "too many KnownAssignedXids" problem, I think we can only use this in limited cases, when there are only few in-progress (sub-)XIDs. But it might still be worth doing, that might cover 90% of the most common real world scenarios, as long as we can fail gracefully or fall back to something else when doesn't work out. |
3000 tests run: 2885 passed, 0 failed, 115 skipped (full report)Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
9ce1930 at 2024-07-01T10:45:00.748Z :recycle: |
There are still some other Neon specific changes in xlog.c/xlogrecovery.c related with startup. I didn't considered yet moving this code to the neon extension, may be it is possible. But it tightly interact with recovery state machine, so it may be non trivial/
I do not quite understand why this problem is specific to this particular way of propagating information about running xacts at replica startup?
I think that it is more correct to say that we can not use this mechanism only I limited number of cases (thousands of subtrans). Because maxKnownAssignedXids = 64*MaxBackends. But as I mentioned above, I think that if there are such larger number of active xids, then we get this error doesn't;t matter which mechanism is used by solving running-xacts problem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
walingest changes look good, almost like a separate bugfix, but perhaps it does not have any effect without the accompanying postgres changes?
Yes, up-to-date value of |
Discovered #7794 while working on this PR. |
I added two test cases to this, which are failing. They demonstrate two problems with the known-xid overflow:
|
I think you could call RestoreRunningXactsFromClog() from |
So frankly speaking I prefer to delay refactoring of xlog.c, extracting neon-specific changes from here, for some future. Right now it is more critical to fix the real problem with running xacts and RO replicas. If we consider that restoring this information from CLOG is proper solution (or at least temporary workaround until CSN patch will be committed in vanilla), then let's decide what to do with known xids overflow: just throw error, restrict number of restored subxacts, ... I am ok with any of this solutions. As far as I understand, the Vanilla approach to this problem is their following: if there is xubxid overflow then just do not produce running xacts WAL record and wait until there are less active subxacts. This approach should never cause "too much known xids" errors but can infinitely delay replica startup. Not sure that we can and do something similar in our case. IMHO large number (thousands) of subxacts is very rare use case and replication is also not needed for everybody. So, combined together, probability that some customer ever faced with this problem is close to zero. Certainly it will be nice to have solution which works in 100% cases. And CSN is such solution. But trying to solve the problem within existed model seems to be just waste of time. |
+1, let's nail down what the behavior will be.
It's a bit more subtle than that: Postgres always creates the running-xacts record, but the if the subxid caches have overflown, then the running-xacts record doesn't include the sub-XIDs that had overflown. In that case, the standby need to wait until all the overflown XIDs hae completed. When it sees another running-xacts record with a higher oldestActiveXid, then it can start accepting queries.
Correct. To be clear, the infinite delay would happen in the case that you have one transaction open in the primary that you never commit. If you continuously start and end transactions, the standby will eventually start up, even if you have millions of transactions open all the time. |
I'm thinking: If the list of XIDs scanned from CLOG don't fit in known-assigned XIDs, with enough free space so that we can be sure that we won't run out of space later during replay either, then start with that. Otherwise, bail out and wait for the next running-xacts record, like Postgres normally does. |
2a58bfe
to
d99c1ab
Compare
(moving this comment from #7855 to here, to keep the discussion in one place): The tests that this adds are failing, that's expected. They demonstrate the two cases where this optimization fails: in one case the known-assigned xids area is overflown so we wait for the running-xacts record, and in the other case it's overflown later, in WAL replay, so we crash with |
There are a bunch of FIXMEs, TODOs, XXX comments here. Need to address them. Two open items in particular:
|
Looked more at it. So right now situation is the following:
If we restore running xids from CLOG, we can fill more than PGPROC_MAX_CACHED_SUBXIDS. If after replica start we start multiple transactions with subsides, then we easily can cause overflow of KnownAssignedXids. So looks like the only safe limit for restoring running xacts from CXLOG is Also please notice that all this arithmetic above took in account only subxids. But KnownAssignedXids also includes normal (top) transaction XIDs. And their number can be larger than PGPROC_MAX_CACHED_SUBXIDS. |
@hlinnaka what are thee further steps with this PR? I have changed my test to spawn thread and perform commit in this thread after some delay. I could not say that I excited by this solution because it depends on timing (sleeps). But do not know better solution. I have to disable your Concerning safe value for number of restores runnings xids, as I mentioned above it should be |
There seems to be one issue with your approach of restoring running xacts from CLOG from neon_rm_startup.
If we look at log of first replica we can see:
Please notice 15 seconds pause before
As far as there are no updates and so far no WAL records we have to wait 15 seconds until running-xacts record is generated. Situation with replica 2 is even worser. According to the log recovery is not started at it at all (there is no "redo starts " record in the log, although the log is similar to log of first replica. The only difference is that they reach consistent point at difference LSNs: First replica:
Second replica:
So we wait for 15 seconds for first replica startup (until www get running xacts record).
So my suggestion is to return RestoreRunningXactsFromClog from neon_smgr. |
Oh. Darn. Ok, yeah we need to call the function earlier then..
Yeah, I don't see any good way to use the existing hooks for this. +1 for registering |
f4d6525
to
344aec5
Compare
This makes it much more convenient to use in the common case that you want to flush all the WAL. (Passing pg_current_wal_insert_lsn() as the argument doesn't work for the same reasons as explained in the comments: we need to be back off to the beginning of a page if the previous record ended at page boundary.) I plan to use this to fix the issue that Arseny Sher called out at #7288 (comment)
We have one pretty serious MVCC visibility bug with hot standby replicas. We incorrectly treat any transactions that are in progress in the primary, when the standby is started, as aborted. That can break MVCC for queries running concurrently in the standby. It can also lead to hint bits being set incorrectly, and that damage can last until the replica is restarted. The fundamental bug was that we treated any replica start as starting from a shut down server. The fix for that is straightforward: we need to set 'wasShutdown = false' in InitWalRecovery() (see changes in the postgres repo). However, that introduces a new problem: with wasShutdown = false, the standby will not open up for queries until it receives a running-xacts WAL record from the primary. That's correct, and that's how Postgres hot standby always works. But it's a problem for Neon, because: * It changes the historical behavior for existing users. Currently, the standby immediately opens up for queries, so if they now need to wait, we can breka existing use cases that were working fine (assuming you don't hit the MVCC issues). * The problem is much worse for Neon than it is for standalone PostgreSQL, because in Neon, we can start a replica from an arbitrary LSN. In standalone PostgreSQL, the replica always starts WAL replay from a checkpoint record, and the primary arranges things so that there is always a running-xacts record soon after each checkpoint record. You can still hit this issue with PostgreSQL if you have a transaction with lots of subtransactions running in the primary, but it's pretty rare in practice. To mitigate that, we introduce another way to collect the running-xacts information at startup, without waiting for the running-xacts WAL record: We can the CLOG for XIDs that haven't been marked as committed or aborted. It has limitations with subtransactions too, but should mitigate the problem for most users. See #7236. Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
76e9116
to
9fee0c6
Compare
The 'leftover' boolean was replaced with a semaphore in commit f0e2bb7, but this initialization was missed. Remove it so that if a test tries to access it, you get an error rather than always claiming that the endpoint is not running. Spotted by Arseny at #7288 (comment)
The 'running' boolean was replaced with a semaphore in commit f0e2bb7, but this initialization was missed. Remove it so that if a test tries to access it, you get an error rather than always claiming that the endpoint is not running. Spotted by Arseny at #7288 (comment)
The 'running' boolean was replaced with a semaphore in commit f0e2bb7, but this initialization was missed. Remove it so that if a test tries to access it, you get an error rather than always claiming that the endpoint is not running. Spotted by Arseny at #7288 (comment)
The 'running' boolean was replaced with a semaphore in commit f0e2bb7, but this initialization was missed. Remove it so that if a test tries to access it, you get an error rather than always claiming that the endpoint is not running. Spotted by Arseny at #7288 (comment)
This makes it much more convenient to use in the common case that you want to flush all the WAL. (Passing pg_current_wal_insert_lsn() as the argument doesn't work for the same reasons as explained in the comments: we need to be back off to the beginning of a page if the previous record ended at page boundary.) I plan to use this to fix the issue that Arseny Sher called out at #7288 (comment)
This makes it much more convenient to use in the common case that you want to flush all the WAL. (Passing pg_current_wal_insert_lsn() as the argument doesn't work for the same reasons as explained in the comments: we need to be back off to the beginning of a page if the previous record ended at page boundary.) I plan to use this to fix the issue that Arseny Sher called out at #7288 (comment)
We have one pretty serious MVCC visibility bug with hot standby replicas. We incorrectly treat any transactions that are in progress in the primary, when the standby is started, as aborted. That can break MVCC for queries running concurrently in the standby. It can also lead to hint bits being set incorrectly, and that damage can last until the replica is restarted. The fundamental bug was that we treated any replica start as starting from a shut down server. The fix for that is straightforward: we need to set 'wasShutdown = false' in InitWalRecovery() (see changes in the postgres repo). However, that introduces a new problem: with wasShutdown = false, the standby will not open up for queries until it receives a running-xacts WAL record from the primary. That's correct, and that's how Postgres hot standby always works. But it's a problem for Neon, because: * It changes the historical behavior for existing users. Currently, the standby immediately opens up for queries, so if they now need to wait, we can breka existing use cases that were working fine (assuming you don't hit the MVCC issues). * The problem is much worse for Neon than it is for standalone PostgreSQL, because in Neon, we can start a replica from an arbitrary LSN. In standalone PostgreSQL, the replica always starts WAL replay from a checkpoint record, and the primary arranges things so that there is always a running-xacts record soon after each checkpoint record. You can still hit this issue with PostgreSQL if you have a transaction with lots of subtransactions running in the primary, but it's pretty rare in practice. To mitigate that, we introduce another way to collect the running-xacts information at startup, without waiting for the running-xacts WAL record: We can the CLOG for XIDs that haven't been marked as committed or aborted. It has limitations with subtransactions too, but should mitigate the problem for most users. See #7236. Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
39a2156
to
9ce1930
Compare
…8215) This makes it much more convenient to use in the common case that you want to flush all the WAL. (Passing pg_current_wal_insert_lsn() as the argument doesn't work for the same reasons as explained in the comments: we need to be back off to the beginning of a page if the previous record ended at page boundary.) I plan to use this to fix the issue that Arseny Sher called out at #7288 (comment)
We have one pretty serious MVCC visibility bug with hot standby replicas. We incorrectly treat any transactions that are in progress in the primary, when the standby is started, as aborted. That can break MVCC for queries running concurrently in the standby. It can also lead to hint bits being set incorrectly, and that damage can last until the replica is restarted. The fundamental bug was that we treated any replica start as starting from a shut down server. The fix for that is straightforward: we need to set 'wasShutdown = false' in InitWalRecovery() (see changes in the postgres repo). However, that introduces a new problem: with wasShutdown = false, the standby will not open up for queries until it receives a running-xacts WAL record from the primary. That's correct, and that's how Postgres hot standby always works. But it's a problem for Neon, because: * It changes the historical behavior for existing users. Currently, the standby immediately opens up for queries, so if they now need to wait, we can breka existing use cases that were working fine (assuming you don't hit the MVCC issues). * The problem is much worse for Neon than it is for standalone PostgreSQL, because in Neon, we can start a replica from an arbitrary LSN. In standalone PostgreSQL, the replica always starts WAL replay from a checkpoint record, and the primary arranges things so that there is always a running-xacts record soon after each checkpoint record. You can still hit this issue with PostgreSQL if you have a transaction with lots of subtransactions running in the primary, but it's pretty rare in practice. To mitigate that, we introduce another way to collect the running-xacts information at startup, without waiting for the running-xacts WAL record: We can the CLOG for XIDs that haven't been marked as committed or aborted. It has limitations with subtransactions too, but should mitigate the problem for most users. See #7236. Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
The 'running' boolean was replaced with a semaphore in commit f0e2bb7, but this initialization was missed. Remove it so that if a test tries to access it, you get an error rather than always claiming that the endpoint is not running. Spotted by Arseny at #7288 (comment)
This makes it much more convenient to use in the common case that you want to flush all the WAL. (Passing pg_current_wal_insert_lsn() as the argument doesn't work for the same reasons as explained in the comments: we need to be back off to the beginning of a page if the previous record ended at page boundary.) I plan to use this to fix the issue that Arseny Sher called out at #7288 (comment)
We have one pretty serious MVCC visibility bug with hot standby replicas. We incorrectly treat any transactions that are in progress in the primary, when the standby is started, as aborted. That can break MVCC for queries running concurrently in the standby. It can also lead to hint bits being set incorrectly, and that damage can last until the replica is restarted. The fundamental bug was that we treated any replica start as starting from a shut down server. The fix for that is straightforward: we need to set 'wasShutdown = false' in InitWalRecovery() (see changes in the postgres repo). However, that introduces a new problem: with wasShutdown = false, the standby will not open up for queries until it receives a running-xacts WAL record from the primary. That's correct, and that's how Postgres hot standby always works. But it's a problem for Neon, because: * It changes the historical behavior for existing users. Currently, the standby immediately opens up for queries, so if they now need to wait, we can breka existing use cases that were working fine (assuming you don't hit the MVCC issues). * The problem is much worse for Neon than it is for standalone PostgreSQL, because in Neon, we can start a replica from an arbitrary LSN. In standalone PostgreSQL, the replica always starts WAL replay from a checkpoint record, and the primary arranges things so that there is always a running-xacts record soon after each checkpoint record. You can still hit this issue with PostgreSQL if you have a transaction with lots of subtransactions running in the primary, but it's pretty rare in practice. To mitigate that, we introduce another way to collect the running-xacts information at startup, without waiting for the running-xacts WAL record: We can the CLOG for XIDs that haven't been marked as committed or aborted. It has limitations with subtransactions too, but should mitigate the problem for most users. See #7236. Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
The pageserver tracks the latest XID seen in the WAL, in the nextXid field in the "checkpoint" key-value pair. To reduce the churn on that single storage key, it's not tracked exactly. Rather, when we advance it, we always advance it to the next multiple of 1024 XIDs. That way, we only need to insert a new checkpoint value to the storage every 1024 transactions. However, read-only replicas now scan the WAL at startup, to find any XIDs that haven't been explicitly aborted or committed, and treats them as still in-progress (PR #7288). When we bump up the nextXid counter by 1024, all those skipped XID look like in-progress XIDs to a read replica. There's a limited amount of space for tracking in-progress XIDs, so there's more cost ot skipping XIDs now. We had a case in production where a read replica did not start up, because the primary had gone through many restart cycles without writing any running-xacts or checkpoint WAL records, and each restart added almost 1024 "orphaned" XIDs that had to be tracked as in-progress in the replica. As soon as the primary writes a running-xacts or checkpoint record, the orphaned XIDs can be removed from the in-progress XIDs list and hte problem resolves, but if those recors are not written, the orphaned XIDs just accumulate. We should work harder to make sure that a running-xacts or checkpoint record is written at primary startup or shutdown. But at the same time, we can just make XID_CHECKPOINT_INTERVAL smaller, to consume fewer XIDs in such scenarios. That means that we will generate more versions of the checkpoint key-value pair in the storage, but we haven't seen any problems with that so it's probably fine to go from 1024 to 128.
…8215) This makes it much more convenient to use in the common case that you want to flush all the WAL. (Passing pg_current_wal_insert_lsn() as the argument doesn't work for the same reasons as explained in the comments: we need to be back off to the beginning of a page if the previous record ended at page boundary.) I plan to use this to fix the issue that Arseny Sher called out at #7288 (comment)
…8215) This makes it much more convenient to use in the common case that you want to flush all the WAL. (Passing pg_current_wal_insert_lsn() as the argument doesn't work for the same reasons as explained in the comments: we need to be back off to the beginning of a page if the previous record ended at page boundary.) I plan to use this to fix the issue that Arseny Sher called out at #7288 (comment)
This makes it much more convenient to use in the common case that you want to flush all the WAL. (Passing pg_current_wal_insert_lsn() as the argument doesn't work for the same reasons as explained in the comments: we need to be back off to the beginning of a page if the previous record ended at page boundary.) I plan to use this to fix the issue that Arseny Sher called out at #7288 (comment)
…8215) This makes it much more convenient to use in the common case that you want to flush all the WAL. (Passing pg_current_wal_insert_lsn() as the argument doesn't work for the same reasons as explained in the comments: we need to be back off to the beginning of a page if the previous record ended at page boundary.) I plan to use this to fix the issue that Arseny Sher called out at #7288 (comment)
This makes it much more convenient to use in the common case that you want to flush all the WAL. (Passing pg_current_wal_insert_lsn() as the argument doesn't work for the same reasons as explained in the comments: we need to be back off to the beginning of a page if the previous record ended at page boundary.) I plan to use this to fix the issue that Arseny Sher called out at #7288 (comment)
…8215) This makes it much more convenient to use in the common case that you want to flush all the WAL. (Passing pg_current_wal_insert_lsn() as the argument doesn't work for the same reasons as explained in the comments: we need to be back off to the beginning of a page if the previous record ended at page boundary.) I plan to use this to fix the issue that Arseny Sher called out at #7288 (comment)
…8215) This makes it much more convenient to use in the common case that you want to flush all the WAL. (Passing pg_current_wal_insert_lsn() as the argument doesn't work for the same reasons as explained in the comments: we need to be back off to the beginning of a page if the previous record ended at page boundary.) I plan to use this to fix the issue that Arseny Sher called out at #7288 (comment)
We have one pretty serious MVCC visibility bug with hot standby
replicas. We incorrectly treat any transactions that are in progress
in the primary, when the standby is started, as aborted. That can
break MVCC for queries running concurrently in the standby. It can
also lead to hint bits being set incorrectly, and that damage can last
until the replica is restarted.
The fundamental bug was that we treated any replica start as starting
from a shut down server. The fix for that is straightforward: we need
to set 'wasShutdown = false' in InitWalRecovery() (see changes in the
postgres repo).
However, that introduces a new problem: with wasShutdown = false, the
standby will not open up for queries until it receives a running-xacts
WAL record from the primary. That's correct, and that's how Postgres
hot standby always works. But it's a problem for Neon, because:
It changes the historical behavior for existing users. Currently,
the standby immediately opens up for queries, so if they now need to
wait, we can breka existing use cases that were working fine
(assuming you don't hit the MVCC issues).
The problem is much worse for Neon than it is for standalone
PostgreSQL, because in Neon, we can start a replica from an
arbitrary LSN. In standalone PostgreSQL, the replica always starts
WAL replay from a checkpoint record, and the primary arranges things
so that there is always a running-xacts record soon after each
checkpoint record. You can still hit this issue with PostgreSQL if
you have a transaction with lots of subtransactions running in the
primary, but it's pretty rare in practice.
To mitigate that, we introduce another way to collect the
running-xacts information at startup, without waiting for the
running-xacts WAL record: We can the CLOG for XIDs that haven't been
marked as committed or aborted. It has limitations with
subtransactions too, but should mitigate the problem for most users.
See #7236.