Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[6X Only] Fix missing xlog files with orphaned prepared transactions after cras… #10654

Merged
merged 1 commit into from Aug 27, 2020

Conversation

@paul-guo-
Copy link
Member

@paul-guo- paul-guo- commented Aug 17, 2020

…h recovery

After crash recovery finishes, the startup process will create an
end-of-recovery checkpoint. The checkpoint will recycle/remove xlog files
according to orphaned prepared transaction LSNs, replication slot data, etc.
The orphaned prepared transaction LSN data (TwoPhaseState->prepXacts, etc) for
checkpoint are populated in the startup process RecoverPreparedTransactions(),
but the function is called after the end-of-recovery checkpoint creation so
xlog files with orphaned prepared transactions might be recycled/removed. This
can cause "requested WAL segment pg_xlog/000000010000000000000009 has already
been removed" kind of error when bringing up the crashed primary. For example
if you run 'gprecoverseg -a -v', you might be able to see failure with stack as
below. In more details, this happens when running the single mode postgres in
pg_rewind.
2 0x5673a1 postgres (xlogutils.c:572)
3 0x567b2f postgres read_local_xlog_page (xlogutils.c:870)
4 0x5658f3 postgres (xlogreader.c:503)
5 0x56518a postgres XLogReadRecord (xlogreader.c:226)
6 0x54d725 postgres RecoverPreparedTransactions (twophase.c:1955)
7 0x55b0b7 postgres StartupXLOG (xlog.c:7760)
8 0xb0087f postgres InitPostgres (postinit.c:704)
9 0x97a97c postgres PostgresMain (postgres.c:4829)

Fixing this by prescanning the prepared transaction data from xlogs directly
for checkpoint creation if it's still InRecovery (i.e. when
TwoPhaseState->prepXacts is not populated).

Please note that "orphaned" might be temporary (usually happens on an
environment with heavy write-operation load) or permanent unless master dtx
recovery (implies a product bug).

Copy link
Contributor

@asimrp asimrp left a comment

This is a very good catch! What complicates the matters is Greenplum's special way of handling prepared transactions state. Let me try to summarize my different places we record this information so far:

  1. Shared memory referenced by TwoPhaseState records transactions in prepared but not committed/aborted state.
  2. The prepared_transaction_agg_state records the list of currently prepared transactions in WAL, as an extension of checkpoint record. Each item in the list is of the form <xid, prepare_lsn>. This is created during checkpoint. Upstream (PostgreSQL 9.4) records prepared transactions in two phase state file instead and the file is fsync during checkpoint.
  3. The crashRecoverPostCheckpointPreparedTransactions_map_ht hash table. It is local to a process. The entries are of the form <xid, prepare_lsn>. As far as I can tell, it's used by startup process during crash recovery and once recovery is complete, it's used by the checkpointer process.
  4. The above hash table is populated during redo of prepare WAL records.
  5. At startup, when the checkpoint record is read for the first time, the startup process populates this hash table based on the prepared xacts recorded in the checkpoint record. See SetupCheckpointPreparedTransactionList().
  6. On standby, the startup / replay process populates this hash table but it is not used by the checkpointer process.

This patch fixes a bug where end-of-recovery checkpoint removes WAL segments containing prepared records that are still needed. The points 4 and 5 above seem to indicate the prepared transaction information is already available in the crashRecoverPostCheckpointPreparedTransactions_map_ht hash table. The WAL pages need not be read again. What do you think?

CHECKPOINT

-- shutdown primary and make sure the segment is down
-1U: select pg_ctl((SELECT datadir from gp_segment_configuration c where c.role='p' and c.content=0), 'stop', 'immediate');
Copy link
Contributor

@asimrp asimrp Aug 24, 2020

If the segment is restarted instead of stop, would it be suffice? That is, can we avoid running gprecoverseg (and then rebalance)? Crash recovery upon restart should also create the desired end-of-recovery checkpoint.

Copy link
Member Author

@paul-guo- paul-guo- Aug 26, 2020

Yes, 'immediate' restart could trigger the issue also (I quickly double-confirmed by modifying the test) but I'd still keep the current test logic since the current test logic is more like what we saw in a real product environment and also I want to test a bit on gprecoverseg on this kind of scenario.

Copy link
Contributor

@asimrp asimrp Aug 26, 2020

My preference is to keep tests as simple as possible, it helps understand the problem clearly and it also reduces flakiness.

-- issue a checkpoint since a new checkpoint depends on previous checkpoint.redo
-- for xlog file recycling/removing.
checkpoint;
CHECKPOINT
Copy link
Contributor

@asimrp asimrp Aug 24, 2020

This checkpoint should retain all WAL up to 0/2C0AB170, the oldest prepared transaction.

void
getTwoPhasePreparedTransactionData(prepared_transaction_agg_state **ptas)
{
if (InRecovery)
Copy link
Contributor

@asimrp asimrp Aug 24, 2020

Pedantic comment: my preference is to use RecoveryInProgress() instead of InRecovery flag because this code can be executed by startup process but also by checkpointer process, but InRecovey flag is meaningful only in the startup process.

Copy link
Member Author

@paul-guo- paul-guo- Aug 26, 2020

getTwoPhasePreparedTransactionDataInRecovery() is expected to run in startup only. In startup, once InRecovery is false we depend on getTwoPhasePreparedTransactionDataNotInRecovery() for checkpointer. startup code RecoverPreparedTransactions() will recover the data structure for getTwoPhasePreparedTransactionDataNotInRecovery()

Copy link
Contributor

@asimrp asimrp Aug 26, 2020

Right. On a slightly different note, the name *NotInRecovery() is not very appealing. The new name can be avoided if we refactor getTwoPhasePreparedTransactionData() like so:

--- a/src/backend/access/transam/twophase.c                                                                                                                                                                                                   
+++ b/src/backend/access/transam/twophase.c                                                                                                                                                                                                   
@@ -2211,6 +2211,12 @@ getTwoPhasePreparedTransactionData(prepared_transaction_agg_state **ptas)
                                                                                                                                                                                                                                              
        Assert(*ptas == NULL);                                                                                                                                                                                                                
                                                                                                                                                                                                                                              
+       if (InRecovery)                                                                                                                                                                                                                       
+       {                                                                                                                                                                                                                                     
+               getTwoPhasePreparedTransactionDataInRecovery(ptas);                                                                                                                                                                           
+               return;                                                                                                                                                                                                                       
+       }                                                                                                                                                                                                                                     
+                                                                                                                                                                                                                                             
        TwoPhaseAddPreparedTransactionInit(ptas, &maxCount);                                                                                                                                                                                  
                                                                                                                                                                                                                                              
        for (int i = 0; i < numberOfPrepareXacts; i++)                                                                                                                                                                                        

/* Deconstruct header */
hdr = (TwoPhaseFileHeader *) XLogRecGetData(tfRecord);

TwoPhaseAddPreparedTransaction(ptas, &maxCount, hdr->xid, &tfXLogRecPtr);
Copy link
Contributor

@asimrp asimrp Aug 24, 2020

To be clear, what I'm trying to suggest is why cannot we use the xid from the hash table entry. The entry is of type:

typedef struct prpt_map
{
  TransactionId xid;
  XLogRecPtr    xlogrecptr;
} prpt_map;

Copy link
Member Author

@paul-guo- paul-guo- Aug 26, 2020

Yes, you are right. Thank. I copied the code from RecoverPreparedTransactions() and did pruning but that still made things complex.

@paul-guo-
Copy link
Member Author

@paul-guo- paul-guo- commented Aug 26, 2020

This is a very good catch! What complicates the matters is Greenplum's special way of handling prepared transactions state. Let me try to summarize my different places we record this information so far:

  1. Shared memory referenced by TwoPhaseState records transactions in prepared but not committed/aborted state.

Yes.

  1. The prepared_transaction_agg_state records the list of currently prepared transactions in WAL, as an extension of checkpoint record. Each item in the list is of the form <xid, prepare_lsn>. This is created during checkpoint. Upstream (PostgreSQL 9.4) records prepared transactions in two phase state file instead and the file is fsync during checkpoint.

Yes.

  1. The crashRecoverPostCheckpointPreparedTransactions_map_ht hash table. It is local to a process. The entries are of the form <xid, prepare_lsn>. As far as I can tell, it's used by startup process during crash recovery and once recovery is complete, it's used by the checkpointer process.

crashRecoverPostCheckpointPreparedTransactions_map_ht is mainly used on startup only. But checkpointer process does not use that.

They are used on primaries for prepare and commit/abort prepare also but they seems to be useless.

EndPrepare()
FinishPreparedTransaction()
  1. The above hash table is populated during redo of prepare WAL records.

Yes. Besides checkpint xlog redo.

  1. At startup, when the checkpoint record is read for the first time, the startup process populates this hash table based on the prepared xacts recorded in the checkpoint record. See SetupCheckpointPreparedTransactionList().

Yes.

  1. On standby, the startup / replay process populates this hash table but it is not used by the checkpointer process.

This patch fixes a bug where end-of-recovery checkpoint removes WAL segments containing prepared records that are still needed. The points 4 and 5 above seem to indicate the prepared transaction information is already available in the crashRecoverPostCheckpointPreparedTransactions_map_ht hash table. The WAL pages need not be read again. What do you think?

crashRecoverPostCheckpointPreparedTransactions_map_ht just includes xid + recptr. If the mirror is promoted we still need the wal of orphaned prepared transactions for commit/abort prepare.

asimrp
asimrp approved these changes Aug 26, 2020
Copy link
Contributor

@asimrp asimrp left a comment

Thank you for incorporating the feedback.

They are used on primaries for prepare and commit/abort prepare also but they seems to be useless.

Let's remove the usage of crashRecoverPostCheckpointPreparedTransactions_map_ht from EndPrepare() and FinishPreparedTransaction(). It causes unnecessary confusion. Additionally, Assert(AmStartupProcess()) can be added to all functions that make use of this hash table.

src/backend/access/transam/twophase.c Show resolved Hide resolved
void
getTwoPhasePreparedTransactionData(prepared_transaction_agg_state **ptas)
{
if (InRecovery)
Copy link
Contributor

@asimrp asimrp Aug 26, 2020

Right. On a slightly different note, the name *NotInRecovery() is not very appealing. The new name can be avoided if we refactor getTwoPhasePreparedTransactionData() like so:

--- a/src/backend/access/transam/twophase.c                                                                                                                                                                                                   
+++ b/src/backend/access/transam/twophase.c                                                                                                                                                                                                   
@@ -2211,6 +2211,12 @@ getTwoPhasePreparedTransactionData(prepared_transaction_agg_state **ptas)
                                                                                                                                                                                                                                              
        Assert(*ptas == NULL);                                                                                                                                                                                                                
                                                                                                                                                                                                                                              
+       if (InRecovery)                                                                                                                                                                                                                       
+       {                                                                                                                                                                                                                                     
+               getTwoPhasePreparedTransactionDataInRecovery(ptas);                                                                                                                                                                           
+               return;                                                                                                                                                                                                                       
+       }                                                                                                                                                                                                                                     
+                                                                                                                                                                                                                                             
        TwoPhaseAddPreparedTransactionInit(ptas, &maxCount);                                                                                                                                                                                  
                                                                                                                                                                                                                                              
        for (int i = 0; i < numberOfPrepareXacts; i++)                                                                                                                                                                                        

@asimrp
Copy link
Contributor

@asimrp asimrp commented Aug 26, 2020

In the commit message, can we please add a note on why this is a 6X-only change?

…h recovery

After crash recovery finishes, the startup process will create an
end-of-recovery checkpoint. The checkpoint will recycle/remove xlog files
according to orphaned prepared transaction LSNs, replication slot data, etc.
The orphaned prepared transaction LSN data (TwoPhaseState->prepXacts, etc) for
checkpoint are populated in the startup process RecoverPreparedTransactions(),
but the function is called after the end-of-recovery checkpoint creation so
xlog files with orphaned prepared transactions might be recycled/removed. This
can cause "requested WAL segment pg_xlog/000000010000000000000009 has already
been removed" kind of error when bringing up the crashed primary. For example
if you run 'gprecoverseg -a -v', you might be able to see failure with stack as
below. In more details, this happens when running the single mode postgres in
pg_rewind.
	2    0x5673a1 postgres <symbol not found> (xlogutils.c:572)
	3    0x567b2f postgres read_local_xlog_page (xlogutils.c:870)
	4    0x5658f3 postgres <symbol not found> (xlogreader.c:503)
	5    0x56518a postgres XLogReadRecord (xlogreader.c:226)
	6    0x54d725 postgres RecoverPreparedTransactions (twophase.c:1955)
	7    0x55b0b7 postgres StartupXLOG (xlog.c:7760)
	8    0xb0087f postgres InitPostgres (postinit.c:704)
	9    0x97a97c postgres PostgresMain (postgres.c:4829)

Fixing this by prescanning the prepared transaction data from xlogs directly
for checkpoint creation if it's still InRecovery (i.e. when
TwoPhaseState->prepXacts is not populated).

Please note that "orphaned" might be temporary (usually happens on an
environment with heavy write-operation load) or permanent unless master dtx
recovery (implies a product bug).

This is a gpdb 6 only issue. On gpdb 7, state files are used to store prepared
transactions during checkpoint so we do not keep the wal files with orphaned
prepared transactions and thus we won't encounter this issue.

Reviewed-by: Asim R P <pasim@vmware.com>
@paul-guo- paul-guo- merged commit dafe0c8 into greenplum-db:6X_STABLE Aug 27, 2020
2 of 3 checks passed
@paul-guo- paul-guo- deleted the 6x_checkpoint branch Aug 27, 2020
@paul-guo-
Copy link
Member Author

@paul-guo- paul-guo- commented Aug 27, 2020

Pushed into 6X_STABLE. Thanks for reviewing.

@hidva
Copy link
Contributor

@hidva hidva commented Mar 23, 2021

How does Checkpointer in mirror obtain the TwoPhasePreparedTransactionData? When mirror is promoted to primary and fast_promoted is not available, we will invoke RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY | CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT) to do the end-of-recovery checkpoint, it means that CreateCheckPoint() will be invoked in the checkpointer process of mirror, and getTwoPhasePreparedTransactionData() in the checkpointer process will return an empty set of TwoPhasePreparedTransactionData. So if the mirror that has been promoted to primary panic when it receives T message for the first time, all TwoPhasePreparedTransactionData will be lost.

@paul-guo-
Copy link
Member Author

@paul-guo- paul-guo- commented Mar 24, 2021

How does Checkpointer in mirror obtain the TwoPhasePreparedTransactionData? When mirror is promoted to primary and fast_promoted is not available, we will invoke RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY | CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT) to do the end-of-recovery checkpoint, it means that CreateCheckPoint() will be invoked in the checkpointer process of mirror, and getTwoPhasePreparedTransactionData() in the checkpointer process will return an empty set of TwoPhasePreparedTransactionData. So if the mirror that has been promoted to primary panic when it receives T message for the first time, all TwoPhasePreparedTransactionData will be lost.

@hidva You are right the fix does not seem to apply to !fast_promoted (i.e. fallback_promote) case, but no worry since fallback_promote is not used in GP6. For GP7 and above, there is no such issue.

@hidva
Copy link
Contributor

@hidva hidva commented Mar 24, 2021

fallback_promote

Is it possible for ReadCheckpointRecord() in if (fast_promote) to return NULL?

@paul-guo-
Copy link
Member Author

@paul-guo- paul-guo- commented Mar 25, 2021

@hidva I take back my yesterday comment. I read the context of this PR and code carefully today. This patch is to fix the issue that in the single-mode postgres launched by pg_rewind, see comment in the commit message.

In more details, this happens when running the single mode postgres in
pg_rewind.
2 0x5673a1 postgres (xlogutils.c:572)
3 0x567b2f postgres read_local_xlog_page (xlogutils.c:870)
4 0x5658f3 postgres (xlogreader.c:503)
5 0x56518a postgres XLogReadRecord (xlogreader.c:226)
6 0x54d725 postgres RecoverPreparedTransactions (twophase.c:1955)
7 0x55b0b7 postgres StartupXLOG (xlog.c:7760)
8 0xb0087f postgres InitPostgres (postinit.c:704)
9 0x97a97c postgres PostgresMain (postgres.c:4829)

For this an End-Of-Recovery checkpoint is created after crash recovery finishes, so surely there is promote. I should have mentioned the scenario at the beginning of this PR.

Another fix for "requested WAL segment pg_xlog/000000010000000000000003 has already been removed" kind of issue is restartpoint(checkpoint) relevant. See 4d282cb. FYI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants