Online repl checkpoint #546

knizhnik · 2024-12-03T15:47:48Z

Checkpoint replication state not only in shutdown, but also in online checkpoint

tristan957 · 2024-12-06T18:23:31Z

So @MMeent added this comment:

+/*
+ * NEON:  we use logical records to persist information of about slots, origins, relation map...
+ * If it is done inside shutdown checkpoint, then Postgres panics: "concurrent write-ahead log activity while database system is shutting down"
+ * So it before checkpoint REDO position is determined.
+ */

It sounds like the PR as written wouldn't account for this panic? Should we keep everything the same, but just move the call to CheckPointSnapBuild() after the if/else in CheckPointReplicationState()?

tristan957 · 2024-12-06T18:46:20Z

Actually, after thinking about this more, it looks good. Let's get a review from Matthias too.

tristan957 · 2024-12-06T18:48:07Z

A more descriptive commit message would also be useful for future git archaeologists.

knizhnik · 2024-12-06T18:53:25Z

So @MMeent added this comment:
+/*
+ * NEON:  we use logical records to persist information of about slots, origins, relation map...
+ * If it is done inside shutdown checkpoint, then Postgres panics: "concurrent write-ahead log activity while database system is shutting down"
+ * So it before checkpoint REDO position is determined.
+ */
It sounds like the PR as written wouldn't account for this panic? Should we keep everything the same, but just move the call to CheckPointSnapBuild() after the if/else in CheckPointReplicationState()?

I just restore behaviour which we have for all other versions of Postgres: pg14-16.
I have added this if in pg17 to minimize number of generated AUX files.
But right now I think that it is not correct: if we do not save slots info in online checkpoints then we can loose it in case of compute crash.

In any case, I think it is good idea to have the same code in all PG versions.

hlinnaka

This is pretty confusing. Even before this PR, but this isn't helping..

The function is called "CheckPointReplicationState", but CheckPointRelationMap() has nothing to do with replication.
CheckPointRelationMap() and CheckPointReplicationOrigin() don't write any WAL records, so they don't need this special treatment. They can be called from CheckPointGuts() just like in upstream
In a shutdown checkpoint, all these functions are now being called twice, once in the PreCheckPointGuts() stage and again in CheckPointGuts(). Seems unnecessary, but also wrong; won't those functions PANIC again, when the try to write the WAL records?

knizhnik · 2024-12-10T17:01:02Z

Yes, CheckPointRelationMap() has not relation to replication.
But it is called before CheckPointReplicationSlots and CheckPointLogicalRewriteHeap in CheclpointGuts this his why I thought that later ones can somehow depends on it and it is better to reserve order.

CheckPointReplicationOrigin() writes "pg_logical/replorigin_checkpoint" file. But we do not persist it using AUX mechanism. Actually I don't remember why. May be just because it is not needed because origins are wallowed in any case.

It is real bug that CheckPointReplicationState is called twice - I forget to copy correspondent check (which is done in pg 14/15/16). I will fix it.

knizhnik · 2024-12-10T17:21:42Z

I committed fix of double call of CheckPointReplicationState.
So now pg17 is ding exactly the same as all there versions.

Do you think that we should include in this PR all other changes: move CheckPointReplicationOrigin() and CheckPointRelationMap() out of CheckPointReplicationState?

I agree with you that they are not writing WAL and so there is no need to call them here.
But as I wrote above, I prefer to reserve order of calling this functions in CheckpointGuts and CheckPointReplicationOrigin() definitely is related to replication state, although in not walloging written file.

I prefer to leave it is as it is now or create separate PR for it, because it should affect all other Postgres versions.

hlinnaka · 2024-12-17T16:26:25Z

CheckPointReplicationOrigin() writes "pg_logical/replorigin_checkpoint" file. But we do not persist it using AUX mechanism. Actually I don't remember why. May be just because it is not needed because origins are wallowed in any case.

Related: neondatabase/neon#8620

hlinnaka

I committed fix of double call of CheckPointReplicationState. So now pg17 is ding exactly the same as all there versions.

Do you think that we should include in this PR all other changes: move CheckPointReplicationOrigin() and CheckPointRelationMap() out of CheckPointReplicationState?

I agree with you that they are not writing WAL and so there is no need to call them here. But as I wrote above, I prefer to reserve order of calling this functions in CheckpointGuts and CheckPointReplicationOrigin() definitely is related to replication state, although in not walloging written file.

I prefer to leave it is as it is now or create separate PR for it, because it should affect all other Postgres versions.

Ok, let's commit this, to bring all the Postgres versions to the same state, and open a separate PR for the other cleanup.

## Problem See https://neondb.slack.com/archives/C04DGM6SMTM/p1733180965970089 Replication state is checkpointed only by shutdown checkpoint. It means that replication snapshots are not removed till compute shutdown. ## Summary of changes Checkpoint replication state during online checkpoint Related Postgres PR: neondatabase/postgres#546 Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>

knizhnik mentioned this pull request Dec 3, 2024

Online checkpoint replication state neondatabase/neon#9976

Merged

tristan957 approved these changes Dec 6, 2024

View reviewed changes

tristan957 requested a review from MMeent December 6, 2024 18:47

hlinnaka reviewed Dec 10, 2024

View reviewed changes

knizhnik force-pushed the online_repl_checkpoint branch from 9845a53 to 4671da9 Compare December 11, 2024 12:14

knizhnik requested a review from hlinnaka December 12, 2024 13:59

knizhnik added 3 commits December 16, 2024 08:33

Online checkpoint replication state

e29d64b

Fix int->bool conversion

21f9464

Prevent duble call of CheckPointReplicationState

7e3f397

knizhnik force-pushed the online_repl_checkpoint branch from 4671da9 to 7e3f397 Compare December 16, 2024 06:33

hlinnaka approved these changes Dec 17, 2024

View reviewed changes

knizhnik merged commit 7e3f397 into REL_17_STABLE_neon Dec 18, 2024
1 check passed

knizhnik deleted the online_repl_checkpoint branch December 18, 2024 09:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Online repl checkpoint #546

Online repl checkpoint #546

Uh oh!

knizhnik commented Dec 3, 2024

Uh oh!

tristan957 commented Dec 6, 2024 •

edited

Loading

Uh oh!

tristan957 commented Dec 6, 2024

Uh oh!

tristan957 commented Dec 6, 2024

Uh oh!

knizhnik commented Dec 6, 2024

Uh oh!

hlinnaka left a comment

Uh oh!

knizhnik commented Dec 10, 2024

Uh oh!

knizhnik commented Dec 10, 2024

Uh oh!

hlinnaka commented Dec 17, 2024

Uh oh!

hlinnaka left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Online repl checkpoint #546

Online repl checkpoint #546

Uh oh!

Conversation

knizhnik commented Dec 3, 2024

Uh oh!

tristan957 commented Dec 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tristan957 commented Dec 6, 2024

Uh oh!

tristan957 commented Dec 6, 2024

Uh oh!

knizhnik commented Dec 6, 2024

Uh oh!

hlinnaka left a comment

Choose a reason for hiding this comment

Uh oh!

knizhnik commented Dec 10, 2024

Uh oh!

knizhnik commented Dec 10, 2024

Uh oh!

hlinnaka commented Dec 17, 2024

Uh oh!

hlinnaka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tristan957 commented Dec 6, 2024 •

edited

Loading