-
Notifications
You must be signed in to change notification settings - Fork 128
Fix/consider timeline when reaching secondary #695
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix/consider timeline when reaching secondary #695
Conversation
src/bin/pg_autoctl/fsm_transition.c
Outdated
| } | ||
|
|
||
| /* finally, check that we're on the same timeline as the new primary */ | ||
| return standby_check_timeline_with_upstream(postgres); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need this here? This function is used for the transition from secondary to catchingup. I think it's enough to only do it in the transition from catchingup back to secondary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Because we use the same transition function in both places, and there's a lot of common code, I just added a check that way:
/*
* Finally, check that we're on the same timeline as the new primary when
* assigned secondary as a goal state. This transition function is also
* used when going from secondary to catchingup, as the primary might have
* changed also in that situation.
*/
if (keeper->state.assigned_role == SECONDARY_STATE)
{
return standby_check_timeline_with_upstream(postgres);
}| #define InvalidXLogRecPtr 0 | ||
| #define XLogRecPtrIsInvalid(r) ((r) == InvalidXLogRecPtr) | ||
|
|
||
| #define PG_AUTOCTL_MAX_TIMELINES 1024 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems dangerous to put a limit on this, since the amount of timelines could reasonably be very high. It might be better to use a malloc instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to avoid malloc and then having to deal with who's responsible for calling free as much as possible, here we don't have to think about the lifetime of the “cache”. It's trivial and quite cheap to enlarge that array even tenfold, if you think we need to do that?
At the moment the array (given 64 bits alignment) is going to be 128kB of memory...
src/bin/pg_autoctl/primary_standby.c
Outdated
| NodeAddress *primaryNode = &(replicationSource->primaryNode); | ||
|
|
||
| /* pg_control_checkpoint() returns the local timeline, update it now */ | ||
| if (!pgsql_checkpoint(&(postgres->sqlClient))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating a checkpoint can be quite heavy. If we only do this to get the current timeline id, maybe instead we should use some other function, such as pg_walfile_name or pg_control_data: https://www.postgresql.org/message-id/3FB63E54-403A-440E-88CB-CFA6FEF3E0F7%40icloud.com
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can grab pg_stat_wal_receiver.received_tli, I believe that having received it locally is what matters in our case. If we have to restart (crash recovery or normal startup), we will be able to catch-up with the new tli, and more importantly in case of a promotion, again, we can replay the WAL files and switch to the new TLI then.
Note that using pg_walfile_name requires knowing which WAL file is the most recently received, and should give us the same information as pg_stat_wal_receiver I believe, and pg_control_data needs a checkpoint too and shows the same value as the pg_control_checkpoint view I was using before.
|
|
||
| return false; | ||
| } | ||
| else if (upstreamTimeline == localTimeline) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also check that the moment when the timeline split is the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to? I don't think so. If yes, then we need to also parse the timeline history file that's local to the standby to grab that piece of information, and this file is named 00000004.history for timeline 4, and the standby grabs it from the primary as part of the streaming replication protocol anyway...
e3a5dca to
174e98a
Compare
| "select pg_is_in_recovery()," | ||
| " coalesce(rep.sync_state, '') as sync_state," | ||
| " case when pg_is_in_recovery()" | ||
| " then coalesce(pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn())" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we use pg_last_wal_receive_lsn here? It seems pg_last_wal_replay_lsn is the one that reflects what we want the most.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If memory serves, that's because that's the LSN that we can reach when restarting Postgres before promotion or for REPORT_LSN. It's safe on-disk even if it's not applied yet. Applying might have delays because of read-only traffic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this out and I think it's good to merge, since it's an improvement over what we have now. I'm not entirely sure if received_tli is the right timeline to use though. I guess it could be that it did receive the timeline, but hasn't actually applied it because of some error.
However, it seems that there's currently no way to get the applied timeline: https://www.postgresql.org/message-id/flat/20190723180518.635ac554%40firost
When a node reports state SECONDARY it can be the target of a failover. Because we want a single straight line history in streaming replication, we need to ensure that standby nodes are on the same timeline as their upstream node before transitioning to SECONDARY.
What matters is that we have received the most recent timeline from the primary node both in case of promotion and in case of restart and crash recovery; so use that value from the streaming replication stats rather than force a checkpoint to obtain it.
ab53dda to
5c1201c
Compare
Check that we are on the same timeline as our upstream node.
When a node reports state SECONDARY it can be the target of a failover.
Because we want a single straight line history in streaming replication, we
need to ensure that standby nodes are on the same timeline as their upstream
node before transitioning to SECONDARY.
See #683