Skip to content

Conversation

@DimCitus
Copy link
Collaborator

@DimCitus DimCitus commented May 4, 2021

Check that we are on the same timeline as our upstream node.

When a node reports state SECONDARY it can be the target of a failover.
Because we want a single straight line history in streaming replication, we
need to ensure that standby nodes are on the same timeline as their upstream
node before transitioning to SECONDARY.

See #683

@DimCitus DimCitus added bug Something isn't working Size: S Effort Estimate: Small labels May 4, 2021
@DimCitus DimCitus added this to the Sprint 2021 W16 W17 milestone May 4, 2021
@DimCitus DimCitus requested a review from JelteF May 4, 2021 09:33
@DimCitus DimCitus self-assigned this May 4, 2021
}

/* finally, check that we're on the same timeline as the new primary */
return standby_check_timeline_with_upstream(postgres);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this here? This function is used for the transition from secondary to catchingup. I think it's enough to only do it in the transition from catchingup back to secondary.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Because we use the same transition function in both places, and there's a lot of common code, I just added a check that way:

	/*
	 * Finally, check that we're on the same timeline as the new primary when
	 * assigned secondary as a goal state. This transition function is also
	 * used when going from secondary to catchingup, as the primary might have
	 * changed also in that situation.
	 */
	if (keeper->state.assigned_role == SECONDARY_STATE)
	{
		return standby_check_timeline_with_upstream(postgres);
	}

#define InvalidXLogRecPtr 0
#define XLogRecPtrIsInvalid(r) ((r) == InvalidXLogRecPtr)

#define PG_AUTOCTL_MAX_TIMELINES 1024
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems dangerous to put a limit on this, since the amount of timelines could reasonably be very high. It might be better to use a malloc instead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to avoid malloc and then having to deal with who's responsible for calling free as much as possible, here we don't have to think about the lifetime of the “cache”. It's trivial and quite cheap to enlarge that array even tenfold, if you think we need to do that?

At the moment the array (given 64 bits alignment) is going to be 128kB of memory...

NodeAddress *primaryNode = &(replicationSource->primaryNode);

/* pg_control_checkpoint() returns the local timeline, update it now */
if (!pgsql_checkpoint(&(postgres->sqlClient)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating a checkpoint can be quite heavy. If we only do this to get the current timeline id, maybe instead we should use some other function, such as pg_walfile_name or pg_control_data: https://www.postgresql.org/message-id/3FB63E54-403A-440E-88CB-CFA6FEF3E0F7%40icloud.com

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can grab pg_stat_wal_receiver.received_tli, I believe that having received it locally is what matters in our case. If we have to restart (crash recovery or normal startup), we will be able to catch-up with the new tli, and more importantly in case of a promotion, again, we can replay the WAL files and switch to the new TLI then.

Note that using pg_walfile_name requires knowing which WAL file is the most recently received, and should give us the same information as pg_stat_wal_receiver I believe, and pg_control_data needs a checkpoint too and shows the same value as the pg_control_checkpoint view I was using before.


return false;
}
else if (upstreamTimeline == localTimeline)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also check that the moment when the timeline split is the same?

Copy link
Collaborator Author

@DimCitus DimCitus May 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to? I don't think so. If yes, then we need to also parse the timeline history file that's local to the standby to grab that piece of information, and this file is named 00000004.history for timeline 4, and the standby grabs it from the primary as part of the streaming replication protocol anyway...

@DimCitus DimCitus force-pushed the fix/consider-timeline-when-reaching-secondary branch from e3a5dca to 174e98a Compare May 6, 2021 16:49
"select pg_is_in_recovery(),"
" coalesce(rep.sync_state, '') as sync_state,"
" case when pg_is_in_recovery()"
" then coalesce(pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn())"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we use pg_last_wal_receive_lsn here? It seems pg_last_wal_replay_lsn is the one that reflects what we want the most.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If memory serves, that's because that's the LSN that we can reach when restarting Postgres before promotion or for REPORT_LSN. It's safe on-disk even if it's not applied yet. Applying might have delays because of read-only traffic.

Copy link
Contributor

@JelteF JelteF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this out and I think it's good to merge, since it's an improvement over what we have now. I'm not entirely sure if received_tli is the right timeline to use though. I guess it could be that it did receive the timeline, but hasn't actually applied it because of some error.

However, it seems that there's currently no way to get the applied timeline: https://www.postgresql.org/message-id/flat/20190723180518.635ac554%40firost

DimCitus added 7 commits May 26, 2021 15:57
When a node reports state SECONDARY it can be the target of a failover.
Because we want a single straight line history in streaming replication, we
need to ensure that standby nodes are on the same timeline as their upstream
node before transitioning to SECONDARY.
What matters is that we have received the most recent timeline from the
primary node both in case of promotion and in case of restart and crash
recovery; so use that value from the streaming replication stats rather than
force a checkpoint to obtain it.
@DimCitus DimCitus force-pushed the fix/consider-timeline-when-reaching-secondary branch from ab53dda to 5c1201c Compare May 26, 2021 13:57
@DimCitus DimCitus merged commit 40c60b4 into master May 26, 2021
@DimCitus DimCitus deleted the fix/consider-timeline-when-reaching-secondary branch May 26, 2021 15:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working Size: S Effort Estimate: Small

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants