Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry pgrewind #14

Closed
wants to merge 4 commits into from
Closed

Retry pgrewind #14

wants to merge 4 commits into from

Conversation

dyson
Copy link

@dyson dyson commented Aug 19, 2019

We want to implement cascading replication which requires the retrying of pg_rewind to avoid expensive pg_basebackups. This is important for our backup project and will also be useful for other use cases in the future.

With cascading replication, the old sync is rebooted to repoint the recovery at the new primary. The old master will try resyncing against the new sync, while it's rebooting, which causes resync to fail. This falls back to pg_basebackup meaning the old master doesn't recover for several hours.

When implementing the pg_rewind retry functionality there were a number of considerations and design decisions:

  • We think exponential backoff is not an ideal strategy for allocating the retries because we have a reasonable expectation that the first tries may fail. If we were to use exponential backoff we would be trying many times at the start (when we may be failing) and then fewer times towards the end of the timeout period when we expect to succeed. Instead we set a pg_rewind interval and timeout so that we can distribute the retries evenly across the timeout period. We also introduce jitter in case many clients are trying at the same time. This could be revisited at a later date if our use case changes and we have many nodes trying to pg_rewind from a single node causing performance issues. It's still hoped in this situation that appropriate configuration of the interval and timeout (and perhaps modifying the jitter) would be sufficient.

  • During the retry period (especially if it is long) the host we need to pg_rewind from could change from underneath us and all subsequent pg_rewind retries would never succeed. We read the host to sync from before each retry to prevent this.

  • During the retry period an operator (or some other process) may cause the keeper to restart. We could keep track of where in the timeout period the retries are but this would add complexity without any compelling enough reason that we could think of. If the keeper is restart, the retry logic starts again from scratch.

  • There is the possibility that a keeper may not be able to rewind for a known reason. In this case, an operator might want to force a pg_basebackup. With the current implementation this would involve patching the cluster spec so that the pg_rewind timeout or interval is set to 0s. The keeper would then need to be restarted. Once the pg_basebackup has started, the cluster spec can be patched back. This has a disadvantage that if any other keepers restart during this period, they are likely to also do a pg_basebackup. For the moment it's thought that this wont come up often enough to cause too much risk, but it is something to think about.

Jira issue: https://gocardless.atlassian.net/browse/CI-54

Allow the setting of the a pg_rewind interval and timeout so that the
retry behaviour of pg_rewind can be customised. The defualt values if
these are not set in the clusterspec are 0s which results in the
current default behaviour of only trying to pg_rewind once if
usePgrewind is set to true.
@dyson dyson force-pushed the dyson-retry-pgrewind branch 3 times, most recently from 679da0c to fc51965 Compare August 20, 2019 11:53
@mos3abof mos3abof changed the title WIP: retry pgrewind Retry pgrewind Aug 20, 2019
@mos3abof mos3abof marked this pull request as ready for review August 20, 2019 13:24
intervalTick := time.Tick(interval + jitter)

Rewind:
for true {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using for true instead of just for seems to be the convention in the codebase.

@@ -379,6 +379,8 @@ func (s *Sentinel) setDBSpecFromClusterSpec(cd *cluster.ClusterData) {
db.Spec.RequestTimeout = *clusterSpec.RequestTimeout
db.Spec.MaxStandbys = *clusterSpec.MaxStandbys
db.Spec.UsePgrewind = *clusterSpec.UsePgrewind
db.Spec.PgrewindInterval = *clusterSpec.PgrewindInterval
Copy link
Author

@dyson dyson Aug 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The case of these new config options is a little sad but just following the convention used elsewhere.

GC has implemented cascading replication which requires the retrying of
pg_rewind to avoid expensive pg_basebackups.

With cascading replication, the old sync is rebooted to repoint the
recovery at the new primary. The old master will try resyncing against
the new sync, while it's rebooting, which causes resync to fail. This
falls back to pg_basebackup meaning the old master doesn't recover
for several hours.
Make the forcing of a checkpoint on the primary before performing a
pg_rewind a cluster spec option so that it can be turned on or off.
Add pgrewindInternal, prgrewindTimeout, and pgrewindCheckpoint to the
cluster spec documentation.
@rnaveiras rnaveiras self-assigned this Aug 23, 2019
@@ -27,6 +27,9 @@ Some options in a running cluster specification can be changed to update the des
| additionalWalSenders | number of additional wal_senders in addition to the ones internally defined by stolon, useful to provide enough wal senders for external standbys (changing this value requires an instance restart) | no | uint16 | 5 |
| additionalMasterReplicationSlots | a list of additional physical replication slots to be created on the master postgres instance. They will be prefixed with `stolon_` (like internal replication slots used for standby replication) to make them "namespaced" from other replication slots. Replication slots starting with `stolon_` and not defined here (and not used for standby replication) will be dropped from the master instance. | no | []string | null |
| usePgrewind | try to use pg_rewind for faster instance resyncronization. | no | bool | false |
| pgrewindInterval | interval to wait until next pg_rewind try | no | string (duration) | 0s |
| pgrewindTimeout | time after which we stop trying to pg_rewind | no | string (duration) | 0s |
| pgrewindCheckpoint | force checkpoint on primary before performing pg_rewind | no | bool | false |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is named differently to the PR to add this configuration parameter: sorintlab#644

I think maybe we should leave the removal of this hardcoding out of this PR?

@rnaveiras rnaveiras removed their assignment Aug 27, 2019
lawrencejones added a commit that referenced this pull request Aug 27, 2019
1. #14

Retry pg_rewind for up-to 5m after the first attempt. This allows us to
handle the followed database booting for a moment before falling back to
the more expensive pg_basebackup.

This comes as a crappier alternative to [1], where we were creating
proper configuration values. We want to try this before we attempt to
create a more mature setup, as we intend to leave that work for whenever
we upstream our cascading replication.
lawrencejones added a commit that referenced this pull request Aug 27, 2019
1. #14

Retry pg_rewind for up-to 5m after the first attempt. This allows us to
handle the followed database booting for a moment before falling back to
the more expensive pg_basebackup.

This comes as a crappier alternative to [1], where we were creating
proper configuration values. We want to try this before we attempt to
create a more mature setup, as we intend to leave that work for whenever
we upstream our cascading replication.
@lawrencejones
Copy link

We've decided to hardcode the retry for the moment in this change: #16

We intend to come back to this PR when we upstream the cascading replication changes, which will force some type of solution for the rewind handling failure.

@lawrencejones lawrencejones deleted the dyson-retry-pgrewind branch August 27, 2019 13:31
VSpike pushed a commit that referenced this pull request Mar 21, 2022
1. #14

Retry pg_rewind for up-to 5m after the first attempt. This allows us to
handle the followed database booting for a moment before falling back to
the more expensive pg_basebackup.

This comes as a crappier alternative to [1], where we were creating
proper configuration values. We want to try this before we attempt to
create a more mature setup, as we intend to leave that work for whenever
we upstream our cascading replication.
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants