-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry pgrewind #14
Retry pgrewind #14
Conversation
Allow the setting of the a pg_rewind interval and timeout so that the retry behaviour of pg_rewind can be customised. The defualt values if these are not set in the clusterspec are 0s which results in the current default behaviour of only trying to pg_rewind once if usePgrewind is set to true.
679da0c
to
fc51965
Compare
fc51965
to
94f09db
Compare
intervalTick := time.Tick(interval + jitter) | ||
|
||
Rewind: | ||
for true { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using for true
instead of just for
seems to be the convention in the codebase.
@@ -379,6 +379,8 @@ func (s *Sentinel) setDBSpecFromClusterSpec(cd *cluster.ClusterData) { | |||
db.Spec.RequestTimeout = *clusterSpec.RequestTimeout | |||
db.Spec.MaxStandbys = *clusterSpec.MaxStandbys | |||
db.Spec.UsePgrewind = *clusterSpec.UsePgrewind | |||
db.Spec.PgrewindInterval = *clusterSpec.PgrewindInterval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The case of these new config options is a little sad but just following the convention used elsewhere.
94f09db
to
bb68edb
Compare
GC has implemented cascading replication which requires the retrying of pg_rewind to avoid expensive pg_basebackups. With cascading replication, the old sync is rebooted to repoint the recovery at the new primary. The old master will try resyncing against the new sync, while it's rebooting, which causes resync to fail. This falls back to pg_basebackup meaning the old master doesn't recover for several hours.
bb68edb
to
2ecac75
Compare
Make the forcing of a checkpoint on the primary before performing a pg_rewind a cluster spec option so that it can be turned on or off.
Add pgrewindInternal, prgrewindTimeout, and pgrewindCheckpoint to the cluster spec documentation.
2ecac75
to
821e93c
Compare
@@ -27,6 +27,9 @@ Some options in a running cluster specification can be changed to update the des | |||
| additionalWalSenders | number of additional wal_senders in addition to the ones internally defined by stolon, useful to provide enough wal senders for external standbys (changing this value requires an instance restart) | no | uint16 | 5 | | |||
| additionalMasterReplicationSlots | a list of additional physical replication slots to be created on the master postgres instance. They will be prefixed with `stolon_` (like internal replication slots used for standby replication) to make them "namespaced" from other replication slots. Replication slots starting with `stolon_` and not defined here (and not used for standby replication) will be dropped from the master instance. | no | []string | null | | |||
| usePgrewind | try to use pg_rewind for faster instance resyncronization. | no | bool | false | | |||
| pgrewindInterval | interval to wait until next pg_rewind try | no | string (duration) | 0s | | |||
| pgrewindTimeout | time after which we stop trying to pg_rewind | no | string (duration) | 0s | | |||
| pgrewindCheckpoint | force checkpoint on primary before performing pg_rewind | no | bool | false | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is named differently to the PR to add this configuration parameter: sorintlab#644
I think maybe we should leave the removal of this hardcoding out of this PR?
1. #14 Retry pg_rewind for up-to 5m after the first attempt. This allows us to handle the followed database booting for a moment before falling back to the more expensive pg_basebackup. This comes as a crappier alternative to [1], where we were creating proper configuration values. We want to try this before we attempt to create a more mature setup, as we intend to leave that work for whenever we upstream our cascading replication.
1. #14 Retry pg_rewind for up-to 5m after the first attempt. This allows us to handle the followed database booting for a moment before falling back to the more expensive pg_basebackup. This comes as a crappier alternative to [1], where we were creating proper configuration values. We want to try this before we attempt to create a more mature setup, as we intend to leave that work for whenever we upstream our cascading replication.
We've decided to hardcode the retry for the moment in this change: #16 We intend to come back to this PR when we upstream the cascading replication changes, which will force some type of solution for the rewind handling failure. |
1. #14 Retry pg_rewind for up-to 5m after the first attempt. This allows us to handle the followed database booting for a moment before falling back to the more expensive pg_basebackup. This comes as a crappier alternative to [1], where we were creating proper configuration values. We want to try this before we attempt to create a more mature setup, as we intend to leave that work for whenever we upstream our cascading replication.
We want to implement cascading replication which requires the retrying of pg_rewind to avoid expensive pg_basebackups. This is important for our backup project and will also be useful for other use cases in the future.
With cascading replication, the old sync is rebooted to repoint the recovery at the new primary. The old master will try resyncing against the new sync, while it's rebooting, which causes resync to fail. This falls back to pg_basebackup meaning the old master doesn't recover for several hours.
When implementing the pg_rewind retry functionality there were a number of considerations and design decisions:
We think exponential backoff is not an ideal strategy for allocating the retries because we have a reasonable expectation that the first tries may fail. If we were to use exponential backoff we would be trying many times at the start (when we may be failing) and then fewer times towards the end of the timeout period when we expect to succeed. Instead we set a pg_rewind interval and timeout so that we can distribute the retries evenly across the timeout period. We also introduce jitter in case many clients are trying at the same time. This could be revisited at a later date if our use case changes and we have many nodes trying to pg_rewind from a single node causing performance issues. It's still hoped in this situation that appropriate configuration of the interval and timeout (and perhaps modifying the jitter) would be sufficient.
During the retry period (especially if it is long) the host we need to pg_rewind from could change from underneath us and all subsequent pg_rewind retries would never succeed. We read the host to sync from before each retry to prevent this.
During the retry period an operator (or some other process) may cause the keeper to restart. We could keep track of where in the timeout period the retries are but this would add complexity without any compelling enough reason that we could think of. If the keeper is restart, the retry logic starts again from scratch.
There is the possibility that a keeper may not be able to rewind for a known reason. In this case, an operator might want to force a pg_basebackup. With the current implementation this would involve patching the cluster spec so that the pg_rewind timeout or interval is set to 0s. The keeper would then need to be restarted. Once the pg_basebackup has started, the cluster spec can be patched back. This has a disadvantage that if any other keepers restart during this period, they are likely to also do a pg_basebackup. For the moment it's thought that this wont come up often enough to cause too much risk, but it is something to think about.
Jira issue: https://gocardless.atlassian.net/browse/CI-54