Reattempt Dqlite start-up instead of worker restart #16129
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
JUJU-4510
We previously introduced back-stop behaviour for the Dqlite cluster whereby if we fail to start the local node, we request API server details and wait. If we get a message indicating that we are the last remaining node, we reconfigure the cluster. However, if we get a message indicating other cluster members, we return an error from the worker, resulting in a restart by the dependency engine.
It turns out it is possible to get into the latter situation when Dqlite is starting and does not process cluster changes quickly enough. This is under investigation, but it makes more sense just to retry starting Dqlite instead of throwing an error.
The same behaviour will result, but with less disruption to the worker graph. It may also speed entry into HA.
Included are some cherry picks from main for test reorganisation.
QA steps
This cannot be replicated consistently. When enabling HA, if establishing the cluster takes more than a minute, you will see the log message unable to reconcile current controller and Dqlite cluster status; reattempting node start-up instead of the worker returning an error.
Documentation changes
None.
Bug reference
In service of https://bugs.launchpad.net/juju/+bug/2015371.