dbchecker: don't delete database file; just back it up #3587
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Deleting the database file also makes the server forget its cluster ID and server ID. This causes the DB to make a new cluster ID and server ID on restart, which actually causes split-brain, because the other servers in the cluster think there's a new member at the same IP address, but the ond member isn't removed. This either causes them to reject the new member or to think the cluster has more members than it should.
When trying to reset the RAFT cluster state, all DBs on all masters need to be removed at the same time, and all servers must be restarted at the same time so that the don't keep the old cluster ID and server IDs. The dbchecker cannot do that, because it only runs on a single machine and can't coordinate between nodes.
Deleting the DB just makes the problem worse and we've see it actually cause split-brain in a number of cases where the DBs could have recovered.
https://issues.redhat.com/browse/OCPBUGS-11705