New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1903660: Handle pruning of unhealthy db files on disk #406
Bug 1903660: Handle pruning of unhealthy db files on disk #406
Conversation
In some corner cases when the DB pods are brought up by the daemonset, the ovn db file may exist. However, it may be in a state where it does not have itself as a valid raft node or hasn't joined the existing raft cluster and is therefore not having valid remote server addresses in the local instance. Our daemonset code assumes that if the db file exists, it has the right raft information present so that it can sync with other db instances and rebuild the db. In the above mentioned edge-case this doesn't hold true and eventually results in master/db pods that are continuously crash-looping. This change relies on the periodic cluster status check to ensure that the local db is part of the cluster (or is atleast a candidate for the cluster). If on 10 consecutive retries, the cluster status command errors out, the db file is deleted and the ovsdb-server container is correspondingly killed to be re-initialized by the daemonset. Signed-off-by: Aniket Bhat <anbhat@redhat.com>
@alexanderConstantinescu: This pull request references Bugzilla bug 1903660, which is valid. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: abhat, alexanderConstantinescu The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
/retest Please review the full test history for this PR and help us cut down flakes. |
2 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
@alexanderConstantinescu: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@alexanderConstantinescu: Some pull requests linked via external trackers have merged: The following pull requests linked via external trackers have not merged: These pull request must merge or be unlinked from the Bugzilla bug in order for it to move to the next state. Once unlinked, request a bug refresh with Bugzilla bug 1903660 has not been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
In some corner cases when the DB pods are brought up by the daemonset,
the ovn db file may exist. However, it may be in a state where it does
not have itself as a valid raft node or hasn't joined the existing
raft cluster and is therefore not having valid remote server addresses
in the local instance. Our daemonset code assumes that if the db file
exists, it has the right raft information present so that it can sync
with other db instances and rebuild the db. In the above mentioned
edge-case this doesn't hold true and eventually results in master/db
pods that are continuously crash-looping.
This change relies on the periodic cluster status check to ensure that
the local db is part of the cluster (or is atleast a candidate for
the cluster). If on 10 consecutive retries, the cluster status command
errors out, the db file is deleted and the ovsdb-server container is
correspondingly killed to be re-initialized by the daemonset.
Signed-off-by: Aniket Bhat anbhat@redhat.com
/assign @abhat @trozet
- What this PR does and why is it needed
- Special notes for reviewers
- How to verify it
- Description for the changelog