-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1903660: Don't error when expected master node amount is not met #954
Bug 1903660: Don't error when expected master node amount is not met #954
Conversation
@alexanderConstantinescu: This pull request references Bugzilla bug 1903660, which is valid. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This seems okay, but I'd like to look in to, also, why the ovndbchecker didn't handle this case correctly. Basically, we should be able to handle this kind of churn. I know it's not pretty, but, ultimately, we need to deal with it. |
I already mentioned why, according to my findings that didn't happen, see: #896 (comment) |
5fb93cf
to
c1afb29
Compare
Also, reduce the timeout used for waiting for all master nodes to boot. On certain cluster deployments, such as: the genesis for this PR (assisted installer deployments) this condition will never be met. As to not hold the reconciliation loop in vain, we can dynamically reduce it. Signed-off-by: Alexander Constantinescu <aconstan@redhat.com>
We want to consistently initialize our NB / SB DB raft cluster to the same node as to reduce the risk of running into a split brain situation. To do this, we annotate network.operator.openshift.io with a dedicated field which sets the first OVN raft initiator and is used during all succeeding reconciliations (as long as that node still exists) Signed-off-by: Alexander Constantinescu <aconstan@redhat.com>
c1afb29
to
16a2fef
Compare
/lgtm We do need to figure out what the gaps are with the daemonset scripts not being able to handle rapid roll-outs. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: abhat, alexanderConstantinescu The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest Please review the full test history for this PR and help us cut down flakes. |
5 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/test e2e-gcp-ovn |
@alexanderConstantinescu: All pull requests linked via external trackers have merged: Bugzilla bug 1903660 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
PR #896 tried to unblock the assisted installer team by reverting commit a14003d from #421. However the level entropy introduced by removing that commit can in some situations bring the OVN cluster down. Essentially, if the amount of master nodes are not met during the initial cluster deployment: we risk rolling out successive versions of the OVN DB cluster in a very short time span. Since the NB / SB DB raft clusters does not have the time to stabilize before the new version rolls out, we risk doing bad things in some cases.
This PR does things as before, except we don't block the cluster deployment if the master node replica count is not met. Thus we will still wait for all master nodes that should be up, are up. However if they are not, we proceed with the deployment of the initial cluster and then proceed to roll out the second version once the new master node(s) join. This has been tested by the assisted installer team using PR #934, and it works for them.
We will still annotate
network.operator.openshift.io
with the raft join point as #896 did, for cases such as the assisted installer use case./assign @squeed @abhat