Bug 1903660: Don't error when expected master node amount is not met #954

alexanderConstantinescu · 2021-01-15T12:03:29Z

PR #896 tried to unblock the assisted installer team by reverting commit a14003d from #421. However the level entropy introduced by removing that commit can in some situations bring the OVN cluster down. Essentially, if the amount of master nodes are not met during the initial cluster deployment: we risk rolling out successive versions of the OVN DB cluster in a very short time span. Since the NB / SB DB raft clusters does not have the time to stabilize before the new version rolls out, we risk doing bad things in some cases.

This PR does things as before, except we don't block the cluster deployment if the master node replica count is not met. Thus we will still wait for all master nodes that should be up, are up. However if they are not, we proceed with the deployment of the initial cluster and then proceed to roll out the second version once the new master node(s) join. This has been tested by the assisted installer team using PR #934, and it works for them.

We will still annotate network.operator.openshift.io with the raft join point as #896 did, for cases such as the assisted installer use case.

/assign @squeed @abhat

openshift-ci-robot · 2021-01-15T12:03:37Z

@alexanderConstantinescu: This pull request references Bugzilla bug 1903660, which is valid. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.7.0) matches configured target release for branch (4.7.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1903660: Don't error when expected master node amount is not met

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

squeed · 2021-01-18T11:46:55Z

This seems okay, but I'd like to look in to, also, why the ovndbchecker didn't handle this case correctly.

Basically, we should be able to handle this kind of churn. I know it's not pretty, but, ultimately, we need to deal with it.

alexanderConstantinescu · 2021-01-18T11:57:15Z

This seems okay, but I'd like to look in to, also, why the ovndbchecker didn't handle this case correctly.

Basically, we should be able to handle this kind of churn. I know it's not pretty, but, ultimately, we need to deal with it.

I already mentioned why, according to my findings that didn't happen, see: #896 (comment)

Also, reduce the timeout used for waiting for all master nodes to boot. On certain cluster deployments, such as: the genesis for this PR (assisted installer deployments) this condition will never be met. As to not hold the reconciliation loop in vain, we can dynamically reduce it. Signed-off-by: Alexander Constantinescu <aconstan@redhat.com>

We want to consistently initialize our NB / SB DB raft cluster to the same node as to reduce the risk of running into a split brain situation. To do this, we annotate network.operator.openshift.io with a dedicated field which sets the first OVN raft initiator and is used during all succeeding reconciliations (as long as that node still exists) Signed-off-by: Alexander Constantinescu <aconstan@redhat.com>

abhat · 2021-01-27T15:13:17Z

/lgtm

We do need to figure out what the gaps are with the daemonset scripts not being able to handle rapid roll-outs.

openshift-ci-robot · 2021-01-27T15:16:37Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhat, alexanderConstantinescu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [abhat]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-bot · 2021-01-27T15:25:59Z

/retest