Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1903660: Don't error when expected master node amount is not met #954

Merged

Conversation

alexanderConstantinescu
Copy link
Contributor

PR #896 tried to unblock the assisted installer team by reverting commit a14003d from #421. However the level entropy introduced by removing that commit can in some situations bring the OVN cluster down. Essentially, if the amount of master nodes are not met during the initial cluster deployment: we risk rolling out successive versions of the OVN DB cluster in a very short time span. Since the NB / SB DB raft clusters does not have the time to stabilize before the new version rolls out, we risk doing bad things in some cases.

This PR does things as before, except we don't block the cluster deployment if the master node replica count is not met. Thus we will still wait for all master nodes that should be up, are up. However if they are not, we proceed with the deployment of the initial cluster and then proceed to roll out the second version once the new master node(s) join. This has been tested by the assisted installer team using PR #934, and it works for them.

We will still annotate network.operator.openshift.io with the raft join point as #896 did, for cases such as the assisted installer use case.

/assign @squeed @abhat

@openshift-ci-robot openshift-ci-robot added the bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. label Jan 15, 2021
@openshift-ci-robot
Copy link
Contributor

@alexanderConstantinescu: This pull request references Bugzilla bug 1903660, which is valid. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.7.0) matches configured target release for branch (4.7.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1903660: Don't error when expected master node amount is not met

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Jan 15, 2021
@squeed
Copy link
Contributor

squeed commented Jan 18, 2021

This seems okay, but I'd like to look in to, also, why the ovndbchecker didn't handle this case correctly.

Basically, we should be able to handle this kind of churn. I know it's not pretty, but, ultimately, we need to deal with it.

@alexanderConstantinescu
Copy link
Contributor Author

This seems okay, but I'd like to look in to, also, why the ovndbchecker didn't handle this case correctly.

Basically, we should be able to handle this kind of churn. I know it's not pretty, but, ultimately, we need to deal with it.

I already mentioned why, according to my findings that didn't happen, see: #896 (comment)

Also, reduce the timeout used for waiting for all master nodes
to boot. On certain cluster deployments, such as: the genesis for
this PR (assisted installer deployments) this condition will
never be met. As to not hold the reconciliation loop in vain, we
can dynamically reduce it.

Signed-off-by: Alexander Constantinescu <aconstan@redhat.com>
We want to consistently initialize our NB / SB DB raft cluster to the same node
as to reduce the risk of running into a split brain situation. To do this, we
annotate network.operator.openshift.io with a dedicated field which sets the first
OVN raft initiator and is used during all succeeding reconciliations (as long as that
node still exists)

Signed-off-by: Alexander Constantinescu <aconstan@redhat.com>
@abhat
Copy link
Contributor

abhat commented Jan 27, 2021

/lgtm

We do need to figure out what the gaps are with the daemonset scripts not being able to handle rapid roll-outs.

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 27, 2021
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhat, alexanderConstantinescu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 27, 2021
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

5 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@alexanderConstantinescu
Copy link
Contributor Author

/test e2e-gcp-ovn
/test e2e-metal-ipi-ovn-ipv6

@openshift-merge-robot openshift-merge-robot merged commit 68f3d89 into openshift:master Jan 28, 2021
@openshift-ci-robot
Copy link
Contributor

@alexanderConstantinescu: All pull requests linked via external trackers have merged:

Bugzilla bug 1903660 has been moved to the MODIFIED state.

In response to this:

Bug 1903660: Don't error when expected master node amount is not met

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants