New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1854402: Improve bootstrap reliability on heterogeneous UPI network configurations #385
Bug 1854402: Improve bootstrap reliability on heterogeneous UPI network configurations #385
Conversation
@ironcladlou: No Bugzilla bug is referenced in the title of this pull request. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
7262533
to
3998873
Compare
@ironcladlou: No Bugzilla bug is referenced in the title of this pull request. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@ironcladlou: No Bugzilla bug is referenced in the title of this pull request. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
3998873
to
056df08
Compare
@ironcladlou: This pull request references Bugzilla bug 1854402, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/retest |
/hold We believe this is breaking ovirt, on 4.6 and master jobs we are seeing:
We started looking into it, please wait with the merge to avoid breaking CI |
Will be fixed on openshift/release#10197 |
#388 is also required; I'll need to work up a combined backport. That should fix the ovirt issue as well. |
/hold cancel |
@ironcladlou: This pull request references Bugzilla bug 1854402, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@ironcladlou: This pull request references Bugzilla bug 1854402, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/lgtm |
/bugzilla refresh |
@hexfusion: This pull request references Bugzilla bug 1854402, which is valid. The bug has been moved to the POST state. 6 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
cc @sdodson |
/hold Need other PR as well. |
@ironcladlou please confirm this includes everything required. Looks ok a79e547 |
@hexfusion this PR rolls up all the commits from 4.6 into one, so it represents a self-contained backport of everything we did there. /hold cancel |
Sorry @sdodson I think we are OK now. |
Actually, I don't know why the first commit here wasn't squashed, which makes me anxious I'm missing something — let me audit one more time in depth. /hold |
…ions Before this change, bootstrap IP discovery assumed that the first address of the unicast interface must be the bootstrap IP. This assumption doesn't always hold in the face of user-defined interfaces and addresses whose ordering isn't guaranteed. When the assumptions are broken and the incorrect bootstrap IP is selected, bootstrapping fails because quorum cannot be established. This change improves the accuracy of bootstrap IP discovery by more flexibly accounting for a wider variety of possible network interface configurations. An IP is now considered the bootstrap IP if all of the following are true. For IPv4: * The IP is contained by the machine CIDR defined in the cluster configuration * On bare metal platforms, the IP is not the API or DNS VIP in the cluster configuration For IPv6, the same must be true in addition to the following: * The IP is not deprecated * The IP is routable according at least one non-default route This work is adapted from https://github.com/openshift/baremetal-runtimecfg/blob/master/pkg/utils/utils.go.
…tions Before this patch, the new bootstrap IP discovery mechanism would fail bootstrapping if no IP could be intelligently discovered. A side-effect of that is effectively validating the machine network CIDR by asserting the bootstrap IPs ability to be discovered within it. Because there may still be edge cases where we fail to detect but where the old assumption to choose the "first IP" would still work, we could introduce an undue burden to fix all existing uses of machine network CIDR even when our fallback could continue to work in those cases. This patch adds a fallback behavior so that when intelligent discovery fails, the first listed IP is selected with a warning, preserving the original discovery behavior. This does effectively mean that clusters can still fail to bootstrap if even the first IP assumption is wrong, but we can presumably use those failures to further improve detection. A worthwhile future improvement would be to find a way to more loudly and clearly surface to the user when we're blindly guessing about the IP, as the resulting downstream failure may obfuscate the source of failure if bootkube logs are lost.
a79e547
to
68ef6a9
Compare
Okay, I re-picked the commits from master and now I'm confident this is in sync. /hold cancel |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hexfusion, ironcladlou The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest Please review the full test history for this PR and help us cut down flakes. |
@ironcladlou: All pull requests linked via external trackers have merged: openshift/cluster-etcd-operator#385. Bugzilla bug 1854402 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Before this change, bootstrap IP discovery assumed that the first address of the
unicast interface must be the bootstrap IP. This assumption doesn't always hold
in the face of user-defined interfaces and addresses whose ordering isn't
guaranteed. When the assumptions are broken and the incorrect bootstrap IP is
selected, bootstrapping fails because quorum cannot be established.
This change improves the accuracy of bootstrap IP discovery by more flexibly
accounting for a wider variety of possible network interface configurations.
An IP is now considered the bootstrap IP if all of the following are true.
For IPv4:
The IP is contained by the machine CIDR defined in the cluster configuration
On bare metal platforms, the IP is not the API or DNS VIP in the cluster configuration
For IPv6, the same must be true in addition to the following:
The IP is not deprecated
The IP is routable according at least one non-default route
This work is adapted from https://github.com/openshift/baremetal-runtimecfg/blob/master/pkg/utils/utils.go.
Backport of #384 and #388