New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1846093: Improve bootstrap reliability on heterogeneous UPI network configurations #384
Bug 1846093: Improve bootstrap reliability on heterogeneous UPI network configurations #384
Conversation
Still need to play with this in a new cluster's bootstrapping environment, here's how it looks on an already solvent master ipv4 GCP master node:
|
What role should machineCIDR play in this discovery? |
@hexfusion @celebdor the latest commit here refactors the detection to add an additional filter that ensures candidate IPs are contained in the machine CIDR. One caveat here is that I'm not sure if machineCIDR is always present. In the current code, I'm assuming that if machineCIDR is unavailable, no machine CIDR containment check will happen (but we do now produce a warning). The net effect in that case is that it's theoretically possible to select an IP which is routable but outside the unknown machine CIDR. I'm not entirely sure what the right assumption here should be. Is machine CIDR truly optional? How should the product react when machine CIDR is unknown here? |
great questions and ones that needs to be answered before this merges |
Looking at the installer interfaces I think it's mandatory: https://github.com/openshift/installer/blob/master/pkg/types/validation/installconfig.go#L249 It's marked as optional in the install-config schema, but AFAICS that's only to enable backwards compatibility with machineNetwork and the now-deprecated machineCIDR interface https://github.com/openshift/installer/blob/master/pkg/types/installconfig.go#L213 https://github.com/openshift/installer/blob/master/pkg/types/installconfig.go#L234 |
Thank you for digging that up. Would it make sense then for us to fail fast and loudly if the machine CIDR is empty? |
/retest |
97532aa
to
20fe257
Compare
Latest revision disentangles render from all bootstrap IP determination by moving all that stuff into the bootstrap_* files. Also cleaned up that code and added a fail-fast check which asserts machine CIDR is required. |
f767ab1
to
e2449fd
Compare
I still really owe this work some unit tests, but it's ready for some detailed review. So far testing has been manual. |
cc @celebdor would love any feedback and fact checking you can spare on this one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ironcladlou looks good in general, a few questions.
e2449fd
to
49bf483
Compare
@ironcladlou: This pull request references Bugzilla bug 1846093, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Added a couple of unit tests. During the course of that I had to do a little refactoring and also discovered a bit of brittle logic around ipv6 handling (using length check to detect the family can be flaky depending on how the address struct is allocated; I switched to using @hexfusion ptal |
/retest |
quick check of aws[1] failure etcd bootstrapped fine. |
hmm CI looks hosed :( |
/retest |
/test e2e-gcp-upgrade |
The metal failure could be something legit with ipv6, need to investigate |
@stbenjam we are nervous to merge this with metal failing. cc @romfreiman |
Okay, I had a chance to dig in to the metal-ipi failures, and the bootstrap IP seems to have been successfully identified and etcd is coming up — I didn't dig further into what might be the root cause of the overall failures. I need to rebase now, so I'll do that and squash again and go through another round of CI tests. |
…ions Before this change, bootstrap IP discovery assumed that the first address of the unicast interface must be the bootstrap IP. This assumption doesn't always hold in the face of user-defined interfaces and addresses whose ordering isn't guaranteed. When the assumptions are broken and the incorrect bootstrap IP is selected, bootstrapping fails because quorum cannot be established. This change improves the accuracy of bootstrap IP discovery by more flexibly accounting for a wider variety of possible network interface configurations. An IP is now considered the bootstrap IP if all of the following are true. For IPv4: * The IP is contained by the machine CIDR defined in the cluster configuration * On bare metal platforms, the IP is not the API or DNS VIP in the cluster configuration For IPv6, the same must be true in addition to the following: * The IP is not deprecated * The IP is routable according at least one non-default route This work is adapted from https://github.com/openshift/baremetal-runtimecfg/blob/master/pkg/utils/utils.go.
74ad18f
to
9340f51
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hexfusion, ironcladlou The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest Please review the full test history for this PR and help us cut down flakes. |
11 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
@ironcladlou: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@ironcladlou: All pull requests linked via external trackers have merged: openshift/cluster-etcd-operator#384. Bugzilla bug 1846093 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Before this change, bootstrap IP discovery assumed that the first address of the
unicast interface must be the bootstrap IP. This assumption doesn't always hold
in the face of user-defined interfaces and addresses whose ordering isn't
guaranteed. When the assumptions are broken and the incorrect bootstrap IP is
selected, bootstrapping fails because quorum cannot be established.
This change improves the accuracy of bootstrap IP discovery by more flexibly
accounting for a wider variety of possible network interface configurations.
An IP is now considered the bootstrap IP if all of the following are true.
For IPv4:
For IPv6, the same must be true in addition to the following:
This work is adapted from https://github.com/openshift/baremetal-runtimecfg/blob/master/pkg/utils/utils.go.