Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1854402: Improve bootstrap reliability on heterogeneous UPI network configurations #385

Conversation

ironcladlou
Copy link
Contributor

@ironcladlou ironcladlou commented Jun 24, 2020

Before this change, bootstrap IP discovery assumed that the first address of the
unicast interface must be the bootstrap IP. This assumption doesn't always hold
in the face of user-defined interfaces and addresses whose ordering isn't
guaranteed. When the assumptions are broken and the incorrect bootstrap IP is
selected, bootstrapping fails because quorum cannot be established.

This change improves the accuracy of bootstrap IP discovery by more flexibly
accounting for a wider variety of possible network interface configurations.

An IP is now considered the bootstrap IP if all of the following are true.

For IPv4:

The IP is contained by the machine CIDR defined in the cluster configuration
On bare metal platforms, the IP is not the API or DNS VIP in the cluster configuration
For IPv6, the same must be true in addition to the following:

The IP is not deprecated
The IP is routable according at least one non-default route
This work is adapted from https://github.com/openshift/baremetal-runtimecfg/blob/master/pkg/utils/utils.go.

Backport of #384 and #388

@openshift-ci-robot
Copy link

@ironcladlou: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

In response to this:

[release-4.5] WIP: make bootstrap IP discovery more robust

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ironcladlou ironcladlou force-pushed the bootstrap-ip-discovery-4.5 branch 2 times, most recently from 7262533 to 3998873 Compare July 2, 2020 12:07
@openshift-ci-robot
Copy link

@ironcladlou: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

In response to this:

[release-4.5] Improve bootstrap reliability on heterogeneous UPI network configurations

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ironcladlou ironcladlou changed the title [release-4.5] WIP: make bootstrap IP discovery more robust [release-4.5] Improve bootstrap reliability on heterogeneous UPI network configurations Jul 2, 2020
@openshift-ci-robot
Copy link

@ironcladlou: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

In response to this:

[release-4.5] Improve bootstrap reliability on heterogeneous UPI network configurations

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ironcladlou ironcladlou changed the title [release-4.5] Improve bootstrap reliability on heterogeneous UPI network configurations [release-4.5] Bug 1854402: Improve bootstrap reliability on heterogeneous UPI network configurations Jul 7, 2020
@openshift-ci-robot openshift-ci-robot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Jul 7, 2020
@openshift-ci-robot
Copy link

@ironcladlou: This pull request references Bugzilla bug 1854402, which is invalid:

  • expected dependent Bugzilla bug 1846093 to be in one of the following states: MODIFIED, ON_QA, VERIFIED, but it is POST instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

[release-4.5] Bug 1854402: Improve bootstrap reliability on heterogeneous UPI network configurations

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ironcladlou
Copy link
Contributor Author

/retest

@Gal-Zaidman
Copy link

/hold

We believe this is breaking ovirt, on 4.6 and master jobs we are seeing:

 bootkube.sh[2212]: I0712 14:57:55.775916       1 bootstrap_ip.go:146] Filtered address 127.0.0.1/8 lo
 bootkube.sh[2212]: I0712 14:57:55.775925       1 bootstrap_ip.go:146] Filtered address ::1/128
 bootkube.sh[2212]: I0712 14:57:55.775932       1 bootstrap_ip.go:146] Filtered address 192.168.217.113/24 ens3
 bootkube.sh[2212]: I0712 14:57:55.775938       1 bootstrap_ip.go:146] Filtered address fe80::9723:56b:8bf4:4f76/64
 bootkube.sh[2212]: I0712 14:57:55.775954       1 bootstrap_ip.go:188] Found routable IPs []
 bootkube.sh[2212]: F0712 14:57:55.775965       1 render.go:63] couldn't find any bootstrap node IPs

We started looking into it, please wait with the merge to avoid breaking CI

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 12, 2020
@Gal-Zaidman
Copy link

/hold

We believe this is breaking ovirt, on 4.6 and master jobs we are seeing:

 bootkube.sh[2212]: I0712 14:57:55.775916       1 bootstrap_ip.go:146] Filtered address 127.0.0.1/8 lo
 bootkube.sh[2212]: I0712 14:57:55.775925       1 bootstrap_ip.go:146] Filtered address ::1/128
 bootkube.sh[2212]: I0712 14:57:55.775932       1 bootstrap_ip.go:146] Filtered address 192.168.217.113/24 ens3
 bootkube.sh[2212]: I0712 14:57:55.775938       1 bootstrap_ip.go:146] Filtered address fe80::9723:56b:8bf4:4f76/64
 bootkube.sh[2212]: I0712 14:57:55.775954       1 bootstrap_ip.go:188] Found routable IPs []
 bootkube.sh[2212]: F0712 14:57:55.775965       1 render.go:63] couldn't find any bootstrap node IPs

We started looking into it, please wait with the merge to avoid breaking CI

Will be fixed on openshift/release#10197

@ironcladlou
Copy link
Contributor Author

#388 is also required; I'll need to work up a combined backport. That should fix the ovirt issue as well.

@ironcladlou
Copy link
Contributor Author

/hold cancel
/bugzilla refresh

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 15, 2020
@openshift-ci-robot
Copy link

@ironcladlou: This pull request references Bugzilla bug 1854402, which is invalid:

  • expected dependent Bugzilla bug 1846093 to be in one of the following states: VERIFIED, RELEASE_PENDING, CLOSED (ERRATA), but it is ON_QA instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/hold cancel
/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link

@ironcladlou: This pull request references Bugzilla bug 1854402, which is invalid:

  • expected dependent Bugzilla bug 1846093 to be in one of the following states: VERIFIED, RELEASE_PENDING, CLOSED (ERRATA), but it is ON_QA instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

[release-4.5] Bug 1854402: Improve bootstrap reliability on heterogeneous UPI network configurations

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@hexfusion
Copy link
Contributor

/lgtm
/approve

@openshift-ci-robot openshift-ci-robot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jul 17, 2020
@hexfusion
Copy link
Contributor

/bugzilla refresh

@openshift-ci-robot openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Jul 17, 2020
@openshift-ci-robot
Copy link

@hexfusion: This pull request references Bugzilla bug 1854402, which is valid. The bug has been moved to the POST state.

6 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.5.z) matches configured target release for branch (4.5.z)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
  • dependent bug Bugzilla bug 1846093 is in the state VERIFIED, which is one of the valid states (VERIFIED, RELEASE_PENDING, CLOSED (ERRATA))
  • dependent Bugzilla bug 1846093 targets the "4.6.0" release, which is one of the valid target releases: 4.6.0, 4.6.z
  • bug has dependents

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@hexfusion hexfusion changed the title [release-4.5] Bug 1854402: Improve bootstrap reliability on heterogeneous UPI network configurations Bug 1854402: Improve bootstrap reliability on heterogeneous UPI network configurations Jul 17, 2020
@hexfusion
Copy link
Contributor

cc @sdodson
High prio for metal

@hexfusion
Copy link
Contributor

/hold

Need other PR as well.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 17, 2020
@sdodson sdodson added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Jul 17, 2020
@hexfusion
Copy link
Contributor

@ironcladlou please confirm this includes everything required.

Looks ok a79e547

@sdodson sdodson removed the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Jul 17, 2020
@ironcladlou
Copy link
Contributor Author

@hexfusion this PR rolls up all the commits from 4.6 into one, so it represents a self-contained backport of everything we did there.

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 17, 2020
@hexfusion
Copy link
Contributor

Sorry @sdodson I think we are OK now.

@sdodson sdodson added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Jul 17, 2020
@ironcladlou
Copy link
Contributor Author

Actually, I don't know why the first commit here wasn't squashed, which makes me anxious I'm missing something — let me audit one more time in depth.

/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 17, 2020
…ions

Before this change, bootstrap IP discovery assumed that the first address of the
unicast interface must be the bootstrap IP. This assumption doesn't always hold
in the face of user-defined interfaces and addresses whose ordering isn't
guaranteed. When the assumptions are broken and the incorrect bootstrap IP is
selected, bootstrapping fails because quorum cannot be established.

This change improves the accuracy of bootstrap IP discovery by more flexibly
accounting for a wider variety of possible network interface configurations.

An IP is now considered the bootstrap IP if all of the following are true.

For IPv4:

* The IP is contained by the machine CIDR defined in the cluster configuration
* On bare metal platforms, the IP is not the API or DNS VIP in the cluster configuration

For IPv6, the same must be true in addition to the following:

* The IP is not deprecated
* The IP is routable according at least one non-default route

This work is adapted from https://github.com/openshift/baremetal-runtimecfg/blob/master/pkg/utils/utils.go.
…tions

Before this patch, the new bootstrap IP discovery mechanism would fail bootstrapping
if no IP could be intelligently discovered. A side-effect of that is effectively
validating the machine network CIDR by asserting the bootstrap IPs ability to be
discovered within it. Because there may still be edge cases where we fail to detect
but where the old assumption to choose the "first IP" would still work, we could
introduce an undue burden to fix all existing uses of machine network CIDR even
when our fallback could continue to work in those cases.

This patch adds a fallback behavior so that when intelligent discovery fails, the
first listed IP is selected with a warning, preserving the original discovery
behavior.

This does effectively mean that clusters can still fail to bootstrap if even the
first IP assumption is wrong, but we can presumably use those failures to further
improve detection.

A worthwhile future improvement would be to find a way to more loudly and clearly
surface to the user when we're blindly guessing about the IP, as the resulting
downstream failure may obfuscate the source of failure if bootkube logs are lost.
@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Jul 17, 2020
@ironcladlou
Copy link
Contributor Author

Okay, I re-picked the commits from master and now I'm confident this is in sync.

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 17, 2020
@hexfusion
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 17, 2020
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hexfusion, ironcladlou

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit b8b3d85 into openshift:release-4.5 Jul 17, 2020
@openshift-ci-robot
Copy link

@ironcladlou: All pull requests linked via external trackers have merged: openshift/cluster-etcd-operator#385. Bugzilla bug 1854402 has been moved to the MODIFIED state.

In response to this:

Bug 1854402: Improve bootstrap reliability on heterogeneous UPI network configurations

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants