Bug 1881143: Backport handle azure vips for pods #2111

aojea · 2020-09-21T16:00:00Z

This PR does the following things:

Rename gcp-routes-controller to apiserver-watcher, since it is generic
Remove obsolete service-management mode from gcp-routes-controller
Change downfile directory to /run/cloud-routes from /run/gcp-routes
Write $VIP.up as well as $VIP.down
Add an azure routes script that fixes hairpin.
Background: Azure hosts cannot hairpin back to themselves over a load balancer. Thus, we need to redirect traffic to the apiserver vip to ourselves via iptables. However, we should only do this when our local apiserver is running.

The apiserver-watcher drops a $VIP.up and $VIP.down file, accordingly, depending on the state of the apiserver. Then, we add or remove iptables rules that short-circuit the load balancer.

Unlike GCP, we don't need to do this for external traffic, only local clients.

How to verify it
Install on azure, ensure connections to the internal API load balancer are reliable - both when the local apiserver process is running and stopped.

- Description for the changelog

The original introduction of this service probably used `gcpRoutesController` which happens to be the same as the MCO image because we didn't have a reference to it, and plumbing the image substitution through all the abstraction layers in the code is certainly not obvious. Prep for openshift#2011 which wants to abstract the GCP work to also handle Azure and it was confusing that `machine-config-daemon-pull.service` was referencing an image with a GCP name.

This PR does the following things: - Rename gcp-routes-controller to apiserver-watcher, since it is generic - Remove obsolete service-management mode from gcp-routes-controller - Change downfile directory to /run/cloud-routes from /run/gcp-routes - Write $VIP.up as well as $VIP.down - Add an azure routes script that fixes hairpin. Background: Azure hosts cannot hairpin back to themselves over a load balancer. Thus, we need to redirect traffic to the apiserver vip to ourselves via iptables. However, we should only do this when our local apiserver is running. The apiserver-watcher drops a $VIP.up and $VIP.down file, accordingly, depending on the state of the apiserver. Then, we add or remove iptables rules that short-circuit the load balancer. Unlike GCP, we don't need to do this for external traffic, only local clients.

Azure hosts cannot hairpin back to themselves over a load balancer. Thus, we need to redirect traffic to the apiserver vip to ourselves via iptables. However, we should only do this when our local apiserver is running. Previously it was added support for host network only, adding the same iptables rules in PREROUTING we apply the same rules for pods. Signed-off-by: Antonio Ojea <aojea@redhat.com>

openshift-ci-robot · 2020-09-21T16:00:06Z

@aojea: This pull request references Bugzilla bug 1881143, which is invalid:

expected the bug to target the "4.5.z" release, but it targets "4.5.0" instead
expected dependent Bugzilla bug 1868158 to be in one of the following states: VERIFIED, RELEASE_PENDING, CLOSED (ERRATA), but it is ON_QA instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1881143: Backport handle azure vips for pods

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2020-09-21T16:00:18Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: aojea
To complete the pull request process, please assign yuqi-zhang
You can assign the PR to them by writing /assign @yuqi-zhang in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

aojea · 2020-09-21T16:01:32Z

/test e2e-azure

deads2k · 2020-09-21T18:49:08Z

/retest

aojea · 2020-09-21T19:56:39Z

@deads2k this does not seem an easy backport, can you tag somebody with better understanding of the machine operator that can guide me through?

aojea · 2020-09-21T22:39:49Z

/hold

aojea · 2020-09-22T15:09:01Z

@cgwalters I'm finding some difficult with this backport, it seems to depend on other commits and I'm afraid of breaking something, can you advice on how to proceed?

cgwalters · 2020-09-22T15:18:29Z

Hm...this is a large change to land in a released version. I would prefer we ship 4.6 first and only backport once we have clear signal there's no other issues. Anyone running 4.5 and is affected can then also upgrade to 4.6 anyways.

cgwalters

/hold
based on above.

deads2k · 2020-09-23T20:08:31Z

Hm...this is a large change to land in a released version. I would prefer we ship 4.6 first and only backport once we have clear signal there's no other issues. Anyone running 4.5 and is affected can then also upgrade to 4.6 anyways.

The issue we're having is that upgrades from levels that don't have this fix on azure are unreliable. this means that a lack of a backport to 4.5 means that upgrades to 4.6 don't go smoothly. In fact, they go worse than upgrades from 4.5.z to 4.5.z+1 because the internal LB is expected to work in 4.6 and the MCO goes last. They don't get stuck, but they do face multi-minute gaps in endpoint establishment.

CI and upgrades on 4.6 look good, the endpoint checker has reported no failures after the changes were merged. How much more signal you looking for?

sdodson · 2020-09-23T20:12:30Z

@cgwalters Your reasoning seems sound, what your preferred timeline look like? Merge at 4.6 GA? 1 week after GA? Keep in mind that the pipeline introduces some lag so it's about 14 days from the time this merges until it's shipped.

@aojea Can you link to all of the upstream PRs so it's easier to determine how long these changes have soaked in master branch?

sdodson · 2020-09-23T20:17:20Z

Nevermind, I'll dig. This is #2089 #2011 and #2061 with the youngest of them having merged 7 days ago.

deads2k · 2020-09-23T20:30:08Z

/retest

openshift-ci-robot · 2020-09-24T00:13:57Z

@aojea: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/unit	`0f644c0`	link	`/test unit`
ci/prow/e2e-gcp-op	`0f644c0`	link	`/test e2e-gcp-op`
ci/prow/e2e-azure	`0f644c0`	link	`/test e2e-azure`
ci/prow/e2e-upgrade	`0f644c0`	link	`/test e2e-upgrade`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

aojea · 2020-09-24T09:34:39Z

@aojea Can you link to all of the upstream PRs so it's easier to determine how long these changes have soaked in master branch?

It seems that this is breaking the azure and gcp jobs, they are failing consistently at the installation time

https://prow.ci.openshift.org/pr-history/?org=openshift&repo=machine-config-operator&pr=2111

I think that the backport is not straight forward, it seems that these changes depend on some changes introduced in 4.6 ... and I don't have enough knowledge of the MCO to be able to track them down 😓

aojea · 2020-10-05T13:31:33Z

/cc @squeed

openshift-merge-robot · 2020-10-23T04:58:15Z

@aojea: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-agnostic-upgrade	`0f644c0`	link	`/test e2e-agnostic-upgrade`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

aojea · 2020-10-28T11:16:02Z

/close
I don't feel that this is going to move forward without changes in the MCO 😓

openshift-ci-robot · 2020-10-28T11:16:20Z

@aojea: Closed this PR.

In response to this:

/close
I don't feel that this is going to move forward without changes in the CNO 😓

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2020-10-28T11:16:29Z

@aojea: This pull request references Bugzilla bug 1881143. The bug has been updated to no longer refer to the pull request using the external bug tracker. All external bug links have been closed. The bug has been moved to the NEW state.

In response to this:

Bug 1881143: Backport handle azure vips for pods

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

cgwalters and others added 3 commits September 21, 2020 17:47

openshift-ci-robot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Sep 21, 2020

openshift-ci-robot requested review from ashcrow and sinnykumari September 21, 2020 16:00

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 21, 2020

ashcrow requested review from yuqi-zhang and removed request for ashcrow September 22, 2020 15:18

cgwalters requested changes Sep 22, 2020

View reviewed changes

openshift-ci-robot requested a review from squeed October 5, 2020 13:31

openshift-ci-robot closed this Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1881143: Backport handle azure vips for pods #2111

Bug 1881143: Backport handle azure vips for pods #2111

aojea commented Sep 21, 2020

openshift-ci-robot commented Sep 21, 2020

openshift-ci-robot commented Sep 21, 2020

aojea commented Sep 21, 2020

deads2k commented Sep 21, 2020

aojea commented Sep 21, 2020

aojea commented Sep 21, 2020

aojea commented Sep 22, 2020

cgwalters commented Sep 22, 2020

cgwalters left a comment

deads2k commented Sep 23, 2020

sdodson commented Sep 23, 2020

sdodson commented Sep 23, 2020

deads2k commented Sep 23, 2020

openshift-ci-robot commented Sep 24, 2020

aojea commented Sep 24, 2020

aojea commented Oct 5, 2020

openshift-merge-robot commented Oct 23, 2020

aojea commented Oct 28, 2020 •

edited

openshift-ci-robot commented Oct 28, 2020

openshift-ci-robot commented Oct 28, 2020

Bug 1881143: Backport handle azure vips for pods #2111

Bug 1881143: Backport handle azure vips for pods #2111

Conversation

aojea commented Sep 21, 2020

openshift-ci-robot commented Sep 21, 2020

openshift-ci-robot commented Sep 21, 2020

aojea commented Sep 21, 2020

deads2k commented Sep 21, 2020

aojea commented Sep 21, 2020

aojea commented Sep 21, 2020

aojea commented Sep 22, 2020

cgwalters commented Sep 22, 2020

cgwalters left a comment

Choose a reason for hiding this comment

deads2k commented Sep 23, 2020

sdodson commented Sep 23, 2020

sdodson commented Sep 23, 2020

deads2k commented Sep 23, 2020

openshift-ci-robot commented Sep 24, 2020

aojea commented Sep 24, 2020

aojea commented Oct 5, 2020

openshift-merge-robot commented Oct 23, 2020

aojea commented Oct 28, 2020 • edited

openshift-ci-robot commented Oct 28, 2020

openshift-ci-robot commented Oct 28, 2020

aojea commented Oct 28, 2020 •

edited