Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1881143: Backport handle azure vips for pods #2111

Closed
wants to merge 3 commits into from

Conversation

aojea
Copy link
Contributor

@aojea aojea commented Sep 21, 2020

This PR does the following things:

Rename gcp-routes-controller to apiserver-watcher, since it is generic
Remove obsolete service-management mode from gcp-routes-controller
Change downfile directory to /run/cloud-routes from /run/gcp-routes
Write $VIP.up as well as $VIP.down
Add an azure routes script that fixes hairpin.
Background: Azure hosts cannot hairpin back to themselves over a load balancer. Thus, we need to redirect traffic to the apiserver vip to ourselves via iptables. However, we should only do this when our local apiserver is running.

The apiserver-watcher drops a $VIP.up and $VIP.down file, accordingly, depending on the state of the apiserver. Then, we add or remove iptables rules that short-circuit the load balancer.

Unlike GCP, we don't need to do this for external traffic, only local clients.

  • How to verify it
    Install on azure, ensure connections to the internal API load balancer are reliable - both when the local apiserver process is running and stopped.

- Description for the changelog

cgwalters and others added 3 commits September 21, 2020 17:47
The original introduction of this service probably used `gcpRoutesController`
which happens to be the same as the MCO image because we didn't have a
reference to it, and plumbing the image substitution through all
the abstraction layers in the code is certainly not obvious.

Prep for openshift#2011
which wants to abstract the GCP work to also handle Azure and it
was confusing that `machine-config-daemon-pull.service` was referencing
an image with a GCP name.
This PR does the following things:

- Rename gcp-routes-controller to apiserver-watcher, since it is generic
- Remove obsolete service-management mode from gcp-routes-controller
- Change downfile directory to /run/cloud-routes from /run/gcp-routes
- Write $VIP.up as well as $VIP.down
- Add an azure routes script that fixes hairpin.

Background: Azure hosts cannot hairpin back to themselves over a load
balancer. Thus, we need to redirect traffic to the apiserver vip to
ourselves via iptables. However, we should only do this when our local
apiserver is running.

The apiserver-watcher drops a $VIP.up and $VIP.down file, accordingly,
depending on the state of the apiserver. Then, we add or remove iptables
rules that short-circuit the load balancer.

Unlike GCP, we don't need to do this for external traffic, only local
clients.
Azure hosts cannot hairpin back to themselves over a load
balancer. Thus, we need to redirect traffic to the apiserver vip to
ourselves via iptables. However, we should only do this when our local
apiserver is running.

Previously it was added support for host network only, adding the
same iptables rules in PREROUTING we apply the same rules for pods.

Signed-off-by: Antonio Ojea <aojea@redhat.com>
@openshift-ci-robot openshift-ci-robot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Sep 21, 2020
@openshift-ci-robot
Copy link
Contributor

@aojea: This pull request references Bugzilla bug 1881143, which is invalid:

  • expected the bug to target the "4.5.z" release, but it targets "4.5.0" instead
  • expected dependent Bugzilla bug 1868158 to be in one of the following states: VERIFIED, RELEASE_PENDING, CLOSED (ERRATA), but it is ON_QA instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1881143: Backport handle azure vips for pods

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: aojea
To complete the pull request process, please assign yuqi-zhang
You can assign the PR to them by writing /assign @yuqi-zhang in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@aojea
Copy link
Contributor Author

aojea commented Sep 21, 2020

/test e2e-azure

@deads2k
Copy link
Contributor

deads2k commented Sep 21, 2020

/retest

@aojea
Copy link
Contributor Author

aojea commented Sep 21, 2020

@deads2k this does not seem an easy backport, can you tag somebody with better understanding of the machine operator that can guide me through?

@aojea
Copy link
Contributor Author

aojea commented Sep 21, 2020

/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 21, 2020
@aojea
Copy link
Contributor Author

aojea commented Sep 22, 2020

@cgwalters I'm finding some difficult with this backport, it seems to depend on other commits and I'm afraid of breaking something, can you advice on how to proceed?

@cgwalters
Copy link
Member

Hm...this is a large change to land in a released version. I would prefer we ship 4.6 first and only backport once we have clear signal there's no other issues. Anyone running 4.5 and is affected can then also upgrade to 4.6 anyways.

@ashcrow ashcrow requested review from yuqi-zhang and removed request for ashcrow September 22, 2020 15:18
Copy link
Member

@cgwalters cgwalters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/hold
based on above.

@deads2k
Copy link
Contributor

deads2k commented Sep 23, 2020

Hm...this is a large change to land in a released version. I would prefer we ship 4.6 first and only backport once we have clear signal there's no other issues. Anyone running 4.5 and is affected can then also upgrade to 4.6 anyways.

The issue we're having is that upgrades from levels that don't have this fix on azure are unreliable. this means that a lack of a backport to 4.5 means that upgrades to 4.6 don't go smoothly. In fact, they go worse than upgrades from 4.5.z to 4.5.z+1 because the internal LB is expected to work in 4.6 and the MCO goes last. They don't get stuck, but they do face multi-minute gaps in endpoint establishment.

CI and upgrades on 4.6 look good, the endpoint checker has reported no failures after the changes were merged. How much more signal you looking for?

@sdodson
Copy link
Member

sdodson commented Sep 23, 2020

@cgwalters Your reasoning seems sound, what your preferred timeline look like? Merge at 4.6 GA? 1 week after GA? Keep in mind that the pipeline introduces some lag so it's about 14 days from the time this merges until it's shipped.

@aojea Can you link to all of the upstream PRs so it's easier to determine how long these changes have soaked in master branch?

@sdodson
Copy link
Member

sdodson commented Sep 23, 2020

Nevermind, I'll dig. This is #2089 #2011 and #2061 with the youngest of them having merged 7 days ago.

@deads2k
Copy link
Contributor

deads2k commented Sep 23, 2020

/retest

@openshift-ci-robot
Copy link
Contributor

@aojea: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/unit 0f644c0 link /test unit
ci/prow/e2e-gcp-op 0f644c0 link /test e2e-gcp-op
ci/prow/e2e-azure 0f644c0 link /test e2e-azure
ci/prow/e2e-upgrade 0f644c0 link /test e2e-upgrade

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@aojea
Copy link
Contributor Author

aojea commented Sep 24, 2020

@aojea Can you link to all of the upstream PRs so it's easier to determine how long these changes have soaked in master branch?

It seems that this is breaking the azure and gcp jobs, they are failing consistently at the installation time

https://prow.ci.openshift.org/pr-history/?org=openshift&repo=machine-config-operator&pr=2111

I think that the backport is not straight forward, it seems that these changes depend on some changes introduced in 4.6 ... and I don't have enough knowledge of the MCO to be able to track them down 😓

@aojea
Copy link
Contributor Author

aojea commented Oct 5, 2020

/cc @squeed

@openshift-merge-robot
Copy link
Contributor

@aojea: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-agnostic-upgrade 0f644c0 link /test e2e-agnostic-upgrade

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@aojea
Copy link
Contributor Author

aojea commented Oct 28, 2020

/close
I don't feel that this is going to move forward without changes in the MCO 😓

@openshift-ci-robot
Copy link
Contributor

@aojea: Closed this PR.

In response to this:

/close
I don't feel that this is going to move forward without changes in the CNO 😓

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Contributor

@aojea: This pull request references Bugzilla bug 1881143. The bug has been updated to no longer refer to the pull request using the external bug tracker. All external bug links have been closed. The bug has been moved to the NEW state.

In response to this:

Bug 1881143: Backport handle azure vips for pods

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants