OPNET-303,OCPBUGS-25744: Remove weights from ingress check script #3698

cybertron · 2023-05-05T19:17:03Z

Previously, because we had weights set for both check scripts for ingress, keepalived would always assign the ingress VIP somewhere, even if there were no ingress controllers available. While normally this is not a problem, this behavior is not ideal in circumstances such as remote workers where one worker per remote subnet will take the VIP because they can't coordinate with the other subnets.

The advantage of removing the weight from the check script is that if there is no ingress controller running on the node it will never take the VIP, even if no other node in the cluster has taken it. This allows us to deploy remote workers without the extra step of disabling keepalived. As long as the ingress pods are deployed to the correct nodes keepalived will handle assigning the VIP on its own, even across subnets that can't communicate directly with each other.

- What I did

- How to verify it

- Description for the changelog
Changed keepalived config so nodes without an ingress pod will go to a fault state and never take the ingress VIP, even if no other node has the VIP. This simplifies deployment in remote worker scenarios.

openshift-ci-robot · 2023-05-05T19:17:06Z

@cybertron: This pull request references OPNET-303 which is a valid jira issue.

In response to this:

Previously, because we had weights set for both check scripts for ingress, keepalived would always assign the ingress VIP somewhere, even if there were no ingress controllers available. While normally this is not a problem, this behavior is not ideal in circumstances such as remote workers where one worker per remote subnet will take the VIP because they can't coordinate with the other subnets.

The advantage of removing the weight from the check script is that if there is no ingress controller running on the node it will never take the VIP, even if no other node in the cluster has taken it. This allows us to deploy remote workers without the extra step of disabling keepalived. As long as the ingress pods are deployed to the correct nodes keepalived will handle assigning the VIP on its own, even across subnets that can't communicate directly with each other.

- What I did

- How to verify it

- Description for the changelog
Changed keepalived config so nodes without an ingress pod will go to a fault state and never take the ingress VIP, even if no other node has the VIP. This simplifies deployment in remote worker scenarios.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

cybertron · 2023-05-19T20:11:53Z

/test e2e-metal-ipi
/test e2e-vsphere-upgrade
/test e2e-openstack

cybertron · 2023-06-02T17:43:28Z

/test e2e-metal-ipi

Keepalived seems to be working correctly so I'm not clear why some of the tests failed.

cybertron · 2023-07-12T20:30:28Z

/test e2e-metal-ipi

tsorya · 2023-09-14T20:32:16Z

You need to remove remote worker logic too in order to for it to work as expected, no?

openshift-bot · 2023-12-14T01:01:20Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-ci-robot · 2023-12-20T16:35:22Z

@cybertron: This pull request references OPNET-303 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

This pull request references Jira Issue OCPBUGS-25744, which is invalid:

expected the bug to target the "4.16.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Previously, because we had weights set for both check scripts for ingress, keepalived would always assign the ingress VIP somewhere, even if there were no ingress controllers available. While normally this is not a problem, this behavior is not ideal in circumstances such as remote workers where one worker per remote subnet will take the VIP because they can't coordinate with the other subnets.

The advantage of removing the weight from the check script is that if there is no ingress controller running on the node it will never take the VIP, even if no other node in the cluster has taken it. This allows us to deploy remote workers without the extra step of disabling keepalived. As long as the ingress pods are deployed to the correct nodes keepalived will handle assigning the VIP on its own, even across subnets that can't communicate directly with each other.

- What I did

- How to verify it

- Description for the changelog
Changed keepalived config so nodes without an ingress pod will go to a fault state and never take the ingress VIP, even if no other node has the VIP. This simplifies deployment in remote worker scenarios.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

cybertron · 2023-12-20T16:36:46Z

/remove-lifecycle stale

We recently had a bug report come in that I believe this will fix, so I think we should move forward with it, regardless of whether we use it for remote workers or not.

cybertron · 2023-12-20T16:37:35Z

/jira refresh

openshift-ci-robot · 2023-12-20T16:37:42Z

@cybertron: This pull request references OPNET-303 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

This pull request references Jira Issue OCPBUGS-25744, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.16.0) matches configured target version for branch (4.16.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @zhaozhanqi

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

cybertron · 2023-12-20T16:37:57Z

/test e2e-metal-ipi

cybertron · 2024-01-15T22:49:58Z

I rebased this because it was failing locally for me the same way it did in CI. After rebase it worked, so hopefully it will here too.

/test e2e-metal-ipi
/test e2e-metal-ipi-ovn-ipv6
/test e2e-metal-ipi-ovn-dualstack

cybertron · 2024-01-16T21:44:40Z

/hold

This is legitimately failing. While the cluster will deploy fine, the keepalived liveness probe is failing and causing a crash loop.

Previously, because we had weights set for both check scripts for ingress, keepalived would always assign the ingress VIP somewhere, even if there were no ingress controllers available. While normally this is not a problem, this behavior is not ideal in circumstances such as remote workers where one worker per remote subnet will take the VIP because they can't coordinate with the other subnets. The advantage of removing the weight from the check script is that if there is no ingress controller running on the node it will never take the VIP, even if no other node in the cluster has taken it. This allows us to deploy remote workers without the extra step of disabling keepalived. As long as the ingress pods are deployed to the correct nodes keepalived will handle assigning the VIP on its own, even across subnets that can't communicate directly with each other. This also requires a modification to the keepalived liveness probe because previously we considered a FAULT state to be a failure. Now that we expect some nodes to be in a fault state we can't use that logic anymore. Since for our purposes we just need to verify that keepalived is functioning, just look for an expected line in the output instead of looking for a line that indicates an error.

cybertron · 2024-01-17T20:31:47Z

/hold cancel
/test e2e-metal-ipi
/test e2e-vsphere

Should be working now, and passed the openstack job so looks good.

cybertron · 2024-01-18T20:57:30Z

/cc @mkowalski

mkowalski · 2024-01-19T16:28:46Z

/lgtm

zhaozhanqi · 2024-01-22T05:00:56Z

Using cluster-bot to do pre-merge testing on vsphere. when deleting the ingress VIP attached worker, and the ingressvip will be failover to another worker with ingress controller. and the deleted worker do not have the ingress vip

 oc delete node zzhao45sp13-5zhmk-worker-0-w7rml
node "zzhao45sp13-5zhmk-worker-0-w7rml" deleted
$ oc get pod -n openshift-ingress -o wide
NAME                              READY   STATUS    RESTARTS   AGE   IP                NODE                               NOMINATED NODE   READINESS GATES
router-default-85498d88c9-flngn   1/1     Running   0          95m   192.168.221.188   zzhao45sp13-5zhmk-worker-0-w7rml   <none>           <none>
router-default-85498d88c9-rw77p   1/1     Running   0          28m   192.168.221.204   zzhao45sp13-5zhmk-worker-0-6hf8k   <none>           <none>
$ oc get pod -n openshift-ingress -o wide
NAME                              READY   STATUS    RESTARTS   AGE   IP                NODE                               NOMINATED NODE   READINESS GATES
router-default-85498d88c9-cpr6h   1/1     Running   0          21s   192.168.221.228   zzhao45sp13-5zhmk-worker-0-lm678   <none>           <none>
router-default-85498d88c9-rw77p   1/1     Running   0          30m   192.168.221.204   zzhao45sp13-5zhmk-worker-0-6hf8k   <none>           <none>
$ oc debug node/zzhao45sp13-5zhmk-worker-0-lm678
Starting pod/zzhao45sp13-5zhmk-worker-0-lm678-debug-zrgcv ...
To use host binaries, run `chroot /host`
Pod IP: 192.168.221.228
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-5.1# ssh -i /tmp/a core@192.168.221.188
Red Hat Enterprise Linux CoreOS 415.92.202401200105-0
  Part of OpenShift 4.15, RHCOS is a Kubernetes native operating system
  managed by the Machine Config Operator (`clusteroperator/machine-config`).

WARNING: Direct SSH access to machines is not recommended; instead,
make configuration changes via `machineconfig` objects:
  https://docs.openshift.com/container-platform/4.15/architecture/architecture-rhcos.html

---
Last login: Mon Jan 22 04:43:00 2024 from 192.168.221.228
[core@zzhao45sp13-5zhmk-worker-0-w7rml ~]$ ip a s br-ex
6: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 00:50:56:bd:f7:54 brd ff:ff:ff:ff:ff:ff
    inet 192.168.221.188/24 brd 192.168.221.255 scope global dynamic noprefixroute br-ex
       valid_lft 3581sec preferred_lft 3581sec
    inet 169.254.169.2/29 brd 169.254.169.7 scope global br-ex
       valid_lft forever preferred_lft forever
    inet6 fe80::7ec5:2998:4cc4:477c/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever

/label qe-approved

openshift-ci-robot · 2024-01-22T05:01:03Z

@cybertron: This pull request references OPNET-303 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

This pull request references Jira Issue OCPBUGS-25744, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.16.0) matches configured target version for branch (4.16.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @zhaozhanqi

In response to this:

Previously, because we had weights set for both check scripts for ingress, keepalived would always assign the ingress VIP somewhere, even if there were no ingress controllers available. While normally this is not a problem, this behavior is not ideal in circumstances such as remote workers where one worker per remote subnet will take the VIP because they can't coordinate with the other subnets.

The advantage of removing the weight from the check script is that if there is no ingress controller running on the node it will never take the VIP, even if no other node in the cluster has taken it. This allows us to deploy remote workers without the extra step of disabling keepalived. As long as the ingress pods are deployed to the correct nodes keepalived will handle assigning the VIP on its own, even across subnets that can't communicate directly with each other.

- What I did

- How to verify it

- Description for the changelog
Changed keepalived config so nodes without an ingress pod will go to a fault state and never take the ingress VIP, even if no other node has the VIP. This simplifies deployment in remote worker scenarios.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

cybertron · 2024-01-22T18:39:39Z

/assign @yuqi-zhang

cybertron · 2024-02-28T17:56:17Z

/retest-required

Single node job should be fixed now.

openshift-ci · 2024-02-28T20:22:42Z

@cybertron: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/okd-scos-e2e-gcp-ovn-upgrade	`a1aa798`	link	false	`/test okd-scos-e2e-gcp-ovn-upgrade`
ci/prow/e2e-alibabacloud-ovn	`a1aa798`	link	false	`/test e2e-alibabacloud-ovn`
ci/prow/e2e-vsphere-upgrade	`a1aa798`	link	false	`/test e2e-vsphere-upgrade`
ci/prow/e2e-metal-ipi-ovn-ipv6	`67c2849`	link	false	`/test e2e-metal-ipi-ovn-ipv6`
ci/prow/e2e-metal-ipi-ovn-dualstack	`67c2849`	link	false	`/test e2e-metal-ipi-ovn-dualstack`
ci/prow/okd-scos-e2e-aws-ovn	`f0f13f5`	link	false	`/test okd-scos-e2e-aws-ovn`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

cybertron · 2024-03-01T18:26:14Z

/retest-required

And it looks like the AWS job has now been fixed too.

openshift-ci · 2024-03-25T17:10:45Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cybertron, mkowalski, sinnykumari

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [sinnykumari]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2024-03-25T22:47:42Z

@cybertron: Jira Issue OCPBUGS-25744: All pull requests linked via external trackers have merged:

openshift/machine-config-operator#3698

Jira Issue OCPBUGS-25744 has been moved to the MODIFIED state.

In response to this:

Previously, because we had weights set for both check scripts for ingress, keepalived would always assign the ingress VIP somewhere, even if there were no ingress controllers available. While normally this is not a problem, this behavior is not ideal in circumstances such as remote workers where one worker per remote subnet will take the VIP because they can't coordinate with the other subnets.

The advantage of removing the weight from the check script is that if there is no ingress controller running on the node it will never take the VIP, even if no other node in the cluster has taken it. This allows us to deploy remote workers without the extra step of disabling keepalived. As long as the ingress pods are deployed to the correct nodes keepalived will handle assigning the VIP on its own, even across subnets that can't communicate directly with each other.

- What I did

- How to verify it

- Description for the changelog
Changed keepalived config so nodes without an ingress pod will go to a fault state and never take the ingress VIP, even if no other node has the VIP. This simplifies deployment in remote worker scenarios.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

cybertron · 2024-03-27T16:16:10Z

/cherry-pick release-4.15

openshift-cherrypick-robot · 2024-03-27T16:16:54Z

@cybertron: new pull request created: #4290

In response to this:

/cherry-pick release-4.15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-merge-robot · 2024-03-29T05:28:28Z

Fix included in accepted release 4.16.0-0.nightly-2024-03-28-223620

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 5, 2023

openshift-ci bot requested review from patrickdillon and stephenfin May 5, 2023 19:17

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 14, 2023

cybertron changed the title ~~OPNET-303: Remove weights from ingress check script~~ OPNET-303,OCPBUGS-25744: Remove weights from ingress check script Dec 20, 2023

openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Dec 20, 2023

openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 20, 2023

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Dec 20, 2023

openshift-ci bot requested a review from zhaozhanqi December 20, 2023 16:37

cybertron force-pushed the ingress-fault-state branch from a1aa798 to 67c2849 Compare January 15, 2024 22:48

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 16, 2024

cybertron force-pushed the ingress-fault-state branch from 67c2849 to f0f13f5 Compare January 17, 2024 17:18

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 17, 2024

openshift-ci bot requested a review from mkowalski January 18, 2024 20:57

openshift-ci bot assigned mkowalski Jan 19, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 19, 2024

openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Jan 22, 2024

openshift-ci bot assigned yuqi-zhang Jan 22, 2024

sinnykumari approved these changes Mar 25, 2024

View reviewed changes

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 25, 2024

openshift-merge-bot bot merged commit f398f28 into openshift:master Mar 25, 2024
18 of 19 checks passed

openshift-cherrypick-robot mentioned this pull request Mar 27, 2024

[release-4.15] OCPBUGS-31461: Remove weights from ingress check script #4290

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OPNET-303,OCPBUGS-25744: Remove weights from ingress check script #3698

OPNET-303,OCPBUGS-25744: Remove weights from ingress check script #3698

cybertron commented May 5, 2023

openshift-ci-robot commented May 5, 2023 •

edited by openshift-ci bot

cybertron commented May 19, 2023

cybertron commented Jun 2, 2023

cybertron commented Jul 12, 2023

tsorya commented Sep 14, 2023

openshift-bot commented Dec 14, 2023

openshift-ci-robot commented Dec 20, 2023 •

edited by openshift-ci bot

cybertron commented Dec 20, 2023

cybertron commented Dec 20, 2023

openshift-ci-robot commented Dec 20, 2023 •

edited by openshift-ci bot

cybertron commented Dec 20, 2023

cybertron commented Jan 15, 2024

cybertron commented Jan 16, 2024

cybertron commented Jan 17, 2024

cybertron commented Jan 18, 2024

mkowalski commented Jan 19, 2024

zhaozhanqi commented Jan 22, 2024

openshift-ci-robot commented Jan 22, 2024 •

edited by openshift-ci bot

cybertron commented Jan 22, 2024

cybertron commented Feb 28, 2024

openshift-ci bot commented Feb 28, 2024 •

edited

cybertron commented Mar 1, 2024

openshift-ci bot commented Mar 25, 2024

openshift-ci-robot commented Mar 25, 2024

cybertron commented Mar 27, 2024

openshift-cherrypick-robot commented Mar 27, 2024

openshift-merge-robot commented Mar 29, 2024

OPNET-303,OCPBUGS-25744: Remove weights from ingress check script #3698

OPNET-303,OCPBUGS-25744: Remove weights from ingress check script #3698

Conversation

cybertron commented May 5, 2023

openshift-ci-robot commented May 5, 2023 • edited by openshift-ci bot

cybertron commented May 19, 2023

cybertron commented Jun 2, 2023

cybertron commented Jul 12, 2023

tsorya commented Sep 14, 2023

openshift-bot commented Dec 14, 2023

openshift-ci-robot commented Dec 20, 2023 • edited by openshift-ci bot

cybertron commented Dec 20, 2023

cybertron commented Dec 20, 2023

openshift-ci-robot commented Dec 20, 2023 • edited by openshift-ci bot

cybertron commented Dec 20, 2023

cybertron commented Jan 15, 2024

cybertron commented Jan 16, 2024

cybertron commented Jan 17, 2024

cybertron commented Jan 18, 2024

mkowalski commented Jan 19, 2024

zhaozhanqi commented Jan 22, 2024

openshift-ci-robot commented Jan 22, 2024 • edited by openshift-ci bot

cybertron commented Jan 22, 2024

cybertron commented Feb 28, 2024

openshift-ci bot commented Feb 28, 2024 • edited

cybertron commented Mar 1, 2024

openshift-ci bot commented Mar 25, 2024

openshift-ci-robot commented Mar 25, 2024

cybertron commented Mar 27, 2024

openshift-cherrypick-robot commented Mar 27, 2024

openshift-merge-robot commented Mar 29, 2024

openshift-ci-robot commented May 5, 2023 •

edited by openshift-ci bot

openshift-ci-robot commented Dec 20, 2023 •

edited by openshift-ci bot

openshift-ci-robot commented Dec 20, 2023 •

edited by openshift-ci bot

openshift-ci-robot commented Jan 22, 2024 •

edited by openshift-ci bot

openshift-ci bot commented Feb 28, 2024 •

edited