New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OPNET-303,OCPBUGS-25744: Remove weights from ingress check script #3698
OPNET-303,OCPBUGS-25744: Remove weights from ingress check script #3698
Conversation
@cybertron: This pull request references OPNET-303 which is a valid jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/test e2e-metal-ipi |
/test e2e-metal-ipi Keepalived seems to be working correctly so I'm not clear why some of the tests failed. |
/test e2e-metal-ipi |
You need to remove remote worker logic too in order to for it to work as expected, no? |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
@cybertron: This pull request references OPNET-303 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set. This pull request references Jira Issue OCPBUGS-25744, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/remove-lifecycle stale We recently had a bug report come in that I believe this will fix, so I think we should move forward with it, regardless of whether we use it for remote workers or not. |
/jira refresh |
@cybertron: This pull request references OPNET-303 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set. This pull request references Jira Issue OCPBUGS-25744, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/test e2e-metal-ipi |
a1aa798
to
67c2849
Compare
I rebased this because it was failing locally for me the same way it did in CI. After rebase it worked, so hopefully it will here too. /test e2e-metal-ipi |
/hold This is legitimately failing. While the cluster will deploy fine, the keepalived liveness probe is failing and causing a crash loop. |
Previously, because we had weights set for both check scripts for ingress, keepalived would always assign the ingress VIP somewhere, even if there were no ingress controllers available. While normally this is not a problem, this behavior is not ideal in circumstances such as remote workers where one worker per remote subnet will take the VIP because they can't coordinate with the other subnets. The advantage of removing the weight from the check script is that if there is no ingress controller running on the node it will never take the VIP, even if no other node in the cluster has taken it. This allows us to deploy remote workers without the extra step of disabling keepalived. As long as the ingress pods are deployed to the correct nodes keepalived will handle assigning the VIP on its own, even across subnets that can't communicate directly with each other. This also requires a modification to the keepalived liveness probe because previously we considered a FAULT state to be a failure. Now that we expect some nodes to be in a fault state we can't use that logic anymore. Since for our purposes we just need to verify that keepalived is functioning, just look for an expected line in the output instead of looking for a line that indicates an error.
67c2849
to
f0f13f5
Compare
/hold cancel Should be working now, and passed the openstack job so looks good. |
/cc @mkowalski |
/lgtm |
Using cluster-bot to do pre-merge testing on vsphere. when deleting the ingress VIP attached worker, and the ingressvip will be failover to another worker with ingress controller. and the deleted worker do not have the ingress vip
/label qe-approved |
@cybertron: This pull request references OPNET-303 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set. This pull request references Jira Issue OCPBUGS-25744, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/assign @yuqi-zhang |
/retest-required Single node job should be fixed now. |
@cybertron: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/retest-required And it looks like the AWS job has now been fixed too. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cybertron, mkowalski, sinnykumari The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
f398f28
into
openshift:master
@cybertron: Jira Issue OCPBUGS-25744: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-25744 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/cherry-pick release-4.15 |
@cybertron: new pull request created: #4290 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Fix included in accepted release 4.16.0-0.nightly-2024-03-28-223620 |
Previously, because we had weights set for both check scripts for ingress, keepalived would always assign the ingress VIP somewhere, even if there were no ingress controllers available. While normally this is not a problem, this behavior is not ideal in circumstances such as remote workers where one worker per remote subnet will take the VIP because they can't coordinate with the other subnets.
The advantage of removing the weight from the check script is that if there is no ingress controller running on the node it will never take the VIP, even if no other node in the cluster has taken it. This allows us to deploy remote workers without the extra step of disabling keepalived. As long as the ingress pods are deployed to the correct nodes keepalived will handle assigning the VIP on its own, even across subnets that can't communicate directly with each other.
- What I did
- How to verify it
- Description for the changelog
Changed keepalived config so nodes without an ingress pod will go to a fault state and never take the ingress VIP, even if no other node has the VIP. This simplifies deployment in remote worker scenarios.