-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regression in winkernel proxier that causes stale load balancing proxy rules #112836
Comments
/sig windows |
/triage accepted |
/sig network |
/milestone v1.26 |
/reopen |
@marosset: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
backports are merged and patches were released |
@jsturtevant: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What happened?
There is a regression in v1.24.0 (and above) that causes stale HNS load balancer proxy rules anytime a backend pod is deleted. Each subsequent deletion will leave behind an additional external VIP load balancing rule that references endpoints which no longer exist.
This can cause occasional connectivity issues and timeouts, if a stale load balancing rule is matched and it redirects traffic to an endpoint which no longer exists.
We are under the impression this may have been introduced here. (creds to @sbangari for identifying this.)
What did you expect to happen?
I expect all load balancing rules created by winkernel proxier (aka HNS load balancer) to be referencing valid backends (aka HNS endpoints)
How can we reproduce it (as minimally and precisely as possible)?
Anything else we need to know?
This issue can be discovered by establishing multiple connections to the service, after pods were deleted. Some of the requests will fail.
This issue can be monitored using the following script:
https://raw.githubusercontent.com/daschott/SDN/patch-1/Kubernetes/windows/debug/networkhealth.ps1
Execute as follows:
.\networkhealth.ps1 -OutputMode Stdout
It will print something along the lines of:
10/3/2022 8:18:28 PM <my_node_name> @{Problem=Detected 1 stale VIPs <my_service_vip> }
Another way to inspect this manually is using the following PowerShell:
In the $refs, you would see multiple entries (indicating there are duplicate HNS load balancer proxy rules with the same VIP) and that some of the references are invalid (ie not showing up in the get-hnsendpoint output)
Workaround
Restart-service kubeproxy
. This will remove all the rules and re-create them.Kubernetes version
1.24.0 and above
Cloud provider
Azure Kubernetes Service, but likely impacting others.
OS version
Windows Server 2019, Windows Server 2022
Install tools
n.A.
Container runtime (CRI) and version (if applicable)
n.A.
Related plugins (CNI, CSI, ...) and versions (if applicable)
n.A.
The text was updated successfully, but these errors were encountered: