Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression in winkernel proxier that causes stale load balancing proxy rules #112836

Closed
daschott opened this issue Oct 3, 2022 · 8 comments · Fixed by #112837
Closed

Regression in winkernel proxier that causes stale load balancing proxy rules #112836

daschott opened this issue Oct 3, 2022 · 8 comments · Fixed by #112837
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/windows Categorizes an issue or PR as relevant to SIG Windows. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@daschott
Copy link
Contributor

daschott commented Oct 3, 2022

What happened?

There is a regression in v1.24.0 (and above) that causes stale HNS load balancer proxy rules anytime a backend pod is deleted. Each subsequent deletion will leave behind an additional external VIP load balancing rule that references endpoints which no longer exist.

This can cause occasional connectivity issues and timeouts, if a stale load balancing rule is matched and it redirects traffic to an endpoint which no longer exists.

We are under the impression this may have been introduced here. (creds to @sbangari for identifying this.)

What did you expect to happen?

I expect all load balancing rules created by winkernel proxier (aka HNS load balancer) to be referencing valid backends (aka HNS endpoints)

How can we reproduce it (as minimally and precisely as possible)?

  1. Create a Kubernetes cluster v1.24 or higher with Windows nodes.
  2. Create a Kubernetes Service with type "LoadBalancer" on Windows.
  3. Delete one or more pods of the Service.
  4. Observe stale load balancing rule left behind referencing endpoint that was already deleted.

Anything else we need to know?

This issue can be discovered by establishing multiple connections to the service, after pods were deleted. Some of the requests will fail.

This issue can be monitored using the following script:
https://raw.githubusercontent.com/daschott/SDN/patch-1/Kubernetes/windows/debug/networkhealth.ps1

Execute as follows:
.\networkhealth.ps1 -OutputMode Stdout

It will print something along the lines of:
10/3/2022 8:18:28 PM <my_node_name> @{Problem=Detected 1 stale VIPs <my_service_vip> }

Another way to inspect this manually is using the following PowerShell:

$refs = get-hnspolicylist | select @{Name="VIP"; Expression={$_.Policies.VIPs}}, References | ? VIP -Like "my_svc_vip" | Select References
get-hnsendpoint | select IPAddress, ID

In the $refs, you would see multiple entries (indicating there are duplicate HNS load balancer proxy rules with the same VIP) and that some of the references are invalid (ie not showing up in the get-hnsendpoint output)

Workaround

  1. You can force all pods of a given service onto a single Node. This will still cause duplicated load balancer entries, but the connections should all succeed as the right rule should be matched.
  2. Use K8s version <v1.24.0
  3. Restart-service kubeproxy. This will remove all the rules and re-create them.

Kubernetes version

1.24.0 and above

Cloud provider

Azure Kubernetes Service, but likely impacting others.

OS version

Windows Server 2019, Windows Server 2022

Install tools

n.A.

Container runtime (CRI) and version (if applicable)

n.A.

Related plugins (CNI, CSI, ...) and versions (if applicable)

n.A.

@daschott daschott added the kind/bug Categorizes issue or PR as related to a bug. label Oct 3, 2022
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 3, 2022
@daschott
Copy link
Contributor Author

daschott commented Oct 3, 2022

/sig windows

@k8s-ci-robot k8s-ci-robot added sig/windows Categorizes an issue or PR as relevant to SIG Windows. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 3, 2022
@jsturtevant
Copy link
Contributor

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 3, 2022
@jsturtevant
Copy link
Contributor

/sig network

@marosset
Copy link
Contributor

marosset commented Oct 4, 2022

/milestone v1.26
/priority critical-urgent

@k8s-ci-robot k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Oct 4, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.26 milestone Oct 4, 2022
@marosset
Copy link
Contributor

marosset commented Oct 4, 2022

/reopen
Let's keep this open until backport PRs merge and the fix is available in v1.24 and v1.25 builds.

@k8s-ci-robot k8s-ci-robot reopened this Oct 4, 2022
@k8s-ci-robot
Copy link
Contributor

@marosset: Reopened this issue.

In response to this:

/reopen
Let's keep this open until backport PRs merge and the fix is available in v1.24 and v1.25 builds.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jsturtevant
Copy link
Contributor

backports are merged and patches were released
/close

@k8s-ci-robot
Copy link
Contributor

@jsturtevant: Closing this issue.

In response to this:

backports are merged and patches were released
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/windows Categorizes an issue or PR as relevant to SIG Windows. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants