Regression in winkernel proxier that causes stale load balancing proxy rules #112836

daschott · 2022-10-03T20:42:58Z

What happened?

There is a regression in v1.24.0 (and above) that causes stale HNS load balancer proxy rules anytime a backend pod is deleted. Each subsequent deletion will leave behind an additional external VIP load balancing rule that references endpoints which no longer exist.

This can cause occasional connectivity issues and timeouts, if a stale load balancing rule is matched and it redirects traffic to an endpoint which no longer exists.

We are under the impression this may have been introduced here. (creds to @sbangari for identifying this.)

What did you expect to happen?

I expect all load balancing rules created by winkernel proxier (aka HNS load balancer) to be referencing valid backends (aka HNS endpoints)

How can we reproduce it (as minimally and precisely as possible)?

Create a Kubernetes cluster v1.24 or higher with Windows nodes.
Create a Kubernetes Service with type "LoadBalancer" on Windows.
Delete one or more pods of the Service.
Observe stale load balancing rule left behind referencing endpoint that was already deleted.

Anything else we need to know?

This issue can be discovered by establishing multiple connections to the service, after pods were deleted. Some of the requests will fail.

This issue can be monitored using the following script:
https://raw.githubusercontent.com/daschott/SDN/patch-1/Kubernetes/windows/debug/networkhealth.ps1

Execute as follows:
.\networkhealth.ps1 -OutputMode Stdout

It will print something along the lines of:
10/3/2022 8:18:28 PM <my_node_name> @{Problem=Detected 1 stale VIPs <my_service_vip> }

Another way to inspect this manually is using the following PowerShell:

$refs = get-hnspolicylist | select @{Name="VIP"; Expression={$_.Policies.VIPs}}, References | ? VIP -Like "my_svc_vip" | Select References
get-hnsendpoint | select IPAddress, ID

In the $refs, you would see multiple entries (indicating there are duplicate HNS load balancer proxy rules with the same VIP) and that some of the references are invalid (ie not showing up in the get-hnsendpoint output)

Workaround

You can force all pods of a given service onto a single Node. This will still cause duplicated load balancer entries, but the connections should all succeed as the right rule should be matched.
Use K8s version <v1.24.0
Restart-service kubeproxy. This will remove all the rules and re-create them.

Kubernetes version

1.24.0 and above

Cloud provider

Azure Kubernetes Service, but likely impacting others.

OS version

Windows Server 2019, Windows Server 2022

Install tools

n.A.

Container runtime (CRI) and version (if applicable)

n.A.

Related plugins (CNI, CSI, ...) and versions (if applicable)

n.A.

The text was updated successfully, but these errors were encountered:

daschott · 2022-10-03T20:43:23Z

/sig windows

jsturtevant · 2022-10-03T21:08:01Z

/triage accepted

jsturtevant · 2022-10-03T21:08:16Z

/sig network

marosset · 2022-10-04T16:44:34Z

/milestone v1.26
/priority critical-urgent

marosset · 2022-10-04T22:16:56Z

/reopen
Let's keep this open until backport PRs merge and the fix is available in v1.24 and v1.25 builds.

k8s-ci-robot · 2022-10-04T22:17:02Z

@marosset: Reopened this issue.

In response to this:

/reopen
Let's keep this open until backport PRs merge and the fix is available in v1.24 and v1.25 builds.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jsturtevant · 2022-11-01T16:52:59Z

backports are merged and patches were released
/close

k8s-ci-robot · 2022-11-01T16:53:06Z

@jsturtevant: Closing this issue.

In response to this:

backports are merged and patches were released
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

daschott added the kind/bug Categorizes issue or PR as related to a bug. label Oct 3, 2022

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 3, 2022

k8s-ci-robot added sig/windows Categorizes an issue or PR as relevant to SIG Windows. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 3, 2022

daschott mentioned this issue Oct 3, 2022

Fix winkernel proxier setting the wrong HNS loadbalancer ID for ingresss IP #112837

Merged

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 3, 2022

k8s-ci-robot added the sig/network Categorizes an issue or PR as relevant to SIG Network. label Oct 3, 2022

AbelHu mentioned this issue Oct 4, 2022

[BUG] Regression Windows issue in k8s v1.24. Please do not upgrade your Windows clusters to k8s v1.24 Azure/AKS#3246

Closed

k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Oct 4, 2022

k8s-ci-robot added this to the v1.26 milestone Oct 4, 2022

k8s-ci-robot closed this as completed in #112837 Oct 4, 2022

k8s-ci-robot reopened this Oct 4, 2022

k8s-ci-robot closed this as completed Nov 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression in winkernel proxier that causes stale load balancing proxy rules #112836

Regression in winkernel proxier that causes stale load balancing proxy rules #112836

daschott commented Oct 3, 2022 •

edited

daschott commented Oct 3, 2022

jsturtevant commented Oct 3, 2022

jsturtevant commented Oct 3, 2022

marosset commented Oct 4, 2022

marosset commented Oct 4, 2022

k8s-ci-robot commented Oct 4, 2022

jsturtevant commented Nov 1, 2022

k8s-ci-robot commented Nov 1, 2022

Regression in winkernel proxier that causes stale load balancing proxy rules #112836

Regression in winkernel proxier that causes stale load balancing proxy rules #112836

Comments

daschott commented Oct 3, 2022 • edited

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Workaround

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

daschott commented Oct 3, 2022

jsturtevant commented Oct 3, 2022

jsturtevant commented Oct 3, 2022

marosset commented Oct 4, 2022

marosset commented Oct 4, 2022

k8s-ci-robot commented Oct 4, 2022

jsturtevant commented Nov 1, 2022

k8s-ci-robot commented Nov 1, 2022

daschott commented Oct 3, 2022 •

edited