Delay in processing HNS LB policies on kube-proxy start on Windows nodes results in unreachable services #109162

LP0101 · 2022-03-30T18:00:31Z

What happened?

When starting windows nodes with a high number of HNS LB policies/rules on the cluster, there is a delay in processing them. This leaves services unreachable during the delay, which takes about half a minute per policy. This can be substatial given enough rules.

This occurs when restarting kube-proxy and rebooting the host. Once the system does reach a state where all the policylists are processed, incremental updates to the services are handled fine (ie. endpoint changes).

What did you expect to happen?

HNS policies should not cause a large delay for Windows nodes.

How can we reproduce it (as minimally and precisely as possible)?

With a large number of HNS policies in place, restart kube-proxy on a Windows node.

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.7", GitCommit:"1f86634ff08f37e54e8bfcd86bc90b61c98f84d4", GitTreeState:"clean", BuildDate:"2021-11-17T14:41:19Z", GoVersion:"go1.16.10", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.7", GitCommit:"f74784f1eaf1e02b651778d6ee2df1ae5ee729ae", GitTreeState:"clean", BuildDate:"2022-03-10T07:58:41Z", GoVersion:"go1.16.10", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

Azure AKS

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

neolit123 · 2022-03-30T20:41:13Z

/sig windows network

jsturtevant · 2022-03-31T22:06:47Z

/triage accepted

There are more details on timing in the fix that @daschott opened #109124

When doing a sync of Services on a new node joining the cluster, the HNS is queried for state on every endpoint in a service which is expensive. When iterating over thousands of services, this can take hours (!).

The fix proposed by @daschott gets the HNS state once per sync instead of each time. This plus a fix in Windows OS the sync is reduced to mins in WS 2019 and ~1 min in WS 2022.

zhiweiv · 2022-06-09T02:20:07Z

@daschott @jsturtevant
For the Windows OS fix mentioned, do you know when will it arrive at Windows Server 2019?

LP0101 added the kind/bug Categorizes issue or PR as related to a bug. label Mar 30, 2022

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 30, 2022

daschott mentioned this issue Mar 30, 2022

winkernel proxier cache HNS data to improve syncProxyRules performance #109124

Merged

k8s-ci-robot added sig/windows Categorizes an issue or PR as relevant to SIG Windows. sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 30, 2022

thockin assigned jsturtevant Mar 31, 2022

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 31, 2022

k8s-ci-robot closed this as completed in #109124 May 4, 2022

jsturtevant mentioned this issue Jun 30, 2022

Windows hns loadbalancers disappear intermittently #110849

Closed

daschott mentioned this issue Sep 9, 2022

REQUEST: New membership for daschott kubernetes/org#3671

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delay in processing HNS LB policies on kube-proxy start on Windows nodes results in unreachable services #109162

Delay in processing HNS LB policies on kube-proxy start on Windows nodes results in unreachable services #109162

LP0101 commented Mar 30, 2022

neolit123 commented Mar 30, 2022

jsturtevant commented Mar 31, 2022 •

edited

zhiweiv commented Jun 9, 2022

Delay in processing HNS LB policies on kube-proxy start on Windows nodes results in unreachable services #109162

Delay in processing HNS LB policies on kube-proxy start on Windows nodes results in unreachable services #109162

Comments

LP0101 commented Mar 30, 2022

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

neolit123 commented Mar 30, 2022

jsturtevant commented Mar 31, 2022 • edited

zhiweiv commented Jun 9, 2022

jsturtevant commented Mar 31, 2022 •

edited