-
Notifications
You must be signed in to change notification settings - Fork 38.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Many ConfigMaps and Pods slow down cluster, until it becomes unavailable (since 1.12) #74412
Comments
@qmfrederik: Reiterating the mentions to trigger a notification: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
On v1.12, the kubelet logs will contain messages like this:
and
and the kube-controller-manager logs:
|
/sig scalability |
@kubernetes/sig-scalability-bugs |
@qmfrederik - couple questions:
You can't start more than 110 pods on a node. So I'm assuming you're talking about creating them roughly at once, and then wait as they will be proceeding (we first schedule ~100 of them (there are ~10 system pods running on that node), and then as they will be finishing we will be scheduling new pods on that node. Am I right? @kubernetes/sig-node-bugs @yujuhong - FYI |
We have encountered this problem too(#74302). This is related to max concurrent http2 streams and the change of configmap manager of kubelet. By default, max concurrent http2 stream of http2 server in kube-apiserver is 250, and every configmap will consume one stream to watch in kubelet at least from version 1.13.x. Kubelet will stuck to communicate to kube-apiserver and then become |
@wojtek-t To answer your questions:
What @YueHonghui says seems consistent with what I'm experiencing. So, it appears that kubelet is still watching configmaps of pods which have completed, and ultimately you hit the max concurrent HTTP2 stream limit. Would it make sense for kubelet to stop watching configmaps once the pod which consumes that configmap has completed? |
@YueHonghui - hmm.. I thought that once we hit the limit per connection (250), we will automatically open a new connection...
It should stop watching: |
Are there any logs, metrics,... I can capture to see whether kubelet actually stops watching? Like a metric for concurrent HTTP2 connections or something similar? |
Unfortunately it's not easy...
|
@wojtek-t Thanks, I'll give that a try (probably tomorrow) and let you know. |
/cc @deads2k @lavalamp @MikeSpreitzer this issue might be suggesting that policing/prioritizing requests merely by user/groups is not sufficient for cluster robustness. we should also classify these requests w/ a finer granularity w/i components like kubelet according to verbs/resources. |
@yue9944882 This issue is due to the golang bug of limiting HTTP2 connections for no reason, no? |
yes, to clarify, i mean we can probably limit the WATCH connections (under 250) for kubelet at server-side to "make room" for the patch calls at client-side. will this help the case? |
@yue9944882 @wojtek-t @lavalamp I have posted goroutine stacktrace of kubelet to #74302 (comment) . In the case we have encountered, kubelet doesn't use new connection to communicate to kube-apiserver when hit limit of max concurrent streams. This seems due to golang bug of http2 connection pool implementing. |
@yue9944882 : what I am hearing here is that the problem is not the apiservers being overloaded but rather a connection management problem in the kubelet. So this is not begging to be solved by traffic policing in the apiservers. |
yes |
@wojtek-t I took the time to reproduce this (on v.1.13.3) with logging enabled (v=3) on the API server. Even after all jobs have completed (i.e. the only running pods are the kube-system pods), up to every 10 seconds, ~230 new entries (which seems awfully close to 250) for a watch for a configmap appear on the log. Here are a couple of cycles, simplified:
Here are the full logs: kube-api-server-logs.txt.gz So it appears that somewhere, at least one process did not stop watching. Happy to run further tests/provide additional information if it helps you. |
Heh... - I think I know where the problem is.
And this one is triggered only by pod deletion: The problem is that pods that are owned by Jobs are not deleted (they are eventually garbage-collected). So it seems there are two problems here:
Also:
@yue9944882 - this won't help in general, because it may be valid to have more then 250 connections (if there are more than that many different secrets/configmaps). |
@wojtek-t
|
Yeah - but that doesn't solve the problem of previous releases... |
One thought besides the actual fixes themselves is we should add scale tests for per-node secrets/configmaps limit, so we can catch such issues in future. At least a kubelet integration test. |
Definitely. Where would the best place to do that be? |
I would think https://github.com/kubernetes/kubernetes/tree/master/test/e2e_node? Actually, our density test in its current form allows setting secretsPerPod - https://github.com/kubernetes/kubernetes/blob/master/test/e2e/scalability/density.go#L550. If we change that to a higher value, we would test this case. cc @wojtek-t @krzysied - Can we make such change in cluster-loader? |
If those are using replicasets then the secrets will not be unique per pod, right? Unique pods, with unique secrets, would more reliably exercise this |
You're right. It's probably better to do that (though I think chances of overlapping secrets on node will be quite low in case of our large cluster tests with many RCs). Also, thinking a bit more, it seems like we'll need to test 2 things:
Seems like first one is more important than second. |
it looks like since we vendor |
Watched based strategy has a couple bugs, 1) golang http2 max streams blocking when the stream limit is reached and 2) the kubelet not cleaning up watches for terminated pods. This patch configures the cache based strategy. Once golang 1.12 is in use, and the kubelet patch is merged we can use the watch based strategy. ref: kubernetes/kubernetes#74412 ref: kubernetes/kubernetes#74412 (comment)
@liggitt @qmfrederik Hello 👋 I'd like to remind that code freeze is starting today PST EOD! ❄️ 🏔️ As far as I see #74781 is punted for 1.15 and somewhat related PR #71501 doesn't have a milestone. There is still #74809, but that doesn't seem to be fixing this issue. Is this issue still relevant to 1.14 or should it be moved to another milestone? |
@liggitt: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
yes, it is in process |
Sorry - it was unfortunate that I was OOO exactly whe this was discovered. Reverting (given that it was 3-line change) seemed like the most reasonable option. Regarding testing that, I actually don't think we need large cluster to test this. |
The application-apply of the stx-openstack application on simplex configurations has been failing since the barbican chart was added to the application. The failure was due to lost node status messages from the kubelet to the kube-apiserver, which causes the node to be marked NotReady and endpoints to be removed. The root cause is the kubernetes bug here: kubernetes/kubernetes#74412 In short, the addition of the barbican chart added enough new secrets/configmaps that the kubelet hit the limit of http2-max-streams-per-connection. As done upstream, the fix is to change the following kubelet config: configMapAndSecretChangeDetectionStrategy (from Watch to Cache). Change-Id: Ic816a91984c4fb82546e4f43b5c83061222c7d05 Closes-bug: 1820928 Signed-off-by: Bart Wensley <barton.wensley@windriver.com>
Upgrading from kubernetes 1.13.5 to 1.15.0 meant the config needed to be updated to handle whatever was deprecated or dropped in 1.14 and 1.15. 1) Removed "ConfigMapAndSecretChangeDetectionStrategy = Watch" reported by kubernetes/kubernetes#74412 because this was a golang deficiency, and is fixed by the newer version of golang. 2) Enforced the kubernetes 1.15.3 version 3) Updated v1alpha3 to v1beta2, since alpha3 was dropped in 1.14 changed fields for beta1 and beta2 are mentioned in these docs: https://godoc.org/k8s.io/kubernetes/cmd/kubeadm/app/apis/kubeadm/v1beta1 https://godoc.org/k8s.io/kubernetes/cmd/kubeadm/app/apis/kubeadm/v1beta2 4) cgroup validation checking now includes the pids subfolder. 5) Update ceph-config-helper to v1.15 kubernetes compatable This means that the stx-openstack version check needed to be increased Change-Id: Ibe3d5960c5dee1d217d01fbb56c785581dd1b42c Story: 2005860 Task: 35841 Depends-On: https://review.opendev.org/#/c/671150 Signed-off-by: Al Bailey <Al.Bailey@windriver.com>
Upgrading from kubernetes 1.13.5 to 1.15.0 meant the config needed to be updated to handle whatever was deprecated or dropped in 1.14 and 1.15. 1) Removed "ConfigMapAndSecretChangeDetectionStrategy = Watch" reported by kubernetes/kubernetes#74412 because this was a golang deficiency, and is fixed by the newer version of golang. 2) Enforced the kubernetes 1.15.3 version 3) Updated v1alpha3 to v1beta2, since alpha3 was dropped in 1.14 changed fields for beta1 and beta2 are mentioned in these docs: https://godoc.org/k8s.io/kubernetes/cmd/kubeadm/app/apis/kubeadm/v1beta1 https://godoc.org/k8s.io/kubernetes/cmd/kubeadm/app/apis/kubeadm/v1beta2 4) cgroup validation checking now includes the pids subfolder. 5) Update ceph-config-helper to v1.15 kubernetes compatable This means that the stx-openstack version check needed to be increased Change-Id: Ibe3d5960c5dee1d217d01fbb56c785581dd1b42c Story: 2005860 Task: 35841 Depends-On: https://review.opendev.org/#/c/671150 Signed-off-by: Al Bailey <Al.Bailey@windriver.com>
The application-apply of the stx-openstack application on simplex configurations has been failing since the barbican chart was added to the application. The failure was due to lost node status messages from the kubelet to the kube-apiserver, which causes the node to be marked NotReady and endpoints to be removed. The root cause is the kubernetes bug here: kubernetes/kubernetes#74412 In short, the addition of the barbican chart added enough new secrets/configmaps that the kubelet hit the limit of http2-max-streams-per-connection. As done upstream, the fix is to change the following kubelet config: configMapAndSecretChangeDetectionStrategy (from Watch to Cache). Change-Id: Ic816a91984c4fb82546e4f43b5c83061222c7d05 Closes-bug: 1820928 Signed-off-by: Bart Wensley <barton.wensley@windriver.com>
Upgrading from kubernetes 1.13.5 to 1.15.0 meant the config needed to be updated to handle whatever was deprecated or dropped in 1.14 and 1.15. 1) Removed "ConfigMapAndSecretChangeDetectionStrategy = Watch" reported by kubernetes/kubernetes#74412 because this was a golang deficiency, and is fixed by the newer version of golang. 2) Enforced the kubernetes 1.15.3 version 3) Updated v1alpha3 to v1beta2, since alpha3 was dropped in 1.14 changed fields for beta1 and beta2 are mentioned in these docs: https://godoc.org/k8s.io/kubernetes/cmd/kubeadm/app/apis/kubeadm/v1beta1 https://godoc.org/k8s.io/kubernetes/cmd/kubeadm/app/apis/kubeadm/v1beta2 4) cgroup validation checking now includes the pids subfolder. 5) Update ceph-config-helper to v1.15 kubernetes compatable This means that the stx-openstack version check needed to be increased Change-Id: Ibe3d5960c5dee1d217d01fbb56c785581dd1b42c Story: 2005860 Task: 35841 Depends-On: https://review.opendev.org/#/c/671150 Signed-off-by: Al Bailey <Al.Bailey@windriver.com>
Upgrading from kubernetes 1.13.5 to 1.15.0 meant the config needed to be updated to handle whatever was deprecated or dropped in 1.14 and 1.15. 1) Removed "ConfigMapAndSecretChangeDetectionStrategy = Watch" reported by kubernetes/kubernetes#74412 because this was a golang deficiency, and is fixed by the newer version of golang. 2) Enforced the kubernetes 1.15.3 version 3) Updated v1alpha3 to v1beta2, since alpha3 was dropped in 1.14 changed fields for beta1 and beta2 are mentioned in these docs: https://godoc.org/k8s.io/kubernetes/cmd/kubeadm/app/apis/kubeadm/v1beta1 https://godoc.org/k8s.io/kubernetes/cmd/kubeadm/app/apis/kubeadm/v1beta2 4) cgroup validation checking now includes the pids subfolder. 5) Update ceph-config-helper to v1.15 kubernetes compatable This means that the stx-openstack version check needed to be increased Change-Id: Ibe3d5960c5dee1d217d01fbb56c785581dd1b42c Story: 2005860 Task: 35841 Depends-On: https://review.opendev.org/#/c/671150 Signed-off-by: Al Bailey <Al.Bailey@windriver.com>
What happened:
I schedule multiple jobs in my cluster. Each job uses a different ConfigMap which contains the configuration for that job.
This worked well on version 1.11 of Kubernetes. After upgrading to 1.12 or 1.13, I've noticed that doing this will cause the cluster to significantly slow down; up to the point where nodes are being marked as NotReady and no new work is being scheduled.
For example, consider a scenario in which I schedule 400 jobs, each with its own ConfigMap, which print "Hello World" on a single-node cluster would.
On v1.11, it takes about 10 minutes for the cluster to process all jobs. New jobs can be scheduled.
On v1.12 and v1.13, it takes about 60 minutes for the cluster to process all jobs. After this, no new jobs can be scheduled.
What you expected to happen:
I did not expect this scenario to cause my nodes to become unavailable in Kubernetes 1.12 and 1.13, and would have expected the behavior which I observe in 1.11.
How to reproduce it (as minimally and precisely as possible):
The easiest way seems to be to schedule, on a single-node cluster, about 300 jobs:
I can consistently reproduce this issue in a VM-based environment, which I configure using Vagrant. You can find the full setup here: https://github.com/qmfrederik/k8s-job-repro
Anything else we need to know?:
Happy to provide further information as needed
Environment:
kubectl version
): v1.12 through v1.13cat /etc/os-release
): 18.04.1 LTS (Bionic Beaver)uname -a
): Linux vagrant 4.15.0-29-generic Add DESIGN.md to document core design. #31-Ubuntu SMP Tue Jul 17 15:39:52 UTC 2018 x86_64 x86_64 x86_64 GNU/LinuxThe text was updated successfully, but these errors were encountered: