-
Notifications
You must be signed in to change notification settings - Fork 39.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible regression in pod-startup-time #42000
Comments
@kubernetes/sig-scalability-bugs |
@k8s-oncall - FYI, I will look into it tomorrow (it's something different than the current failure, which hopefully should be fixed by #42003) |
FYI, this happened to me too in the afternoon when I was running my own cluster within the My hypothesis is that there were way too many many machines running in the project at that time (~5000 from gke-large-cluster test and 60 from the usual kubemark-2000 test). So we had 5 MIGs from gke-large (in us-east1-a) and 1 MIG from kubemark-2000 (in us-central1-f). We probably have a limit of 5 MIGs (either per zone(in which case some other MIG might've been running in us-east1-a) or per project, I'm not sure). So maybe either this or we had some network bandwidth shortage? |
kubemark-500 is running in completely different project - it's 100% unrelated. Regarding kubernetes-scale - we have much higher quotas and it's fine to run both definitely. Additionally, there is isolation between networks. It really seems like a regression to me. |
So we did a lot of investigation of it today together with @gmarek and the findings are:
The way how we detect if it is bad run, we are looking into those lines:
The 99th percentile here higher than 2s is definitely a bad sign (it generally is below 1s). The fact that this is high clearly suggests that this is regression in kubelet code. The only PR that was merged within previous 5 runs that may look related is this one: @derekwaynecarr @justinsb @vishh @sjenning ^^ I tried 5 times running kubemark-500 locally with this PR reverted and all 5 runs were good (i.e. didn't face this 99perc higher than 1s). |
Last 2 runs are also pretty bad: |
So I have one more observation - it seems that whether the run is good or bad depends on whether we are first running load test or density test (we are running those 2 as part of each test run). It doesn't change the fact that it's a regression, but maybe this will help with understanding? |
I have a pr in flight to ensure a pod cgroup is deleted prior to a pod being deleted. Right now it's in background which may be causing really broad hierarchies? |
Which runs better if first? Load or density? |
If I'm running density in a loop, I don't see any problems (both with and without your PR). It seems that something is staying from load test either in hollow-kubelet or holow-kubeproxy. These are not objects per-se, because we are deleting the whole namespaces (and waiting until they are deleted), but maybe kube-proxy has some backlog work that it is doing? Or sth like that? |
So yeah - running load test just before density makes it reproducible. We talked with @derekwaynecarr and for now we are going to disable the flag for cgroups. That should solve the regression issue. In the background, I will be working on making those tests more independent. |
@wojtek-t - opened pr |
@wojtek-t what is the |
@sjenning , poke @mffiedler to blast out a cluster for you ;-) You'll likely never see it without a kubemark-500 equiv. |
Thanks @timothysc :-) we've got one up and running. |
@wojtek-t just ran kubemark on my own cluster.
Is 2 seconds e2e normal or high? |
@sjenning - 2s is normal. Though - see my comment above: #42000 (comment) |
@wojtek-t yes, that is what I'm doing next. Just wondering what a baseline acceptable value was. |
Betweem 2s and 2.5s is normal. The values above 3s suggest regression. |
Ohh - to clarify. This "between 2s and 2.5s being normal" I'm talking about: |
@wojtek-t so after a load, then another density, I could reproduce
perc99 of 3.42s e2e with perc100 being 4.52s Now to figure out how to get the metrics out of the hollow kubelet. |
@gmarek ooooooh... no that was not clear. not to me at least. that changes what i'm looking for. thanks for clarifying. |
Load test somehow causes huge CPU usage increase on both hollow-kubelets and hollow-kubeproxies which stays very high (read: 100%) for quite some time. |
@gmarek @wojtek-t @derekwaynecarr ok i'm finally catching up with everyone else. Without PR #41753, the Burstable QoS level cgroup has HOWEVER, @ncdc and I are looking into a situation where my main cluster nodes are chewing 100% CPU while both the main cluster and kubemark cluster are idle after the kubemark test is over.
|
Yup - this is exactly what we observed. We don't know why it takes so long for components to become idle again though:( |
It looks like the kubelet is spinning its wheels constantly trying to remount various pod volumes for some reason |
so i am ok disabling pod cgroups for now until burstable cpu shares are set as expected, but we need kubemark to stop hot looping. |
It's not just kubemark. It's the kubelet too |
Here's a gist of 10 seconds of log data from the kubelet grepping for a single secret: https://gist.github.com/ncdc/8ddb83d5376b472e6bb84bf416a2b3e1 |
This secret is not mutating in etcd at all |
This sounds bad. |
@vishh -- do we still want to disable pod cgroups until burstable shares for safety reasons? |
I feel it doesn't matter as long as the tests don't flake because of pod
cgroups. QoS cgroups update PR will also land soon and so let's avoid a
revert if we can.
…On Fri, Feb 24, 2017 at 2:28 PM, Derek Carr ***@***.***> wrote:
@vishh <https://github.com/vishh> -- do we still want to disable pod
cgroups until burstable shares for safety reasons?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#42000 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGvIKDL9mk1aELotr8UoM2K-XjXXBtHiks5rf1mMgaJpZM4MKcuW>
.
|
Just saw this issue. FYI: @yujuhong @Random-Liu and I were debugging the issue of the extra startup latency this morning, and we believe the root cause was identified after switching to CRI: #42081 But the fundamental issue is the artificial API QPS limit (default: 5) applying to all clients because unknown API Server's scalability limit / scope. Any changes at the node which generates more requests might cause the regression if we didn't fix the API Server scalability issue. |
that limit is client-side, not server-side... do we know of a specific client that is being broadly shared and hitting the limit? (we should see "Throttling request took ..." messages in logs) |
Kubelet only uses one apiserver client. We did see a lot of “Throttling", some even up to 30 seconds. |
Yeah, the kubelet is defaulting to 5 (https://github.com/kubernetes/kubernetes/blob/master/pkg/apis/componentconfig/v1alpha1/defaults.go#L354-L357) Other components have different defaults (scheduler defaults to 50, controller-manager to 20). We can certainly revisit the kubelet default if we need to. |
We're aware that kubelet API limit is quite low now and that it's the main bottleneck for kubelet throughput right now. And yes, this means that any change in Kubelets requirements for contacting API server will be very visible and have a big impact on our performance tests - every change needs to be multiplied by 5000 which changes even 1 QPS change for kubelet, 5k QPS change for the API server (i.e. it's way easier/safer to double QPS limit for e.g. controller-manager than adding even 1 QPS for kubelet:() - which is exactly why we noticed a need for #42081. That being said, this was unrelated with this issue. We were running those test in non-CRI mode (i.e. base cluster kubelets were running in non-CRI mode), and IIUC hollow-kubelets are still using Docker fake, not CRI-one (@shyamjvs @Random-Liu). Indeed we saw exactly the same error after #42081 was merged: https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-kubemark-500-gce/2822/build-log.txt It's unclear to me whether this is a problem only with mocks, or with real things as well. We'd need to run a load test on a real big cluster to check that. |
@gmarek The kubernetes version is v1.7.0-alpha.0.201+46b20acba22d9a 46b20ac, which is merged before #42081 |
I have a fix for the underlying problem locally. I need to fix unit tests a will send out a PR out for review. The problem is both in real kube-proxy and in our mock so the fix should visibily benefit real kube-proxy too (and the latency of propagating endpoints). I will try to send it by Monday morning PST to have it merged in 1.6 |
The PR is out: #42108 It is saving 2/3 of cpu & memory allocations of the "non-iptables" related code in kube-proxy. |
OK, it seems that it was caused by the same thing. |
The summary is that:
So cgroups just changed things, but the underlying problem was cpu starvation. |
thx @wojtek-t |
…pu_usage Automatic merge from submit-queue (batch tested with PRs 40746, 41699, 42108, 42174, 42093) Switch kube-proxy to informers & save 2/3 of cpu & memory of non-iptables related code. Fix kubernetes#42000 This PR should be no-op from the behavior perspective. It is changing KubeProxy to use standard "informer" framework instead of combination of reflector + undelta store. This is significantly reducing CPU usage of kube-proxy and number of memory allocations. Previously, on every endpoints/service update, we were copying __all__ endpoints/services at least 3 times, now it is once (which should also be removed in the future). In Kubemark-500, hollow-proxies were processing backlog from load test for an hour after the test was finishing. With this change, it is keeping up with the load. @thockin @ncdc @derekwaynecarr
In the last 10 runs of kubemark-500, 4-of them failed with "too high pod-startup-time".
We've never had problems with it before.
The failures are:
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-kubemark-500-gce/2777
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-kubemark-500-gce/2782
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-kubemark-500-gce/2784
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-kubemark-500-gce/2785
Should be investigated.
@shyamjvs @gmarek
The text was updated successfully, but these errors were encountered: