Possible regression in pod-startup-time #42000

wojtek-t · 2017-02-23T20:18:35Z

In the last 10 runs of kubemark-500, 4-of them failed with "too high pod-startup-time".
We've never had problems with it before.

The failures are:
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-kubemark-500-gce/2777
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-kubemark-500-gce/2782
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-kubemark-500-gce/2784
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-kubemark-500-gce/2785

Should be investigated.

@shyamjvs @gmarek

wojtek-t · 2017-02-23T20:19:05Z

@kubernetes/sig-scalability-bugs

wojtek-t · 2017-02-23T21:01:11Z

@k8s-oncall - FYI, I will look into it tomorrow (it's something different than the current failure, which hopefully should be fixed by #42003)

shyamjvs · 2017-02-23T21:53:24Z

FYI, this happened to me too in the afternoon when I was running my own cluster within the kubernetes-scale project. Density test was just taking way too long (even though my cluster was just 3 nodes and 10 hollow-nodes). So I brought down the cluster and ran one in my personal project.

My hypothesis is that there were way too many many machines running in the project at that time (~5000 from gke-large-cluster test and 60 from the usual kubemark-2000 test). So we had 5 MIGs from gke-large (in us-east1-a) and 1 MIG from kubemark-2000 (in us-central1-f). We probably have a limit of 5 MIGs (either per zone(in which case some other MIG might've been running in us-east1-a) or per project, I'm not sure). So maybe either this or we had some network bandwidth shortage?

wojtek-t · 2017-02-24T07:13:27Z

kubemark-500 is running in completely different project - it's 100% unrelated.

Regarding kubernetes-scale - we have much higher quotas and it's fine to run both definitely. Additionally, there is isolation between networks.

It really seems like a regression to me.

wojtek-t · 2017-02-24T13:18:24Z

So we did a lot of investigation of it today together with @gmarek and the findings are:

this is "flakiness" - i.e. some runs are as they were before.
the first bad run is this one:
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-kubemark-500-gce/2763?log#log

The way how we detect if it is bad run, we are looking into those lines:

I0222 10:00:41.620] Feb 22 10:00:41.613: INFO: 10% worst run-after-schedule latencies: ...
I0222 10:00:41.621] Feb 22 10:00:41.613: INFO: perc50: 788.444452ms, perc90: 1.347778773s, perc99: 2.771830267s

The 99th percentile here higher than 2s is definitely a bad sign (it generally is below 1s).
There are a lot of such runs after this point.

The fact that this is high clearly suggests that this is regression in kubelet code.
@kubernetes/sig-node-bugs

The only PR that was merged within previous 5 runs that may look related is this one:
#41349

@derekwaynecarr @justinsb @vishh @sjenning ^^

I tried 5 times running kubemark-500 locally with this PR reverted and all 5 runs were good (i.e. didn't face this 99perc higher than 1s).
So I think we should revert this PR.

wojtek-t · 2017-02-24T14:11:26Z

Last 2 runs are also pretty bad:
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-kubemark-500-gce/2802/?log#log
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-kubemark-500-gce/2803/?log#log
(though the one before them is pretty good).

wojtek-t · 2017-02-24T14:33:02Z

So I have one more observation - it seems that whether the run is good or bad depends on whether we are first running load test or density test (we are running those 2 as part of each test run).

It doesn't change the fact that it's a regression, but maybe this will help with understanding?

derekwaynecarr · 2017-02-24T14:37:18Z

I have a pr in flight to ensure a pod cgroup is deleted prior to a pod being deleted. Right now it's in background which may be causing really broad hierarchies?

derekwaynecarr · 2017-02-24T14:38:08Z

Which runs better if first? Load or density?

wojtek-t · 2017-02-24T14:42:54Z

If I'm running density in a loop, I don't see any problems (both with and without your PR).

It seems that something is staying from load test either in hollow-kubelet or holow-kubeproxy. These are not objects per-se, because we are deleting the whole namespaces (and waiting until they are deleted), but maybe kube-proxy has some backlog work that it is doing? Or sth like that?

wojtek-t · 2017-02-24T14:56:53Z

So yeah - running load test just before density makes it reproducible.

We talked with @derekwaynecarr and for now we are going to disable the flag for cgroups. That should solve the regression issue.

In the background, I will be working on making those tests more independent.

derekwaynecarr · 2017-02-24T15:48:19Z

@wojtek-t - opened pr

sjenning · 2017-02-24T18:05:38Z

@wojtek-t what is the ginkgo.focus passed to test/kubemark/run-e2e-tests.sh for "load" and "density" tests?

timothysc · 2017-02-24T18:55:24Z

@sjenning , poke @mffiedler to blast out a cluster for you ;-)

You'll likely never see it without a kubemark-500 equiv.

ncdc · 2017-02-24T19:00:45Z

Thanks @timothysc :-) we've got one up and running.

sjenning · 2017-02-24T19:25:28Z

@wojtek-t just ran kubemark on my own cluster.

Density [Feature:Performance] should allow starting 30 pods per node using { ReplicationController} with 0 secrets and 0 daemons and got:

worst schedule latencies
perc50: 9.307743ms, perc90: 13.122051ms, perc99: 24.634236ms
worst run-after-schedule latencies
perc50: 200.033329ms, perc90: 601.934257ms, perc99: 731.414727ms
worst watch latencies
perc50: 1.025306815s, perc90: 1.695499114s, perc99: 2.046010868s
worst scheduled-to-end total latencies
perc50: 1.129049573s, perc90: 1.737352444s, perc99: 2.043592652s
worst e2e total latencies
perc50: 1.138982067s, perc90: 1.744747286s, perc99: 2.056566743s

Is 2 seconds e2e normal or high?

wojtek-t · 2017-02-24T19:27:37Z

@sjenning - 2s is normal.

Though - see my comment above: #42000 (comment)
It seems that it's not possible to repro it just by running density. It seems that it is reproducible by first running load.go, and running density after it.

sjenning · 2017-02-24T19:28:18Z

@wojtek-t yes, that is what I'm doing next. Just wondering what a baseline acceptable value was.

wojtek-t · 2017-02-24T19:32:55Z

Betweem 2s and 2.5s is normal. The values above 3s suggest regression.

wojtek-t · 2017-02-24T19:33:48Z

Ohh - to clarify. This "between 2s and 2.5s being normal" I'm talking about:
"worst e2e total latencies"

sjenning · 2017-02-24T19:59:44Z

@wojtek-t so after a load, then another density, I could reproduce

worst schedule latencies
perc50: 9.392381ms, perc90: 12.978804ms, perc99: 31.069847ms
worst run-after-schedule latencies
perc50: 751.525304ms, perc90: 1.36334556s, perc99: 2.045595845s
worst watch latencies
perc50: 1.0551617s, perc90: 1.669326098s, perc99: 2.231286216s
worst scheduled-to-end total latencies
perc50: 1.732232327s, perc90: 2.496441259s, perc99: 3.413937383s
worst e2e total latencies
perc50: 1.741176838s, perc90: 2.504717641s, perc99: 3.422459192s

perc99 of 3.42s e2e with perc100 being 4.52s

Now to figure out how to get the metrics out of the hollow kubelet.

sjenning · 2017-02-24T21:00:20Z

@gmarek ooooooh... no that was not clear. not to me at least. that changes what i'm looking for. thanks for clarifying.

gmarek · 2017-02-24T21:07:26Z

Load test somehow causes huge CPU usage increase on both hollow-kubelets and hollow-kubeproxies which stays very high (read: 100%) for quite some time.

sjenning · 2017-02-24T21:27:16Z

@gmarek @wojtek-t @derekwaynecarr ok i'm finally catching up with everyone else.

Without PR #41753, the Burstable QoS level cgroup has cpu.shares = 1024. This means that, under contention with guaranteed pods, all of the the burstable pods together (i.e. hollow-nodes) would get the proportionally the same cpu as one guaranteed pod with cpu: 1024m. So CPU starvation could be happening where it wasn't happening before.

HOWEVER, @ncdc and I are looking into a situation where my main cluster nodes are chewing 100% CPU while both the main cluster and kubemark cluster are idle after the kubemark test is over.

top - 21:22:11 up  3:54,  2 users,  load average: 102.95, 104.47, 108.35
Tasks: 1151 total,   9 running, 1142 sleeping,   0 stopped,   0 zombie
%Cpu(s): 94.0 us,  5.6 sy,  0.3 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem:  30839500 total, 15158632 used, 15680868 free,   555004 buffers
KiB Swap:        0 total,        0 used,        0 free.  3452916 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND         
 1720 root      20   0 2940092 196976  46016 S  33.8  0.6  92:32.99 kubelet         
17196 root      20   0  718424 144548  39892 S  22.6  0.5  14:09.96 kubemark        
20098 root      20   0  528972 182228  39916 S  22.6  0.6  14:02.18 kubemark        
19417 root      20   0  663044 171052  40652 S  22.3  0.6  14:13.43 kubemark        
19795 root      20   0  574196 156776  40700 S  22.0  0.5  14:09.56 kubemark        
19324 root      20   0  699200 157816  40216 S  19.7  0.5  14:05.28 kubemark        
14722 root      20   0  648736 172952  39796 S  19.4  0.6  14:11.93 kubemark        
15225 root      20   0  710852 163300  40196 R  19.4  0.5  14:07.59 kubemark        
21050 root      20   0  717580 159276  40192 S  19.0  0.5  13:58.52 kubemark        
18086 root      20   0  580224 156096  40852 S  18.4  0.5  14:00.11 kubemark        
18928 root      20   0  633344 151272  40264 S  18.4  0.5  14:00.83 kubemark        
10610 root      20   0  677400 192628  40376 S  18.0  0.6  13:59.43 kubemark        
13056 root      20   0  784836 158268  39332 R  18.0  0.5  13:57.14 kubemark        
20527 root      20   0  637180 159680  40076 R  17.7  0.5  14:04.54 kubemark        
17073 root      20   0  567936 156192  40476 S  17.1  0.5  14:00.54 kubemark        
15857 root      20   0  593792 157932  40016 S  16.7  0.5  13:57.08 kubemark        
17770 root      20   0  563196 153116  40440 R  16.7  0.5  14:05.02 kubemark        
19072 root      20   0  644368 156604  40276 S  16.7  0.5  14:13.21 kubemark        
19751 root      20   0  662204 162348  40508 S  16.7  0.5  14:10.16 kubemark        
21147 root      20   0  628756 152296  39828 R  16.7  0.5  14:10.16 kubemark        
16148 root      20   0  576932 166704  40236 R  16.4  0.5  14:03.58 kubemark        
14554 root      20   0  723444 170592  40204 S  16.1  0.6  14:09.89 kubemark        
22356 root      20   0  523912 170952  39884 S  15.7  0.6  14:03.93 kubemark        
15128 root      20   0  644248 157684  40788 S  15.1  0.5  14:09.08 kubemark        
20825 root      20   0  649720 157160  40428 S  15.1  0.5  14:12.02 kubemark        
21803 root      20   0  702768 152988  40392 S  15.1  0.5  14:05.38 kubemark        
18455 root      20   0  719432 162592  40720 S  14.8  0.5  14:00.64 kubemark        
19365 root      20   0  724780 159392  39660 S  14.8  0.5  14:10.65 kubemark        
 8554 root      20   0  589996 170820  40248 S  14.4  0.6  14:06.76 kubemark        
12798 root      20   0  563772 153904  40076 S  14.4  0.5  14:03.53 kubemark        
14603 root      20   0  646892 162208  39424 S  14.4  0.5  13:55.67 kubemark        
15075 root      20   0  782860 163996  39696 S  14.4  0.5  14:08.49 kubemark

gmarek · 2017-02-24T21:31:56Z

Yup - this is exactly what we observed. We don't know why it takes so long for components to become idle again though:(

ncdc · 2017-02-24T21:58:08Z

It looks like the kubelet is spinning its wheels constantly trying to remount various pod volumes for some reason

derekwaynecarr · 2017-02-24T22:11:30Z

so i am ok disabling pod cgroups for now until burstable cpu shares are set as expected, but we need kubemark to stop hot looping.

ncdc · 2017-02-24T22:12:22Z

It's not just kubemark. It's the kubelet too

ncdc · 2017-02-24T22:12:48Z

Here's a gist of 10 seconds of log data from the kubelet grepping for a single secret: https://gist.github.com/ncdc/8ddb83d5376b472e6bb84bf416a2b3e1

ncdc · 2017-02-24T22:13:08Z

This secret is not mutating in etcd at all

vishh · 2017-02-24T22:16:03Z

It looks like the kubelet is spinning its wheels constantly trying to remount various pod volumes for some reason

This sounds bad.

vishh · 2017-02-24T22:19:03Z

@ncdc #42081

ncdc · 2017-02-24T22:20:51Z

Thanks @vishh. @sjenning you might try pulling that PR, doing a new build, and creating a new cluster to see if that helps (you'll have to tear down and recreate the cluster).

derekwaynecarr · 2017-02-24T22:27:52Z

@vishh -- do we still want to disable pod cgroups until burstable shares for safety reasons?

vishh · 2017-02-24T22:34:06Z

I feel it doesn't matter as long as the tests don't flake because of pod cgroups. QoS cgroups update PR will also land soon and so let's avoid a revert if we can.

…

On Fri, Feb 24, 2017 at 2:28 PM, Derek Carr ***@***.***> wrote: @vishh <https://github.com/vishh> -- do we still want to disable pod cgroups until burstable shares for safety reasons? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#42000 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGvIKDL9mk1aELotr8UoM2K-XjXXBtHiks5rf1mMgaJpZM4MKcuW> .

dchen1107 · 2017-02-24T22:34:59Z

Just saw this issue. FYI: @yujuhong @Random-Liu and I were debugging the issue of the extra startup latency this morning, and we believe the root cause was identified after switching to CRI: #42081

But the fundamental issue is the artificial API QPS limit (default: 5) applying to all clients because unknown API Server's scalability limit / scope. Any changes at the node which generates more requests might cause the regression if we didn't fix the API Server scalability issue.

liggitt · 2017-02-25T02:34:50Z

But the fundamental issue is the artificial API QPS limit (default: 5) applying to all clients because unknown API Server's scalability limit / scope. Any changes at the node which generates more requests might cause the regression if we didn't fix the API Server scalability issue.

that limit is client-side, not server-side... do we know of a specific client that is being broadly shared and hitting the limit? (we should see "Throttling request took ..." messages in logs)

Random-Liu · 2017-02-25T03:18:38Z

that limit is client-side, not server-side... do we know of a specific client that is being broadly shared and hitting the limit? (we should see "Throttling request took ..." messages in logs)

Kubelet only uses one apiserver client. We did see a lot of “Throttling", some even up to 30 seconds.

liggitt · 2017-02-25T05:49:50Z

Yeah, the kubelet is defaulting to 5 (https://github.com/kubernetes/kubernetes/blob/master/pkg/apis/componentconfig/v1alpha1/defaults.go#L354-L357)

Other components have different defaults (scheduler defaults to 50, controller-manager to 20). We can certainly revisit the kubelet default if we need to.

gmarek · 2017-02-25T07:29:12Z

We're aware that kubelet API limit is quite low now and that it's the main bottleneck for kubelet throughput right now. And yes, this means that any change in Kubelets requirements for contacting API server will be very visible and have a big impact on our performance tests - every change needs to be multiplied by 5000 which changes even 1 QPS change for kubelet, 5k QPS change for the API server (i.e. it's way easier/safer to double QPS limit for e.g. controller-manager than adding even 1 QPS for kubelet:() - which is exactly why we noticed a need for #42081.

That being said, this was unrelated with this issue. We were running those test in non-CRI mode (i.e. base cluster kubelets were running in non-CRI mode), and IIUC hollow-kubelets are still using Docker fake, not CRI-one (@shyamjvs @Random-Liu). Indeed we saw exactly the same error after #42081 was merged: https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-kubemark-500-gce/2822/build-log.txt

It's unclear to me whether this is a problem only with mocks, or with real things as well. We'd need to run a load test on a real big cluster to check that.

Random-Liu · 2017-02-25T09:07:22Z

@gmarek
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-kubemark-500-gce/2822 hasn't included #42081 yet I believe.

The kubernetes version is v1.7.0-alpha.0.201+46b20acba22d9a 46b20ac, which is merged before #42081

wojtek-t · 2017-02-25T17:50:09Z

I have a fix for the underlying problem locally. I need to fix unit tests a will send out a PR out for review.

The problem is both in real kube-proxy and in our mock so the fix should visibily benefit real kube-proxy too (and the latency of propagating endpoints).

I will try to send it by Monday morning PST to have it merged in 1.6

wojtek-t · 2017-02-25T20:16:46Z

The PR is out: #42108

It is saving 2/3 of cpu & memory allocations of the "non-iptables" related code in kube-proxy.

gmarek · 2017-02-27T16:56:25Z

OK, it seems that it was caused by the same thing.

timothysc · 2017-02-27T17:07:50Z

@gmarek What is the concise summary here? There are a number of issues posed, while @wojtek-t 's changes are on the proxy. It's not clear to the outside observer what the root is.

wojtek-t · 2017-02-27T17:21:07Z

The summary is that:

when we run load test first (without my changes) kubeproxies are processing backlog during the next running density test
with cgroups enabled, this is affecting kubelet ability to start pods, due to lack of cpu.

So cgroups just changed things, but the underlying problem was cpu starvation.

timothysc · 2017-02-27T17:36:17Z

thx @wojtek-t

@thockin

…pu_usage Automatic merge from submit-queue (batch tested with PRs 40746, 41699, 42108, 42174, 42093) Switch kube-proxy to informers & save 2/3 of cpu & memory of non-iptables related code. Fix kubernetes#42000 This PR should be no-op from the behavior perspective. It is changing KubeProxy to use standard "informer" framework instead of combination of reflector + undelta store. This is significantly reducing CPU usage of kube-proxy and number of memory allocations. Previously, on every endpoints/service update, we were copying __all__ endpoints/services at least 3 times, now it is once (which should also be removed in the future). In Kubemark-500, hollow-proxies were processing backlog from load test for an hour after the test was finishing. With this change, it is keeping up with the load. @thockin @ncdc @derekwaynecarr

wojtek-t added kind/flake Categorizes issue or PR as related to a flaky test. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. labels Feb 23, 2017

wojtek-t added this to the v1.6 milestone Feb 23, 2017

wojtek-t assigned gmarek, shyamjvs and wojtek-t Feb 23, 2017

wojtek-t mentioned this issue Feb 24, 2017

Revert "Enable pod level cgroups by default" #42047

Closed

derekwaynecarr mentioned this issue Feb 24, 2017

Disable cgroups-per-qos pending Burstable/cpu.shares being set #42052

Merged

wojtek-t mentioned this issue Feb 25, 2017

Switch kube-proxy to informers & save 2/3 of cpu & memory of non-iptables related code. #42108

Merged

gmarek closed this as completed Feb 27, 2017

ncdc mentioned this issue Feb 27, 2017

Add pprof trace support #42169

Merged

Possible regression in pod-startup-time #42000

Possible regression in pod-startup-time #42000

Comments

wojtek-t commented Feb 23, 2017

wojtek-t commented Feb 23, 2017

wojtek-t commented Feb 23, 2017

shyamjvs commented Feb 23, 2017 • edited Loading

wojtek-t commented Feb 24, 2017

wojtek-t commented Feb 24, 2017

wojtek-t commented Feb 24, 2017

wojtek-t commented Feb 24, 2017

derekwaynecarr commented Feb 24, 2017

derekwaynecarr commented Feb 24, 2017

wojtek-t commented Feb 24, 2017

wojtek-t commented Feb 24, 2017

derekwaynecarr commented Feb 24, 2017

sjenning commented Feb 24, 2017

timothysc commented Feb 24, 2017

ncdc commented Feb 24, 2017

sjenning commented Feb 24, 2017 • edited Loading

wojtek-t commented Feb 24, 2017

sjenning commented Feb 24, 2017

wojtek-t commented Feb 24, 2017

wojtek-t commented Feb 24, 2017

sjenning commented Feb 24, 2017 • edited Loading

sjenning commented Feb 24, 2017

gmarek commented Feb 24, 2017

sjenning commented Feb 24, 2017

gmarek commented Feb 24, 2017

ncdc commented Feb 24, 2017

derekwaynecarr commented Feb 24, 2017

ncdc commented Feb 24, 2017

ncdc commented Feb 24, 2017

ncdc commented Feb 24, 2017

vishh commented Feb 24, 2017

vishh commented Feb 24, 2017

ncdc commented Feb 24, 2017

derekwaynecarr commented Feb 24, 2017

vishh commented Feb 24, 2017 via email

dchen1107 commented Feb 24, 2017

liggitt commented Feb 25, 2017

Random-Liu commented Feb 25, 2017

liggitt commented Feb 25, 2017

gmarek commented Feb 25, 2017

Random-Liu commented Feb 25, 2017 • edited Loading

wojtek-t commented Feb 25, 2017

wojtek-t commented Feb 25, 2017

gmarek commented Feb 27, 2017

timothysc commented Feb 27, 2017

wojtek-t commented Feb 27, 2017

timothysc commented Feb 27, 2017

shyamjvs commented Feb 23, 2017 •

edited

Loading

sjenning commented Feb 24, 2017 •

edited

Loading

sjenning commented Feb 24, 2017 •

edited

Loading

Random-Liu commented Feb 25, 2017 •

edited

Loading