Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible regression in pod-startup-time #42000

Closed
wojtek-t opened this issue Feb 23, 2017 · 48 comments
Closed

Possible regression in pod-startup-time #42000

wojtek-t opened this issue Feb 23, 2017 · 48 comments
Assignees
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/flake Categorizes issue or PR as related to a flaky test. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Milestone

Comments

@wojtek-t
Copy link
Member

@wojtek-t wojtek-t added kind/flake Categorizes issue or PR as related to a flaky test. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. labels Feb 23, 2017
@wojtek-t wojtek-t added this to the v1.6 milestone Feb 23, 2017
@wojtek-t
Copy link
Member Author

@kubernetes/sig-scalability-bugs

@wojtek-t
Copy link
Member Author

@k8s-oncall - FYI, I will look into it tomorrow (it's something different than the current failure, which hopefully should be fixed by #42003)

@shyamjvs
Copy link
Member

shyamjvs commented Feb 23, 2017

FYI, this happened to me too in the afternoon when I was running my own cluster within the kubernetes-scale project. Density test was just taking way too long (even though my cluster was just 3 nodes and 10 hollow-nodes). So I brought down the cluster and ran one in my personal project.

My hypothesis is that there were way too many many machines running in the project at that time (~5000 from gke-large-cluster test and 60 from the usual kubemark-2000 test). So we had 5 MIGs from gke-large (in us-east1-a) and 1 MIG from kubemark-2000 (in us-central1-f). We probably have a limit of 5 MIGs (either per zone(in which case some other MIG might've been running in us-east1-a) or per project, I'm not sure). So maybe either this or we had some network bandwidth shortage?

@wojtek-t
Copy link
Member Author

kubemark-500 is running in completely different project - it's 100% unrelated.

Regarding kubernetes-scale - we have much higher quotas and it's fine to run both definitely. Additionally, there is isolation between networks.

It really seems like a regression to me.

@wojtek-t
Copy link
Member Author

So we did a lot of investigation of it today together with @gmarek and the findings are:

The way how we detect if it is bad run, we are looking into those lines:

I0222 10:00:41.620] Feb 22 10:00:41.613: INFO: 10% worst run-after-schedule latencies: ...
I0222 10:00:41.621] Feb 22 10:00:41.613: INFO: perc50: 788.444452ms, perc90: 1.347778773s, perc99: 2.771830267s

The 99th percentile here higher than 2s is definitely a bad sign (it generally is below 1s).
There are a lot of such runs after this point.

The fact that this is high clearly suggests that this is regression in kubelet code.
@kubernetes/sig-node-bugs

The only PR that was merged within previous 5 runs that may look related is this one:
#41349

@derekwaynecarr @justinsb @vishh @sjenning ^^

I tried 5 times running kubemark-500 locally with this PR reverted and all 5 runs were good (i.e. didn't face this 99perc higher than 1s).
So I think we should revert this PR.

@wojtek-t
Copy link
Member Author

@wojtek-t
Copy link
Member Author

So I have one more observation - it seems that whether the run is good or bad depends on whether we are first running load test or density test (we are running those 2 as part of each test run).

It doesn't change the fact that it's a regression, but maybe this will help with understanding?

@derekwaynecarr
Copy link
Member

I have a pr in flight to ensure a pod cgroup is deleted prior to a pod being deleted. Right now it's in background which may be causing really broad hierarchies?

@derekwaynecarr
Copy link
Member

Which runs better if first? Load or density?

@wojtek-t
Copy link
Member Author

If I'm running density in a loop, I don't see any problems (both with and without your PR).

It seems that something is staying from load test either in hollow-kubelet or holow-kubeproxy. These are not objects per-se, because we are deleting the whole namespaces (and waiting until they are deleted), but maybe kube-proxy has some backlog work that it is doing? Or sth like that?

@wojtek-t
Copy link
Member Author

So yeah - running load test just before density makes it reproducible.

We talked with @derekwaynecarr and for now we are going to disable the flag for cgroups. That should solve the regression issue.

In the background, I will be working on making those tests more independent.

@derekwaynecarr
Copy link
Member

@wojtek-t - opened pr

@sjenning
Copy link
Contributor

@wojtek-t what is the ginkgo.focus passed to test/kubemark/run-e2e-tests.sh for "load" and "density" tests?

@timothysc
Copy link
Member

@sjenning , poke @mffiedler to blast out a cluster for you ;-)

You'll likely never see it without a kubemark-500 equiv.

@ncdc
Copy link
Member

ncdc commented Feb 24, 2017

Thanks @timothysc :-) we've got one up and running.

@sjenning
Copy link
Contributor

sjenning commented Feb 24, 2017

@wojtek-t just ran kubemark on my own cluster.

Density [Feature:Performance] should allow starting 30 pods per node using { ReplicationController} with 0 secrets and 0 daemons and got:

worst schedule latencies
perc50: 9.307743ms, perc90: 13.122051ms, perc99: 24.634236ms
worst run-after-schedule latencies
perc50: 200.033329ms, perc90: 601.934257ms, perc99: 731.414727ms
worst watch latencies
perc50: 1.025306815s, perc90: 1.695499114s, perc99: 2.046010868s
worst scheduled-to-end total latencies
perc50: 1.129049573s, perc90: 1.737352444s, perc99: 2.043592652s
worst e2e total latencies
perc50: 1.138982067s, perc90: 1.744747286s, perc99: 2.056566743s

Is 2 seconds e2e normal or high?

@wojtek-t
Copy link
Member Author

@sjenning - 2s is normal.

Though - see my comment above: #42000 (comment)
It seems that it's not possible to repro it just by running density. It seems that it is reproducible by first running load.go, and running density after it.

@sjenning
Copy link
Contributor

@wojtek-t yes, that is what I'm doing next. Just wondering what a baseline acceptable value was.

@wojtek-t
Copy link
Member Author

Betweem 2s and 2.5s is normal. The values above 3s suggest regression.

@wojtek-t
Copy link
Member Author

Ohh - to clarify. This "between 2s and 2.5s being normal" I'm talking about:
"worst e2e total latencies"

@sjenning
Copy link
Contributor

sjenning commented Feb 24, 2017

@wojtek-t so after a load, then another density, I could reproduce

worst schedule latencies
perc50: 9.392381ms, perc90: 12.978804ms, perc99: 31.069847ms
worst run-after-schedule latencies
perc50: 751.525304ms, perc90: 1.36334556s, perc99: 2.045595845s
worst watch latencies
perc50: 1.0551617s, perc90: 1.669326098s, perc99: 2.231286216s
worst scheduled-to-end total latencies
perc50: 1.732232327s, perc90: 2.496441259s, perc99: 3.413937383s
worst e2e total latencies
perc50: 1.741176838s, perc90: 2.504717641s, perc99: 3.422459192s

perc99 of 3.42s e2e with perc100 being 4.52s

Now to figure out how to get the metrics out of the hollow kubelet.

@sjenning
Copy link
Contributor

@gmarek ooooooh... no that was not clear. not to me at least. that changes what i'm looking for. thanks for clarifying.

@gmarek
Copy link
Contributor

gmarek commented Feb 24, 2017

Load test somehow causes huge CPU usage increase on both hollow-kubelets and hollow-kubeproxies which stays very high (read: 100%) for quite some time.

@sjenning
Copy link
Contributor

@gmarek @wojtek-t @derekwaynecarr ok i'm finally catching up with everyone else.

Without PR #41753, the Burstable QoS level cgroup has cpu.shares = 1024. This means that, under contention with guaranteed pods, all of the the burstable pods together (i.e. hollow-nodes) would get the proportionally the same cpu as one guaranteed pod with cpu: 1024m. So CPU starvation could be happening where it wasn't happening before.

HOWEVER, @ncdc and I are looking into a situation where my main cluster nodes are chewing 100% CPU while both the main cluster and kubemark cluster are idle after the kubemark test is over.

top - 21:22:11 up  3:54,  2 users,  load average: 102.95, 104.47, 108.35
Tasks: 1151 total,   9 running, 1142 sleeping,   0 stopped,   0 zombie
%Cpu(s): 94.0 us,  5.6 sy,  0.3 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem:  30839500 total, 15158632 used, 15680868 free,   555004 buffers
KiB Swap:        0 total,        0 used,        0 free.  3452916 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND         
 1720 root      20   0 2940092 196976  46016 S  33.8  0.6  92:32.99 kubelet         
17196 root      20   0  718424 144548  39892 S  22.6  0.5  14:09.96 kubemark        
20098 root      20   0  528972 182228  39916 S  22.6  0.6  14:02.18 kubemark        
19417 root      20   0  663044 171052  40652 S  22.3  0.6  14:13.43 kubemark        
19795 root      20   0  574196 156776  40700 S  22.0  0.5  14:09.56 kubemark        
19324 root      20   0  699200 157816  40216 S  19.7  0.5  14:05.28 kubemark        
14722 root      20   0  648736 172952  39796 S  19.4  0.6  14:11.93 kubemark        
15225 root      20   0  710852 163300  40196 R  19.4  0.5  14:07.59 kubemark        
21050 root      20   0  717580 159276  40192 S  19.0  0.5  13:58.52 kubemark        
18086 root      20   0  580224 156096  40852 S  18.4  0.5  14:00.11 kubemark        
18928 root      20   0  633344 151272  40264 S  18.4  0.5  14:00.83 kubemark        
10610 root      20   0  677400 192628  40376 S  18.0  0.6  13:59.43 kubemark        
13056 root      20   0  784836 158268  39332 R  18.0  0.5  13:57.14 kubemark        
20527 root      20   0  637180 159680  40076 R  17.7  0.5  14:04.54 kubemark        
17073 root      20   0  567936 156192  40476 S  17.1  0.5  14:00.54 kubemark        
15857 root      20   0  593792 157932  40016 S  16.7  0.5  13:57.08 kubemark        
17770 root      20   0  563196 153116  40440 R  16.7  0.5  14:05.02 kubemark        
19072 root      20   0  644368 156604  40276 S  16.7  0.5  14:13.21 kubemark        
19751 root      20   0  662204 162348  40508 S  16.7  0.5  14:10.16 kubemark        
21147 root      20   0  628756 152296  39828 R  16.7  0.5  14:10.16 kubemark        
16148 root      20   0  576932 166704  40236 R  16.4  0.5  14:03.58 kubemark        
14554 root      20   0  723444 170592  40204 S  16.1  0.6  14:09.89 kubemark        
22356 root      20   0  523912 170952  39884 S  15.7  0.6  14:03.93 kubemark        
15128 root      20   0  644248 157684  40788 S  15.1  0.5  14:09.08 kubemark        
20825 root      20   0  649720 157160  40428 S  15.1  0.5  14:12.02 kubemark        
21803 root      20   0  702768 152988  40392 S  15.1  0.5  14:05.38 kubemark        
18455 root      20   0  719432 162592  40720 S  14.8  0.5  14:00.64 kubemark        
19365 root      20   0  724780 159392  39660 S  14.8  0.5  14:10.65 kubemark        
 8554 root      20   0  589996 170820  40248 S  14.4  0.6  14:06.76 kubemark        
12798 root      20   0  563772 153904  40076 S  14.4  0.5  14:03.53 kubemark        
14603 root      20   0  646892 162208  39424 S  14.4  0.5  13:55.67 kubemark        
15075 root      20   0  782860 163996  39696 S  14.4  0.5  14:08.49 kubemark   

@gmarek
Copy link
Contributor

gmarek commented Feb 24, 2017

Yup - this is exactly what we observed. We don't know why it takes so long for components to become idle again though:(

@ncdc
Copy link
Member

ncdc commented Feb 24, 2017

It looks like the kubelet is spinning its wheels constantly trying to remount various pod volumes for some reason

@derekwaynecarr
Copy link
Member

so i am ok disabling pod cgroups for now until burstable cpu shares are set as expected, but we need kubemark to stop hot looping.

@ncdc
Copy link
Member

ncdc commented Feb 24, 2017

It's not just kubemark. It's the kubelet too

@ncdc
Copy link
Member

ncdc commented Feb 24, 2017

Here's a gist of 10 seconds of log data from the kubelet grepping for a single secret: https://gist.github.com/ncdc/8ddb83d5376b472e6bb84bf416a2b3e1

@ncdc
Copy link
Member

ncdc commented Feb 24, 2017

This secret is not mutating in etcd at all

@vishh
Copy link
Contributor

vishh commented Feb 24, 2017

It looks like the kubelet is spinning its wheels constantly trying to remount various pod volumes for some reason

This sounds bad.

@vishh
Copy link
Contributor

vishh commented Feb 24, 2017

@ncdc #42081

@ncdc
Copy link
Member

ncdc commented Feb 24, 2017

Thanks @vishh. @sjenning you might try pulling that PR, doing a new build, and creating a new cluster to see if that helps (you'll have to tear down and recreate the cluster).

@derekwaynecarr
Copy link
Member

@vishh -- do we still want to disable pod cgroups until burstable shares for safety reasons?

@vishh
Copy link
Contributor

vishh commented Feb 24, 2017 via email

@dchen1107
Copy link
Member

Just saw this issue. FYI: @yujuhong @Random-Liu and I were debugging the issue of the extra startup latency this morning, and we believe the root cause was identified after switching to CRI: #42081

But the fundamental issue is the artificial API QPS limit (default: 5) applying to all clients because unknown API Server's scalability limit / scope. Any changes at the node which generates more requests might cause the regression if we didn't fix the API Server scalability issue.

@liggitt
Copy link
Member

liggitt commented Feb 25, 2017

But the fundamental issue is the artificial API QPS limit (default: 5) applying to all clients because unknown API Server's scalability limit / scope. Any changes at the node which generates more requests might cause the regression if we didn't fix the API Server scalability issue.

that limit is client-side, not server-side... do we know of a specific client that is being broadly shared and hitting the limit? (we should see "Throttling request took ..." messages in logs)

@Random-Liu
Copy link
Member

that limit is client-side, not server-side... do we know of a specific client that is being broadly shared and hitting the limit? (we should see "Throttling request took ..." messages in logs)

Kubelet only uses one apiserver client. We did see a lot of “Throttling", some even up to 30 seconds.

@liggitt
Copy link
Member

liggitt commented Feb 25, 2017

Yeah, the kubelet is defaulting to 5 (https://github.com/kubernetes/kubernetes/blob/master/pkg/apis/componentconfig/v1alpha1/defaults.go#L354-L357)

Other components have different defaults (scheduler defaults to 50, controller-manager to 20). We can certainly revisit the kubelet default if we need to.

@gmarek
Copy link
Contributor

gmarek commented Feb 25, 2017

We're aware that kubelet API limit is quite low now and that it's the main bottleneck for kubelet throughput right now. And yes, this means that any change in Kubelets requirements for contacting API server will be very visible and have a big impact on our performance tests - every change needs to be multiplied by 5000 which changes even 1 QPS change for kubelet, 5k QPS change for the API server (i.e. it's way easier/safer to double QPS limit for e.g. controller-manager than adding even 1 QPS for kubelet:() - which is exactly why we noticed a need for #42081.

That being said, this was unrelated with this issue. We were running those test in non-CRI mode (i.e. base cluster kubelets were running in non-CRI mode), and IIUC hollow-kubelets are still using Docker fake, not CRI-one (@shyamjvs @Random-Liu). Indeed we saw exactly the same error after #42081 was merged: https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-kubemark-500-gce/2822/build-log.txt

It's unclear to me whether this is a problem only with mocks, or with real things as well. We'd need to run a load test on a real big cluster to check that.

@Random-Liu
Copy link
Member

Random-Liu commented Feb 25, 2017

@gmarek
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-kubemark-500-gce/2822 hasn't included #42081 yet I believe.

The kubernetes version is v1.7.0-alpha.0.201+46b20acba22d9a 46b20ac, which is merged before #42081

@wojtek-t
Copy link
Member Author

I have a fix for the underlying problem locally. I need to fix unit tests a will send out a PR out for review.

The problem is both in real kube-proxy and in our mock so the fix should visibily benefit real kube-proxy too (and the latency of propagating endpoints).

I will try to send it by Monday morning PST to have it merged in 1.6

@wojtek-t
Copy link
Member Author

The PR is out: #42108

It is saving 2/3 of cpu & memory allocations of the "non-iptables" related code in kube-proxy.

@gmarek
Copy link
Contributor

gmarek commented Feb 27, 2017

OK, it seems that it was caused by the same thing.

@gmarek gmarek closed this as completed Feb 27, 2017
@timothysc
Copy link
Member

@gmarek What is the concise summary here? There are a number of issues posed, while @wojtek-t 's changes are on the proxy. It's not clear to the outside observer what the root is.

@wojtek-t
Copy link
Member Author

The summary is that:

  • when we run load test first (without my changes) kubeproxies are processing backlog during the next running density test
  • with cgroups enabled, this is affecting kubelet ability to start pods, due to lack of cpu.

So cgroups just changed things, but the underlying problem was cpu starvation.

@timothysc
Copy link
Member

thx @wojtek-t

copejon pushed a commit to copejon/kubernetes that referenced this issue Feb 28, 2017
…pu_usage

Automatic merge from submit-queue (batch tested with PRs 40746, 41699, 42108, 42174, 42093)

Switch kube-proxy to informers & save 2/3 of cpu & memory of non-iptables related code.

Fix kubernetes#42000

This PR should be no-op from the behavior perspective.
It is changing KubeProxy to use standard "informer" framework instead of combination of reflector + undelta store.

This is significantly reducing CPU usage of kube-proxy and number of memory allocations.
Previously, on every endpoints/service update, we were copying __all__ endpoints/services at least 3 times, now it is once (which should also be removed in the future).

In Kubemark-500, hollow-proxies were processing backlog from load test for an hour after the test was finishing. With this change, it is keeping up with the load.

@thockin @ncdc @derekwaynecarr
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/flake Categorizes issue or PR as related to a flaky test. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Projects
None yet
Development

No branches or pull requests