New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler CPU usage hurting scalability? #70708

Closed
jberkus opened this Issue Nov 6, 2018 · 29 comments

Comments

@jberkus

jberkus commented Nov 6, 2018

CI testing upgraded to golang 1.12.2 around 4am last night.

After that, the two gce-100 test runs failed: https://gubernator.k8s.io/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-scalability/19422

This was followed by two runs whose runtime was slightly (2 min) above average for the test: https://gubernator.k8s.io/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-scalability/19424

We need to verify whether the golang upgrade had a measureable effect on performance, or whether this is just a continuation of issue #69600 Unfortunately, the Perf dashboard is no help here (issue on that to come)

/kind flake
/priority important-soon
/sig scalability

attn:
@wojtek-t
@shyamjvs
@cblecker
@AishSundar

@cblecker

This comment has been minimized.

@AishSundar

This comment has been minimized.

Contributor

AishSundar commented Nov 6, 2018

FWIW, there were runs before the Go lang upgrade that ventured into the 43m mark as well, but it was an aberration vs the norm

https://gubernator.k8s.io/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-scalability/19395

We are watching the job closely to see if the timing comes down at all or not.

@jberkus

This comment has been minimized.

jberkus commented Nov 6, 2018

So times are continuing to average higher -- almost 10% longer runs than prior to the golang change.

Sadly, we can't get better granularity from the perf dashboard, so at this point we don't know and it looks like a blocker.

Escalting priority

/priority critical-urgent

@jberkus

This comment has been minimized.

jberkus commented Nov 6, 2018

/remove-priority important-soon

@jberkus

This comment has been minimized.

jberkus commented Nov 6, 2018

/assign @wojtek-t

@AishSundar

This comment has been minimized.

Contributor

AishSundar commented Nov 6, 2018

@wojtek-t This is currently blocking Beta cut for 1.13. Please investigate this as priority. thanks

@BenTheElder

This comment has been minimized.

Member

BenTheElder commented Nov 7, 2018

CI testing upgraded to golang 1.12.2 around 4am last night.

1.11.2* ?

@wojtek-t

This comment has been minimized.

Member

wojtek-t commented Nov 7, 2018

@jberkus @AishSundar - I looked into tests and i don't see correlation with golang upgrade.
The metrics look fine to me.

The problem that we observe is that scheduler is using more cpu now.
It started happening more than a week ago, so it's not related to the 1.11.2 golang upgrade (I don't know when we upgraded to 1.11.1 though).

I looked a bit into scheduler changes from last 2 weeks or so, and the only PRs that looked potentially a little bit suspcious are:
#70605
#70366

@bsalamat :

  1. does scheduler benchmark gather cpu-usage so that we can verify that? [I'm afraid it's not, just wanted to be sure]
  2. are you aware of any change in the last couple weeks that may result in higher cpu-usage?

@jberkus @AishSundar - I don't consider this to be serious enough to be beta release blocker.

@wojtek-t

This comment has been minimized.

Member

wojtek-t commented Nov 7, 2018

I'm not lowering the priority, as I would like to see the results of today gce-scale-performance suite run.

@AishSundar

This comment has been minimized.

Contributor

AishSundar commented Nov 7, 2018

Thanks @wojtek-t for the investigation. Go was updated to 1.11.1 on Oct 3rd. the last Go 1.10.4 update was on Sep 18th.

One ques towards Beta cut is, should hold Beta until today's gce-scale-performance suite runs, or do you think we can proceed with Beta before that with the signal we got from gce-100 job?

@bsalamat

This comment has been minimized.

Contributor

bsalamat commented Nov 7, 2018

It is likely caused by #70366. I was afraid that it could potentially cause higher CPU usage in our benchmarks where there are not many unschedulable pods. We should revert it.

@AishSundar

This comment has been minimized.

Contributor

AishSundar commented Nov 7, 2018

@bsalamat thanks for quick response. Do you want to use this issue to track the revert PR ? I am assigning this to you for now

/assign @bsalamat

@AishSundar

This comment has been minimized.

Contributor

AishSundar commented Nov 8, 2018

Thanks for the revert PR @bsalamat

@wojtek-t meanwhile we didnt get a run of the large scale-performance job today. Can you please look into whats going on? Thanks

@BenTheElder

This comment has been minimized.

Member

BenTheElder commented Nov 8, 2018

meanwhile we didnt get a run of the large scale-performance job today. Can you please look into whats going on? Thanks

I looked into this, the job pod was evicted and the rescheduled pod is still running afaict.

We've seen this before, but not recently. I'd say that's a broader prow / k8s issue. Filing an issue to do more to mitigate

@AishSundar

This comment has been minimized.

Contributor

AishSundar commented Nov 8, 2018

Thanks @BenTheElder thats mighty helpful. How do we get the job back in a good state? can we assume the rescheduled pod is good and that run will complete?

@BenTheElder

This comment has been minimized.

Member

BenTheElder commented Nov 8, 2018

Not sure, it might be OK, in the past we had issues with the previous pod's resources breaking the rescheduled pod. It should either be in an OK state now or the next run will be.

Metadata for 252 (2018-11-07 14:22 PST) and 251 (2018-11-07 00:01 PST) are both pod / ProwJob ID 4bb6bf52-e263-11e8-bcaa-0a580a6c0209 so it rescheduled about 2 hours in.

@krzyzacy

This comment has been minimized.

Member

krzyzacy commented Nov 8, 2018

we can make a GIANT dedicated node for scalability jobs...

@BenTheElder

This comment has been minimized.

Member

BenTheElder commented Nov 8, 2018

@AishSundar

This comment has been minimized.

Contributor

AishSundar commented Nov 8, 2018

@wojtek-t #70776 reverting the problematic feature merged yesterday and we have some new runs https://testgrid.k8s.io/sig-release-master-blocking#gce-cos-master-scalability-100

We also got a new large scale run for 11/7 - https://testgrid.k8s.io/sig-release-master-blocking#gce-master-scale-performance

can you look it is ok from performance and CPU consumption and then close this issue? Thanks

@jberkus jberkus changed the title from Did golang upgrade hurt scalability measureably? to Scheduler CPU usage hurting scalability? Nov 9, 2018

@jberkus

This comment has been minimized.

jberkus commented Nov 9, 2018

Not solved, we're getting some GCE-100 failures because of this now, for example this one:

https://gubernator.k8s.io/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-scalability/19500

@jberkus

This comment has been minimized.

jberkus commented Nov 9, 2018

/remove-priority important-soon
/milestone v1.13

@k8s-ci-robot k8s-ci-robot added this to the v1.13 milestone Nov 9, 2018

@jberkus

This comment has been minimized.

jberkus commented Nov 9, 2018

It looks like this flake started getting a lot more frequent on either 10/4 or 10/5: https://storage.googleapis.com/k8s-gubernator/triage/index.html?date=2018-10-10&text=kube-scheduler%20is%20using

@bsalamat

This comment has been minimized.

Contributor

bsalamat commented Nov 9, 2018

I am reverting the other suspicious PR as well.

@AishSundar

This comment has been minimized.

Contributor

AishSundar commented Nov 10, 2018

Looks like the latest revert #70837 worked ! The gce-cos-master-scalability-100 job has been stable that last day.

@wojtek-t can you chk and close this issue if you deem it fixed.

@AishSundar

This comment has been minimized.

Contributor

AishSundar commented Nov 13, 2018

@wojtek-t are we ok to close this issue for 1.13? We had newer runs of scale-correctness, scale-performance and scalability-100 in the past few days. From looking at the overall timing, there seems to be no regression. But can you look at the perf dashboards and advise accordingly?

@mariantalla

This comment has been minimized.

Contributor

mariantalla commented Nov 14, 2018

Hey @wojtek-t and @bsalamat , just a heads up that this recently failed. Here's testgrid and kubernator.

It's probably a flake (I think that because the diff shows only a changelog change) but it may be worth a look as the error message is interesting.

Thoughts?

@bsalamat

This comment has been minimized.

Contributor

bsalamat commented Nov 14, 2018

There has not been any notable changes recently and the test has passed after the failure. I looked at error message, it was about longer startup time than expected. Given that the tests have passed after, I guess it was a flake.

@jberkus

This comment has been minimized.

jberkus commented Nov 15, 2018

We are still seeing this flake in gce-cos-scalability-100, but at this point I'm going to close this issue in favor of a new issue focused only on the current flake: #71760

/close

@k8s-ci-robot

This comment has been minimized.

Contributor

k8s-ci-robot commented Nov 15, 2018

@jberkus: Closing this issue.

In response to this:

We are still seeing this flake in gce-cos-scalability-100, but at this point I'm going to close this issue in favor of a new issue focused only on the current flake: #71760

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment