New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove 20x factor in garbage-collector qps #63657

Merged
merged 1 commit into from May 10, 2018

Conversation

@shyamjvs
Member

shyamjvs commented May 10, 2018

Fixes #63610

I was discussing offline with @wojtek-t. And among the two options of:

  • Increasing the qps 20x to compensate for the earlier "bloated" qps (this kind of assumes that we have ~20 resource types in the cluster)
  • Keeping the qps same to make sure that we don't overwhelm the apiserver with the new re-adjusted qps (like what seems to happen with our performance tests where we mostly just have ~1 resource type)

we agreed that the latter one seems to be less riskier as it's probably better to have the GC slower than to make our API call latencies shoot up.
That said, we can try to increase it later if it's justifiable.

cc @kubernetes/sig-api-machinery-misc @deads2k @wojtek-t

GC is now bound by QPS (it wasn't before) and so if you need more QPS to avoid ratelimiting GC, you'll have to set it.
@wojtek-t

This comment has been minimized.

Member

wojtek-t commented May 10, 2018

@kubernetes/sig-api-machinery-misc @deads2k
To add to what @shyamjvs already wrote - I think the fact that GC was effectively able to reach hundreds of QPS limits was a bug (not a feature). That PR is effectively fixing that bug (potentially making GC slower that it was before).
Once we fix the regression, we can consider bumping it in a more controlled way.

@wojtek-t

This comment has been minimized.

Member

wojtek-t commented May 10, 2018

/approve no-issue

[I will let @deads2k to at least take a look and comment]

@shyamjvs

This comment has been minimized.

Member

shyamjvs commented May 10, 2018

So from the currently running 100-node presubmit where density test already finished, seems like this fix worked:

Top latency metric: {Resource:pods Subresource: Verb:DELETE Scope:namespace Latency:{Perc50:4.29ms Perc90:6.828ms Perc99:14.858ms Perc100:0s} Count:6232}

It went down all the way from >2s to <15ms :)

(Ref: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/63657/pull-kubernetes-e2e-gce-100-performance/5318)

@deads2k

This comment has been minimized.

Contributor

deads2k commented May 10, 2018

I think the fact that GC was effectively able to reach hundreds of QPS limits was a bug (not a feature)

I agree, but our QPS is so imaginary it's hard to decide the "correct" behavior.

I updated the release note to clearly indicate what this change.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label May 10, 2018

@k8s-ci-robot

This comment has been minimized.

Contributor

k8s-ci-robot commented May 10, 2018

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, shyamjvs, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-merge-robot

This comment has been minimized.

Contributor

k8s-merge-robot commented May 10, 2018

/test all [submit-queue is verifying that this PR is safe to merge]

@wojtek-t

This comment has been minimized.

Member

wojtek-t commented May 10, 2018

but our QPS is so imaginary it's hard to decide the "correct" behavior.

That is also true :)
Thanks @deads2k !

@shyamjvs

This comment has been minimized.

Member

shyamjvs commented May 10, 2018

Thanks @deads2k.

@wojtek-t

This comment has been minimized.

Member

wojtek-t commented May 10, 2018

/retest

@k8s-merge-robot

This comment has been minimized.

Contributor

k8s-merge-robot commented May 10, 2018

Automatic merge from submit-queue (batch tested with PRs 63424, 63657). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-merge-robot k8s-merge-robot merged commit d5ab355 into kubernetes:master May 10, 2018

14 of 16 checks passed

Submit Queue Required Github CI test is not green: pull-kubernetes-e2e-kops-aws
Details
pull-kubernetes-e2e-kops-aws Job triggered.
Details
cla/linuxfoundation shyamjvs authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Skipped
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gke Skipped
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce Job succeeded.
Details
pull-kubernetes-local-e2e Skipped
pull-kubernetes-local-e2e-containerized Skipped
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details

@shyamjvs shyamjvs deleted the shyamjvs:remove-gc-qps-bump branch May 10, 2018

@caesarxuchao

This comment has been minimized.

Member

caesarxuchao commented May 18, 2018

The old dynamic client uses one REST client per group version, not per resource (https://github.com/kubernetes/kubernetes/blob/v1.10.2/staging/src/k8s.io/client-go/dynamic/client_pool.go#L102).

core/v1, apps/v1, extensions/v1beta1, batch/v1 are the group versions that actively use the garbage collector, so 4*5 qps is close enough to restore the old GC performance (@deads2k sorry I didn't realize 20X qps would cause problem when reviewed your PR).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment