New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix #51135 make CFS quota period configurable #63437

Merged
merged 1 commit into from Sep 1, 2018

Conversation

@szuecs
Copy link
Contributor

szuecs commented May 4, 2018

What this PR does / why we need it:

This PR makes it possible for users to change CFS quota period from the default 100ms to some other value between 1µs and 1s.
#51135 shows that multiple production users have serious issues running reasonable workloads in kubernetes. The latency added by the 100ms CFS quota period is adding way too much time.

Which issue(s) this PR fixes:
Fixes #51135

Special notes for your reviewer:

Release note:

Adds a kubelet parameter and config option to change CFS quota period from the default 100ms to some other value between 1µs and 1s. This was done to improve response latencies for workloads running in clusters with guaranteed and burstable QoS classes.  
@liggitt

This comment has been minimized.

Copy link
Member

liggitt commented May 4, 2018

@derekwaynecarr

This comment has been minimized.

Copy link
Member

derekwaynecarr commented May 4, 2018

The value used today is the default in many OS distributions.

There are many users running with 100ms today that could see a change in behavior.

I was not at Kubecon, so I missed some context for the discussion. I prefer we have a kubelet flag to tweak the desired cfs_period_us setting on a linux host rather than hard-coding. Red Hat has customers that disable CFS quota entirely via the existing flag. I think its reasonable to pair that flag with the additional option to tweak the default CFS period. I do not think we need to let it be tweaked per pod.

@dchen1107 @vishh

/hold

@dims

This comment has been minimized.

Copy link
Member

dims commented May 4, 2018

@derekwaynecarr The "cpu-cfs-quota" command line flag in kubelet right? So we would have a "cpu-cfs-quota-period" to go along with it?

@derekwaynecarr

This comment has been minimized.

Copy link
Member

derekwaynecarr commented May 4, 2018

@dims

This comment has been minimized.

Copy link
Member

dims commented May 4, 2018

@szuecs looking forward to an update in this PR with the suggestion above

@derekwaynecarr

This comment has been minimized.

Copy link
Member

derekwaynecarr commented May 5, 2018

see my comment here for why a flag is preferred. Not all nodes are tuned for latency.

#51135 (comment)

@szuecs

This comment has been minimized.

Copy link
Contributor Author

szuecs commented May 5, 2018

Thanks for all your comments. I will work on it probably next week, we have a short week, so I am not sure if takes more than next week.

@k8s-ci-robot k8s-ci-robot added size/L and removed size/XS labels May 7, 2018

@szuecs

This comment has been minimized.

Copy link
Contributor Author

szuecs commented May 7, 2018

@derekwaynecarr I added the cli flag and config option. I hope that I added it everywhere, where it is needed.

I tested, that:

  • tests run and report ok % go test ./pkg/kubelet/cm -v
  • I can locally build kubelet: % make WHAT=cmd/kubelet

Let me know if I have to change anything.

@vishh

This comment has been minimized.

Copy link
Member

vishh commented May 7, 2018

/hold
I'd like for us to reach a consensus on #51135 (comment) prior to merging this patch.

@szuecs szuecs changed the title fix #51135 set default quota period to 5ms based on user experience fix #51135 make CFS quota period configurable May 7, 2018

@dims

This comment has been minimized.

Copy link
Member

dims commented May 8, 2018

@szuecs i think you need to set the default value in SetDefaults_KubeletConfiguration method too:
https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/kubeletconfig/v1beta1/defaults.go#L151-L153

@dims

This comment has been minimized.

Copy link
Member

dims commented May 8, 2018

/ok-to-test

@dims

This comment has been minimized.

Copy link
Member

dims commented May 8, 2018

@derekwaynecarr @vishh Do we count this as a new feature? if so, Does this feature need an alpha feature gate?

@dims

This comment has been minimized.

Copy link
Member

dims commented May 8, 2018

@derekwaynecarr @vishh Also, do we leave the current default as-is? (when nothing is specified in the command line)

@dims

This comment has been minimized.

Copy link
Member

dims commented May 8, 2018

@szuecs

  • if run "hack/update-bazel.sh" that should take care of the verify job failure. (and bazel-test job failure too)
  • the fix for e2e failure should have already gone in, but you may have to rebase to master

So, please squash the commits (for easier review), rebase to master and we should get all green (cross my fingers)

thanks,
Dims

@szuecs szuecs force-pushed the szuecs:fix/51135-set-saneer-default-cpu.cfs_period branch from 8bba451 to f0b2f0d May 9, 2018

@szuecs szuecs force-pushed the szuecs:fix/51135-set-saneer-default-cpu.cfs_period branch from 5f849d5 to 938d979 Sep 1, 2018

fix #51135 make CFS quota period configurable, adds a cli flag and co…
…nfig option to kubelet to be able to set cpu.cfs_period and defaults to 100ms as before.

It requires to enable feature gate CustomCPUCFSQuotaPeriod.

Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>

@szuecs szuecs force-pushed the szuecs:fix/51135-set-saneer-default-cpu.cfs_period branch from 938d979 to 588d280 Sep 1, 2018

@szuecs

This comment has been minimized.

Copy link
Contributor Author

szuecs commented Sep 1, 2018

/retest

@dims

This comment has been minimized.

Copy link
Member

dims commented Sep 1, 2018

reapplying lgtm from @derekwaynecarr (lost in rebase)

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Sep 1, 2018

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Sep 1, 2018

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: derekwaynecarr, dims, szuecs, timothysc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-merge-robot

This comment has been minimized.

Copy link
Contributor

k8s-merge-robot commented Sep 1, 2018

/test all [submit-queue is verifying that this PR is safe to merge]

@k8s-merge-robot

This comment has been minimized.

Copy link
Contributor

k8s-merge-robot commented Sep 1, 2018

Automatic merge from submit-queue (batch tested with PRs 63437, 68081). If you want to cherry-pick this change to another branch, please follow the instructions here: https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md.

@k8s-merge-robot k8s-merge-robot merged commit 147520f into kubernetes:master Sep 1, 2018

5 of 18 checks passed

Submit Queue Required Github CI test is not green: pull-kubernetes-bazel-build
Details
pull-kubernetes-bazel-build Job triggered.
Details
pull-kubernetes-bazel-test Job triggered.
Details
pull-kubernetes-e2e-gce Job triggered.
Details
pull-kubernetes-e2e-gce-100-performance Job triggered.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job triggered.
Details
pull-kubernetes-e2e-kops-aws Job triggered.
Details
pull-kubernetes-integration Job triggered.
Details
pull-kubernetes-kubemark-e2e-gce-big Job triggered.
Details
pull-kubernetes-local-e2e-containerized Job triggered.
Details
pull-kubernetes-node-e2e Job triggered.
Details
pull-kubernetes-typecheck Job triggered.
Details
pull-kubernetes-verify Job triggered.
Details
cla/linuxfoundation szuecs authorized
Details
pull-kubernetes-cross Skipped
pull-kubernetes-e2e-gke Skipped
pull-kubernetes-e2e-kubeadm-gce Skipped
pull-kubernetes-local-e2e Skipped
@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Sep 2, 2018

@szuecs: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-local-e2e 0a3c4db link /test pull-kubernetes-local-e2e
pull-kubernetes-e2e-gce 588d280 link /test pull-kubernetes-e2e-gce

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@hjacobs

This comment has been minimized.

Copy link

hjacobs commented Sep 2, 2018

@dashpole

This comment has been minimized.

Copy link
Contributor

dashpole commented Sep 6, 2018

This has broken the release blocking alphafeatures test suite: https://k8s-testgrid.appspot.com/sig-release-1.12-blocking#gce-cos-1.12-alphafeatures

The kubelet is crashlooping:
server.go:207] invalid configuration: CPUCFSQuotaPeriod (--cpu-cfs-quota-period) {0s} must be between 1usec and 1sec, inclusive

Tracking bug for release-blocking failures: #68313

@guineveresaenger

This comment has been minimized.

Copy link
Contributor

guineveresaenger commented Sep 6, 2018

@dashpole @derekwaynecarr @dims please give this your immediate attention re: #68313.

@@ -154,6 +155,9 @@ func SetDefaults_KubeletConfiguration(obj *KubeletConfiguration) {
if obj.CPUCFSQuota == nil {
obj.CPUCFSQuota = utilpointer.BoolPtr(true)
}
if obj.CPUCFSQuotaPeriod == nil && obj.FeatureGates[string(features.CPUCFSQuotaPeriod)] {

This comment has been minimized.

@dashpole

dashpole Sep 6, 2018

Contributor

I don't know if this actually works... We generally do not feature gate the defaulting of flags. We generally only place the feature gate around code we don't want to run, rather than around configuration.

This comment has been minimized.

@derekwaynecarr

derekwaynecarr Sep 6, 2018

Member

This does look like a problem. The flag should always default, it just shouldn’t have been used. Apologies for missing in review

@dashpole

This comment has been minimized.

Copy link
Contributor

dashpole commented Sep 7, 2018

Fix is out: #68386

k8s-merge-robot added a commit that referenced this pull request Sep 7, 2018

Merge pull request #68386 from dashpole/fix_alphafeatures
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here: https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md.

Remove feature gate from kubelet defaulting

**What this PR does / why we need it**:
Fixes a release-blocking test: https://k8s-testgrid.appspot.com/sig-release-1.12-blocking#gce-cos-1.12-alphafeatures
Regression added by #63437
This solution was discussed on slack in the sig-release channel
This should be targeted for 1.12

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Issue ##68313

**Special notes for your reviewer**:
/hold
testing to make sure this fixes the issue
Using: `make test-e2e-node FOCUS=ImageGCNoEviction SKIP= PARALLELISM=1 REMOTE=true TEST_ARGS='--feature-gates=CustomCPUCFSQuotaPeriod=true'` to reproduce the issue, as it runs a test with the feature gate enabled.

**Release note**:
```release-note
NONE
```

/assign @dims @derekwaynecarr 
/sig node
/kind bug
/priority critical-urgent

@szuecs szuecs referenced this pull request Jan 30, 2019

Closed

REQUEST: New membership for szuecs #433

6 of 6 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment