Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix computing of cpu nano core usage #74933

Merged
merged 1 commit into from
Mar 5, 2019

Conversation

yujuhong
Copy link
Contributor

@yujuhong yujuhong commented Mar 5, 2019

CRI runtimes do not supply cpu nano core usage as it is not part of CRI
stats. However, there are upstream components that still rely on such
stats to function. The previous fix was faulty because the multiple
callers could compete and update the stats, causing
inconsistent/incoherent metrics. This change, instead, creates a
separate call for updating the usage, and rely on eviction manager,
which runs periodically, to trigger the updates. The caveat is that if
eviction manager is completley turned off, no one would compute the
usage.

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/kind api-change

/kind bug

/kind cleanup
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #74667

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

@yujuhong yujuhong added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Mar 5, 2019
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 5, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: yujuhong

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet labels Mar 5, 2019
@yujuhong yujuhong added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed area/kubelet needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Mar 5, 2019
@yujuhong
Copy link
Contributor Author

yujuhong commented Mar 5, 2019

@k8s-ci-robot
Copy link
Contributor

@yujuhong: GitHub didn't allow me to request PR reviews from the following users: hpandeycodeit.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @michmike @PatrickLang @neolit123 @adelina-t @feiskyer @hpandeycodeit

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Member

@neolit123 neolit123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks a lot for looking into this @yujuhong
found a couple of minor nits.

i still have the concern that we have reduced precision on Windows, but let's see it it's not a problem.

@adelina-t do you think you will be able to test this patch by building a custom kubelet?

wget https://github.com/kubernetes/kubernetes/pull/74933.diff && git apply 74933.diff
KUBE_BUILD_PLATFORMS=windows/amd64 make all WHAT=cmd/kubelet
# result should be at '_output/local/bin/windows/amd64/kubelet.exe'
# 'git reset --hard HEAD' to remove the patch

pkg/kubelet/stats/cri_stats_provider.go Outdated Show resolved Hide resolved
pkg/kubelet/stats/cri_stats_provider.go Outdated Show resolved Hide resolved
@neolit123
Copy link
Member

/sig windows

@k8s-ci-robot k8s-ci-robot added the sig/windows Categorizes an issue or PR as relevant to SIG Windows. label Mar 5, 2019
@michmike michmike added this to 1.14 Release Blocking (Windows GA, gMSA alpha) in SIG-Windows Mar 5, 2019
Copy link
Contributor

@michmike michmike left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/LGTM
fix the couple of nits that were pointed out and this is good

CRI runtimes do not supply cpu nano core usage as it is not part of CRI
stats. However, there are upstream components that still rely on such
stats to function. The previous fix was faulty because the multiple
callers could compete and update the stats, causing
inconsistent/incoherent metrics. This change, instead, creates a
separate call for updating the usage, and rely on eviction manager,
which runs periodically, to trigger the updates. The caveat is that if
eviction manager is completley turned off, no one would compute the
usage.
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 5, 2019
@michmike
Copy link
Contributor

michmike commented Mar 5, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 5, 2019
@michmike
Copy link
Contributor

michmike commented Mar 5, 2019

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 5, 2019
@yujuhong
Copy link
Contributor Author

yujuhong commented Mar 5, 2019

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 5, 2019
@yujuhong
Copy link
Contributor Author

yujuhong commented Mar 5, 2019

@yujuhong , did you ran this one in gce ci ?

Our CI job tracks the master branch right now, so not yet. I manually tested and see no kubelet crash, and also kubectl top pods reporting meaningful numbers.

@adelina-t
Copy link
Contributor

adelina-t commented Mar 5, 2019

@yujuhong

Our CI job tracks the master branch right now, so not yet. I manually tested and see no kubelet crash, and also kubectl top pods reporting meaningful numbers.

Yeah, I was thinking manual runs :) I have jobs running as we speak, seeing consistent runs as well, but haven't seen logs yet, will look tomorrow.

@dashpole
Copy link
Contributor

dashpole commented Mar 5, 2019

This should work well, and is as clean a way of implementing something like this as I think we will find. Ill expand on @yujuhong comment above to document it here.

The eviction manager does not run if:

  • The list of hard eviction thresholds is explicitly set to empty
  • No soft eviction thresholds are set
  • The feature gate --LocalStorageCapacityIsolation is explicitly disabled.

/lgtm

@yujuhong
Copy link
Contributor Author

yujuhong commented Mar 5, 2019

Performance of the /stats/summary endpoint is pretty bad compared to Linux, but not sure if that's important. This does definitely fix the panic issue. Usage during the load test was consistently low.

@benmoss could you help file an issue about the performance for future reference? Thanks!

@dashpole thanks for reviewing.

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 5, 2019
@yujuhong
Copy link
Contributor Author

yujuhong commented Mar 5, 2019

i still have the concern that we have reduced precision on Windows, but let's see it it's not a problem.

The reduced precision should not cause problems with the current interval (updating every 10s), but it's worth more investigation in the future.
@neolit123 could you file an issue summarizing your findings, so that we can come back and look at them later? Thanks!

@neolit123
Copy link
Member

@neolit123 could you file an issue summarizing your findings, so that we can come back and look at them later? Thanks!

sure, here it is:
#74999

@k8s-ci-robot k8s-ci-robot merged commit 4d1b830 into kubernetes:master Mar 5, 2019
SIG-Windows automation moved this from 1.14 Release Blocking (Windows GA, gMSA alpha) to Done (v.1.14) Mar 5, 2019

if err != nil {
// This should not happen. Log now to raise visiblity
klog.Errorf("failed updating cpu usage nano core: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should nil be returned in this case ?

@SleepyBrett
Copy link

Is this going to get cherry picked back into the 1.13 and others?

@mattjmcnaughton
Copy link
Contributor

Is this going to get cherry picked back into the 1.13 and others?

I think that's a good idea :) I answered in your issue.

@SleepyBrett
Copy link

Didn't seem to fix my problem on 1.14.1

k8s-ci-robot added a commit that referenced this pull request May 9, 2019
…of-#73659-#74933-upstream-release-1.13

Automated cherry pick of #73659: Kubelet: add usageNanoCores from CRI stats provider #74933: Fix computing of cpu nano core usage
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/windows Categorizes an issue or PR as relevant to SIG Windows. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Divide by zero panic in getContainerUsageNanocores