-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix computing of cpu nano core usage #74933
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: yujuhong The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
e9dea60
to
d4aa0fa
Compare
@yujuhong: GitHub didn't allow me to request PR reviews from the following users: hpandeycodeit. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks a lot for looking into this @yujuhong
found a couple of minor nits.
i still have the concern that we have reduced precision on Windows, but let's see it it's not a problem.
@adelina-t do you think you will be able to test this patch by building a custom kubelet?
wget https://github.com/kubernetes/kubernetes/pull/74933.diff && git apply 74933.diff
KUBE_BUILD_PLATFORMS=windows/amd64 make all WHAT=cmd/kubelet
# result should be at '_output/local/bin/windows/amd64/kubelet.exe'
# 'git reset --hard HEAD' to remove the patch
/sig windows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/LGTM
fix the couple of nits that were pointed out and this is good
CRI runtimes do not supply cpu nano core usage as it is not part of CRI stats. However, there are upstream components that still rely on such stats to function. The previous fix was faulty because the multiple callers could compete and update the stats, causing inconsistent/incoherent metrics. This change, instead, creates a separate call for updating the usage, and rely on eviction manager, which runs periodically, to trigger the updates. The caveat is that if eviction manager is completley turned off, no one would compute the usage.
d4aa0fa
to
191666d
Compare
/lgtm |
/hold cancel |
/hold |
Our CI job tracks the master branch right now, so not yet. I manually tested and see no kubelet crash, and also |
Yeah, I was thinking manual runs :) I have jobs running as we speak, seeing consistent runs as well, but haven't seen logs yet, will look tomorrow. |
This should work well, and is as clean a way of implementing something like this as I think we will find. Ill expand on @yujuhong comment above to document it here. The eviction manager does not run if:
/lgtm |
@benmoss could you help file an issue about the performance for future reference? Thanks! @dashpole thanks for reviewing. /hold cancel |
The reduced precision should not cause problems with the current interval (updating every 10s), but it's worth more investigation in the future. |
sure, here it is: |
|
||
if err != nil { | ||
// This should not happen. Log now to raise visiblity | ||
klog.Errorf("failed updating cpu usage nano core: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should nil be returned in this case ?
Is this going to get cherry picked back into the 1.13 and others? |
I think that's a good idea :) I answered in your issue. |
Didn't seem to fix my problem on 1.14.1 |
CRI runtimes do not supply cpu nano core usage as it is not part of CRI
stats. However, there are upstream components that still rely on such
stats to function. The previous fix was faulty because the multiple
callers could compete and update the stats, causing
inconsistent/incoherent metrics. This change, instead, creates a
separate call for updating the usage, and rely on eviction manager,
which runs periodically, to trigger the updates. The caveat is that if
eviction manager is completley turned off, no one would compute the
usage.
What type of PR is this?
/kind bug
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #74667
Special notes for your reviewer:
Does this PR introduce a user-facing change?: