-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disable AcceleratorUsage Metrics initial kep #1868
Disable AcceleratorUsage Metrics initial kep #1868
Conversation
/assign |
### Graduation Criteria | ||
|
||
#### Alpha Graduation | ||
* Feature Flag is present. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The awkward part of this process is that enabling this feature flag is recommended by all deployers at this phase. I think we should document and promote immediate usage of this flag even if it’s alpha.
@brancz has sig-monitoring encountered a similar scenario? Usage of these metrics actually impairs the ability to consume what is measured.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A situation exactly like this hasn't come up, but generally speaking, since we have the metrics stability framework in place and all these metrics are alpha metrics, theoretically by the metrics stability framework you wouldn't even have to have a feature flag, but you could just remove them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will let @dashpole review and remove the hold, but I agree with this change and only wish we work hard to promote enabling the alpha feature gate as a best practice given how it impacts usage of the device it monitors. I cannot think of a similar situation like this prior.
/approve
/hold
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: derekwaynecarr, RenaudWasTaken The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, but will leave @dashpole to take a look.
0009496
to
21dc414
Compare
|
||
#### Alpha -> Beta Graduation | ||
|
||
* Sufficient heads up has been given (1 year) to users. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, we can move to Beta after a single release, but should give a long-ish period before moving to GA. Most users won't notice this until it is enabled by default, and thus if we want to give users time to adapt and migrate to the daemonset, it would be between Beta and GA, where they can explicitly opt-out of the deprecation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm in favor of doing that! Is that something others would be in favor of :) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping @derekwaynecarr @dchen1107
Thanks !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
The PR is ready to go, @dashpole should we figure out the beta graduation strategy in the next release?
What do you think? |
Signed-off-by: Renaud Gaubert <rgaubert@nvidia.com>
21dc414
to
6b4b9d0
Compare
sounds great! |
What type of PR is this?
/kind feature
What this PR does / why we need it: Adds the "Disable AcceleratorUsage Metrics" KEP
TLDR: Add a FeatureGate to Disable the AcceleratorUsage Metrics collected by Kubelet.
Kubelet should no longer be collecting metrics from devices as the path decided by Sig-node is to use the pod resources API and have device vendors expose metrics through their own metrics container.
Some context that I left out of the KEP, because Kubelet has an open handle on the NVIDIA driver, this breaks any infrastructure interactions (e.g: Removing or Updating the driver) with the NVIDIA driver. In other words any actions related to the NVIDIA driver cannot be taken without killing the kubelet.
/cc @dashpole
/sig node
/stage alpha
Does this PR introduce a user-facing change?: Yes!
Relevant enhancement issue: #1867
Finally note that I'm still fairly new with the process and might have overlooked some fields or mis-answered them because of a lack of understanding :)
Signed-off-by: Renaud Gaubert rgaubert@nvidia.com