Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose metrics about resource requests and limits that represent the pod model #1748

Open
smarterclayton opened this issue May 7, 2020 · 51 comments
Assignees
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. stage/stable Denotes an issue tracking an enhancement targeted for Stable/GA status

Comments

@smarterclayton
Copy link
Contributor

smarterclayton commented May 7, 2020

Enhancement Description

  • Administrators of Kube need a metric that correctly describes resource requests from pods and hides the complexities of the evolving pod resource model.
  • Kubernetes Enhancement Proposal: kep: Pod resource metrics #1916
  • Primary contact (assignee): @smarterclayton
  • Responsible SIGs: sig-instrumentation, sig-node, sig-scheduling
  • Enhancement target (which target equals to which milestone):
    • Alpha release target (1.20)
    • Beta release target (1.21)
    • Stable release target (1.22)
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label May 7, 2020
@smarterclayton
Copy link
Contributor Author

smarterclayton commented May 7, 2020

/sig instrumentation
/sig node
/sig scheduling

@k8s-ci-robot k8s-ci-robot added sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 7, 2020
@harshanarayana
Copy link

harshanarayana commented May 14, 2020

Hey there @smarterclayton -- 1.19 Enhancements shadow here. I wanted to check in and see if you think this Enhancement will be graduating in 1.19?

In order to have this part of the release:

  1. The KEP PR must be merged in an implementable state
  2. The KEP must have test plans
  3. The KEP must have graduation criteria.

The current release schedule is:

  • Monday, April 13: Week 1 - Release cycle begins
  • Tuesday, May 19: Week 6 - Enhancements Freeze
  • Thursday, June 25: Week 11 - Code Freeze
  • Thursday, July 9: Week 14 - Docs must be completed and reviewed
  • Tuesday, August 4: Week 17 - Kubernetes v1.19.0 released

If you do, I'll add it to the 1.19 tracking sheet (http://bit.ly/k8s-1-19-enhancements). Once coding begins please list all relevant k/k PRs in this issue so they can be tracked properly. 👍

Thanks!

@smarterclayton
Copy link
Contributor Author

smarterclayton commented May 15, 2020

I don't think we'll make implementable and merged by Tuesday, so should be targeted for 1.20

@harshanarayana
Copy link

harshanarayana commented May 18, 2020

Hey @smarterclayton Thanks for confirming the inclusion state. I've marked the Enhancement as Deferred in the Tracker and updating the milestone accordingly.

/milestone v1.20

@k8s-ci-robot k8s-ci-robot added this to the v1.20 milestone May 18, 2020
@harshanarayana harshanarayana added the tracked/no Denotes an enhancement issue is NOT actively being tracked by the Release Team label May 18, 2020
@fejta-bot
Copy link

fejta-bot commented Aug 16, 2020

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 16, 2020
@palnabarun
Copy link
Member

palnabarun commented Sep 1, 2020

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 1, 2020
@kikisdeliveryservice
Copy link
Member

kikisdeliveryservice commented Sep 11, 2020

Hi @smarterclayton !

Enhancements Lead here, do you still intend to target this for alpha in 1.20?

Thanks!
Kirsten

@smarterclayton
Copy link
Contributor Author

smarterclayton commented Sep 21, 2020

Yes, this is target alpha for 1.20 assuming we can close the remaining questions in the KEP

@kikisdeliveryservice kikisdeliveryservice added tracked/yes Denotes an enhancement issue is actively being tracked by the Release Team stage/alpha Denotes an issue tracking an enhancement targeted for Alpha status and removed tracked/no Denotes an enhancement issue is NOT actively being tracked by the Release Team labels Sep 21, 2020
@kikisdeliveryservice
Copy link
Member

kikisdeliveryservice commented Sep 21, 2020

Thanks Clayton!!

As a reminder, by Enhancements Freeze (October 6th), KEPs must be:

  • merged in an implementable state (yours is provisional)
  • must have test plans (missing)
  • must have graduation criteria (missing)

Best,
Kirsten

I also added the PR link to the Issue description we can update again once merged.

@mikejoh
Copy link

mikejoh commented Sep 29, 2020

Hi @smarterclayton 👋!

I'm one of the Enhancement shadows for the 1.20 release cycle. This is a friendly reminder that the Enhancement freeze is on the 6th of October, i'm repeating the requirements needed by then:

  • The KEP must be merged in an implementable state.
    • It's provisional at the moment and i see that there's active work ongoing.
  • The KEP must have test plans.
    • This is missing at the moment.
  • The KEP must have graduation criteria(s).
    • This is also missing at the moment.

Thanks!

@smarterclayton
Copy link
Contributor Author

smarterclayton commented Sep 29, 2020

Thanks for the reminder, updated those. Will be working with the sig.

@kikisdeliveryservice
Copy link
Member

kikisdeliveryservice commented Oct 2, 2020

The current PR looks complete from a enhancements freeze POV, we'll monitor to see if it merges in time.

@kikisdeliveryservice
Copy link
Member

kikisdeliveryservice commented Oct 7, 2020

Hi @smarterclayton

Enhancements Freeze is now in effect. Unfortunately, your KEP PR has not yet merged. If you wish to be included in the 1.20 Release, please submit an Exception Request as soon as possible.

Best,
Kirsten
1.20 Enhancements Lead

@kikisdeliveryservice kikisdeliveryservice added tracked/no Denotes an enhancement issue is NOT actively being tracked by the Release Team and removed tracked/yes Denotes an enhancement issue is actively being tracked by the Release Team labels Oct 7, 2020
@kikisdeliveryservice kikisdeliveryservice removed this from the v1.20 milestone Oct 7, 2020
@smarterclayton
Copy link
Contributor Author

smarterclayton commented Sep 2, 2021

Canvasing the community to get feedback before GA promotion.

https://www.reddit.com/r/kubernetes/comments/pgn4sj/feedback_wanted_on_pod_resource_metrics_before_ga/

@k8s-triage-robot
Copy link

k8s-triage-robot commented Dec 1, 2021

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 1, 2021
@ehashman
Copy link
Member

ehashman commented Dec 2, 2021

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 2, 2021
@k8s-triage-robot
Copy link

k8s-triage-robot commented Mar 2, 2022

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 2, 2022
@Huang-Wei
Copy link
Member

Huang-Wei commented Mar 24, 2022

@smarterclayton I just realized this metric has a pod label, which IMO increase the cardinality a lot and yield a pressure on the scraper side. Did you hear any concern/feedback from the users? Per the KEP, all the goals can be satisfied by removing the pod dimension as in terms of a metric, its primary goal is to give a high-level overview on aggregated pods' reqs/limits. Pod-level metric doesn't seem that common. WDYT?

@k8s-triage-robot
Copy link

k8s-triage-robot commented Apr 23, 2022

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 23, 2022
@logicalhan
Copy link
Contributor

logicalhan commented May 12, 2022

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label May 12, 2022
@logicalhan
Copy link
Contributor

logicalhan commented May 12, 2022

@smarterclayton any plans to graduate this to beta?

@logicalhan
Copy link
Contributor

logicalhan commented May 12, 2022

If there is no one working on this, we will have to deprecate and remove this stuff. Alternatively, we will need to find someone to graduate this.

@k8s-triage-robot
Copy link

k8s-triage-robot commented Aug 10, 2022

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 10, 2022
@dashpole
Copy link
Contributor

dashpole commented Sep 8, 2022

/lifecycle frozen
I think we might need to look for a new owner to drive this work in 1.26

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 8, 2022
@logicalhan
Copy link
Contributor

logicalhan commented Sep 22, 2022

/lifecycle frozen

@Huang-Wei
Copy link
Member

Huang-Wei commented Sep 23, 2022

@dashpole @logicalhan do you happen to find some volunteers to continue the work?

@smarterclayton
Copy link
Contributor Author

smarterclayton commented Sep 26, 2022

Oh man, we didn't take this to beta?! This is my fault. Let me talk to @dgrisonnet who pinged me about it a day ago - originally the delay was gathering feedback from admins doing capacity planning, and I had been working with a few people on leveraging it more widely.

The use I was most familiar with was OpenShift and we replaced the dashboards that were using the (old, incorrect, not complete) kube-state-metrics for this - among the folks who did the change there was general agreement that the new metrics were superior and the cost of cardinality was worth it to replace the generally incorrect metrics from kube-state-metrics (at the time we felt that completely replicating the pod resource model code in ksm was not appropriate, and this was a better solution).. Next phase was getting community user input on building metric based capacity dashboards and whether the dimensions worked for the audience. I did a few analysis when planning out e2e CI runs and found the metrics provided better human visibility of comparing bulk "used vs requested".

@smarterclayton
Copy link
Contributor Author

smarterclayton commented Sep 26, 2022

@Huang-Wei re:

I just realized this metric has a pod label, which IMO increase the cardinality a lot and yield a pressure on the scraper side. Did you hear any concern/feedback from the users? Per the KEP, all the goals can be satisfied by removing the pod dimension as in terms of a metric, its primary goal is to give a high-level overview on aggregated pods' reqs/limits. Pod-level metric doesn't seem that common. WDYT?

The original intent was to allow admins to build capacity planning dashboards, and to pair the resource vs pod level resource metrics (like cpu, memory, etc). So the intent was very much to have a pod dimension. Do we have a proposal to remove or make optional pod level cpu consumption or memory consumption dimensions? If so, such a change would apply to this metric as well, but as this is already an optional endpoint for users who are concerned about cardinality.

@smarterclayton
Copy link
Contributor Author

smarterclayton commented Sep 26, 2022

To clarify - this is in beta since 1.21 (#1748 (comment)). Was there some belief that it was not beta?

It would be last step to go to GA, I'm happy to push that over the line with @dgrisonnet

@dgrisonnet
Copy link
Member

dgrisonnet commented Sep 26, 2022

I also thought this was still in Alpha for some reason even though we have a label marking the stability 😅

Yet let's try to get this over the finish and gather some feedback from users to know if they encountered any issues with these new metrics.

/assign @smarterclayton @dgrisonnet

@dgrisonnet
Copy link
Member

dgrisonnet commented Sep 26, 2022

Do we have a proposal to remove or make optional pod level cpu consumption or memory consumption dimensions? If so, such a change would apply to this metric as well, but as this is already an optional endpoint for users who are concerned about cardinality.

We already have a cardinality protection mechanism in Kubernetes: https://github.com/kubernetes/enhancements/tree/master/keps/sig-instrumentation/2305-metrics-cardinality-enforcement so users could already tweak the dimensions if needed.
That, plus the fact that the endpoint is optional, it sounds fairly safe to expose without having to worry about potential cardinality explosions.

@dgrisonnet
Copy link
Member

dgrisonnet commented Sep 29, 2022

I opened kubernetes/kube-state-metrics#1846 in kube-state-metrics to make the transition to the kube-scheduler metrics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. stage/stable Denotes an issue tracking an enhancement targeted for Stable/GA status
Projects
None yet
Development

No branches or pull requests