New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1780405: Gather ~10 metrics that tell us workloads that are being used #579
Bug 1780405: Gather ~10 metrics that tell us workloads that are being used #579
Conversation
@derekwaynecarr @bparees this is the first attempt to capture a small number of high value metrics around usage. We obviously can't get everything, but this answers some fundamental questions. Suggestions on other super critical metrics (as seen by the whole product) might be useful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/hold
Until openshift/telemeter#270 is merged
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: smarterclayton, squat The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Actually we don't need to hold as that PR you referenced targeting 4.3 and this is targeting master, the PR for whitelisting for that has already been merged in telemeter master. /hold cancel |
/retest |
/retest Please review the full test history for this PR and help us cut down flakes. |
8 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
assets/prometheus-k8s/rules.yaml
Outdated
- expr: count(count (kube_pod_restart_policy{type!="Always",namespace!~"openshift-.+"}) | ||
by (namespace,pod)) | ||
record: cluster:usage:pods:terminal:workload:sum | ||
- expr: sum(max (kubelet_containers_per_pod_count_sum) by (instance)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
- expr: sum(max (kubelet_containers_per_pod_count_sum) by (instance)) | |
- expr: sum(max(kubelet_containers_per_pod_count_sum) by (instance)) |
/retest Please review the full test history for this PR and help us cut down flakes. |
1 similar comment
/retest Please review the full test history for this PR and help us cut down flakes. |
@adambkaplan fyi, there will be metrics about number of build objects on clusters in telemeter w/ this change |
/hold |
9917388
to
c826eba
Compare
New changes are detected. LGTM label has been removed. |
These are the first metrics that show rough details of what is going on for a cluster so that we can assess rough usage. It captures both scale (number of particular workload types) and gives us a small hint about usage type (are people using DCs or deployments, statefulsets or jobs). The containers usage can be used to assess approx container per pod numbers when contrasted to pod numbers, which is also useful. For cardinality 10, we get a much better insight into whether this cluster is in use. We can approximate infrastructure resource count against the user's workload resource count by looking at empty clusters, but sheer scale helps us determine better values.
c826eba
to
27edb32
Compare
Fixed, applying label |
/cherry-pick release-4.3 |
@smarterclayton: once the present PR merges, I will cherry-pick it on top of release-4.3 in a new PR and assign it to you. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@smarterclayton: #579 failed to apply on top of branch "release-4.3":
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@smarterclayton: All pull requests linked via external trackers have merged. Bugzilla bug 1780405 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
These are the first metrics that show rough details of what is going on for
a cluster so that we can assess rough usage. It captures both scale (number
of particular workload types) and gives us a small hint about usage type
(are people using DCs or deployments, statefulsets or jobs). The containers
usage can be used to assess approx container per pod numbers when
contrasted to pod numbers, which is also useful.
For cardinality 10, we get a much better insight into whether this cluster
is in use. We can approximate infrastructure resource count against the
user's workload resource count by looking at empty clusters, but sheer
scale helps us determine better values.