Bug 1780405: Gather ~10 metrics that tell us workloads that are being used #579

smarterclayton · 2019-12-05T01:58:29Z

These are the first metrics that show rough details of what is going on for
a cluster so that we can assess rough usage. It captures both scale (number
of particular workload types) and gives us a small hint about usage type
(are people using DCs or deployments, statefulsets or jobs). The containers
usage can be used to assess approx container per pod numbers when
contrasted to pod numbers, which is also useful.

For cardinality 10, we get a much better insight into whether this cluster
is in use. We can approximate infrastructure resource count against the
user's workload resource count by looking at empty clusters, but sheer
scale helps us determine better values.

smarterclayton · 2019-12-05T01:59:29Z

@derekwaynecarr @bparees this is the first attempt to capture a small number of high value metrics around usage. We obviously can't get everything, but this answers some fundamental questions. Suggestions on other super critical metrics (as seen by the whole product) might be useful.

squat

/lgtm
/hold

Until openshift/telemeter#270 is merged

openshift-ci-robot · 2019-12-05T08:28:59Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: smarterclayton, squat

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [squat]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

lilic · 2019-12-05T08:47:11Z

Actually we don't need to hold as that PR you referenced targeting 4.3 and this is targeting master, the PR for whitelisting for that has already been merged in telemeter master.

/hold cancel

lilic · 2019-12-05T08:47:18Z

/retest

openshift-bot · 2019-12-05T08:58:44Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2019-12-05T10:03:20Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2019-12-05T10:16:33Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2019-12-05T11:20:44Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2019-12-05T11:37:29Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2019-12-05T12:25:53Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2019-12-05T12:38:54Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2019-12-05T12:51:54Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2019-12-05T13:31:27Z

/retest

Please review the full test history for this PR and help us cut down flakes.

lilic · 2019-12-05T13:46:36Z

assets/prometheus-k8s/rules.yaml

+    - expr: count(count (kube_pod_restart_policy{type!="Always",namespace!~"openshift-.+"})
+        by (namespace,pod))
+      record: cluster:usage:pods:terminal:workload:sum
+    - expr: sum(max (kubelet_containers_per_pod_count_sum) by (instance))


Nit:

Suggested change

- expr: sum(max (kubelet_containers_per_pod_count_sum) by (instance))

- expr: sum(max(kubelet_containers_per_pod_count_sum) by (instance))

openshift-bot · 2019-12-05T14:00:45Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2019-12-05T14:36:02Z

/retest

Please review the full test history for this PR and help us cut down flakes.

bparees · 2019-12-05T14:38:43Z

@adambkaplan fyi, there will be metrics about number of build objects on clusters in telemeter w/ this change

bparees · 2019-12-05T14:51:43Z

/hold
per @lilic's comment about failing tests

openshift-ci-robot · 2019-12-05T16:26:24Z

New changes are detected. LGTM label has been removed.

These are the first metrics that show rough details of what is going on for a cluster so that we can assess rough usage. It captures both scale (number of particular workload types) and gives us a small hint about usage type (are people using DCs or deployments, statefulsets or jobs). The containers usage can be used to assess approx container per pod numbers when contrasted to pod numbers, which is also useful. For cardinality 10, we get a much better insight into whether this cluster is in use. We can approximate infrastructure resource count against the user's workload resource count by looking at empty clusters, but sheer scale helps us determine better values.

smarterclayton · 2019-12-05T21:01:00Z

Fixed, applying label

smarterclayton · 2019-12-05T21:01:34Z

/cherry-pick release-4.3

openshift-cherrypick-robot · 2019-12-05T21:01:35Z

@smarterclayton: once the present PR merges, I will cherry-pick it on top of release-4.3 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.3

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-cherrypick-robot · 2019-12-05T21:02:55Z

@smarterclayton: #579 failed to apply on top of branch "release-4.3":

error: Failed to merge in the changes.
Using index info to reconstruct a base tree...
M	assets/prometheus-k8s/rules.yaml
M	jsonnet/rules.jsonnet
M	pkg/manifests/bindata.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/manifests/bindata.go
CONFLICT (content): Merge conflict in pkg/manifests/bindata.go
Auto-merging jsonnet/rules.jsonnet
Auto-merging assets/prometheus-k8s/rules.yaml
Patch failed at 0001 jsonnet: Gather ~10 metrics that tell us workloads that are being used

In response to this:

/cherry-pick release-4.3

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2019-12-05T21:38:44Z

@smarterclayton: All pull requests linked via external trackers have merged. Bugzilla bug 1780405 has been moved to the MODIFIED state.

In response to this:

Bug 1780405: Gather ~10 metrics that tell us workloads that are being used

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot requested review from brancz and s-urbaniak December 5, 2019 01:58

openshift-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Dec 5, 2019

squat approved these changes Dec 5, 2019

View reviewed changes

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 5, 2019

openshift-ci-robot assigned squat Dec 5, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 5, 2019

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 5, 2019

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 5, 2019

lilic reviewed Dec 5, 2019

View reviewed changes

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 5, 2019

smarterclayton force-pushed the gather_usage branch from 9917388 to c826eba Compare December 5, 2019 16:26

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Dec 5, 2019

openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 5, 2019

smarterclayton force-pushed the gather_usage branch from c826eba to 27edb32 Compare December 5, 2019 19:23

smarterclayton added lgtm Indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Dec 5, 2019

openshift-merge-robot merged commit 80312c8 into openshift:master Dec 5, 2019

smarterclayton changed the title ~~jsonnet: Gather ~10 metrics that tell us workloads that are being used~~ Bug 1780405: Gather ~10 metrics that tell us workloads that are being used Dec 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1780405: Gather ~10 metrics that tell us workloads that are being used #579

Bug 1780405: Gather ~10 metrics that tell us workloads that are being used #579

smarterclayton commented Dec 5, 2019

smarterclayton commented Dec 5, 2019

squat left a comment

openshift-ci-robot commented Dec 5, 2019

lilic commented Dec 5, 2019 •

edited

lilic commented Dec 5, 2019

openshift-bot commented Dec 5, 2019

openshift-bot commented Dec 5, 2019

openshift-bot commented Dec 5, 2019

openshift-bot commented Dec 5, 2019

openshift-bot commented Dec 5, 2019

openshift-bot commented Dec 5, 2019

openshift-bot commented Dec 5, 2019

openshift-bot commented Dec 5, 2019

openshift-bot commented Dec 5, 2019

lilic Dec 5, 2019

openshift-bot commented Dec 5, 2019

openshift-bot commented Dec 5, 2019

bparees commented Dec 5, 2019

bparees commented Dec 5, 2019

openshift-ci-robot commented Dec 5, 2019

smarterclayton commented Dec 5, 2019

smarterclayton commented Dec 5, 2019

openshift-cherrypick-robot commented Dec 5, 2019

openshift-cherrypick-robot commented Dec 5, 2019

openshift-ci-robot commented Dec 5, 2019

	- expr: sum(max (kubelet_containers_per_pod_count_sum) by (instance))
	- expr: sum(max(kubelet_containers_per_pod_count_sum) by (instance))

Bug 1780405: Gather ~10 metrics that tell us workloads that are being used #579

Bug 1780405: Gather ~10 metrics that tell us workloads that are being used #579

Conversation

smarterclayton commented Dec 5, 2019

smarterclayton commented Dec 5, 2019

squat left a comment

Choose a reason for hiding this comment

openshift-ci-robot commented Dec 5, 2019

lilic commented Dec 5, 2019 • edited

lilic commented Dec 5, 2019

openshift-bot commented Dec 5, 2019

openshift-bot commented Dec 5, 2019

openshift-bot commented Dec 5, 2019

openshift-bot commented Dec 5, 2019

openshift-bot commented Dec 5, 2019

openshift-bot commented Dec 5, 2019

openshift-bot commented Dec 5, 2019

openshift-bot commented Dec 5, 2019

openshift-bot commented Dec 5, 2019

lilic Dec 5, 2019

Choose a reason for hiding this comment

openshift-bot commented Dec 5, 2019

openshift-bot commented Dec 5, 2019

bparees commented Dec 5, 2019

bparees commented Dec 5, 2019

openshift-ci-robot commented Dec 5, 2019

smarterclayton commented Dec 5, 2019

smarterclayton commented Dec 5, 2019

openshift-cherrypick-robot commented Dec 5, 2019

openshift-cherrypick-robot commented Dec 5, 2019

openshift-ci-robot commented Dec 5, 2019

lilic commented Dec 5, 2019 •

edited