Add scheduler feature usage metrics #838

damemi · 2020-07-02T15:43:14Z

Adds cluster_legacy_scheduler_policy (openshift/cluster-kube-scheduler-operator#250) and cluster_master_schedulable (openshift/cluster-kube-scheduler-operator#168) metrics

This is important for us because the upstream scheduler is migrating from Policy configs (setting predicates/priorities) to a new Profile config, and is planning to remove Policy support entirely in the coming releases.

This requires us to react (see openshift/origin#25203, openshift/cluster-kube-scheduler-operator#255, openshift/api#677) if we wish to continue supporting the ability to configure the scheduler algorithm through KSO.

The current usage % of Policy configs from cluster_legacy_scheduler_policy will help us determine whether it is worth continuing to support. cluster_master_schedulable, assumed being a reasonably used setting, will help set the benchmark threshold against which Policy config is compared (as well as provide usage information for that feature).

I added CHANGELOG entry for this change.
No user facing changes, so no entry in CHANGELOG was needed.

damemi · 2020-07-02T15:49:44Z

/cc @deads2k

smarterclayton · 2020-07-02T15:55:02Z

Documentation/data-collection.md

@@ -209,6 +209,12 @@ data:
    # Possible values are bootstrap|cloud-resources|monitoring|authentication|products|solution-explorer|deletion|complete.
    # This metric is used by OCM to detect when an RHMI installation is complete & ready to use i.e. rhmi_status{stage='complete'}
    - '{__name__="rhmi_status"}'
+    # (workloads-team, @openshift/openshift-group-b) cluster_legacy_scheduler_policy reports whether the scheduler operator is


Does the metric itself have a description that explains the labels and values? If so, copy it here (along with why). Not knowing possible values (and really should just be a copy of the core metric) might cause a customer / user to ask "but what is it reporting" and we want that answer to be easy.

Both are "boolean" gauges that just report 0 or 1 (https://github.com/openshift/cluster-kube-scheduler-operator/blob/master/pkg/operator/configmetrics/configmetrics.go#L15-L21). Updated the description with that

smarterclayton · 2020-07-02T15:55:49Z

In description add how many series you expect.

Also, in general, your commit message and title should roughly contain some of this info (so someone doesn't have to look at the PR later)

damemi · 2020-07-02T16:12:59Z

In description add how many series you expect.

@smarterclayton both of these metrics just report 0 or 1 for the cluster. Is that 1 or 2 series?

metalmatze · 2020-07-02T16:58:03Z

That would be one series with a value of 0 or 1, I suppose. 😊

metalmatze · 2020-07-02T17:00:25Z

Just out of curiosity: Do you want to later on count the number of cluster that have this enabled vs the ones that have it disabled? Trying to understand the use case 😉

damemi · 2020-07-02T17:01:53Z

Just out of curiosity: Do you want to later on count the number of cluster that have this enabled vs the ones that have it disabled? Trying to understand the use case 😉

@metalmatze that's exactly what we want to gather, a percent of clusters that have these features enabled to gauge against our effort required to continue supporting them

The metrics 'cluster_legacy_scheduler_policy' and 'cluster_master_schedulable' will be used to measure the usage of the custom Policy config and mastersSchedulable settings in kube-scheduler-operator. These are important to determine if it is necessary to continue supporting Policy config as upstream migrates to Plugins.

s-urbaniak · 2020-07-03T04:47:39Z

approval from @smarterclayton given in https://coreos.slack.com/archives/C0VMT03S5/p1593703067141800?thread_ts=1593701332.131000&cid=C0VMT03S5
the metrics here are shipped as core OpenShift and are already part of the product
Total number of series: 2 (cluster_legacy_scheduler_policy: 1, cluster_master_schedulable: 1)
Follows naming best practices: cluster_master_schedulable implies a bool value, cluster_legacy_scheduler_policy seems a bit off for a bool value

s-urbaniak · 2020-07-03T04:48:12Z

@openshift/openshift-team-monitoring (cc @brancz ) from my perspective lgtm modulo the naming, asking for another set of eyes.

s-urbaniak · 2020-07-03T04:53:10Z

/approve

brancz · 2020-07-03T08:34:17Z

As these are gauges, I'm fine with the naming. Leaving final lgtm to @smarterclayton though to give a chance to verify if his comment was addressed in a way he expected.

damemi · 2020-07-07T17:36:39Z

bump @smarterclayton does this look good to you?

damemi · 2020-07-07T18:54:28Z

/retest

smarterclayton · 2020-07-08T21:25:16Z

This LGTM, feel free to tag.

brancz · 2020-07-09T07:25:33Z

/lgtm
/retest

openshift-ci-robot · 2020-07-09T07:25:50Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: brancz, damemi, s-urbaniak

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [brancz,s-urbaniak]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-bot · 2020-07-09T08:19:09Z

/retest