Add metrics for volume scheduling operations #59529

wackxu · 2018-02-08T04:27:03Z

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes # #56162

Special notes for your reviewer:
/assign @msau42
Release note:

Add metrics for volume scheduling operations

msau42 · 2018-02-08T21:25:47Z

/ok-to-test

msau42 · 2018-02-08T23:17:00Z

pkg/scheduler/metrics/metrics.go

+			Buckets:   prometheus.ExponentialBuckets(1000, 2, 15),
+		},
+	)
+	VolumeBindingPredicateLatency = prometheus.NewHistogram(


@bsalamat, what do you think about having a metric on a specific predicate? This predicate does have the potential to be slow because it has to iterative through all PVs.

Having metrics for a specific predicate is not consistent with the rest of our code base, but it's fine for now while we are finding a better solution for dynamic resource binding.

msau42 · 2018-02-08T23:23:26Z

pkg/scheduler/metrics/metrics.go

@@ -87,6 +87,30 @@ var (
 			Name:      "total_preemption_attempts",
 			Help:      "Total preemption attempts in the cluster till now",
 		})
+	AssumeBindVolumesLatency = prometheus.NewHistogram(


I know the function name is assume and bind, but I think it makes more sense for the metric to be named just "AssumeVolumesLatency" because the function doesn't actually do the bind operation, it just queues it.

msau42 · 2018-02-08T23:26:05Z

pkg/scheduler/scheduler.go

@@ -350,6 +352,8 @@ func (sched *Scheduler) bindVolumesWorker() {
 			Status: v1.ConditionFalse,
 			Reason: reason,
 		})
+
+		metrics.BindPodVolumeLatency.Observe(metrics.SinceInMicroseconds(start))


It may also be good to add a metric to count VolumeBindingFailed. I don't expect it to happen often.

@msau42 Done

wackxu · 2018-02-09T02:57:37Z

I do not know why verify import boss failure, It is so strange.

errors in package "k8s.io/kubernetes/pkg/kubectl/cmd/resource":
import k8s.io/kubernetes/pkg/scheduler/metrics did not match any allowed prefix

errors in package "k8s.io/kubernetes/pkg/kubectl/cmd/set":
import k8s.io/kubernetes/pkg/scheduler/metrics did not match any allowed prefix

errors in package "k8s.io/kubernetes/pkg/kubectl/cmd/testing":
import k8s.io/kubernetes/pkg/scheduler/metrics did not match any allowed prefix

errors in package "k8s.io/kubernetes/pkg/kubectl/cmd":
import k8s.io/kubernetes/pkg/scheduler/metrics did not match any allowed prefix

errors in package "k8s.io/kubernetes/pkg/kubectl/cmd/config":
import k8s.io/kubernetes/pkg/scheduler/metrics did not match any allowed prefix

msau42 · 2018-02-09T23:53:34Z

@wackxu can you also add metrics to pkg/controller/volume/persistentvolume/scheduler_binder_cache.go? I want to track how many times a binding was added to the cache, and how many times it was removed from the cache. This could help detect potential memory leaks.

msau42 · 2018-02-09T23:56:05Z

regarding your import failures, I think because the change is adding an import to the predicates pkg, it will trigger some allowed import check. You can modify the import restrictions file, like this

wackxu · 2018-02-11T03:30:02Z

@msau42 Done, PTAL

msau42 · 2018-02-12T18:34:14Z

pkg/controller/volume/persistentvolume/scheduler_binder_cache.go

@@ -59,6 +60,8 @@ func (c *podBindingCache) DeleteBindings(pod *v1.Pod) {

 	podName := getPodName(pod)
 	delete(c.bindings, podName)
+
+	metrics.VolumeBindingDeleteFromSchedulerBinderCache.Inc()


I think it would be better to only record when it's actually removed (it could have already been removed previously)

msau42 · 2018-02-12T18:35:00Z

pkg/controller/volume/persistentvolume/scheduler_binder_cache.go

@@ -72,6 +75,8 @@ func (c *podBindingCache) UpdateBindings(pod *v1.Pod, node string, bindings []*b
 		c.bindings[podName] = nodeBinding
 	}
 	nodeBinding[node] = bindings
+
+	metrics.VolumeBindingAddToSchedulerBinderCache.Inc()


And here, only record when it's actually added.

wackxu · 2018-02-26T04:07:34Z

@msau42 Done, PTAL

msau42 · 2018-02-26T21:31:44Z

pkg/controller/volume/persistentvolume/scheduler_binder_cache.go

@@ -71,6 +75,10 @@ func (c *podBindingCache) UpdateBindings(pod *v1.Pod, node string, bindings []*b
 		nodeBinding = nodeBindings{}
 		c.bindings[podName] = nodeBinding
 	}
+	if _, ok := nodeBinding[node]; !ok {
+		metrics.VolumeBindingAddToSchedulerBinderCache.Inc()


I think this can just be updated in the block above, since the metrics is based on the pod name, not node.

msau42 · 2018-02-27T17:05:06Z

pkg/scheduler/algorithm/predicates/predicates.go

@@ -1588,6 +1591,7 @@ func (c *VolumeBindingChecker) predicate(pod *v1.Pod, meta algorithm.PredicateMe
 		return false, failReasons, nil
 	}

+	metrics.VolumeBindingPredicateLatency.Observe(metrics.SinceInMicroseconds(start))


I think we should also record latency when predicate returns failReasons because it still has to do lots of calculations in FindPodVolumes.

msau42 · 2018-02-27T17:07:27Z

pkg/scheduler/metrics/metrics.go

+	VolumeBindingFailed = prometheus.NewCounter(
+		prometheus.CounterOpts{
+			Subsystem: schedulerSubsystem,
+			Name:      "total_volume_binding_failed",


I think the convention is to have count or latency at the end of the name, so something like "volume_binding_error_count"

msau42 · 2018-02-27T17:08:22Z

pkg/scheduler/scheduler.go

@@ -298,6 +299,7 @@ func (sched *Scheduler) assumeAndBindVolumes(assumed *v1.Pod, host string) error
 			}
 			return err
 		}
+		metrics.AssumeVolumesLatency.Observe(metrics.SinceInMicroseconds(start))


We should observe latency even when error

msau42 · 2018-02-27T17:10:32Z

pkg/controller/volume/persistentvolume/metrics/metrics.go

@@ -56,6 +73,8 @@ type PVCLister interface {
 func Register(pvLister PVLister, pvcLister PVCLister) {
 	registerMetrics.Do(func() {
 		prometheus.MustRegister(newPVAndPVCCountCollector(pvLister, pvcLister))
+		prometheus.MustRegister(VolumeBindingAddToSchedulerBinderCache)


I think the registration here is not going to work. This registration is called in PV controller process, but the volume binding cache is a library used by scheduler process. So I think the library needs export a register method that the scheduler will call.

These registration calls should be removed from this file, because this is called for PV controller process, not scheduler process.

msau42 · 2018-03-24T01:09:38Z

@wackxu thanks for working on this. I just wanted to check if you will still have time to continue on this?

wackxu · 2018-03-24T02:13:40Z

@msau42 Sorry for the delay, will update later

msau42

This is looking really good thanks! Just one minor nit

msau42 · 2018-07-20T19:45:30Z

pkg/controller/volume/persistentvolume/scheduler_bind_cache_metrics.go

+
+// RegisterForScheduler is used for scheduler, because the volume binding cache is a library
+// used by scheduler process.
+func RegisterForScheduler() {


Can you call this "RegisterVolumeSchedulingMetrics" to be more clear?

msau42 · 2018-07-23T16:17:47Z

/lgtm
/retest

msau42 · 2018-07-27T16:36:44Z

/assign @bsalamat

bsalamat · 2018-07-31T23:43:57Z

pkg/scheduler/metrics/metrics.go

@@ -21,6 +21,7 @@ import (
 	"time"

 	"github.com/prometheus/client_golang/prometheus"
+	"k8s.io/kubernetes/pkg/controller/volume/persistentvolume"


Any chance we could separate the metrics in a different package and import only the metrics package instead of the whole persistent volume controller?

@wackxu would you be able to look into this?

We have already import k8s.io/kubernetes/pkg/controller/volume/persistentvolume in the scheduler and the metric is used only for scheduler like scheduler_binder files. I think we can extract a separate dir for those files in a follow pr

That sounds reasonable.

k8s-github-robot · 2018-07-31T23:44:46Z

/test all

Tests are more than 96 hours old. Re-running tests.

msau42 · 2018-10-03T23:06:25Z

@wackxu a lot of refactoring has gone in 1.12, so this change needs to be rebased. Do you still have time to work on this?

wackxu · 2018-10-12T01:36:40Z

@msau42 Sorry for the delay, will update today

bsalamat

/approve

msau42 · 2018-10-15T18:23:52Z

pkg/controller/volume/persistentvolume/scheduler_bind_cache_metrics.go

+	VolumeBindingFailed = prometheus.NewCounter(
+		prometheus.CounterOpts{
+			Subsystem: VolumeSchedulerSubsystem,
+			Name:      "binding_error_total",


Can we also have error count for assume and predicates too? (similar to stage latency)

Sorry for the delay, updated

msau42 · 2018-10-23T17:57:10Z

/lgtm
/approve

Thanks for continuing to work on this!

k8s-ci-robot · 2018-10-23T17:57:20Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bsalamat, msau42, wackxu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/controller/volume/persistentvolume/OWNERS~~ [msau42]
~~pkg/scheduler/OWNERS~~ [bsalamat]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

msau42 · 2018-10-23T17:57:33Z

/retest

msau42 · 2018-10-23T18:51:04Z

/retest

k8s-ci-robot assigned msau42 Feb 8, 2018

k8s-ci-robot requested review from bsalamat and jayunit100 February 8, 2018 04:27

wackxu force-pushed the addmetricvol branch from 540e436 to 3174ccb Compare February 8, 2018 04:27

k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Feb 8, 2018

msau42 reviewed Feb 8, 2018

View reviewed changes

wackxu force-pushed the addmetricvol branch from 3174ccb to 6b9ff79 Compare February 9, 2018 01:34

wackxu force-pushed the addmetricvol branch 3 times, most recently from 25ec499 to e4e0612 Compare February 11, 2018 02:27

msau42 reviewed Feb 12, 2018

View reviewed changes

k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 14, 2018

wackxu force-pushed the addmetricvol branch from 023b0ac to b86e67b Compare February 26, 2018 03:13

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 26, 2018

wackxu force-pushed the addmetricvol branch from b86e67b to 43c0945 Compare February 26, 2018 03:22

msau42 reviewed Feb 26, 2018

View reviewed changes

wackxu force-pushed the addmetricvol branch from 43c0945 to bf8faf7 Compare February 27, 2018 04:13

msau42 reviewed Feb 27, 2018

View reviewed changes

msau42 reviewed Jul 20, 2018

View reviewed changes

wackxu force-pushed the addmetricvol branch from ecb5f6a to 752fbd2 Compare July 23, 2018 01:25

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 23, 2018

k8s-ci-robot assigned bsalamat Jul 27, 2018

bsalamat reviewed Jul 31, 2018

View reviewed changes

msau42 mentioned this pull request Aug 20, 2018

Add metrics to volume scheduling operations #56162

Closed

wackxu force-pushed the addmetricvol branch from 752fbd2 to 3dddae5 Compare October 12, 2018 08:49

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 12, 2018

wackxu force-pushed the addmetricvol branch from 9d67ed9 to 5c22870 Compare October 12, 2018 09:19

bsalamat approved these changes Oct 12, 2018

View reviewed changes

msau42 reviewed Oct 15, 2018

View reviewed changes

msau42 mentioned this pull request Oct 17, 2018

Topology Aware Volume Scheduling kubernetes/enhancements#490

Closed

Add metrics to volume scheduling operations

d5edcd3

wackxu force-pushed the addmetricvol branch from 63dbddf to d5edcd3 Compare October 23, 2018 13:01

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Oct 23, 2018

wackxu changed the title ~~Add metrics to volume scheduling operations~~ Add metrics for volume scheduling operations Oct 23, 2018

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 23, 2018

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 23, 2018

k8s-ci-robot merged commit 101d26c into kubernetes:master Oct 23, 2018

wackxu deleted the addmetricvol branch October 24, 2018 01:10

Add metrics for volume scheduling operations #59529

Add metrics for volume scheduling operations #59529

Conversation

wackxu commented Feb 8, 2018 • edited Loading

msau42 commented Feb 8, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wackxu commented Feb 9, 2018

msau42 commented Feb 9, 2018

msau42 commented Feb 9, 2018

wackxu commented Feb 11, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wackxu commented Feb 26, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msau42 commented Mar 24, 2018

wackxu commented Mar 24, 2018

msau42 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msau42 commented Jul 23, 2018

msau42 commented Jul 27, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wackxu Oct 12, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-github-robot commented Jul 31, 2018

msau42 commented Oct 3, 2018

wackxu commented Oct 12, 2018

bsalamat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msau42 commented Oct 23, 2018

k8s-ci-robot commented Oct 23, 2018

msau42 commented Oct 23, 2018

msau42 commented Oct 23, 2018

wackxu commented Feb 8, 2018 •

edited

Loading

wackxu Oct 12, 2018 •

edited

Loading