Adding prometheus metrics for ASB #497

shawn-hurley · 2017-10-16T17:57:23Z

Describe what this PR does and why we need it:
Adds Prometheus metrics for the ASB
Changes proposed in this pull request

Metrics package to manage the Prometheus metrics
Adding new endpoint from apiserver /metrics

Does this PR depend on another PR (Use this to track when PRs should be merged)
depends-on
N/A
Which issue this PR fixes (This will close that issue when PR gets merged)
N/A

djzager

LGTM with one question.

djzager · 2017-10-16T18:16:12Z

pkg/app/app.go

@@ -335,11 +337,14 @@ func (a *App) Start() {
 	daHandler := handler.NewHandler(a.broker, a.log.Logger, a.config.Broker, clusterURL, providers, rules)

 	if clusterURL == "/" {
-		genericserver.Handler.NonGoRestfulMux.HandlePrefix("/", daHandler)
+		genericserver.Handler.NonGoRestfulMux.HandlePrefix("/", prometheus.InstrumentHandler("ansible-service-broker", daHandler))


This may make it clear my ignorance. Could you not update daHandler above to something like:

daHandler := prometheus.InstrumentHandler( "ansible-service-broker", handler.NewHandler(a.broker, a.log.Logger, a.config.Broker, clusterURL, providers, rules), )

I bet you could, and that would probably be much more clear :) let me double check 👍

djzager · 2017-10-16T18:18:19Z

Does this need a bugzilla since I'm not sure if it is being included in 3.7.

shawn-hurley · 2017-10-16T18:22:19Z

(8) [CM-OPS-Tools] Ansible Service Broker Prometheus endpoint coverage

jwmatthews · 2017-10-16T18:34:46Z

This PR does not need a bugzilla, it is one of two trello cards which have approval for being worked on post the general feature complete date 3.7. Other card is the service instance work.

Both features must be completed prior to this Friday.

jwmatthews · 2017-10-16T19:19:46Z

@jcantrill @smarterclayton @liggitt
CC: @pweil-

do you have any input on this PR. It's our first cut at adding prometheus metrics for the Ansible Broker as per card: https://trello.com/c/Cruuo5Vl

This is a sample of the output returned:
https://hastebin.com/onaqimepux.coffeescript

Endpoint is accessed by:

curl -k -H "Authorization: bearer `oc whoami -t`" https://asb-1338-ansible-service-broker.172.17.0.1.nip.io/metrics

Assuming that the logged in user has 'cluster-debugger-role' for accessing this endpoint.

Notes here on the metrics we are gathering:
https://docs.google.com/document/d/1ui57sb3kMf2HEt6LelfuDEGjrAhbSaD4NhYPljecyRE/edit

jmrodri

Some questions and suggestions for metric names.

jmrodri · 2017-10-16T19:31:41Z

pkg/apb/svc_acct.go

@@ -54,6 +55,7 @@ func (s *ServiceAccountManager) CreateApbSandbox(
 	executionContext ExecutionContext,
 	apbRole string,
 ) (string, error) {
+	metrics.SandboxCreated()


Do we actually mark created before it is really created? What happens if this method returns with an error?

nothing bad would happen, our metrics would say we created a sandbox when it actually failed to create the sandbox, I think that moving it down to once it is actually created makes sense to me.

jmrodri · 2017-10-16T19:31:59Z

pkg/apb/svc_acct.go

@@ -283,6 +285,7 @@ func (s *ServiceAccountManager) DestroyApbSandbox(executionContext ExecutionCont
 	// "If there is an error, it will be of type *PathError"
 	// We don't care, because it's gone
 	os.Remove(filePathFromHandle(executionContext.PodName))
+	metrics.SandboxDeleted()


This makes sense to do it once things have actually been deleted.

jmrodri · 2017-10-16T19:32:49Z

pkg/app/app.go


 	if clusterURL == "/" {
 		genericserver.Handler.NonGoRestfulMux.HandlePrefix("/", daHandler)
 	} else {
 		genericserver.Handler.NonGoRestfulMux.HandlePrefix(fmt.Sprintf("%v/", clusterURL), daHandler)
 	}

+	defaultMetrics := routes.DefaultMetrics{}
+	defaultMetrics.Install(genericserver.Handler.NonGoRestfulMux)


What is this doing?

I guess we couldn't just add a metrics endpoint like we did with /v2 and the dev endpoints?

https://github.com/kubernetes/apiserver/blob/master/pkg/server/routes/metrics.go#L31

We totally could, we get metrics that the apiserver sets up by default. Some have to do with apiserver audit events, those I would not want to set up on my own. But the others, (golang process metrics) we could set up on our own. Figured we already have apiserver in might as well use the stuff it provides for "free".

jmrodri · 2017-10-16T19:33:17Z

pkg/app/app.go

+	daHandler := prometheus.InstrumentHandler(
+		"ansible-service-broker",
+		handler.NewHandler(a.broker, a.log.Logger, a.config.Broker, clusterURL, providers, rules),
+	)


This makes sense, a "middleware" :) handler around our handler.

jmrodri · 2017-10-16T19:35:32Z

pkg/metrics/metrics.go

+func recoverMetricPanic() {
+	if r := recover(); r != nil {
+		log.Errorf("Recovering from metric function - %v", r)
+	}


What does recover do? What are we recovering?

jmrodri · 2017-10-16T19:46:17Z

pkg/broker/deprovision_subscriber.go


 	go func() {
 		d.log.Info("Listening for deprovision messages")
 		for {
 			msg := <-msgBuffer
+			var dmsg *DeprovisionMsg
+			metrics.RemoveDeprovisionJob()


Same comment as RemoveProvisionJob.

jmrodri · 2017-10-16T19:46:53Z

pkg/metrics/metrics.go

+}
+
+// AddProvisionJob - Add a provision job to the counter.
+func AddProvisionJob() {


What about ProvisionJobStarted?

jmrodri · 2017-10-16T19:47:04Z

pkg/metrics/metrics.go

+}
+
+// AddDeprovisionJob - Add a deprovision job to the counter.
+func AddDeprovisionJob() {


DeprovisionJobStarted

jmrodri · 2017-10-16T19:47:13Z

pkg/metrics/metrics.go

+}
+
+// RemoveProvisionJob - Remove a provision job to the counter.
+func RemoveProvisionJob() {


ProvisionJobFinished

jmrodri · 2017-10-16T19:47:25Z

pkg/metrics/metrics.go

+}
+
+// RemoveDeprovisionJob - Remove a deprovision job to the counter.
+func RemoveDeprovisionJob() {


DeprovisionJobFinished or something like PopDeprovisionJob (of course make it consistent with provision) :)

rthallisey · 2017-10-18T14:59:18Z

pkg/metrics/metrics.go

+)
+
+func init() {
+	prometheus.MustRegister(sandboxCreated)


This panics if there's any error. If we want to have error checking around this we should use prometheus.Register. It's the same thing, but returns error.

I think the pod should probably blow up if we can't expose metrics, this seems like the right thing to do if people are going to use the metrics to determine if things are wrong.

This is different than a metric not working (guarding with the recover function), this is all of the Prometheus client is not working. I think this is why I would want to error the broker loudly and early, thoughts?

I'm always a proponent of loud and early failure for systems deemed important enough. Curious, is it possible to configure metrics as enabled/disabled?

We still are going to error loudly and early. It's just like how we check if we can connect to the etcd endpoint. But, we will have control over the error message so we can have a smoother exit.

rthallisey · 2017-10-18T15:36:02Z

pkg/metrics/metrics.go

+			Help:      "Counter of how many times the specs have been reset.",
+		})
+
+	provisionJob = prometheus.NewGauge(


I think another good metric is a counter of the number of things provisioned and deprovisioned. Ditto bind and unbind.

Can you add to https://docs.google.com/document/d/1ui57sb3kMf2HEt6LelfuDEGjrAhbSaD4NhYPljecyRE/edit#

I think that we can add better metrics for 3.8, but need to get some basics up today.

Do you feel super strongly that we should have them for the initial PR?

Makes sense to me to get a baseline and then iterate once we have some feedback on what is useful in the wild.

IMO keeping track of number of bind/unbind is important. It's one the four APIs we track and care about. We can add better metrics for 3.8, but to me this seems like a core metric we want.

rthallisey · 2017-10-18T15:41:16Z

pkg/metrics/metrics.go

+)
+
+var (
+	sandboxCreated = prometheus.NewCounter(


Why do we want to count all the namespaces created? To me this metric means 'the total number of actions the broker has performed'.

I think sandbox is specific, this will give us some nice indirection metrics, say if we have a lot of sandbox creations and no sandbox deletions then things are going very wrong somewhere. This is not the actions, bind and unbind are configured by default to not launch APB's and therefore will not be counted in this metric.

I think sandbox is specific, this will give us some nice indirection metrics, say if we have a lot of
sandbox creations and no sandbox deletions then things are going very wrong somewhere.

It's a good metric to be aware that the sandboxes are being cleaned up, but we should do that with a gauge instead of two counters.

This is not the actions, bind and unbind are configured by default to not launch APB's and
therefore will not be counted in this metric.

We can still count them if you increment in the if block.

ansible-service-broker/pkg/broker/broker.go

Line 865 in 81381b1

if a.brokerConfig.LaunchApbOnBind {

eriknelson · 2017-10-18T16:11:01Z

pkg/metrics/metrics.go

+}
+
+// SandboxCreated - Counter for how many sandbox created.
+func SandboxCreated() {


+1 for these hooks.

eriknelson

Don't have much to add that hasn't already been said. Looks like this will be really useful! 👍

shawn-hurley · 2017-10-18T17:28:05Z

@rthallisey Can you double check that the agreed upon changes are what you were thinking?

* Adding prometheus metrics for ASB * updating based on PR comments. * updating based on PR comments * updating based on PR comments * fixing typos

openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 16, 2017

djzager approved these changes Oct 16, 2017

View reviewed changes

shawn-hurley added feature needs-review labels Oct 16, 2017

shawn-hurley assigned jmrodri, eriknelson and rthallisey Oct 16, 2017

jmrodri unassigned jmrodri, eriknelson and rthallisey Oct 16, 2017

jmrodri requested review from jmrodri, eriknelson and rthallisey October 16, 2017 19:22

jmrodri suggested changes Oct 16, 2017

View reviewed changes

shawn-hurley force-pushed the prometheus-metrics branch from d92c25f to 45d0060 Compare October 17, 2017 17:45

jmrodri approved these changes Oct 17, 2017

View reviewed changes

rthallisey suggested changes Oct 18, 2017

View reviewed changes

eriknelson reviewed Oct 18, 2017

View reviewed changes

eriknelson approved these changes Oct 18, 2017

View reviewed changes

shawn-hurley added 4 commits October 18, 2017 12:49

Adding prometheus metrics for ASB

6397365

updating based on PR comments.

5b1a3f1

updating based on PR comments

6ef0017

updating based on PR comments

5b081f2

shawn-hurley force-pushed the prometheus-metrics branch from 45d0060 to 5b081f2 Compare October 18, 2017 17:26

fixing typos

f128d6d

rthallisey approved these changes Oct 18, 2017

View reviewed changes

rthallisey merged commit bd4f352 into openshift:master Oct 18, 2017

Adding prometheus metrics for ASB #497

Adding prometheus metrics for ASB #497

Conversation

shawn-hurley commented Oct 16, 2017

djzager left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

djzager commented Oct 16, 2017

shawn-hurley commented Oct 16, 2017

jwmatthews commented Oct 16, 2017

jwmatthews commented Oct 16, 2017 • edited by djzager

jmrodri left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eriknelson left a comment

Choose a reason for hiding this comment

shawn-hurley commented Oct 18, 2017

jwmatthews commented Oct 16, 2017 •

edited by djzager