Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding prometheus metrics for ASB #497

Merged
merged 5 commits into from
Oct 18, 2017

Conversation

shawn-hurley
Copy link
Contributor

Describe what this PR does and why we need it:
Adds Prometheus metrics for the ASB
Changes proposed in this pull request

  • Metrics package to manage the Prometheus metrics
  • Adding new endpoint from apiserver /metrics

Does this PR depend on another PR (Use this to track when PRs should be merged)
depends-on
N/A
Which issue this PR fixes (This will close that issue when PR gets merged)
N/A

@openshift-ci-robot openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 16, 2017
Copy link
Member

@djzager djzager left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with one question.

pkg/app/app.go Outdated
@@ -335,11 +337,14 @@ func (a *App) Start() {
daHandler := handler.NewHandler(a.broker, a.log.Logger, a.config.Broker, clusterURL, providers, rules)

if clusterURL == "/" {
genericserver.Handler.NonGoRestfulMux.HandlePrefix("/", daHandler)
genericserver.Handler.NonGoRestfulMux.HandlePrefix("/", prometheus.InstrumentHandler("ansible-service-broker", daHandler))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may make it clear my ignorance. Could you not update daHandler above to something like:

daHandler := prometheus.InstrumentHandler(
    "ansible-service-broker",
    handler.NewHandler(a.broker, a.log.Logger, a.config.Broker, clusterURL, providers, rules),
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I bet you could, and that would probably be much more clear :) let me double check 👍

@djzager
Copy link
Member

djzager commented Oct 16, 2017

Does this need a bugzilla since I'm not sure if it is being included in 3.7.

@shawn-hurley
Copy link
Contributor Author

@jwmatthews
Copy link
Member

This PR does not need a bugzilla, it is one of two trello cards which have approval for being worked on post the general feature complete date 3.7. Other card is the service instance work.

Both features must be completed prior to this Friday.

@jwmatthews
Copy link
Member

jwmatthews commented Oct 16, 2017

@jcantrill @smarterclayton @liggitt
CC: @pweil-

do you have any input on this PR. It's our first cut at adding prometheus metrics for the Ansible Broker as per card: https://trello.com/c/Cruuo5Vl

This is a sample of the output returned:
https://hastebin.com/onaqimepux.coffeescript

Endpoint is accessed by:

curl -k -H "Authorization: bearer `oc whoami -t`" https://asb-1338-ansible-service-broker.172.17.0.1.nip.io/metrics 

Assuming that the logged in user has 'cluster-debugger-role' for accessing this endpoint.

Notes here on the metrics we are gathering:
https://docs.google.com/document/d/1ui57sb3kMf2HEt6LelfuDEGjrAhbSaD4NhYPljecyRE/edit

Copy link
Contributor

@jmrodri jmrodri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions and suggestions for metric names.

@@ -54,6 +55,7 @@ func (s *ServiceAccountManager) CreateApbSandbox(
executionContext ExecutionContext,
apbRole string,
) (string, error) {
metrics.SandboxCreated()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually mark created before it is really created? What happens if this method returns with an error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nothing bad would happen, our metrics would say we created a sandbox when it actually failed to create the sandbox, I think that moving it down to once it is actually created makes sense to me.

@@ -283,6 +285,7 @@ func (s *ServiceAccountManager) DestroyApbSandbox(executionContext ExecutionCont
// "If there is an error, it will be of type *PathError"
// We don't care, because it's gone
os.Remove(filePathFromHandle(executionContext.PodName))
metrics.SandboxDeleted()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to do it once things have actually been deleted.


if clusterURL == "/" {
genericserver.Handler.NonGoRestfulMux.HandlePrefix("/", daHandler)
} else {
genericserver.Handler.NonGoRestfulMux.HandlePrefix(fmt.Sprintf("%v/", clusterURL), daHandler)
}

defaultMetrics := routes.DefaultMetrics{}
defaultMetrics.Install(genericserver.Handler.NonGoRestfulMux)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this doing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we couldn't just add a metrics endpoint like we did with /v2 and the dev endpoints?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/kubernetes/apiserver/blob/master/pkg/server/routes/metrics.go#L31

We totally could, we get metrics that the apiserver sets up by default. Some have to do with apiserver audit events, those I would not want to set up on my own. But the others, (golang process metrics) we could set up on our own. Figured we already have apiserver in might as well use the stuff it provides for "free".

daHandler := prometheus.InstrumentHandler(
"ansible-service-broker",
handler.NewHandler(a.broker, a.log.Logger, a.config.Broker, clusterURL, providers, rules),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense, a "middleware" :) handler around our handler.

func recoverMetricPanic() {
if r := recover(); r != nil {
log.Errorf("Recovering from metric function - %v", r)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does recover do? What are we recovering?


go func() {
d.log.Info("Listening for deprovision messages")
for {
msg := <-msgBuffer
var dmsg *DeprovisionMsg
metrics.RemoveDeprovisionJob()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as RemoveProvisionJob.

}

// AddProvisionJob - Add a provision job to the counter.
func AddProvisionJob() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about ProvisionJobStarted?

}

// AddDeprovisionJob - Add a deprovision job to the counter.
func AddDeprovisionJob() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DeprovisionJobStarted

}

// RemoveProvisionJob - Remove a provision job to the counter.
func RemoveProvisionJob() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ProvisionJobFinished

}

// RemoveDeprovisionJob - Remove a deprovision job to the counter.
func RemoveDeprovisionJob() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DeprovisionJobFinished or something like PopDeprovisionJob (of course make it consistent with provision) :)

)

func init() {
prometheus.MustRegister(sandboxCreated)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This panics if there's any error. If we want to have error checking around this we should use prometheus.Register. It's the same thing, but returns error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the pod should probably blow up if we can't expose metrics, this seems like the right thing to do if people are going to use the metrics to determine if things are wrong.

This is different than a metric not working (guarding with the recover function), this is all of the Prometheus client is not working. I think this is why I would want to error the broker loudly and early, thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm always a proponent of loud and early failure for systems deemed important enough. Curious, is it possible to configure metrics as enabled/disabled?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still are going to error loudly and early. It's just like how we check if we can connect to the etcd endpoint. But, we will have control over the error message so we can have a smoother exit.

Help: "Counter of how many times the specs have been reset.",
})

provisionJob = prometheus.NewGauge(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think another good metric is a counter of the number of things provisioned and deprovisioned. Ditto bind and unbind.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add to https://docs.google.com/document/d/1ui57sb3kMf2HEt6LelfuDEGjrAhbSaD4NhYPljecyRE/edit#

I think that we can add better metrics for 3.8, but need to get some basics up today.

Do you feel super strongly that we should have them for the initial PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me to get a baseline and then iterate once we have some feedback on what is useful in the wild.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO keeping track of number of bind/unbind is important. It's one the four APIs we track and care about. We can add better metrics for 3.8, but to me this seems like a core metric we want.

)

var (
sandboxCreated = prometheus.NewCounter(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to count all the namespaces created? To me this metric means 'the total number of actions the broker has performed'.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think sandbox is specific, this will give us some nice indirection metrics, say if we have a lot of sandbox creations and no sandbox deletions then things are going very wrong somewhere. This is not the actions, bind and unbind are configured by default to not launch APB's and therefore will not be counted in this metric.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think sandbox is specific, this will give us some nice indirection metrics, say if we have a lot of
sandbox creations and no sandbox deletions then things are going very wrong somewhere.

It's a good metric to be aware that the sandboxes are being cleaned up, but we should do that with a gauge instead of two counters.

This is not the actions, bind and unbind are configured by default to not launch APB's and
therefore will not be counted in this metric.

We can still count them if you increment in the if block.

if a.brokerConfig.LaunchApbOnBind {

}

// SandboxCreated - Counter for how many sandbox created.
func SandboxCreated() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for these hooks.

Copy link
Contributor

@eriknelson eriknelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't have much to add that hasn't already been said. Looks like this will be really useful! 👍

@shawn-hurley
Copy link
Contributor Author

@rthallisey Can you double check that the agreed upon changes are what you were thinking?

@rthallisey rthallisey merged commit bd4f352 into openshift:master Oct 18, 2017
shawn-hurley added a commit to shawn-hurley/ansible-service-broker that referenced this pull request Oct 19, 2017
* Adding prometheus metrics for ASB

* updating based on PR comments.

* updating based on PR comments

* updating based on PR comments

* fixing typos
shawn-hurley added a commit to shawn-hurley/ansible-service-broker that referenced this pull request Oct 19, 2017
* Adding prometheus metrics for ASB

* updating based on PR comments.

* updating based on PR comments

* updating based on PR comments

* fixing typos
jianzhangbjz pushed a commit to jianzhangbjz/ansible-service-broker that referenced this pull request May 17, 2018
* Adding prometheus metrics for ASB

* updating based on PR comments.

* updating based on PR comments

* updating based on PR comments

* fixing typos
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature needs-review size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants