-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding prometheus metrics for ASB #497
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with one question.
pkg/app/app.go
Outdated
@@ -335,11 +337,14 @@ func (a *App) Start() { | |||
daHandler := handler.NewHandler(a.broker, a.log.Logger, a.config.Broker, clusterURL, providers, rules) | |||
|
|||
if clusterURL == "/" { | |||
genericserver.Handler.NonGoRestfulMux.HandlePrefix("/", daHandler) | |||
genericserver.Handler.NonGoRestfulMux.HandlePrefix("/", prometheus.InstrumentHandler("ansible-service-broker", daHandler)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may make it clear my ignorance. Could you not update daHandler
above to something like:
daHandler := prometheus.InstrumentHandler(
"ansible-service-broker",
handler.NewHandler(a.broker, a.log.Logger, a.config.Broker, clusterURL, providers, rules),
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I bet you could, and that would probably be much more clear :) let me double check 👍
Does this need a bugzilla since I'm not sure if it is being included in 3.7. |
This PR does not need a bugzilla, it is one of two trello cards which have approval for being worked on post the general feature complete date 3.7. Other card is the service instance work. Both features must be completed prior to this Friday. |
@jcantrill @smarterclayton @liggitt do you have any input on this PR. It's our first cut at adding prometheus metrics for the Ansible Broker as per card: https://trello.com/c/Cruuo5Vl This is a sample of the output returned: Endpoint is accessed by:
Assuming that the logged in user has 'cluster-debugger-role' for accessing this endpoint. Notes here on the metrics we are gathering: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions and suggestions for metric names.
pkg/apb/svc_acct.go
Outdated
@@ -54,6 +55,7 @@ func (s *ServiceAccountManager) CreateApbSandbox( | |||
executionContext ExecutionContext, | |||
apbRole string, | |||
) (string, error) { | |||
metrics.SandboxCreated() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually mark created before it is really created? What happens if this method returns with an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nothing bad would happen, our metrics would say we created a sandbox when it actually failed to create the sandbox, I think that moving it down to once it is actually created makes sense to me.
pkg/apb/svc_acct.go
Outdated
@@ -283,6 +285,7 @@ func (s *ServiceAccountManager) DestroyApbSandbox(executionContext ExecutionCont | |||
// "If there is an error, it will be of type *PathError" | |||
// We don't care, because it's gone | |||
os.Remove(filePathFromHandle(executionContext.PodName)) | |||
metrics.SandboxDeleted() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to do it once things have actually been deleted.
|
||
if clusterURL == "/" { | ||
genericserver.Handler.NonGoRestfulMux.HandlePrefix("/", daHandler) | ||
} else { | ||
genericserver.Handler.NonGoRestfulMux.HandlePrefix(fmt.Sprintf("%v/", clusterURL), daHandler) | ||
} | ||
|
||
defaultMetrics := routes.DefaultMetrics{} | ||
defaultMetrics.Install(genericserver.Handler.NonGoRestfulMux) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this doing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we couldn't just add a metrics endpoint like we did with /v2 and the dev endpoints?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/kubernetes/apiserver/blob/master/pkg/server/routes/metrics.go#L31
We totally could, we get metrics that the apiserver sets up by default. Some have to do with apiserver audit events, those I would not want to set up on my own. But the others, (golang process metrics) we could set up on our own. Figured we already have apiserver in might as well use the stuff it provides for "free".
daHandler := prometheus.InstrumentHandler( | ||
"ansible-service-broker", | ||
handler.NewHandler(a.broker, a.log.Logger, a.config.Broker, clusterURL, providers, rules), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense, a "middleware" :) handler around our handler.
pkg/metrics/metrics.go
Outdated
func recoverMetricPanic() { | ||
if r := recover(); r != nil { | ||
log.Errorf("Recovering from metric function - %v", r) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does recover
do? What are we recovering?
pkg/broker/deprovision_subscriber.go
Outdated
|
||
go func() { | ||
d.log.Info("Listening for deprovision messages") | ||
for { | ||
msg := <-msgBuffer | ||
var dmsg *DeprovisionMsg | ||
metrics.RemoveDeprovisionJob() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as RemoveProvisionJob.
pkg/metrics/metrics.go
Outdated
} | ||
|
||
// AddProvisionJob - Add a provision job to the counter. | ||
func AddProvisionJob() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about ProvisionJobStarted?
pkg/metrics/metrics.go
Outdated
} | ||
|
||
// AddDeprovisionJob - Add a deprovision job to the counter. | ||
func AddDeprovisionJob() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DeprovisionJobStarted
pkg/metrics/metrics.go
Outdated
} | ||
|
||
// RemoveProvisionJob - Remove a provision job to the counter. | ||
func RemoveProvisionJob() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ProvisionJobFinished
pkg/metrics/metrics.go
Outdated
} | ||
|
||
// RemoveDeprovisionJob - Remove a deprovision job to the counter. | ||
func RemoveDeprovisionJob() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DeprovisionJobFinished
or something like PopDeprovisionJob
(of course make it consistent with provision) :)
d92c25f
to
45d0060
Compare
pkg/metrics/metrics.go
Outdated
) | ||
|
||
func init() { | ||
prometheus.MustRegister(sandboxCreated) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This panics if there's any error. If we want to have error checking around this we should use prometheus.Register
. It's the same thing, but returns error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the pod should probably blow up if we can't expose metrics, this seems like the right thing to do if people are going to use the metrics to determine if things are wrong.
This is different than a metric not working (guarding with the recover function), this is all of the Prometheus client is not working. I think this is why I would want to error the broker loudly and early, thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm always a proponent of loud and early failure for systems deemed important enough. Curious, is it possible to configure metrics as enabled/disabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still are going to error loudly and early. It's just like how we check if we can connect to the etcd endpoint. But, we will have control over the error message so we can have a smoother exit.
Help: "Counter of how many times the specs have been reset.", | ||
}) | ||
|
||
provisionJob = prometheus.NewGauge( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think another good metric is a counter of the number of things provisioned and deprovisioned. Ditto bind and unbind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add to https://docs.google.com/document/d/1ui57sb3kMf2HEt6LelfuDEGjrAhbSaD4NhYPljecyRE/edit#
I think that we can add better metrics for 3.8, but need to get some basics up today.
Do you feel super strongly that we should have them for the initial PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me to get a baseline and then iterate once we have some feedback on what is useful in the wild.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO keeping track of number of bind/unbind is important. It's one the four APIs we track and care about. We can add better metrics for 3.8, but to me this seems like a core metric we want.
pkg/metrics/metrics.go
Outdated
) | ||
|
||
var ( | ||
sandboxCreated = prometheus.NewCounter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we want to count all the namespaces created? To me this metric means 'the total number of actions the broker has performed'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think sandbox is specific, this will give us some nice indirection metrics, say if we have a lot of sandbox creations and no sandbox deletions then things are going very wrong somewhere. This is not the actions, bind and unbind are configured by default to not launch APB's and therefore will not be counted in this metric.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think sandbox is specific, this will give us some nice indirection metrics, say if we have a lot of
sandbox creations and no sandbox deletions then things are going very wrong somewhere.
It's a good metric to be aware that the sandboxes are being cleaned up, but we should do that with a gauge instead of two counters.
This is not the actions, bind and unbind are configured by default to not launch APB's and
therefore will not be counted in this metric.
We can still count them if you increment in the if block.
ansible-service-broker/pkg/broker/broker.go
Line 865 in 81381b1
if a.brokerConfig.LaunchApbOnBind { |
} | ||
|
||
// SandboxCreated - Counter for how many sandbox created. | ||
func SandboxCreated() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for these hooks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't have much to add that hasn't already been said. Looks like this will be really useful! 👍
45d0060
to
5b081f2
Compare
@rthallisey Can you double check that the agreed upon changes are what you were thinking? |
* Adding prometheus metrics for ASB * updating based on PR comments. * updating based on PR comments * updating based on PR comments * fixing typos
* Adding prometheus metrics for ASB * updating based on PR comments. * updating based on PR comments * updating based on PR comments * fixing typos
* Adding prometheus metrics for ASB * updating based on PR comments. * updating based on PR comments * updating based on PR comments * fixing typos
Describe what this PR does and why we need it:
Adds Prometheus metrics for the ASB
Changes proposed in this pull request
apiserver
/metrics
Does this PR depend on another PR (Use this to track when PRs should be merged)
depends-on
N/A
Which issue this PR fixes (This will close that issue when PR gets merged)
N/A