Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add prometheus metrics to internal controller #132

Merged
merged 11 commits into from
Nov 2, 2018

Conversation

JoelSpeed
Copy link
Contributor

@JoelSpeed JoelSpeed commented Sep 3, 2018

Fixes #119

Adds prometheus metric server and some metrics to the controller runtime to allow users to see inside the controllers they are building.

Metrics Added:

  • Reconcile Errors: Counter of how many reconcile errors have occurred (per controller)
  • Queue Length: How many items are in the reconcile queue (per controller)
  • Reconcile Time: Histogram metric for how long reconcile's are taking (per controller)

Changes to operation:

  • Manager creates a listener in New which binds to the given address (default :8080)
  • Manager creates a prometheus registry which will be served on the listener
  • Manager .Start() starts serving the prometheus registry from the manager
  • Manager .Add() adds metrics from the runnable to the manager's regsitry
  • Manager exposes .AddMetrics() to add more metrics to the registry
    • This is where kubebuilder consumers will add metrics, in their .add() method they can register their metrics with the Manager
  • Controller embed Metrics struct
    • Metrics struct exposes GetCollectors() to allow Manager.Add() to register metrics from the controller
    • NewController creates a new set of metrics for use within this particular instance of the controller
    • processNextWorkItem updates the metrics within the Controller's Metrics struct

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Sep 3, 2018
@JoelSpeed
Copy link
Contributor Author

/assign @DirectXMan12

Copy link
Contributor

@DirectXMan12 DirectXMan12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

So, my general comment would be:

Is there a good reason to not just have a global or semi-global controller-runtime registry that people can use? At the very least, I think it's probably better here to just plumb through a Prometheus registry that things can get a handle to/get injected, and then register metrics against that.

Is there a compelling use case for the extra abstraction layers on top of that?

@JoelSpeed
Copy link
Contributor Author

@DirectXMan12 I'm not following your last comment, sorry. Thanks for getting back quick though!

I've added a more in depth description to the top of this PR which describes the changes I've made, perhaps we could work from that to coordinate our understanding of the problem?

Is there a good reason to not just have a global or semi-global controller-runtime registry that people can use?

I think the registry within the Manager is semi-global is it not? Any Controller added to the Manager will have it's metrics registered to this instance of the Manager right? Do you mean to make it some exported global within some package so users would import a package and register to that?

At the very least, I think it's probably better here to just plumb through a Prometheus registry that things can get a handle to/get injected, and then register metrics against that.

I believe the PR already solves that, the Manager's registry has two points of access, GetRegistry and AddMetrics, either of which would be reachable within the controller implementations so that kubebuilder users could register metrics to the Manager's registry when they are adding their reconcile function to the controller within their .add method. This is how I saw it being used anyhow.

Is there a compelling use case for the extra abstraction layers on top of that?

Which abstraction are you talking about, I'm lost at this point 😅

@DirectXMan12
Copy link
Contributor

The extra abstraction that I'm talking about is the last two bullet points. I don't see much of a reason to not just have a global controller-runtime prometheus registry (or use the default, maybe?) and if people want to add new metrics, they can just import sig.k8s.io/controller-runtime/pkg/metrics, and then metrics.Registry.MustRegister(...).

@JoelSpeed
Copy link
Contributor Author

@DirectXMan12 Got ya! I added this extra abstraction after discussing with a colleague about globals vs non-globals. We just went for our preference which is to avoid globals as much as possible but I'm happy to modify the PR to make a global sigs.k8s.io/controller-runtime/pkg/metrics if you'd prefer.

The reason I'm suggesting not to use the default prometheus registry is in case you have multiple packages all setting up metrics handling capabilities within one binary. For instance, we plan to run a webhook alongside one of our controllers which sets up it's own metrics endpoint using the global prometheus registry (part of the framework), I'm not personally a fan as at this point, both the webhook and the controllers would serve the same set of metrics if they both used the global registry.

I don't know if that's a common problem, but to me, not using the default prometheus registry seems cleaner, happy to meet in the middle with a controller-runtime registry global if that's what you think is the best approach?

@DirectXMan12
Copy link
Contributor

The reason I'm suggesting not to use the default prometheus registry is in case you have multiple packages all setting up metrics handling capabilities within one binary

Yeah, I agree with that part :-)

I don't know if that's a common problem, but to me, not using the default prometheus registry seems cleaner, happy to meet in the middle with a controller-runtime registry global if that's what you think is the best approach?

The generally accepted pattern for prometheus is using globals of some variety, AFAIK, so a controller-runtime-global registry might be the best approach. Then, if people want to serve it off the default endpoint, you should just be able to use prometheus.Gatherers to wrap it.

@JoelSpeed
Copy link
Contributor Author

@DirectXMan12 I've added a couple of commits;

The first rewrites the metrics registry to be a global as we have previously discussed,

The second adds some tests and modifies the internal metrics to make sure that you can actually register multiple controllers with one manager. The way I was doing it before you'd only ever be able to add one so I've added test cases for this and made sure it all works with multiple controller instances.

Copy link
Contributor

@DirectXMan12 DirectXMan12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ultimately, it's still unclear to me why we want all the extra structs/interfaces/per-controller-metrics-objects here. Can you elaborate a bit? It's not like the client-go ones, which get used to skip metrics collection or pulling in a Prometheus dependency...

import "github.com/prometheus/client_golang/prometheus"

// Metrics holds the prometheus metrics used internally by the controller
type Metrics struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's unclear to me why we're not just defining these somewhere as is the normal Prometheus pattern.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and then later on doing

ctrlmetrics.QueueLength.WithLabelValues(...).Observe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was considering the fact that each controller runs in a separate go-routine, perhaps it is better to have one instance of each metric per controller to reduce cross thread communication.

Happy to make the change as you've suggested if you're confident that writing to prometheus metrics is thread-safe? I wasn't sure whether they were or not so went for what I thought was the safer option

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, Observe and friends are threadsafe:

All exported functions and methods are safe to be used concurrently unless specified otherwise.

https://godoc.org/github.com/prometheus/client_golang/prometheus

})
mux := http.NewServeMux()
mux.Handle("/metrics", handler)
server := http.Server{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might want to consider using the existing kubernetes machinery for this. Can be a follow-up PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you like me to add a TODO?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, please

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If user adds webhooks, then we have separate listener for webhook. I am assuming the default ports don't conflict.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've double checked, the other listener defaults to 443 where this defaults to 8080

@JoelSpeed
Copy link
Contributor Author

@DirectXMan12 I've changed it all to be globals and also rebased and squashed a bunch of the commits. Does this look more sensible now?

One other thing I've changed is the metrics servings now has a 0 option to stop the listener being created in the first place, I've used this in tests where it was getting messy. I'm not sure if this is the best way to implement this, any ideas?

@droot
Copy link
Contributor

droot commented Sep 20, 2018

@JoelSpeed coming late to the party. Will def. take a look at it tomorrow.

@DirectXMan12
Copy link
Contributor

Have a review in progress now that I'm back, but GitHub isn't letting me leave review comments (looks like a temporary blip in their system). I'll try again this afternoon.

Copy link
Contributor

@DirectXMan12 DirectXMan12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit inline, otherwise looks good.

As for disabling listening, not sure there's an easy way if we want to have a nice default, except maybe having a separate option.

// Shutdown the server when stop is closed
select {
case <-stop:
server.Shutdown(context.TODO())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

context.Background(), unless this is actually a TODO

@JoelSpeed
Copy link
Contributor Author

@DirectXMan12 I fixed the context.Background() thing.

As an FYI, I chose the 0 to disable serving of metrics since this seems to fit with the way other Kubernetes components handle disabling serving

Eg. --secure-port on kube-apiserver:

The port on which to serve HTTPS with authentication and authorization. If 0, don't serve HTTPS at all.

@DirectXMan12
Copy link
Contributor

ack, that makes sense

/approve

fix the rebase issue, then I'll lgtm this

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: DirectXMan12, JoelSpeed

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 27, 2018

// ReconcileErrors is a prometheus counter metrics which holds the total
// number of errors from the Reconciler
ReconcileErrors = prometheus.NewCounterVec(prometheus.CounterOpts{
Copy link

@lilic lilic Sep 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides the ReconcileErrors I would suggest adding also ReconcileTotal . That way we can see if in the past 5mins the rate of errors was too high.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small typo in the suggested new metric name, should probably be ReconcileTotal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@droot @DirectXMan12 Do you think this would be a worthwhile metric to integrate into this PR?

Gopkg.lock Outdated
packages = [
"prometheus",
"prometheus/promhttp",
]
pruneopts = "UT"
revision = "c5b7fccd204277076155f10851dad72b76a49317"
version = "v0.8.0"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some nice features in master that are not in this release, maybe now or in the future this could be changed to master instead, as this release is old.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend using the latest stable release not the master.

One more point: since this is a direct dependency, please add dep constraint in Gopkg.toml so that dep gets this hint while resolving deps for a kubebuilder project.

@JoelSpeed
Copy link
Contributor Author

@DirectXMan12 I've rebased and squashed the last couple of commits into earlier ones, should be ready to go now

@hasbro17
Copy link
Contributor

@JoelSpeed Any update on this? Seems like one of the tests timed out.

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Oct 31, 2018
@JoelSpeed
Copy link
Contributor Author

@droot Apologies for the delay. I've resolved the conflict and pinned the prometheus dependency to v0.9.0 in the Gopkg.toml

@droot droot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 31, 2018
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 1, 2018
@JoelSpeed
Copy link
Contributor Author

@droot @DirectXMan12 I have been looking at some of our metrics today and realised that, at present, this implementation of the controller-runtime doesn't serve metrics for non-leader pods. Therefore you will see targets down in Prometheus, this is probably worth addressing at some point but wanted to ask whether you'd like me to fix that before merging this or follow up and add a TODO for it

@droot
Copy link
Contributor

droot commented Nov 1, 2018

@droot @DirectXMan12 I have been looking at some of our metrics today and realised that, at present, this implementation of the controller-runtime doesn't serve metrics for non-leader pods. Therefore you will see targets down in Prometheus, this is probably worth addressing at some point but wanted to ask whether you'd like me to fix that before merging this or follow up and add a TODO for it

I am ok with fixing that as a followup PR. Please file an issue for that.

@DirectXMan12
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 2, 2018
@k8s-ci-robot k8s-ci-robot merged commit 7748cf9 into kubernetes-sigs:master Nov 2, 2018
justinsb pushed a commit to justinsb/controller-runtime that referenced this pull request Dec 7, 2018
Add prometheus metrics to internal controller
DirectXMan12 pushed a commit that referenced this pull request Jan 31, 2020
Add support for creating core type controllers
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants