Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add prom metrics for connector->server requests #3200

Merged
merged 2 commits into from
Sep 19, 2022

Conversation

mxyng
Copy link
Collaborator

@mxyng mxyng commented Sep 12, 2022

Summary

Observe the requests the connector makes to the server and report them as Prometheus metrics.

Checklist

  • Wrote appropriate unit tests
  • Considered security implications of the change
  • Updated associated docs where necessary
  • Updated associated configuration where necessary
  • Change is backwards compatible if it needs to be (user can upgrade without manual steps?)
  • Nothing sensitive logged
  • Considered data migrations for smooth upgrades

Related Issues

Resolves #2664

api/client.go Outdated Show resolved Hide resolved
internal/connector/connector.go Outdated Show resolved Hide resolved
internal/connector/connector.go Show resolved Hide resolved
Copy link
Contributor

@dnephin dnephin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change seems reasonable to me. I guess someone can use this to solve #2664 by building an alert that checks for the time size the last 200 response code from this metric?

I think this metric is valuable for other purposes, but generally how I would expect to monitor something like #2664 would be with a gauge metric that emits the last time the reconciler ran successfully. If the reconcile loop is failing for any reason (not just calls to the infra API), that's a signal that the role bindings are out of date. I think monitoring the success of the reconcile operation is the most accurate signal. I'd use the gauge metric in a monitor that checks when the value is older than some threshold, and send an alert. Not a blocker, just something to consider.

api/client.go Outdated Show resolved Hide resolved
internal/connector/connector.go Outdated Show resolved Hide resolved
internal/connector/connector.go Outdated Show resolved Hide resolved
@mxyng
Copy link
Collaborator Author

mxyng commented Sep 14, 2022

@dnephin you're right that this does not capture the sync status of the cluster's roles but that was never the intent. My understanding of #2664 is that it's specifically concerned with the connectivity between the server and the connector, which tracking requests will address. The sync status of cluster roles is out of scope.

api/client.go Outdated
Comment on lines 120 to 122
if client.ObserveFunc != nil {
client.ObserveFunc(start, req, resp)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I really like this approach for being able to instrument the api client.

internal/connector/connector.go Show resolved Hide resolved
internal/connector/connector.go Outdated Show resolved Hide resolved
@mxyng mxyng force-pushed the mxyng/connector-metrics branch 2 times, most recently from e000c63 to e914921 Compare September 15, 2022 02:58
internal/connector/connector.go Outdated Show resolved Hide resolved
ObserveFunc: func(start time.Time, request *http.Request, response *http.Response, err error) {
errorLabel := ""
if err != nil {
errorLabel = err.Error()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this be too high cardinality, since these are effectively infinite error strings?

I was thinking of just a true or false value for the error label.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically yes. My concern with setting just true or false is it provides no information as to why it failed. With the status code, there's at least some context. Setting a true or false error is almost like setting status to -1 on error instead of a meaningful HTTP code

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true. We could attempt to detect the error and reduce it to a few well known constant values, but I'm not sure it's worth the effort right now. We can always add that later as a separate errorClass or some other label.

Generally I would not expect metrics to tell me about the error. If I see the number of errors increasing I would go to logs to figure out more details about the errors.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true. It's nice when metrics tell you what the problem is but not necessary. In this case, I'd prefer setting status to -1 since status and error=true are (probably) mutually exclusive; adding a separate labels seems unnecessary

@mxyng mxyng merged commit 03afc71 into main Sep 19, 2022
@mxyng mxyng deleted the mxyng/connector-metrics branch September 19, 2022 21:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add destination name label to infra_destinations metric
3 participants