Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emit metrics for requests to the Kubernetes control plane #118

Merged
merged 3 commits into from
Jan 17, 2024

Conversation

charleskorn
Copy link
Contributor

This PR adds support for emitting latency metrics for requests to the Kubernetes control plane.


resp, err := k.next.RoundTrip(req)
duration := time.Since(start)
instrument.ObserveWithExemplar(req.Context(), k.hist.WithLabelValues(req.URL.EscapedPath(), req.Method, strconv.Itoa(resp.StatusCode)), duration.Seconds())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love using the URL path as a label, I'm concerned about a cardinality explosion here. I'd much rather have something like the object type, so we have metrics for operations like "patch any StatefulSet" and "list all Pods" - but there doesn't seem to be a reliable way to get that information at this level.

One mitigating factor is that there aren't too many unique URLs that a rollout-operator instance will call. In my testing, this results in a set of buckets for each of:

  • GET pods in namespace
  • GET certificate
  • GET validating webhook configurations
  • GET mutating webhook configurations
  • GET statefulsets in namespace
  • PATCH statefulset (one set per statefulset)

One option would be to add some custom instrumentation around each client API call, but this seems easy to forget. Another option that would reduce the impact would be to use native histograms for this.

Interested in opinions and other ideas from reviewers.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could have a pre-defined subset of URLs/paths that we want to track (such as the ones listed), and "other" for ones we don't care about or expect. We could also take a subsection of the path to group similar ones together.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the hesitation to include URLs as a label but like you mentioned, it's a small and known number of URLs and we only run a single rollout-operator per namespace. I'm not too worried about the cardinality but Joshua's suggestion seems pretty reasonable if you are.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can go with what's here for now and revisit it if it turns out to be worse than expected.

@charleskorn charleskorn marked this pull request as ready for review January 15, 2024 04:58
@charleskorn charleskorn requested a review from a team as a code owner January 15, 2024 04:58

resp, err := k.next.RoundTrip(req)
duration := time.Since(start)
instrument.ObserveWithExemplar(req.Context(), k.hist.WithLabelValues(req.URL.EscapedPath(), req.Method, strconv.Itoa(resp.StatusCode)), duration.Seconds())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could have a pre-defined subset of URLs/paths that we want to track (such as the ones listed), and "other" for ones we don't care about or expect. We could also take a subsection of the path to group similar ones together.

Copy link
Contributor

@56quarters 56quarters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


resp, err := k.next.RoundTrip(req)
duration := time.Since(start)
instrument.ObserveWithExemplar(req.Context(), k.hist.WithLabelValues(req.URL.EscapedPath(), req.Method, strconv.Itoa(resp.StatusCode)), duration.Seconds())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the hesitation to include URLs as a label but like you mentioned, it's a small and known number of URLs and we only run a single rollout-operator per namespace. I'm not too worried about the cardinality but Joshua's suggestion seems pretty reasonable if you are.

"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
"k8s.io/client-go/rest"
"net/http"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you reorg this imports how we do in Mimir? stdlib, 3rd party, local packages

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened the issue #120 since this has shown up in other PRs as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in b5eae5a

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...and I've opened #121 to fix #120.

@charleskorn charleskorn merged commit 8db9791 into main Jan 17, 2024
6 checks passed
@charleskorn charleskorn deleted the charleskorn/kubernetes-client-metrics branch January 17, 2024 04:55
charleskorn added a commit that referenced this pull request Jan 17, 2024
charleskorn added a commit that referenced this pull request Jan 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants