Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust bundled Prometheus to scrape for only essential metrics #1805

Merged
merged 1 commit into from
Nov 21, 2022

Conversation

thomasvn
Copy link
Member

@thomasvn thomasvn commented Nov 15, 2022

What does this PR change?

The scrape_config titled kubernetes-service-endpoints currently attempts to scrape all service endpoints available in the cluster which include the prometheus.io/scrape: true annotation

Because Kubecost only needs metrics from kubecost-kube-state-metrics, kubecost-prometheus-node-exporter, and kubecost-network-costs ... Prometheus is scraping more than it needs to. This results in duplicate metrics, and errors when attempting to scrape a service endpoint which it doesn't have permissions to scrape.

This PR adds a filter in the kubernetes-service-endpoints scrape_config so that it is only scraping for the metrics required by Kubecost.

prometheus

Does this PR rely on any other PRs?

No

How does this PR impact users? (This is the kind of thing that goes in release notes!)

If users were querying any metrics from Kubecost’s bundled prometheus that are not listed here: https://github.com/kubecost/docs/blob/main/user-metrics.md, they will be affected.

Links to Issues or ZD tickets this PR addresses or fixes

How was this PR tested?

Using the following values.yaml, I verified on the Prometheus server that all metrics endpoints required by Kubecost were still being scraped, while all excess metrics endpoints (e.g. kube-dns) were no longer scraped.

kubectl port-forward svc/kubecost-prometheus-server 8080:80

# values.yaml
kube-state-metrics:
  enabled: true
nodeExporter:
  enabled: true
networkCosts:
  enabled: true

Have you made an update to documentation?

No

@jessegoodier
Copy link
Collaborator

Love this. testing now.

@AjayTripathy
Copy link
Contributor

This is all we need for now, but as a note we need to add the DCGM exporter when we reintroduce GPU usage metrics: https://docs.google.com/document/d/1fsbV55wTbpfWy4m9_FVETqHgtTMahfUa2w5MzV9gngc/edit

@AjayTripathy
Copy link
Contributor

Approved, @jessegoodier @thomasvn you all feel good to merge this?

@thomasvn
Copy link
Member Author

Good to merge. After further review, here are some future todo items:

  • Test the Nvidia GPU metric DCGM_FI_DEV_GPU_UTIL once the feature is reintroduced. Reintroduce gpuRequestAverage and gpuUsageAverage to the Allocation API Schema #1787
  • Consider removing kubernetes-service-endpoints-slow scrape_config because this only applies to the annotation prometheus.io/scrape-slow, which is not used by Kubecost targets. Slower scrape intervals can be configured in Kubecost’s .Values.prometheus.server.global.scrape_interval.
  • Consider removing the kubernetes-nodes scrape_config. I believe its metrics are not currently used by Kubecost (I may be wrong).

@thomasvn thomasvn marked this pull request as ready for review November 19, 2022 02:13
@Adam-Stack-PM
Copy link

So cool @thomasvn. Thanks for pushing this forward.

@jessegoodier
Copy link
Collaborator

Approved, @jessegoodier @thomasvn you all feel good to merge this?

yes, working in all of my tests. much cleaner. nice work @thomasvn

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants