Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scraping Metrics from a CronJob #1338

Closed
yutkin opened this issue Jan 9, 2024 · 6 comments
Closed

Scraping Metrics from a CronJob #1338

yutkin opened this issue Jan 9, 2024 · 6 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@yutkin
Copy link

yutkin commented Jan 9, 2024

Is your feature request related to a problem? Please describe.
We were running Descheduler as deployment but decided to switch to CronJob because we want to run it only during nighttime. However, CronJob finishes within a few seconds, which is not enough for a pod to be scraped by Prometheus.

Describe the solution you'd like
I don't have a solution, but maybe introduce a CLI flag configuring how long to keep the Descheduler up and running. It would allow to keep a pod, for example for 15s, which is enough to be scraped by Prometheus. But I am open for other suggestions.

Describe alternatives you've considered
Run Descheduler as a deployment, however, it allows only to specify a period, but not the exact time when to run.

What version of the descheduler are you using?
descheduler version: 0.29

@yutkin yutkin added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 9, 2024
@damemi
Copy link
Contributor

damemi commented Jan 16, 2024

This is a problem that unfortunately affects metrics collection for any short-lived workload (I was working on the same issue for serverless recently). So, it's not just a descheduler issue and I don't think a deadline setting like you proposed is technically the right solution. I don't know if the Prometheus community has come to a broader solution for this type of problem.

Ultimately, short-lived workloads benefit from exporting their metrics to a listening server, rather than the Prometheus standard of waiting to be scraped by a server. This is how OpenTelemetry metrics work, and when a workload shuts down all metrics in memory are flushed to the collection endpoint.

So I think to really address this, we should consider updating our metrics implementation to use OpenTelemetry. We already use Otel for traces, so there is some benefit to using both. But the good news is we could do this without breaking existing Prometheus users either by:

@yutkin Unfortunately this still doesn't fix your problem, because you're using a Prometheus server to scrape the endpoint. But if we implement Otel metrics, you could run an OpenTelemetry Collector with otlp receiver and Prometheus exporter, then point your Prometheus agent at that endpoint.

@seanmalloy
Copy link
Member

Here is another option:

Prometheus has a push gateway for handling this, https://github.com/prometheus/pushgateway. I'm not super familiar with push gateway, but I believe the descheduler code would need to be updated to have an option to push metrics when running as a Job or CronJob.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 15, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 15, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

5 participants