Scraping Metrics from a CronJob #1338

yutkin · 2024-01-09T17:12:56Z

Is your feature request related to a problem? Please describe.
We were running Descheduler as deployment but decided to switch to CronJob because we want to run it only during nighttime. However, CronJob finishes within a few seconds, which is not enough for a pod to be scraped by Prometheus.

Describe the solution you'd like
I don't have a solution, but maybe introduce a CLI flag configuring how long to keep the Descheduler up and running. It would allow to keep a pod, for example for 15s, which is enough to be scraped by Prometheus. But I am open for other suggestions.

Describe alternatives you've considered
Run Descheduler as a deployment, however, it allows only to specify a period, but not the exact time when to run.

What version of the descheduler are you using?
descheduler version: 0.29

The text was updated successfully, but these errors were encountered:

damemi · 2024-01-16T17:50:37Z

This is a problem that unfortunately affects metrics collection for any short-lived workload (I was working on the same issue for serverless recently). So, it's not just a descheduler issue and I don't think a deadline setting like you proposed is technically the right solution. I don't know if the Prometheus community has come to a broader solution for this type of problem.

Ultimately, short-lived workloads benefit from exporting their metrics to a listening server, rather than the Prometheus standard of waiting to be scraped by a server. This is how OpenTelemetry metrics work, and when a workload shuts down all metrics in memory are flushed to the collection endpoint.

So I think to really address this, we should consider updating our metrics implementation to use OpenTelemetry. We already use Otel for traces, so there is some benefit to using both. But the good news is we could do this without breaking existing Prometheus users either by:

Use the Otel Prometheus bridge with our current prom instrumentation
Switch our instrumentation entirely to Otel and use the Prometheus exporter to keep providing a prometheus endpoint

@yutkin Unfortunately this still doesn't fix your problem, because you're using a Prometheus server to scrape the endpoint. But if we implement Otel metrics, you could run an OpenTelemetry Collector with otlp receiver and Prometheus exporter, then point your Prometheus agent at that endpoint.

seanmalloy · 2024-01-16T18:09:30Z

Here is another option:

Prometheus has a push gateway for handling this, https://github.com/prometheus/pushgateway. I'm not super familiar with push gateway, but I believe the descheduler code would need to be updated to have an option to push metrics when running as a Job or CronJob.

k8s-triage-robot · 2024-04-15T18:36:14Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-05-15T19:09:17Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-06-14T19:29:44Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2024-06-14T19:29:48Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

yutkin added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 9, 2024

knelasevero mentioned this issue Jan 16, 2024

Deprecate CronJob deployment approach #1340

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 15, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 15, 2024

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scraping Metrics from a CronJob #1338

Scraping Metrics from a CronJob #1338

yutkin commented Jan 9, 2024

damemi commented Jan 16, 2024

seanmalloy commented Jan 16, 2024

k8s-triage-robot commented Apr 15, 2024

k8s-triage-robot commented May 15, 2024

k8s-triage-robot commented Jun 14, 2024

k8s-ci-robot commented Jun 14, 2024

Scraping Metrics from a CronJob #1338

Scraping Metrics from a CronJob #1338

Comments

yutkin commented Jan 9, 2024

damemi commented Jan 16, 2024

seanmalloy commented Jan 16, 2024

k8s-triage-robot commented Apr 15, 2024

k8s-triage-robot commented May 15, 2024

k8s-triage-robot commented Jun 14, 2024

k8s-ci-robot commented Jun 14, 2024