Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gaps in K8up metrics #976

Open
simu opened this issue May 29, 2024 · 2 comments
Open

Gaps in K8up metrics #976

simu opened this issue May 29, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@simu
Copy link

simu commented May 29, 2024

Description

Currently, K8up only emits timeseries for schedules for which it's seen at least one Job with the matching completion state, e.g. k8up_jobs_successful_counter will only have timeseries for schedules which have at least one successful job since the last K8up restart.

For schedules with relatively low frequency (e.g. 1/day) this can lead to significant gaps in the metric in Prometheus which confuses Prometheus functions such as rate() which otherwise can compensate for counter resets due to pod restarts.

Additional Context

No response

Logs

No response

Expected Behavior

K8up initializes the counter metrics (k8up_jobs_failed_counter, k8up_jobs_successful_counter, and k8up_jobs_total) with value 0 for all job types and all namespaces in which a Schedule exists immediately after startup.

Steps To Reproduce

  1. Create a Schedule
  2. Check K8up's /metrics endpoint and observe that there's no k8up_jobs_* timeseries for the namespace of the new Schedule until a first job runs.

Version of K8up

v2.7.2

Version of Kubernetes

v1.27.13

Distribution of Kubernetes

OpenShift 4

@simu simu added the bug Something isn't working label May 29, 2024
@mhutter
Copy link
Contributor

mhutter commented May 29, 2024

For my understanding, the problem would be fixed if K8up would emit those metrics with labels for all existing schedules and a value of 0?

@simu
Copy link
Author

simu commented May 30, 2024

For my understanding, the problem would be fixed if K8up would emit those metrics with labels for all existing schedules and a value of 0?

Yes, (if I understand Prometheus's behavior correctly) emitting metrics with labels for all existing schedules and value 0 until the first job is observed would let Prometheus correctly identify the counter resets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants