Occasional double counting of job completions in the job_controller_job_finished_total
metric
#112873
Labels
kind/bug
Categorizes issue or PR as related to a bug.
sig/apps
Categorizes an issue or PR as relevant to SIG Apps.
triage/accepted
Indicates an issue or PR is ready to be actively worked on.
What happened?
Occasionally job completions are counted twice (or potentially multiple times?) in the
job_controller_job_finished_total
metric. This happens for about 20% of jobs, but may depend on the Job specificity, which results in an overestimation of the number of finished jobs.What did you expect to happen?
The
job_controller_job_finished_total
metric reflects the actual number of completed jobs.How can we reproduce it (as minimally and precisely as possible)?
Setup a kind cluster with prometheus monitoring stack (for example follow this: https://medium.com/@charled.breteche/kind-fix-missing-prometheus-operator-targets-1a1ff5d8c8ad).
Check the number of completed jobs by inspecting the
job_controller_job_finished_total
metricCreate and delete a job multiple times in a loop
Example job:
Example repro script:
4.. Check the number of completed jobs by inspecting the
job_controller_job_finished_total
metric and the jobAnything else we need to know?
Yes, in the kube-controller-logs we find the following log lines which are related and probably indicate repeated updated which results in double counting.
Kubernetes version
Cloud provider
OS version
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
The text was updated successfully, but these errors were encountered: