Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occasional double counting of job completions in the job_controller_job_finished_total metric #112873

Closed
mimowo opened this issue Oct 5, 2022 · 4 comments · Fixed by #112948
Closed
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/apps Categorizes an issue or PR as relevant to SIG Apps. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@mimowo
Copy link
Contributor

mimowo commented Oct 5, 2022

What happened?

Occasionally job completions are counted twice (or potentially multiple times?) in the job_controller_job_finished_total metric. This happens for about 20% of jobs, but may depend on the Job specificity, which results in an overestimation of the number of finished jobs.

What did you expect to happen?

The job_controller_job_finished_total metric reflects the actual number of completed jobs.

How can we reproduce it (as minimally and precisely as possible)?

  1. Setup a kind cluster with prometheus monitoring stack (for example follow this: https://medium.com/@charled.breteche/kind-fix-missing-prometheus-operator-targets-1a1ff5d8c8ad).

  2. Check the number of completed jobs by inspecting the job_controller_job_finished_total metric

  3. Create and delete a job multiple times in a loop

Example job:

apiVersion: batch/v1
kind: Job
metadata:
  name: indexed-job
  labels:
    jobgroup: indexedjob
spec:
  completions: 5
  parallelism: 3
  completionMode: Indexed
  template:
    metadata:
      labels:
        jobgroup: indexedjob
    spec:
      restartPolicy: Never
      initContainers:
      - name: input
        image: 'docker.io/library/bash'
        command:
        - "bash"
        - "-c"
        - |
          items=(foo bar baz qux xyz)
          echo ${items[$JOB_COMPLETION_INDEX]} > /input/data.txt
        volumeMounts:
        - mountPath: /input
          name: input
      containers:
      - name: 'worker'
        image: 'docker.io/library/busybox'
        command:
        - "rev"
        - "/input/data.txt"
        volumeMounts:
        - mountPath: /input
          name: input
      volumes:
      - name: input
        emptyDir: {}

Example repro script:

import os, time

TMP_KUBE_BUILD = "/tmp/cmd-0123"

def exec_cmd(cmd):
    print("----")
    print(cmd)
    exitCode = os.system(cmd)
    print("exit code: ", exitCode)
    print("----")
    return exitCode

def is_complete():
    exec_cmd("kubectl get jobs | tee %s" % TMP_KUBE_BUILD)
    with open(TMP_KUBE_BUILD) as file:
        for line in file.readlines():
            if "indexed-job" in line and "5/5" in line:
                return True
        return False

def await_complete():
    is_complete_tmp = False
    while not is_complete_tmp:
        is_complete_tmp = is_complete()
        time.sleep(5.0)

if __name__ == "__main__":
    for i in range(1, 11):
        print("=============================== START: " + str(i))
        if exec_cmd("kubectl create -f job-indexed.yaml") != 0:
            continue
        await_complete()
        if exec_cmd("kubectl delete -f job-indexed.yaml") != 0:
            continue
        print("=============================== END: " + str(i))

4.. Check the number of completed jobs by inspecting the job_controller_job_finished_total metric and the job

Anything else we need to know?

Yes, in the kube-controller-logs we find the following log lines which are related and probably indicate repeated updated which results in double counting.

E1005 07:37:12.511132       1 job_controller.go:545] syncing job: tracking status: removing uncounted pods from status: Operation cannot be fulfilled on jobs.batch "indexed-job": the object has been modified; please apply your changes to the latest version and try again
E1005 07:38:04.577649       1 job_controller.go:545] syncing job: tracking status: removing uncounted pods from status: Operation cannot be fulfilled on jobs.batch "indexed-job": the object has been modified; please apply your changes to the latest version and try again

Kubernetes version

$ kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.0", GitCommit:"4ce5a8954017644c5420bae81d72b09b735c21f0", GitTreeState:"clean", BuildDate:"2022-05-03T13:46:05Z", GoVersion:"go1.18.1", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.4
Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.0", GitCommit:"a866cbe2e5bbaa01cfd5e969aa3e033f3282a8a2", GitTreeState:"clean", BuildDate:"2022-09-01T23:30:43Z", GoVersion:"go1.19", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

Reproducible with kind

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@mimowo mimowo added the kind/bug Categorizes issue or PR as related to a bug. label Oct 5, 2022
@mimowo
Copy link
Contributor Author

mimowo commented Oct 5, 2022

/assign

@mimowo
Copy link
Contributor Author

mimowo commented Oct 5, 2022

FYI @alculquicondor

@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 5, 2022
@mimowo
Copy link
Contributor Author

mimowo commented Oct 5, 2022

/sig apps

@k8s-ci-robot k8s-ci-robot added the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label Oct 5, 2022
@alculquicondor
Copy link
Member

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/apps Categorizes an issue or PR as relevant to SIG Apps. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
3 participants