Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--wait-for-jobs does not fails if job failed #9285

Closed
antonio-antuan opened this issue Jan 27, 2021 · 18 comments · Fixed by #9313 or #9950 · May be fixed by #11289
Closed

--wait-for-jobs does not fails if job failed #9285

antonio-antuan opened this issue Jan 27, 2021 · 18 comments · Fixed by #9313 or #9950 · May be fixed by #11289

Comments

@antonio-antuan
Copy link

Output of helm version:
version.BuildInfo{Version:"v3.5.0", GitCommit:"32c22239423b3b4ba6706d450bd044baffdcf9e6", GitTreeState:"clean", GoVersion:"go1.15.6"}

Output of kubectl version:
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.4", GitCommit:"d360454c9bcd1634cf4cc52d1867af5491dc9c5f", GitTreeState:"archive", BuildDate:"2020-11-25T13:19:56Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.8", GitCommit:"9f2892aab98fe339f3bd70e3c470144299398ace", GitTreeState:"clean", BuildDate:"2020-08-13T16:04:18Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

Cloud Provider/Platform (AKS, GKE, Minikube etc.):

I have this template:

apiVersion: batch/v1
kind: Job
metadata:
  name: backend-tests-job
  labels:
    app-name: tests
    app-service: backend-tests
spec:
  backoffLimit: 0
  template:
    metadata:
      name: backend-tests
    spec:
      restartPolicy: Never
      imagePullSecrets:
        - name: {{ .Values.imagePullSecrets }}

      containers:
      - name: tests
        image: {{.Values.image.repository}}/{{ .Values.appName }}:{{ .Values.appVersion }}
        imagePullPolicy: Always
        command: ["/usr/bin/make"]
        args: ["test", "PYTEST_ARGS=${ARGS}", "VERSION=${VERSION}"]
        env:
          - name: ARGS
            value: {{ .Values.pytestArgs | quote }}
          - name: GG_VERSION
            value: {{ .Values.version | quote }}
        resources:
            {{- toYaml .Values.resources | nindent 12 }}

I installs it with this command:
helm install CHART --wait-for-jobs --wait --timeout 10m

Sometimes job failed (because of tests, yeap :) ), does not restart (because of backoffLimit: 0) and then helm just... waits.
I think that helm should not wait if job failed without any other restarts

@yxxhero
Copy link
Member

yxxhero commented Jan 28, 2021

I see,--timeout 10m, when the time is up, what happend in helm? @AcLr

@antonio-antuan
Copy link
Author

Yes, install fails. But I expected that fail happens when job failed. Actually backend-tests-job fails in several seconds. It does not restart because of backoffLimit: 0, nothing happens and helm just waits extra time without any progress. I think that helm install can failed much earlier.

@yxxhero
Copy link
Member

yxxhero commented Jan 28, 2021

I already known your mean! @AcLr

@antonio-antuan
Copy link
Author

Ok, here is an output. Job already failed and helm still waits:

apiVersion: batch/v1
kind: Job
metadata:
  annotations:
    meta.helm.sh/release-name: tests
    meta.helm.sh/release-namespace: tests
  creationTimestamp: "2021-01-28T15:49:21Z"
  labels:
    app-name: tests
    app-service: backend-tests
    app.kubernetes.io/managed-by: Helm
  managedFields:
  - apiVersion: batch/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:meta.helm.sh/release-name: {}
          f:meta.helm.sh/release-namespace: {}
        f:labels:
          .: {}
          f:app-name: {}
          f:app-service: {}
          f:app.kubernetes.io/managed-by: {}
      f:spec:
        f:backoffLimit: {}
        f:completions: {}
        f:parallelism: {}
        f:template:
          f:metadata:
            f:labels:
              .: {}
              f:app-name: {}
              f:app-service: {}
            f:name: {}
          f:spec:
            f:containers:
              k:{"name":"tests"}:
                .: {}
                f:args: {}
                f:command: {}
                f:env:
                  .: {}
                  k:{"name":"ARGS"}:
                    .: {}
                    f:name: {}
                    f:value: {}
                  k:{"name":"VERSION"}:
                    .: {}
                    f:name: {}
                    f:value: {}
                f:image: {}
                f:imagePullPolicy: {}
                f:name: {}
                f:resources:
                  .: {}
                  f:limits:
                    .: {}
                    f:memory: {}
                  f:requests:
                    .: {}
                    f:cpu: {}
                    f:memory: {}
                f:terminationMessagePath: {}
                f:terminationMessagePolicy: {}
            f:dnsPolicy: {}
            f:imagePullSecrets:
              .: {}
              k:{"name":"[HIDDEN]"}:
                .: {}
                f:name: {}
            f:restartPolicy: {}
            f:schedulerName: {}
            f:securityContext: {}
            f:terminationGracePeriodSeconds: {}
    manager: Go-http-client
    operation: Update
    time: "2021-01-28T15:49:21Z"
  - apiVersion: batch/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        f:conditions:
          .: {}
          k:{"type":"Failed"}:
            .: {}
            f:lastProbeTime: {}
            f:lastTransitionTime: {}
            f:message: {}
            f:reason: {}
            f:status: {}
            f:type: {}
        f:failed: {}
        f:startTime: {}
    manager: kube-controller-manager
    operation: Update
    time: "2021-01-28T15:49:28Z"
  name: backend-tests-job
  namespace: tests
  resourceVersion: "24637583"
  selfLink: /apis/batch/v1/namespaces/tests/jobs/backend-tests-job
  uid: d1e216fb-fba1-479e-8c5e-20db8d7b8ed1
spec:
  backoffLimit: 0
  completions: 1
  parallelism: 1
  selector:
    matchLabels:
      controller-uid: d1e216fb-fba1-479e-8c5e-20db8d7b8ed1
  template:
    metadata:
      creationTimestamp: null
      labels:
        app-name: tests
        app-service: backend-tests
        controller-uid: d1e216fb-fba1-479e-8c5e-20db8d7b8ed1
        job-name: backend-tests-job
      name: backend-tests
    spec:
      containers:
      - args:
        - test
        - PYTEST_ARGS=${ARGS}
        - VERSION=${VERSION}
        command:
        - /usr/bin/make
        env:
        - name: ARGS
          value: qwewqewew
        - name: VERSION
          value: test-version
        image: [HIDDEN]
        imagePullPolicy: Always
        name: tests
        resources:
          limits:
            memory: 4Gi
          requests:
            cpu: "1"
            memory: 512Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      imagePullSecrets:
      - name: [HIDDEN]
      restartPolicy: Never
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
status:
  conditions:
  - lastProbeTime: "2021-01-28T15:49:28Z"
    lastTransitionTime: "2021-01-28T15:49:28Z"
    message: Job has reached the specified backoff limit
    reason: BackoffLimitExceeded
    status: "True"
    type: Failed
  failed: 1
  startTime: "2021-01-28T15:49:21Z"

@yxxhero
Copy link
Member

yxxhero commented Jan 29, 2021

In accordance with the logic of the program, the behave of helm is right.@AcLr

@antonio-antuan
Copy link
Author

That's great! Don't you want to change the behave? :)

And could you please not to remove your messages? Currently my messages are looked a bit weird)

@yxxhero
Copy link
Member

yxxhero commented Jan 29, 2021

My english is poor!

That's great! Don't you want to change the behave? :)

And could you please not to remove your messages? Currently my messages are looked a bit weird)

@bacongobbler

@yxxhero
Copy link
Member

yxxhero commented Feb 10, 2021

#9313 I guess. You need this. @AcLr

@antonio-antuan
Copy link
Author

I checked #9313 and my problem is not fixed: helm still waits when job failed. And yes I double-checked backoffLimit: 0

@selfuryon
Copy link

selfuryon commented Feb 23, 2021

I also checked it and I still have the same problem with waiting failed job :(
Just added debug print:

func (w *waiter) jobReady(job *batchv1.Job) bool {
    fmt.Println("job.Status.Failed:",job.Status.Failed,"   ,*job.Spec.BackoffLimit:",*job.Spec.BackoffLimit)
	if job.Status.Failed > *job.Spec.BackoffLimit {
		w.log("Job is failed: %s/%s", job.GetNamespace(), job.GetName())
		return false
	}
	if job.Status.Succeeded < *job.Spec.Completions {
		w.log("Job is not completed: %s/%s", job.GetNamespace(), job.GetName())
		return false
	}
	return true
}

And still see waiting for timeout:

helm/bin/helm install gg-tests .cicd/deploy/k8s/tests --wait --wait-for-jobs --timeout 10m
job.Status.Failed: 0    ,*job.Spec.BackoffLimit: 0
job.Status.Failed: 0    ,*job.Spec.BackoffLimit: 0
job.Status.Failed: 0    ,*job.Spec.BackoffLimit: 0
job.Status.Failed: 0    ,*job.Spec.BackoffLimit: 0
job.Status.Failed: 0    ,*job.Spec.BackoffLimit: 0
job.Status.Failed: 0    ,*job.Spec.BackoffLimit: 0
job.Status.Failed: 0    ,*job.Spec.BackoffLimit: 0
job.Status.Failed: 0    ,*job.Spec.BackoffLimit: 0
job.Status.Failed: 1    ,*job.Spec.BackoffLimit: 0
job.Status.Failed: 1    ,*job.Spec.BackoffLimit: 0
job.Status.Failed: 1    ,*job.Spec.BackoffLimit: 0
job.Status.Failed: 1    ,*job.Spec.BackoffLimit: 0
job.Status.Failed: 1    ,*job.Spec.BackoffLimit: 0
job.Status.Failed: 1    ,*job.Spec.BackoffLimit: 0
job.Status.Failed: 1    ,*job.Spec.BackoffLimit: 0
job.Status.Failed: 1    ,*job.Spec.BackoffLimit: 0
job.Status.Failed: 1    ,*job.Spec.BackoffLimit: 0
job.Status.Failed: 1    ,*job.Spec.BackoffLimit: 0
job.Status.Failed: 1    ,*job.Spec.BackoffLimit: 0
job.Status.Failed: 1    ,*job.Spec.BackoffLimit: 0
job.Status.Failed: 1    ,*job.Spec.BackoffLimit: 0
job.Status.Failed: 1    ,*job.Spec.BackoffLimit: 0
job.Status.Failed: 1    ,*job.Spec.BackoffLimit: 0
...

@bacongobbler bacongobbler linked a pull request Feb 23, 2021 that will close this issue
3 tasks
@antonio-antuan
Copy link
Author

as mentioned above problem is not closed with #9313

@jdolitsky jdolitsky reopened this Mar 17, 2021
@dindurthy
Copy link

dindurthy commented Jun 10, 2021

This is a feature request. Right now, in jobReady a job failure is treated the same way as a job not-yet-successful. What would be great is if failure short-circuited the polling. A failed job dooms the upgrade, so no point in waiting.

jeffrosenberg pushed a commit to jeffrosenberg/helm that referenced this issue Jul 16, 2021
jeffrosenberg pushed a commit to jeffrosenberg/helm that referenced this issue Jul 16, 2021
Closes helm#9285

Signed-off-by: Rosenberg, Jeff <jeff.rosenberg@icfnext.com>
@jeffrosenberg
Copy link
Contributor

I've pushed a PR that I hope fixes this -- #9950. It's possible that it's not that simple, I'm open to feedback 🙂

@jeffrosenberg
Copy link
Contributor

@jdolitsky @bacongobbler I'm new to contributing to this project. I have a PR open, but not sure what I need to do to get a review on it, is it okay to just request your review?

@github-actions
Copy link

This issue has been marked as stale because it has been open for 90 days with no activity. This thread will be automatically closed in 30 days if no further activity occurs.

@egorchepiga
Copy link

egorchepiga commented Dec 23, 2021

@jdolitsky @bacongobbler Could you please approve #9950
Could you reopen this issue? as mentioned above problem is not closed with #9313

@wirwolf
Copy link

wirwolf commented Sep 29, 2022

Have the same problem. Waiting for fix

@jouve
Copy link
Contributor

jouve commented Oct 29, 2022

#9950 fixes this issue for me too :)

wahabmk pushed a commit to wahabmk/helm that referenced this issue Nov 6, 2023
Closes helm#9285

Signed-off-by: Rosenberg, Jeff <jeff.rosenberg@icfnext.com>
Signed-off-by: Wahab Ali <wahabalimk@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment