Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cronjobs - failedJobsHistoryLimit not reaping state Error #53331

Open
civik opened this issue Oct 2, 2017 · 50 comments
Open

Cronjobs - failedJobsHistoryLimit not reaping state Error #53331

civik opened this issue Oct 2, 2017 · 50 comments
Assignees
Labels
area/workload-api/cronjob area/workload-api/job kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/apps Categorizes an issue or PR as relevant to SIG Apps.
Projects

Comments

@civik
Copy link

civik commented Oct 2, 2017

/kind bug
/sig apps

Cronjob limits were defined in #52390 - however it doesn't appear that failedJobsHistoryLimit will reap cronjob pods that end up in a state of Error

 kubectl get pods --show-all | grep cronjob | grep Error | wc -l
 566

Cronjob had failedJobsHistoryLimit set to 2

Environment:

  • Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.6", GitCommit:"4bc5e7f9a6c25dc4c03d4d656f2cefd21540e28c", GitTreeState:"clean", BuildDate:"2017-09-15T08:51:09Z", GoVersion:"go1.9", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.4", GitCommit:"d6f433224538d4f9ca2f7ae19b252e6fcb66a3ae", GitTreeState:"clean", BuildDate:"2017-05-19T18:33:17Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}

  • OS (e.g. from /etc/os-release):

Centos7.3

  • Kernel (e.g. uname -a):

4.4.83-1.el7.elrepo.x86_64 #1 SMP Thu Aug 17 09:03:51 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Oct 2, 2017
@k8s-github-robot k8s-github-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Oct 2, 2017
@civik civik changed the title failedJobsHistoryLimit not reaping state Error Cronjobs - failedJobsHistoryLimit not reaping state Error Oct 2, 2017
@civik
Copy link
Author

civik commented Oct 4, 2017

/sig apps

@k8s-ci-robot k8s-ci-robot added the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label Oct 4, 2017
@k8s-github-robot k8s-github-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Oct 4, 2017
@dims
Copy link
Member

dims commented Oct 21, 2017

cc @soltysh

@soltysh
Copy link
Contributor

soltysh commented Jan 17, 2018

@civik @imiskolee can you folks provide a situation where your pod failed in a cronjob. I'm specifically interested the phase of the pod (see official docs).
The only one that comes to my mind of top of my head is when I specify wrong image, for example. In that case the pod is not failed, but pending. Which means none of the controllers (job nor cronjob) can qualify this execution as a failed one and do anything about it. So no removal can happen actually.

There are a few possible approaches to this problem:

  1. Set activeDeadlineSeconds for a job, which will fail a job after its exceeded its duration
  2. Ensure backoffLimit is set, which is responsible for retries after which a job is failed. Although in the particular example I gave (with the wrong pullspec) this won't help.

Personally, I try to combine the two usually for tighter control.

@soltysh
Copy link
Contributor

soltysh commented Jan 17, 2018

I've also created #58384 to discuss the start timeout for a job.

@civik
Copy link
Author

civik commented Feb 13, 2018

@soltysh Thanks for the update. I'm thinking the issues I'm seeing are due to jobs that will create another pod with restartPolicy set to OnFailure or Always that go into CrashBackLoop. The job will happily keep stamping out pods that sit in a restart loop. Is there some sort of timer that could be set on the parent job that could kill anything it created on a failure?

@kow3ns kow3ns added this to Backlog in Workloads Feb 27, 2018
@soltysh
Copy link
Contributor

soltysh commented Mar 8, 2018

@civik iiuc your job is creating another pod, in which case there's no controller owning your pod. In that case you have two options:

  1. set the OwnerRef, but that will only remove the pod when the owning pod/job is being removed
  2. manually clean your pods

@mcronce
Copy link

mcronce commented Mar 11, 2018

I'm seeing this happen as well (1.7.3) - successfulJobsHistoryLimit (set to 2) works fine, but failedJobsHistoryLimit (set to 5) will end up with hundreds of pods in CrashLoopBackOff until eventually it hits my nodes' resource limits and then they just stack up in Pending

@soltysh
Copy link
Contributor

soltysh commented Mar 16, 2018

Pending pods are not failed ones and thus the controller won't be able to clean them.

@KIVagant
Copy link

KIVagant commented Mar 17, 2018

Same problem for me: I've got ~8000 pods in state "Error" when failedJobsHistoryLimit was set to 5.
The cronjob had wrong environment variable so containers were failed trying to start in application level. From the K8s side the configuration was ok, but internal application error led to this situation.

@mcronce
Copy link

mcronce commented Mar 17, 2018

@soltysh Correct - however, it should be reaping the ones in Error and CrashLoopBackOff. If it does that correctly, the cluster's resource limits aren't exhausted and they never stack up in Pending.

@soltysh
Copy link
Contributor

soltysh commented Mar 19, 2018

@KIVagant @mcronce can you give me the yaml of the pod status you're having in Error state? For CrashLoopBackOff is specific, but unfortunately it does not give definite answer that the pod failed. If you look carefully through pod status you'll see it's in the waiting state, scheduled, initialized and waiting for further actions. Nowhere in the code we have any kind of special casing for situation such as this one, I'm hesitant on adding those to job controller as well. I'll try to bring this discussion for the next sig-apps and see what's the outcome.

@mcronce
Copy link

mcronce commented Mar 19, 2018

@soltysh Right now I don't have any, I've been manually clearing them with a little bash one-liner for a while. Next time I experience it, though, I'll grab the YAML and paste it here. Thanks!

@KIVagant
Copy link

KIVagant commented Mar 20, 2018

Same for me, I've already fixed the root cause for the failed pods and cleared all of them. I can reproduce the situation, but right now I have much bigger problem with cluster and kops, so maybe later.

@garethlewin
Copy link

garethlewin commented Apr 17, 2018

@soltysh here are the results of describe and get -o yaml with a bunch of stuff removed (tried to keep just what is relevant)

Status:         Failed
Containers:
  kollector:
    State:          Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 17 Apr 2018 04:50:26 -0700
      Finished:     Tue, 17 Apr 2018 04:50:27 -0700
    Ready:          False
    Restart Count:  0
Conditions:
  Type           Status
  Initialized    True 
  Ready          False 
  PodScheduled   True 
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.alpha.kubernetes.io/notReady:NoExecute for 300s
                 node.alpha.kubernetes.io/unreachable:NoExecute for 300s
apiVersion: v1
kind: Pod
metadata:
  ownerReferences:
  - apiVersion: batch/v1
    blockOwnerDeletion: true
    controller: true
    kind: Job
spec:
  containers:
  restartPolicy: Never
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: 2018-04-17T11:50:25Z
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: 2018-04-17T11:50:25Z
    message: 'containers with unready status: [kollector]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: 2018-04-17T11:50:25Z
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: docker://7f9ca3488d4e714f1264620b2385cbf2b8ced40de26e6f5a0ec22e73385701ed
    lastState: {}
    name: kollector
    ready: false
    restartCount: 0
    state:
      terminated:
        containerID: docker://7f9ca3488d4e714f1264620b2385cbf2b8ced40de26e6f5a0ec22e73385701ed
        exitCode: 1
        finishedAt: 2018-04-17T11:50:27Z
        reason: Error
        startedAt: 2018-04-17T11:50:26Z
  phase: Failed
  qosClass: Burstable
  startTime: 2018-04-17T11:50:25Z

@soltysh
Copy link
Contributor

soltysh commented May 10, 2018

Apparently there's #62382 which I fixed in #63650. Maybe you're hitting that?

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 8, 2018
@civik
Copy link
Author

civik commented Aug 24, 2018

/remove-lifecycle stale

I think this might be still an active issue impacting operators. Can anyone confirm if this was fixed by #63650? I dont have an environment in which to test this right now.

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 24, 2018
@soltysh
Copy link
Contributor

soltysh commented Aug 29, 2018

@civik nope, the linked PR is for handling backoffs, not to address problems with Error state, which is far more complicated, like I said before.

@mrak
Copy link

mrak commented Sep 5, 2019

/reopen

We are still seeing this on 1.13 and 1.14

@k8s-ci-robot
Copy link
Contributor

@mrak: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

We are still seeing this on 1.13 and 1.14

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@2rs2ts
Copy link
Contributor

2rs2ts commented Sep 5, 2019

/reopen

Seeing this on 1.14.6 ATM

@k8s-ci-robot
Copy link
Contributor

@2rs2ts: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Seeing this on 1.14.6 ATM

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@2rs2ts
Copy link
Contributor

2rs2ts commented Sep 5, 2019

@civik can you reopen this?

@gitnik
Copy link

gitnik commented Sep 30, 2019

Hey guys,
what's the state of this issue? Since it wasn't reopened I am assuming it was maybe fixed? But we are still seeing this issue in 1.14

@2rs2ts
Copy link
Contributor

2rs2ts commented Oct 9, 2019

It was probably not fixed, people just ghost on their own issues :/

@vincent-pli
Copy link

Seems the issue still there:

for _, job := range js {
isFinished, finishedStatus := getFinishedStatus(&job)
if isFinished && finishedStatus == batchv1.JobComplete {
successfulJobs = append(successfulJobs, job)
} else if isFinished && finishedStatus == batchv1.JobFailed {
failedJobs = append(failedJobs, job)
}

@2rs2ts
Copy link
Contributor

2rs2ts commented Jun 30, 2020

Should I file a duplicate issue since the OP has not reopened the issue?

@alejandrox1
Copy link
Contributor

reopening this because i see a lot of attempts to do so (only org members can use prow commands).
/reopen

@k8s-ci-robot
Copy link
Contributor

@alejandrox1: Reopened this issue.

In response to this:

reopening this because i see a lot of attempts to do so (only org members can use prow commands).
/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Jul 14, 2020
Workloads automation moved this from Done to Backlog Jul 14, 2020
@alejandrox1
Copy link
Contributor

gonna freeze this until someone wants to volunteer to work on this
/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Jul 14, 2020
@ohthehugemanatee
Copy link

ohthehugemanatee commented Aug 6, 2020

I thought I ran into this with an easy-to-reproduce example... but in the end it validates that .spec.backoffLimit works as intended. I note that the other examples with information to reproduce all happen before default .spec.backoffLimit was introduced.

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: curl
spec:
  schedule: "0 * * * *"
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 1
  failedJobsHistoryLimit: 1
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: curl
            image: buildpack-deps:curl
            args:
            - /bin/sh
            - -ec
            - curl http://some-service
          restartPolicy: Never

I made a mistake and forgot that some-service is listening on port 3000, not 80. So curl fails to connect and times out. Came back in the morning and I had 5 empty pods in status Error. Looks like the default .spec.backoffLimit value worked just fine for me. I suggest that addition is why we see a sharp dropoff in interest in this issue.

For future developers who feel that they've run into this problem:

  • the job resource has good debugging/logging information for you. Please include the output of kubectl describe job my-job-1596704400 in your post.
  • Please include the YAML spec of your (cron)job, as simplified as you can make it.

@2rs2ts
Copy link
Contributor

2rs2ts commented Aug 24, 2020

The backoffLimit definitely has helped mitigate this but my company has 768 cronjobs in one of our production clusters :) It is a not-too-uncommon occurrence that we get support requests for cronjobs that haven't fired in a while because of this bug. We're on 1.17.8 now and we still get these requests from time to time.

@soltysh
Copy link
Contributor

soltysh commented Sep 21, 2020

The problem with Error state as presented in kubectl is that these are usually jobs that are running. It's hard in the controller to speculate whether such error state is permanent or temporary. Unless there's a clear Failed signal, the controller won't be able to differentiate between those. So this is not quite a bug.

@2rs2ts
Copy link
Contributor

2rs2ts commented Oct 7, 2020

We have jobs that don't restart when they get an error and they don't get reaped sometimes. So it does seem like a bug to me.

@soltysh
Copy link
Contributor

soltysh commented Nov 26, 2020

Do you have an example yaml of such a failed pod?

@2rs2ts
Copy link
Contributor

2rs2ts commented Dec 2, 2020

@soltysh if I find a repro case I will share it, however it'll be pretty heavily redacted (company secrets and all that) so I'm not sure how much help that'll be.

@gtorre
Copy link

gtorre commented Jan 20, 2021

This is an issue for us as well:

provisioner-supervise-1607355000-jpn2q          0/1     Error       0          44d
provisioner-supervise-1607355000-lj6lp          0/1     Error       0          44d
provisioner-supervise-1607355000-pjnkr          0/1     Error       0          44d
provisioner-supervise-1607355000-szlpd          0/1     Error       0          44d
provisioner-supervise-1607355000-vfh9z          0/1     Error       0          44d
provisioner-supervise-1607355000-z4rsx          0/1     Error       0          44d
provisioner-supervise-1607355000-zh9vx          0/1     Error       0          44d
provisioner-supervise-1608060600-2vcsd          0/1     Error       0          35d
provisioner-supervise-1608060600-kckfl          0/1     Error       0          35d
provisioner-supervise-1608060600-mdqgp          0/1     Error       0          35d
provisioner-supervise-1608060600-nlgsg          0/1     Error       0          35d
provisioner-supervise-1608060600-zbws7          0/1     Error       0          35d
provisioner-supervise-1608060600-zvgmc          0/1     Error       0          35d
provisioner-supervise-1611159000-dss9j          0/1     Completed   0          9m3s

Our cronjob spec looks like this:

spec:
  schedule: "*/10 * * * *"
  successfulJobsHistoryLimit: 1
  failedJobsHistoryLimit: 2

Per @soltysh's request in a previous comment, here is the json output of a faileld pod:

{
    "apiVersion": "v1",
    "kind": "Pod",
    "metadata": {
        "creationTimestamp": "2020-12-07T15:31:31Z",
        "generateName": "provisioner-supervise-1607355000-",
        "labels": {
            "controller-uid": "8aa58562-fc22-4782-b94e-a2dcb6071328",
            "job-name": "provisioner-supervise-1607355000"
        },
        "managedFields": [
            {
                "apiVersion": "v1",
                "fieldsType": "FieldsV1",
                "fieldsV1": {
                    "f:metadata": {
                        "f:generateName": {},
                        "f:labels": {
                            ".": {},
                            "f:controller-uid": {},
                            "f:job-name": {}
                        },
                        "f:ownerReferences": {
                            ".": {},
                            "k:{\"uid\":\"8aa58562-fc22-4782-b94e-a2dcb6071328\"}": {
                                ".": {},
                                "f:apiVersion": {},
                                "f:blockOwnerDeletion": {},
                                "f:controller": {},
                                "f:kind": {},
                                "f:name": {},
                                "f:uid": {}
                            }
                        }
                    },
                    "f:spec": {
                        "f:containers": {
                            "k:{\"name\":\"provisioner-supervise\"}": {
                                ".": {},
                                "f:args": {},
                                "f:image": {},
                                "f:imagePullPolicy": {},
                                "f:name": {},
                                "f:resources": {
                                    ".": {},
                                    "f:limits": {
                                        ".": {},
                                        "f:cpu": {},
                                        "f:memory": {}
                                    },
                                    "f:requests": {
                                        ".": {},
                                        "f:cpu": {},
                                        "f:memory": {}
                                    }
                                },
                                "f:terminationMessagePath": {},
                                "f:terminationMessagePolicy": {}
                            }
                        },
                        "f:dnsPolicy": {},
                        "f:enableServiceLinks": {},
                        "f:restartPolicy": {},
                        "f:schedulerName": {},
                        "f:securityContext": {},
                        "f:terminationGracePeriodSeconds": {}
                    }
                },
                "manager": "kube-controller-manager",
                "operation": "Update",
                "time": "2020-12-07T15:31:31Z"
            },
            {
                "apiVersion": "v1",
                "fieldsType": "FieldsV1",
                "fieldsV1": {
                    "f:status": {
                        "f:conditions": {
                            "k:{\"type\":\"ContainersReady\"}": {
                                ".": {},
                                "f:lastProbeTime": {},
                                "f:lastTransitionTime": {},
                                "f:message": {},
                                "f:reason": {},
                                "f:status": {},
                                "f:type": {}
                            },
                            "k:{\"type\":\"Initialized\"}": {
                                ".": {},
                                "f:lastProbeTime": {},
                                "f:lastTransitionTime": {},
                                "f:status": {},
                                "f:type": {}
                            },
                            "k:{\"type\":\"Ready\"}": {
                                ".": {},
                                "f:lastProbeTime": {},
                                "f:lastTransitionTime": {},
                                "f:message": {},
                                "f:reason": {},
                                "f:status": {},
                                "f:type": {}
                            }
                        },
                        "f:containerStatuses": {},
                        "f:hostIP": {},
                        "f:phase": {},
                        "f:podIP": {},
                        "f:podIPs": {
                            ".": {},
                            "k:{\"ip\":\"some_ip"}": {
                                ".": {},
                                "f:ip": {}
                            }
                        },
                        "f:startTime": {}
                    }
                },
                "manager": "kubelet",
                "operation": "Update",
                "time": "2020-12-07T15:31:42Z"
            }
        ],
        "name": "provisioner-supervise-1607355000-szlpd",
        "namespace": "some_namespace",
        "ownerReferences": [
            {
                "apiVersion": "batch/v1",
                "blockOwnerDeletion": true,
                "controller": true,
                "kind": "Job",
                "name": "provisioner-supervise-1607355000",
                "uid": "8aa58562-fc22-4782-b94e-a2dcb6071328"
            }
        ],
        "resourceVersion": "453999483",
        "selfLink": "/api/v1/namespaces/some_namespace/pods/provisioner-supervise-1607355000-szlpd",
        "uid": "9dab634b-e100-4847-b371-9125c65b615d"
    },
    "spec": {
        "containers": [
            {
                "args": [
                    "/bin/sh",
                    "-c",
                    "wget -SO - https://provisioner.example.net/endpoint"
                ],
                "image": "busybox",
                "imagePullPolicy": "Always",
                "name": "provisioner-supervise",
                "resources": {
                    "limits": {
                        "cpu": "500m",
                        "memory": "512Mi"
                    },
                    "requests": {
                        "cpu": "500m",
                        "memory": "512Mi"
                    }
                },
                "terminationMessagePath": "/dev/termination-log",
                "terminationMessagePolicy": "File",
                "volumeMounts": [
                    {
                        "mountPath": "/var/run/secrets/kubernetes.io/serviceaccount",
                        "name": "some-token",
                        "readOnly": true
                    }
                ]
            }
        ],
        "dnsPolicy": "ClusterFirst",
        "enableServiceLinks": true,
        "nodeName": "kubernetes.example.net",
        "priority": 0,
        "restartPolicy": "Never",
        "schedulerName": "default-scheduler",
        "securityContext": {},
        "serviceAccount": "default",
        "serviceAccountName": "default",
        "terminationGracePeriodSeconds": 30,
        "tolerations": [
            {
                "effect": "NoExecute",
                "key": "node.kubernetes.io/not-ready",
                "operator": "Exists",
                "tolerationSeconds": 300
            },
            {
                "effect": "NoExecute",
                "key": "node.kubernetes.io/unreachable",
                "operator": "Exists",
                "tolerationSeconds": 300
            }
        ],
        "volumes": [
            {
                "name": "some-token",
                "secret": {
                    "defaultMode": 420,
                    "secretName": "some-token"
                }
            }
        ]
    },
    "status": {
        "conditions": [
            {
                "lastProbeTime": null,
                "lastTransitionTime": "2020-12-07T15:31:31Z",
                "status": "True",
                "type": "Initialized"
            },
            {
                "lastProbeTime": null,
                "lastTransitionTime": "2020-12-07T15:31:31Z",
                "message": "containers with unready status: [provisioner-supervise]",
                "reason": "ContainersNotReady",
                "status": "False",
                "type": "Ready"
            },
            {
                "lastProbeTime": null,
                "lastTransitionTime": "2020-12-07T15:31:31Z",
                "message": "containers with unready status: [provisioner-supervise]",
                "reason": "ContainersNotReady",
                "status": "False",
                "type": "ContainersReady"
            },
            {
                "lastProbeTime": null,
                "lastTransitionTime": "2020-12-07T15:31:31Z",
                "status": "True",
                "type": "PodScheduled"
            }
        ],
        "containerStatuses": [
            {
                "containerID": "docker://e19bfd01a16a63761b4e3370752c54af2854ef4a9e0a4af6fb94a0bd85befa43",
                "image": "busybox:latest",
                "imageID": "docker-pullable://busybox@sha256:bde48e1751173b709090c2539fdf12d6ba64e88ec7a4301591227ce925f3c678",
                "lastState": {},
                "name": "provisioner-supervise",
                "ready": false,
                "restartCount": 0,
                "started": false,
                "state": {
                    "terminated": {
                        "containerID": "docker://e19bfd01a16a63761b4e3370752c54af2854ef4a9e0a4af6fb94a0bd85befa43",
                        "exitCode": 1,
                        "finishedAt": "2020-12-07T15:31:41Z",
                        "reason": "Error",
                        "startedAt": "2020-12-07T15:31:41Z"
                    }
                }
            }
        ],
        "hostIP": "some_ip",
        "phase": "Failed",
        "podIP": "some_ip",
        "podIPs": [
            {
                "ip": "some_ip"
            }
        ],
        "qosClass": "Guaranteed",
        "startTime": "2020-12-07T15:31:31Z"
    }
}

@soltysh
Copy link
Contributor

soltysh commented Jun 15, 2021

I think we have a problem in the job controller, not cronjob controller. A somewhat similar situation as here is being described in #93783. In both cases job controller will indefinitely try to complete a job, but either due to error in the pod or other issues (quota, wrong pull spec, etc.) the pod will not start or will always fail. We would need a safety mechanism in the job controller which would eventually fail pause a job which is in a perma-stuck.

@soltysh
Copy link
Contributor

soltysh commented Jun 15, 2021

Hmm... I just tried with an explicitly failing job:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: my-job
spec:
  jobTemplate:
    metadata:
      name: my-job
    spec:
      template:
        metadata:
        spec:
          containers:
          - image: busybox
            name: my-job
            args:
            - "/bin/false"
          restartPolicy: OnFailure
  schedule: '*/1 * * * *'
  successfulJobsHistoryLimit: 1
  failedJobsHistoryLimit: 1

it does take some longer wait, but eventually the job controller fails the job, it just takes significant amount of time, until a pod reaches the error state.

What the job looked like in your situation, where pod wasn't accounted as failed?

@Rkapoor1707
Copy link

I am facing the same issue, the cronjob pod errors out into crashloopbackoff due to some issue, and the following pods just go into pending state.
I was able to resolve the issue with crashloopbackoff but have to manually delete all cron jobs to terminate the pods stuck in pending state. It would be good to have these pods either not created in the first place because the previous jobs are not failing out/ stuck or keep terminating them after a certain amount of time instead of spinning up new ones.

I tried setting both .spec.activeDeadlineSeconds and .spec.progressDeadlineSeconds in the cronjob but both did not work. I have backoffLimit set to 0 but that does not terminate any pods.

Has anyone been able to successfully test using another cron job to delete such stuck pods?

@alculquicondor
Copy link
Member

I tried setting both .spec.activeDeadlineSeconds and .spec.progressDeadlineSeconds in the cronjob but both did not work.

Can you elaborate? Those are fields for the Job spec. So you have to put them as part of .jobTemplace.spec

@jdnurmi
Copy link

jdnurmi commented Feb 23, 2023

Just for someone else who runs across this and is confused - those fields apply to the job - if you care about cleaning up the pod, it's probably easiest to set ttlSecondsAfterFinished on the jobTemplate

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/workload-api/cronjob area/workload-api/job kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/apps Categorizes an issue or PR as relevant to SIG Apps.
Projects
Status: Needs Triage
Workloads
  
Backlog
Development

No branches or pull requests