-
Notifications
You must be signed in to change notification settings - Fork 39k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cronjobs - failedJobsHistoryLimit not reaping state Error
#53331
Comments
Error
Error
/sig apps |
cc @soltysh |
@civik @imiskolee can you folks provide a situation where your pod failed in a cronjob. I'm specifically interested the phase of the pod (see official docs). There are a few possible approaches to this problem:
Personally, I try to combine the two usually for tighter control. |
I've also created #58384 to discuss the start timeout for a job. |
@soltysh Thanks for the update. I'm thinking the issues I'm seeing are due to jobs that will create another pod with restartPolicy set to OnFailure or Always that go into CrashBackLoop. The job will happily keep stamping out pods that sit in a restart loop. Is there some sort of timer that could be set on the parent job that could kill anything it created on a failure? |
@civik iiuc your job is creating another pod, in which case there's no controller owning your pod. In that case you have two options:
|
I'm seeing this happen as well (1.7.3) - |
Pending pods are not failed ones and thus the controller won't be able to clean them. |
Same problem for me: I've got ~8000 pods in state "Error" when failedJobsHistoryLimit was set to 5. |
@soltysh Correct - however, it should be reaping the ones in |
@KIVagant @mcronce can you give me the yaml of the pod status you're having in |
@soltysh Right now I don't have any, I've been manually clearing them with a little bash one-liner for a while. Next time I experience it, though, I'll grab the YAML and paste it here. Thanks! |
Same for me, I've already fixed the root cause for the failed pods and cleared all of them. I can reproduce the situation, but right now I have much bigger problem with cluster and kops, so maybe later. |
@soltysh here are the results of
|
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale I think this might be still an active issue impacting operators. Can anyone confirm if this was fixed by #63650? I dont have an environment in which to test this right now. |
@civik nope, the linked PR is for handling backoffs, not to address problems with |
/reopen We are still seeing this on |
@mrak: You can't reopen an issue/PR unless you authored it or you are a collaborator. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/reopen Seeing this on 1.14.6 ATM |
@2rs2ts: You can't reopen an issue/PR unless you authored it or you are a collaborator. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@civik can you reopen this? |
Hey guys, |
It was probably not fixed, people just ghost on their own issues :/ |
Seems the issue still there: kubernetes/pkg/controller/cronjob/cronjob_controller.go Lines 162 to 168 in 7766e65
|
Should I file a duplicate issue since the OP has not reopened the issue? |
reopening this because i see a lot of attempts to do so (only org members can use prow commands). |
@alejandrox1: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
gonna freeze this until someone wants to volunteer to work on this |
I thought I ran into this with an easy-to-reproduce example... but in the end it validates that
I made a mistake and forgot that For future developers who feel that they've run into this problem:
|
The |
The problem with Error state as presented in kubectl is that these are usually jobs that are running. It's hard in the controller to speculate whether such error state is permanent or temporary. Unless there's a clear Failed signal, the controller won't be able to differentiate between those. So this is not quite a bug. |
We have jobs that don't restart when they get an error and they don't get reaped sometimes. So it does seem like a bug to me. |
Do you have an example yaml of such a failed pod? |
@soltysh if I find a repro case I will share it, however it'll be pretty heavily redacted (company secrets and all that) so I'm not sure how much help that'll be. |
This is an issue for us as well:
Our cronjob spec looks like this:
Per @soltysh's request in a previous comment, here is the json output of a faileld pod:
|
I think we have a problem in the job controller, not cronjob controller. A somewhat similar situation as here is being described in #93783. In both cases job controller will indefinitely try to complete a job, but either due to error in the pod or other issues (quota, wrong pull spec, etc.) the pod will not start or will always fail. We would need a safety mechanism in the job controller which would eventually fail pause a job which is in a perma-stuck. |
Hmm... I just tried with an explicitly failing job: apiVersion: batch/v1
kind: CronJob
metadata:
name: my-job
spec:
jobTemplate:
metadata:
name: my-job
spec:
template:
metadata:
spec:
containers:
- image: busybox
name: my-job
args:
- "/bin/false"
restartPolicy: OnFailure
schedule: '*/1 * * * *'
successfulJobsHistoryLimit: 1
failedJobsHistoryLimit: 1 it does take some longer wait, but eventually the job controller fails the job, it just takes significant amount of time, until a pod reaches the error state. What the job looked like in your situation, where pod wasn't accounted as failed? |
I am facing the same issue, the cronjob pod errors out into crashloopbackoff due to some issue, and the following pods just go into pending state. I tried setting both .spec.activeDeadlineSeconds and .spec.progressDeadlineSeconds in the cronjob but both did not work. I have backoffLimit set to 0 but that does not terminate any pods. Has anyone been able to successfully test using another cron job to delete such stuck pods? |
Can you elaborate? Those are fields for the Job spec. So you have to put them as part of .jobTemplace.spec |
Just for someone else who runs across this and is confused - those fields apply to the |
/kind bug
/sig apps
Cronjob limits were defined in #52390 - however it doesn't appear that
failedJobsHistoryLimit
will reap cronjob pods that end up in a state ofError
Cronjob had
failedJobsHistoryLimit
set to 2Environment:
kubectl version
):Client Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.6", GitCommit:"4bc5e7f9a6c25dc4c03d4d656f2cefd21540e28c", GitTreeState:"clean", BuildDate:"2017-09-15T08:51:09Z", GoVersion:"go1.9", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.4", GitCommit:"d6f433224538d4f9ca2f7ae19b252e6fcb66a3ae", GitTreeState:"clean", BuildDate:"2017-05-19T18:33:17Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}
Centos7.3
uname -a
):4.4.83-1.el7.elrepo.x86_64 #1 SMP Thu Aug 17 09:03:51 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux
The text was updated successfully, but these errors were encountered: