-
Notifications
You must be signed in to change notification settings - Fork 39.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stopping cluster overnight prevents scheduled jobs from running after cluster startup. #42649
Comments
@kubernetes/sig-apps-bugs |
There was similar problem recently fixed #36311, although in this particular use case it's hard to distinguish whether it was down or something else broke, from a controller point of view. I could only suggest tweaking |
@soltysh if I set |
@lpreson up to |
@soltysh This could be a common approach for money saving in the cloud when running non production clusters. Thank you for the interim suggestion, I will try this out. |
With the current time-frame I'm not sure. I'll be sweeping through my bugs over the next few weeks. I'll see what I can go with. |
Issues go stale after 90d of inactivity. Prevent issues from auto-closing with an If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or |
@lpreson - I was wondering if setting .spec.startingDeadlineSecond and or setting .spec.restartPolicy to Forbid resolved this issue. We have a similar issue with 2 of our non-prod clusters that we shut down each night. |
I'm still experiencing this on |
I've got the same problem with Minikube 0.25.0 running Kubernetes server 1.9.0 in a VirtualBox on my laptop. When I close the lid and laptop suspends the minikube VM obviously stops and upon resume my scheduled job keeps failing with: Cannot determine if job needs to be started: Too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew. |
This is due to the hard-coded value of how many missed start times the cronjob controller can handle. It's reasonable to have it configurable, I guess. |
Actually I was thinking about it more this morning. Currently if we exceed the artificial, hardcoded limit the cronjob will always error out without any chance to recover the cronjob. Maybe we should error out as we do today and pause the cronjob to prevent it from further erroring out, but at the same time giving the user the ability to restart it. I'd like to hear what others think about such approach. |
Some restart button was exactly what I was looking for in the dashboard! Couldn't find a way to do it and had to delete and recreate the job :( |
I would prefer it be configurable. I could then set it > the number of scheduled starts it would have had overnight while our EC2 instances are shut down. I need something automated. I was thinking about adding a check to my monitoring pod to see if the scheduled job was dead then recreate it. |
+1 |
This just bit us today, too. We had to work around the issue by deleting and re-adding part of our Helm chart on the cluster :( Would be great to be able to disable/re-enable a CronJob. |
Not only being able to restart the CronJob would be nice, but also exclude actually suspending a CronJob - in this case, the user has specified he doesn't want the CronJob to be run, so it should be okay to start again on schedule as soon as the user has resumed the CronJob. If the above is not a major change - I'd like to pick it up if someone else hasn't yet. |
@dzoeteman suspension is already supported, see |
If I flip a "stuck" CronJob to |
Same issue happen on Kubernetes 1.18.6 |
Another fix for this same thing? #89397 |
This is being fixed in the new controller #93370 |
Still experience it with minikube v1.16.0 (Kubernetes v1.20.0) |
@Skybladev2 did you enable the |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
/remove-lifecycle stale |
This should now be resolved with the new controller implementation. |
@soltysh: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
For every CronJob, the CronJob Controller checks how many schedules it missed in the duration from its last scheduled time until now. If there are more than 100 missed schedules, then it does not start the job and logs the error. If the value is set, the cronjob will survive a cluster downtime. See: https://access.redhat.com/solutions/3667021 and kubernetes/kubernetes#42649 for details
For every CronJob, the CronJob Controller checks how many schedules it missed in the duration from its last scheduled time until now. If there are more than 100 missed schedules, then it does not start the job and logs the error. If the value is set, the cronjob will survive a cluster downtime. See: https://access.redhat.com/solutions/3667021 and kubernetes/kubernetes#42649 for details
* Fix for openshift-etcd-backup cronjob For every CronJob, the CronJob Controller checks how many schedules it missed in the duration from its last scheduled time until now. If there are more than 100 missed schedules, then it does not start the job and logs the error. If the value is set, the cronjob will survive a cluster downtime. See: https://access.redhat.com/solutions/3667021 and kubernetes/kubernetes#42649 for details * update chart version
For every CronJob, the CronJob Controller checks how many schedules it missed in the duration from its last scheduled time until now. If there are more than 100 missed schedules, then it does not start the job and logs the error. If the value is set, the cronjob will survive a cluster downtime. See: https://access.redhat.com/solutions/3667021 and kubernetes/kubernetes#42649 for details
If the cluster is down for an extended period of time, the cronjob will fail until recreated. The behavior is fixed in CronV2, but we don't use it yet. For now, use this fix. See kubernetes/kubernetes#42649
FWIW - another use case besides just shutting the cluster down is pausing the job. In GCP it's easy to pause a cron job, and we often do this. If the job is having an issue, I'll pause the prod container for hours while its being debugged and then run into this issue when trying to resume. We are using old GKE/K8s version so am hoping it goes away when upgraded. The "Run Now" still works for one-off runs, but then the cron does not resume after. I'm going to try the restartPolicy= |
* Fix CronJob deadlock Apparently if we fail to start 100 jobs we just stop forever.. this left my cluster in a state with no endpoint updates. See kubernetes/kubernetes#42649 * fix type
* Fix CronJob deadlock Apparently if we fail to start 100 jobs we just stop forever.. this left my cluster in a state with no endpoint updates. See kubernetes/kubernetes#42649 * fix type
Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see http://kubernetes.io/docs/troubleshooting/.):
No
What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.):
CronJob
ScheduledJob
"Too many missed start times to list"
Is this a BUG REPORT or FEATURE REQUEST? (choose one):
Feature Request
Kubernetes version (use
kubectl version
):Environment:
uname -a
):Linux 4.9.11-1.el7.elrepo.x86_64 Unit test coverage in Kubelet is lousy. (~30%) #1 SMP Sat Feb 18 18:16:50 EST 2017 x86_64 x86_64 x86_64 GNU/LinuxWhat happened:
Cluster shutdown nightly to reduce costs. On startup the cluster refuses to run jobs.
The controller manager errors with:
E0307 15:40:23.754617 1 controller.go:163] Cannot determine if default/ needs to be started: Too many missed start times to list
What you expected to happen:
The cluster will run scheduled jobs after being restored.
How to reproduce it (as minimally and precisely as possible):
Set a regular schedule such as
*/1 * * * *
Shutdown for 100 scheduled executions of the job (100 minutes).
Anything else we need to know:
This seems to be related to:
Function: getRecentUnmetScheduleTimes
v1.4: https://github.com/kubernetes/kubernetes/blob/release-1.4/pkg/controller/scheduledjob/utils.go#L169
The text was updated successfully, but these errors were encountered: