New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix CronJob missed start time handling #81557
Fix CronJob missed start time handling #81557
Conversation
Hi @praseodym. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: praseodym The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/assign @janetkuo |
/ok-to-test |
/retest |
0b82197
to
047c1d0
Compare
/assign @soltysh |
/assign @mortent |
It's not actually clear to me that we should do this. |
/assign @kow3ns |
} | ||
return nil, nil | ||
} | ||
if next := sched.Next(earliestTime); !next.After(now) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is wrong and the previous logic was trying to prevent us from doing this. Imagine the situation described in the comment you removed. 5pm Friday something is borked, you only notice a problem with it on Tuesday, in the mean time 80 jobs were hold up, now suddenly we'll start creating all of those 80 jobs, which is wrong. If you want to fix it, I'd prefer we emit an event that there was a problem and we missed X starts and pick only the last date. The current logic does not have that protection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR will not work in the way you describe because we now set Status.LastScheduleTime
to the actual start time instead of the original scheduled time.
In your example, if we consider this an hourly cron job: the Friday 6pm job will get scheduled after fixing the problem, let’s say Tuesday 10:30am, but the next run after that will be Tuesday 11am. Other missed executions will be skipped.
The tests have been changed to reflect this behaviour as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other missed executions will be skipped.
Just skip and go on is not a good practice, it may result in unpredictable situations.
Make missed starts configurable maybe better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bbbmj It's still better than not running the cronjob at all unless it's deleted and recreated (which is what happens now), so making this configurable should be done as a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@praseodym I see your point, apologies for the initial misunderstanding. I see a lot of benefit in your approach, but at the same time this fix will break backwards compatibility. We need to ensure that previous behaviour is somehow configurable, maybe through additional field (iow. how many missed jobs can be ignored, or whether a cronjob should be paused after significant failure). Can we start with the latter, which seems to be the easiest and is both backwards compatible and at the same time will make users more in control of getting cronjob back to shape? I've added this topic to sig-apps agenda for the next meeting (according to my calendar it should be Jan 13th). Let's discuss the best possible approach there and we can continue from there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’d be against introducing additional complexity in the codebase, whereas this PR (in my opinion) is currently mostly cleanup instead (see e.g. #73169, which this PR also fixes).
Furthermore, I don’t see anyone relying on the current behaviour of CronJobs breaking after 100 missed executions (they’re not paused — they break and need to be completely recreated). The discussion in #42649 also doesn’t contain any comments defending the current behaviour (it does contain references to workarounds that should make any Kubernetes maintainer cry). Finally, the KEP for Graduating CronJob to GA (unfortunately not merged) also lists this issue as a bug, not as an enhancement.
I propose we move forward with this PR as-is, and include a clear release note instead.
Rebased |
This removes the arbitrary limit of 100 missed start times and will instead always schedule a job when an execution was missed and the `Spec.StartingDeadlineSeconds` period has not yet passed (if set). To prevent starting multiple jobs if multiple start times were missed, `Status.LastScheduleTime` is now set to the actual start time of the job instead of the original scheduled start time. The "missed starting window" warning message is removed because it was broken (see kubernetes#73169).
2ae22f1
to
a9ef15d
Compare
This issue was first raised in kubernetes#73169. In the past, we've seen that when the cronjob controller's syncAll loop time exceeds startingDeadlineSeconds, deadlines get missed. However, we currently do not get any notifications or telemetry on when deadlines are missed. Allowing the `MissedStart` event emission code path is the first step to getting telemetry on missed deadlines. There have been attempts to fix this by removing the "TooLate (MissedStart)" check wholesale: 1. kubernetes#81557 (open) 2. kubernetes#74058 (closed) I don't believe this check should be removed since the user should be notified when their deadline is missed. Changes: - test when startingDeadline elapses, should count as an unmet start time - make tests expect MissedStart event when startingDeadline is missed - fix TooLate (MissedStart) condition never being hit
I believe the site will also need to be updated for this. https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/#cron-job-limitations and https://kubernetes.io/docs/tasks/job/automated-tasks-with-cron-jobs/#starting-deadline reference the old behavior. |
@2rs2ts that's correct, as mentioned in the initial PR message I will do this as soon as this is merged. If there's another procedure for this please let me know! |
This issue was first raised in kubernetes#73169. In the past, we've seen that when the cronjob controller's syncAll loop time exceeds startingDeadlineSeconds, deadlines get missed. However, we currently do not get any notifications or telemetry on when deadlines are missed. Allowing the `MissedStart` event emission code path is the first step to getting telemetry on missed deadlines. There have been attempts to fix this by removing the "TooLate (MissedStart)" check wholesale: 1. kubernetes#81557 (open) 2. kubernetes#74058 (closed) I don't believe this check should be removed since the user should be notified when their deadline is missed. Changes: - test when startingDeadline elapses, should count as an unmet start time - make tests expect MissedStart event when startingDeadline is missed - fix TooLate (MissedStart) condition never being hit
@praseodym maybe I'm misreading but there are two docs I linked and your summary only links one of them. Maybe I should have said "this doc also" but I was just listing the places where it occurs to be complete |
@praseodym @soltysh Can we go ahead and merge this PR? Or is there something else blocking us? |
This PR still needs to be approved by a reviewer before it can be merged. |
Does #93370 obsolete this PR? I am confused about it. |
That PR introduces a new v2 controller, but still includes the 100 missed start times limit 'feature' from the v1 controller. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
/remove-lifecycle stale |
This is not true at this point. |
@praseodym: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@soltysh: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Why is this got closed? |
Sounds like they're expecting people to upgrade to the v2 cronjob controller to get the fix? Seems a bit rough |
This issue was first raised in kubernetes#73169. In the past, we've seen that when the cronjob controller's syncAll loop time exceeds startingDeadlineSeconds, deadlines get missed. However, we currently do not get any notifications or telemetry on when deadlines are missed. Allowing the `MissedStart` event emission code path is the first step to getting telemetry on missed deadlines. There have been attempts to fix this by removing the "TooLate (MissedStart)" check wholesale: 1. kubernetes#81557 (open) 2. kubernetes#74058 (closed) I don't believe this check should be removed since the user should be notified when their deadline is missed. Changes: - test when startingDeadline elapses, should count as an unmet start time - make tests expect MissedStart event when startingDeadline is missed - fix TooLate (MissedStart) condition never being hit
Could you please point out why the v2 cronjob controller can get the fix? It still includes the 100 missed start times limit 'feature' from the v1 controller. |
@b0b0haha I don't know, that was over a year ago. Maybe there had been discussion somewhere else about reimplementing this fix, but for the v2 controller, but it sounds like that never happened. I see comments from people thinking that the "feature" had been fixed (i.e. removed) in the new implementation, but maybe that is mistaken: #42649 (comment) |
What type of PR is this?
/kind bug
/sig apps
/area workload-api/cronjob
/priority important-soon
What this PR does / why we need it:
This removes the arbitrary limit of 100 missed start times and will instead always schedule a job when an execution was missed and the
Spec.StartingDeadlineSeconds
period has not yet passed (if set).To prevent starting multiple jobs if multiple start times were missed,
Status.LastScheduleTime
is now set to the actual start time of the job instead of the original scheduled start time.The "missed starting window" warning message is removed because it was broken (see #73169).
Which issue(s) this PR fixes:
Fixes #42649
Fixes #73169
Special notes for your reviewer:
See #42649 for numerous examples of problems caused by the current arbitrary limit of 100 missed start times, which is easily hit in real-world usage. For example, with a cron job set to execute every minute, 101 minutes of controller downtime (due to maintenance or some bug in the controller) already causes the cron job to permanently fail.
Does this PR introduce a user-facing change?:
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
The above doc page will need to be updated if this PR is merged, I will create a PR in https://github.com/kubernetes/website once this is approved.