Fix CronJob missed start time handling #81557

praseodym · 2019-08-17T18:45:53Z

What type of PR is this?

/kind bug
/sig apps
/area workload-api/cronjob
/priority important-soon

What this PR does / why we need it:

This removes the arbitrary limit of 100 missed start times and will instead always schedule a job when an execution was missed and the Spec.StartingDeadlineSeconds period has not yet passed (if set).

To prevent starting multiple jobs if multiple start times were missed, Status.LastScheduleTime is now set to the actual start time of the job instead of the original scheduled start time.

The "missed starting window" warning message is removed because it was broken (see #73169).

Which issue(s) this PR fixes:

Fixes #42649
Fixes #73169

Special notes for your reviewer:

See #42649 for numerous examples of problems caused by the current arbitrary limit of 100 missed start times, which is easily hit in real-world usage. For example, with a cron job set to execute every minute, 101 minutes of controller downtime (due to maintenance or some bug in the controller) already causes the cron job to permanently fail.

Does this PR introduce a user-facing change?:

Fix CronJob missed start time handling: arbitrary limit of 100 missed start times is removed. CronJob `Status.LastScheduleTime` is now set to the actual start time of the job instead of the original scheduled start time.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [Usage]: https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/

The above doc page will need to be updated if this PR is merged, I will create a PR in https://github.com/kubernetes/website once this is approved.

k8s-ci-robot · 2019-08-17T18:46:00Z

Hi @praseodym. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2019-08-17T18:46:59Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: praseodym
To complete the pull request process, please assign janetkuo
You can assign the PR to them by writing /assign @janetkuo in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

pkg/controller/cronjob/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

praseodym · 2019-08-17T18:47:32Z

/assign @janetkuo

neolit123 · 2019-08-21T19:09:06Z

/ok-to-test

praseodym · 2019-08-21T19:39:45Z

/retest

kow3ns · 2019-09-04T18:16:54Z

/assign @soltysh

kow3ns · 2019-09-04T18:17:10Z

/assign @mortent

kow3ns · 2019-09-04T18:17:27Z

It's not actually clear to me that we should do this.

kow3ns · 2019-09-04T18:17:44Z

/assign @kow3ns

pkg/controller/cronjob/utils.go

soltysh · 2019-09-11T12:57:32Z

pkg/controller/cronjob/utils.go

-		}
+		return nil, nil
+	}
+	if next := sched.Next(earliestTime); !next.After(now) {


This is wrong and the previous logic was trying to prevent us from doing this. Imagine the situation described in the comment you removed. 5pm Friday something is borked, you only notice a problem with it on Tuesday, in the mean time 80 jobs were hold up, now suddenly we'll start creating all of those 80 jobs, which is wrong. If you want to fix it, I'd prefer we emit an event that there was a problem and we missed X starts and pick only the last date. The current logic does not have that protection.

This PR will not work in the way you describe because we now set Status.LastScheduleTime to the actual start time instead of the original scheduled time.

In your example, if we consider this an hourly cron job: the Friday 6pm job will get scheduled after fixing the problem, let’s say Tuesday 10:30am, but the next run after that will be Tuesday 11am. Other missed executions will be skipped.

The tests have been changed to reflect this behaviour as well.

Other missed executions will be skipped.

Just skip and go on is not a good practice, it may result in unpredictable situations.

Make missed starts configurable maybe better.

@bbbmj It's still better than not running the cronjob at all unless it's deleted and recreated (which is what happens now), so making this configurable should be done as a separate PR.

@praseodym I see your point, apologies for the initial misunderstanding. I see a lot of benefit in your approach, but at the same time this fix will break backwards compatibility. We need to ensure that previous behaviour is somehow configurable, maybe through additional field (iow. how many missed jobs can be ignored, or whether a cronjob should be paused after significant failure). Can we start with the latter, which seems to be the easiest and is both backwards compatible and at the same time will make users more in control of getting cronjob back to shape? I've added this topic to sig-apps agenda for the next meeting (according to my calendar it should be Jan 13th). Let's discuss the best possible approach there and we can continue from there.

I’d be against introducing additional complexity in the codebase, whereas this PR (in my opinion) is currently mostly cleanup instead (see e.g. #73169, which this PR also fixes).

Furthermore, I don’t see anyone relying on the current behaviour of CronJobs breaking after 100 missed executions (they’re not paused — they break and need to be completely recreated). The discussion in #42649 also doesn’t contain any comments defending the current behaviour (it does contain references to workarounds that should make any Kubernetes maintainer cry). Finally, the KEP for Graduating CronJob to GA (unfortunately not merged) also lists this issue as a bug, not as an enhancement.

I propose we move forward with this PR as-is, and include a clear release note instead.

praseodym · 2020-08-19T19:01:02Z

Rebased

This removes the arbitrary limit of 100 missed start times and will instead always schedule a job when an execution was missed and the `Spec.StartingDeadlineSeconds` period has not yet passed (if set). To prevent starting multiple jobs if multiple start times were missed, `Status.LastScheduleTime` is now set to the actual start time of the job instead of the original scheduled start time. The "missed starting window" warning message is removed because it was broken (see kubernetes#73169).

This issue was first raised in kubernetes#73169. In the past, we've seen that when the cronjob controller's syncAll loop time exceeds startingDeadlineSeconds, deadlines get missed. However, we currently do not get any notifications or telemetry on when deadlines are missed. Allowing the `MissedStart` event emission code path is the first step to getting telemetry on missed deadlines. There have been attempts to fix this by removing the "TooLate (MissedStart)" check wholesale: 1. kubernetes#81557 (open) 2. kubernetes#74058 (closed) I don't believe this check should be removed since the user should be notified when their deadline is missed. Changes: - test when startingDeadline elapses, should count as an unmet start time - make tests expect MissedStart event when startingDeadline is missed - fix TooLate (MissedStart) condition never being hit

2rs2ts · 2020-09-09T22:58:36Z

I believe the site will also need to be updated for this. https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/#cron-job-limitations and https://kubernetes.io/docs/tasks/job/automated-tasks-with-cron-jobs/#starting-deadline reference the old behavior.

praseodym · 2020-09-10T05:50:20Z

@2rs2ts that's correct, as mentioned in the initial PR message I will do this as soon as this is merged. If there's another procedure for this please let me know!

This issue was first raised in kubernetes#73169. In the past, we've seen that when the cronjob controller's syncAll loop time exceeds startingDeadlineSeconds, deadlines get missed. However, we currently do not get any notifications or telemetry on when deadlines are missed. Allowing the `MissedStart` event emission code path is the first step to getting telemetry on missed deadlines. There have been attempts to fix this by removing the "TooLate (MissedStart)" check wholesale: 1. kubernetes#81557 (open) 2. kubernetes#74058 (closed) I don't believe this check should be removed since the user should be notified when their deadline is missed. Changes: - test when startingDeadline elapses, should count as an unmet start time - make tests expect MissedStart event when startingDeadline is missed - fix TooLate (MissedStart) condition never being hit

2rs2ts · 2020-10-07T18:42:44Z

@praseodym maybe I'm misreading but there are two docs I linked and your summary only links one of them. Maybe I should have said "this doc also" but I was just listing the places where it occurs to be complete

Guldanx · 2020-10-30T00:29:01Z

@praseodym @soltysh Can we go ahead and merge this PR? Or is there something else blocking us?

praseodym · 2020-10-30T12:36:09Z

@praseodym @soltysh Can we go ahead and merge this PR? Or is there something else blocking us?

This PR still needs to be approved by a reviewer before it can be merged.

2rs2ts · 2020-11-18T22:13:10Z

Does #93370 obsolete this PR? I am confused about it.

praseodym · 2020-11-18T22:20:55Z

Does #93370 obsolete this PR? I am confused about it.

That PR introduces a new v2 controller, but still includes the 100 missed start times limit 'feature' from the v1 controller.

fejta-bot · 2021-02-16T22:25:56Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

unixfox · 2021-02-16T22:35:18Z

/remove-lifecycle stale

soltysh · 2021-02-25T20:26:12Z

That PR introduces a new v2 controller, but still includes the 100 missed start times limit 'feature' from the v1 controller.

This is not true at this point.
/close

k8s-ci-robot · 2021-02-25T20:26:19Z

@praseodym: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2021-02-25T20:26:21Z

@soltysh: Closed this PR.

In response to this:

That PR introduces a new v2 controller, but still includes the 100 missed start times limit 'feature' from the v1 controller.

This is not true at this point.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

unixfox · 2021-02-25T21:07:37Z

Why is this got closed?

2rs2ts · 2021-02-25T23:48:38Z

Sounds like they're expecting people to upgrade to the v2 cronjob controller to get the fix? Seems a bit rough

This issue was first raised in kubernetes#73169. In the past, we've seen that when the cronjob controller's syncAll loop time exceeds startingDeadlineSeconds, deadlines get missed. However, we currently do not get any notifications or telemetry on when deadlines are missed. Allowing the `MissedStart` event emission code path is the first step to getting telemetry on missed deadlines. There have been attempts to fix this by removing the "TooLate (MissedStart)" check wholesale: 1. kubernetes#81557 (open) 2. kubernetes#74058 (closed) I don't believe this check should be removed since the user should be notified when their deadline is missed. Changes: - test when startingDeadline elapses, should count as an unmet start time - make tests expect MissedStart event when startingDeadline is missed - fix TooLate (MissedStart) condition never being hit

b0b0haha · 2022-03-14T13:40:08Z

Sounds like they're expecting people to upgrade to the v2 cronjob controller to get the fix? Seems a bit rough

Could you please point out why the v2 cronjob controller can get the fix? It still includes the 100 missed start times limit 'feature' from the v1 controller.
https://github.com/alaypatel07/kubernetes/blob/8d7dd4415e28bded77667ce857e1c58016f9ab3a/pkg/controller/cronjob/utils.go#L194-L214

2rs2ts · 2022-03-15T00:23:22Z

@b0b0haha I don't know, that was over a year ago. Maybe there had been discussion somewhere else about reimplementing this fix, but for the v2 controller, but it sounds like that never happened.

I see comments from people thinking that the "feature" had been fixed (i.e. removed) in the new implementation, but maybe that is mistaken: #42649 (comment)

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 17, 2019

k8s-ci-robot requested review from foxish and krmayankk August 17, 2019 18:47

k8s-ci-robot assigned janetkuo Aug 17, 2019

praseodym mentioned this pull request Aug 17, 2019

Stopping cluster overnight prevents scheduled jobs from running after cluster startup. #42649

Closed

k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Aug 18, 2019

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 21, 2019

praseodym force-pushed the fix-cronjob-missed-start-time-handling branch from 0b82197 to 047c1d0 Compare August 22, 2019 20:17

k8s-ci-robot assigned soltysh Sep 4, 2019

k8s-ci-robot assigned mortent Sep 4, 2019

k8s-ci-robot assigned kow3ns Sep 4, 2019

soltysh requested changes Sep 11, 2019

View reviewed changes

praseodym force-pushed the fix-cronjob-missed-start-time-handling branch from 2ae22f1 to a9ef15d Compare August 19, 2020 19:29

praseodym mentioned this pull request Nov 5, 2020

Add cronjob controller v2 #93370

Merged

2rs2ts mentioned this pull request Dec 2, 2020

WIP: Remove the "100 missed start times" CronJob limit #89397

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 16, 2021

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 16, 2021

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 25, 2021

k8s-ci-robot closed this Feb 25, 2021

2rs2ts mentioned this pull request Jan 8, 2024

Kubernetes cronjob documentation out of date? kubernetes/website#44525

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CronJob missed start time handling #81557

Fix CronJob missed start time handling #81557

praseodym commented Aug 17, 2019 •

edited

k8s-ci-robot commented Aug 17, 2019

k8s-ci-robot commented Aug 17, 2019

praseodym commented Aug 17, 2019

neolit123 commented Aug 21, 2019

praseodym commented Aug 21, 2019

kow3ns commented Sep 4, 2019

kow3ns commented Sep 4, 2019

kow3ns commented Sep 4, 2019

kow3ns commented Sep 4, 2019

soltysh Sep 11, 2019

praseodym Sep 12, 2019 •

edited

bbbmj Dec 2, 2019

aermakov-zalando Dec 4, 2019

soltysh Dec 23, 2019

praseodym Dec 23, 2019

praseodym commented Aug 19, 2020

2rs2ts commented Sep 9, 2020 •

edited

praseodym commented Sep 10, 2020

2rs2ts commented Oct 7, 2020

Guldanx commented Oct 30, 2020

praseodym commented Oct 30, 2020

2rs2ts commented Nov 18, 2020

praseodym commented Nov 18, 2020

fejta-bot commented Feb 16, 2021

unixfox commented Feb 16, 2021

soltysh commented Feb 25, 2021

k8s-ci-robot commented Feb 25, 2021

k8s-ci-robot commented Feb 25, 2021

unixfox commented Feb 25, 2021

2rs2ts commented Feb 25, 2021

b0b0haha commented Mar 14, 2022

2rs2ts commented Mar 15, 2022

Fix CronJob missed start time handling #81557

Fix CronJob missed start time handling #81557

Conversation

praseodym commented Aug 17, 2019 • edited

k8s-ci-robot commented Aug 17, 2019

k8s-ci-robot commented Aug 17, 2019

praseodym commented Aug 17, 2019

neolit123 commented Aug 21, 2019

praseodym commented Aug 21, 2019

kow3ns commented Sep 4, 2019

kow3ns commented Sep 4, 2019

kow3ns commented Sep 4, 2019

kow3ns commented Sep 4, 2019

soltysh Sep 11, 2019

Choose a reason for hiding this comment

praseodym Sep 12, 2019 • edited

Choose a reason for hiding this comment

bbbmj Dec 2, 2019

Choose a reason for hiding this comment

aermakov-zalando Dec 4, 2019

Choose a reason for hiding this comment

soltysh Dec 23, 2019

Choose a reason for hiding this comment

praseodym Dec 23, 2019

Choose a reason for hiding this comment

praseodym commented Aug 19, 2020

2rs2ts commented Sep 9, 2020 • edited

praseodym commented Sep 10, 2020

2rs2ts commented Oct 7, 2020

Guldanx commented Oct 30, 2020

praseodym commented Oct 30, 2020

2rs2ts commented Nov 18, 2020

praseodym commented Nov 18, 2020

fejta-bot commented Feb 16, 2021

unixfox commented Feb 16, 2021

soltysh commented Feb 25, 2021

k8s-ci-robot commented Feb 25, 2021

k8s-ci-robot commented Feb 25, 2021

unixfox commented Feb 25, 2021

2rs2ts commented Feb 25, 2021

b0b0haha commented Mar 14, 2022

2rs2ts commented Mar 15, 2022

praseodym commented Aug 17, 2019 •

edited

praseodym Sep 12, 2019 •

edited

2rs2ts commented Sep 9, 2020 •

edited