Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix CronJob missed start time handling #81557

Conversation

praseodym
Copy link
Contributor

@praseodym praseodym commented Aug 17, 2019

What type of PR is this?

/kind bug
/sig apps
/area workload-api/cronjob
/priority important-soon

What this PR does / why we need it:

This removes the arbitrary limit of 100 missed start times and will instead always schedule a job when an execution was missed and the Spec.StartingDeadlineSeconds period has not yet passed (if set).

To prevent starting multiple jobs if multiple start times were missed, Status.LastScheduleTime is now set to the actual start time of the job instead of the original scheduled start time.

The "missed starting window" warning message is removed because it was broken (see #73169).

Which issue(s) this PR fixes:

Fixes #42649
Fixes #73169

Special notes for your reviewer:

See #42649 for numerous examples of problems caused by the current arbitrary limit of 100 missed start times, which is easily hit in real-world usage. For example, with a cron job set to execute every minute, 101 minutes of controller downtime (due to maintenance or some bug in the controller) already causes the cron job to permanently fail.

Does this PR introduce a user-facing change?:

Fix CronJob missed start time handling: arbitrary limit of 100 missed start times is removed. CronJob `Status.LastScheduleTime` is now set to the actual start time of the job instead of the original scheduled start time.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [Usage]: https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/

The above doc page will need to be updated if this PR is merged, I will create a PR in https://github.com/kubernetes/website once this is approved.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. sig/apps Categorizes an issue or PR as relevant to SIG Apps. area/workload-api/cronjob cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Aug 17, 2019
@k8s-ci-robot
Copy link
Contributor

Hi @praseodym. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 17, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: praseodym
To complete the pull request process, please assign janetkuo
You can assign the PR to them by writing /assign @janetkuo in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@praseodym
Copy link
Contributor Author

/assign @janetkuo

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Aug 18, 2019
@neolit123
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 21, 2019
@praseodym
Copy link
Contributor Author

/retest

@praseodym praseodym force-pushed the fix-cronjob-missed-start-time-handling branch from 0b82197 to 047c1d0 Compare August 22, 2019 20:17
@kow3ns
Copy link
Member

kow3ns commented Sep 4, 2019

/assign @soltysh

@kow3ns
Copy link
Member

kow3ns commented Sep 4, 2019

/assign @mortent

@kow3ns
Copy link
Member

kow3ns commented Sep 4, 2019

It's not actually clear to me that we should do this.

@kow3ns
Copy link
Member

kow3ns commented Sep 4, 2019

/assign @kow3ns

pkg/controller/cronjob/utils.go Outdated Show resolved Hide resolved
pkg/controller/cronjob/utils.go Outdated Show resolved Hide resolved
pkg/controller/cronjob/utils.go Outdated Show resolved Hide resolved
}
return nil, nil
}
if next := sched.Next(earliestTime); !next.After(now) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong and the previous logic was trying to prevent us from doing this. Imagine the situation described in the comment you removed. 5pm Friday something is borked, you only notice a problem with it on Tuesday, in the mean time 80 jobs were hold up, now suddenly we'll start creating all of those 80 jobs, which is wrong. If you want to fix it, I'd prefer we emit an event that there was a problem and we missed X starts and pick only the last date. The current logic does not have that protection.

Copy link
Contributor Author

@praseodym praseodym Sep 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR will not work in the way you describe because we now set Status.LastScheduleTime to the actual start time instead of the original scheduled time.

In your example, if we consider this an hourly cron job: the Friday 6pm job will get scheduled after fixing the problem, let’s say Tuesday 10:30am, but the next run after that will be Tuesday 11am. Other missed executions will be skipped.

The tests have been changed to reflect this behaviour as well.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other missed executions will be skipped.

Just skip and go on is not a good practice, it may result in unpredictable situations.

Make missed starts configurable maybe better.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bbbmj It's still better than not running the cronjob at all unless it's deleted and recreated (which is what happens now), so making this configurable should be done as a separate PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@praseodym I see your point, apologies for the initial misunderstanding. I see a lot of benefit in your approach, but at the same time this fix will break backwards compatibility. We need to ensure that previous behaviour is somehow configurable, maybe through additional field (iow. how many missed jobs can be ignored, or whether a cronjob should be paused after significant failure). Can we start with the latter, which seems to be the easiest and is both backwards compatible and at the same time will make users more in control of getting cronjob back to shape? I've added this topic to sig-apps agenda for the next meeting (according to my calendar it should be Jan 13th). Let's discuss the best possible approach there and we can continue from there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’d be against introducing additional complexity in the codebase, whereas this PR (in my opinion) is currently mostly cleanup instead (see e.g. #73169, which this PR also fixes).

Furthermore, I don’t see anyone relying on the current behaviour of CronJobs breaking after 100 missed executions (they’re not paused — they break and need to be completely recreated). The discussion in #42649 also doesn’t contain any comments defending the current behaviour (it does contain references to workarounds that should make any Kubernetes maintainer cry). Finally, the KEP for Graduating CronJob to GA (unfortunately not merged) also lists this issue as a bug, not as an enhancement.

I propose we move forward with this PR as-is, and include a clear release note instead.

@praseodym
Copy link
Contributor Author

Rebased

This removes the arbitrary limit of 100 missed start times and will
instead always schedule a job when an execution was missed and
the `Spec.StartingDeadlineSeconds` period has not yet passed (if set).

To prevent starting multiple jobs if multiple start times were missed,
`Status.LastScheduleTime` is now set to the actual start time of the job
instead of the original scheduled start time.

The "missed starting window" warning message is removed because it was
broken (see kubernetes#73169).
@praseodym praseodym force-pushed the fix-cronjob-missed-start-time-handling branch from 2ae22f1 to a9ef15d Compare August 19, 2020 19:29
paulnivin pushed a commit to lyft/kubernetes that referenced this pull request Aug 24, 2020
This issue was first raised in kubernetes#73169.

In the past, we've seen that when the cronjob controller's syncAll loop
time exceeds startingDeadlineSeconds, deadlines get missed.
However, we currently do not get any notifications or telemetry on when
deadlines are missed.

Allowing the `MissedStart` event emission code path is the first step
to getting telemetry on missed deadlines.

There have been attempts to fix this by removing the "TooLate
(MissedStart)" check wholesale:

1. kubernetes#81557 (open)
2. kubernetes#74058 (closed)

I don't believe this check should be removed since the user should be
notified when their deadline is missed.

Changes:
- test when startingDeadline elapses, should count as an unmet start time

- make tests expect MissedStart event when startingDeadline is missed

- fix TooLate (MissedStart) condition never being hit
@2rs2ts
Copy link
Contributor

2rs2ts commented Sep 9, 2020

@praseodym
Copy link
Contributor Author

@2rs2ts that's correct, as mentioned in the initial PR message I will do this as soon as this is merged. If there's another procedure for this please let me know!

paulnivin pushed a commit to lyft/kubernetes that referenced this pull request Sep 16, 2020
This issue was first raised in kubernetes#73169.

In the past, we've seen that when the cronjob controller's syncAll loop
time exceeds startingDeadlineSeconds, deadlines get missed.
However, we currently do not get any notifications or telemetry on when
deadlines are missed.

Allowing the `MissedStart` event emission code path is the first step
to getting telemetry on missed deadlines.

There have been attempts to fix this by removing the "TooLate
(MissedStart)" check wholesale:

1. kubernetes#81557 (open)
2. kubernetes#74058 (closed)

I don't believe this check should be removed since the user should be
notified when their deadline is missed.

Changes:
- test when startingDeadline elapses, should count as an unmet start time

- make tests expect MissedStart event when startingDeadline is missed

- fix TooLate (MissedStart) condition never being hit
@2rs2ts
Copy link
Contributor

2rs2ts commented Oct 7, 2020

@praseodym maybe I'm misreading but there are two docs I linked and your summary only links one of them. Maybe I should have said "this doc also" but I was just listing the places where it occurs to be complete

@Guldanx
Copy link

Guldanx commented Oct 30, 2020

@praseodym @soltysh Can we go ahead and merge this PR? Or is there something else blocking us?

@praseodym
Copy link
Contributor Author

@praseodym @soltysh Can we go ahead and merge this PR? Or is there something else blocking us?

This PR still needs to be approved by a reviewer before it can be merged.

@2rs2ts
Copy link
Contributor

2rs2ts commented Nov 18, 2020

Does #93370 obsolete this PR? I am confused about it.

@praseodym
Copy link
Contributor Author

Does #93370 obsolete this PR? I am confused about it.

That PR introduces a new v2 controller, but still includes the 100 missed start times limit 'feature' from the v1 controller.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 16, 2021
@unixfox
Copy link

unixfox commented Feb 16, 2021

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 16, 2021
@soltysh
Copy link
Contributor

soltysh commented Feb 25, 2021

That PR introduces a new v2 controller, but still includes the 100 missed start times limit 'feature' from the v1 controller.

This is not true at this point.
/close

@k8s-ci-robot
Copy link
Contributor

@praseodym: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 25, 2021
@k8s-ci-robot
Copy link
Contributor

@soltysh: Closed this PR.

In response to this:

That PR introduces a new v2 controller, but still includes the 100 missed start times limit 'feature' from the v1 controller.

This is not true at this point.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@unixfox
Copy link

unixfox commented Feb 25, 2021

Why is this got closed?

@2rs2ts
Copy link
Contributor

2rs2ts commented Feb 25, 2021

Sounds like they're expecting people to upgrade to the v2 cronjob controller to get the fix? Seems a bit rough

MHBauer pushed a commit to lyft/kubernetes that referenced this pull request Sep 3, 2021
This issue was first raised in kubernetes#73169.

In the past, we've seen that when the cronjob controller's syncAll loop
time exceeds startingDeadlineSeconds, deadlines get missed.
However, we currently do not get any notifications or telemetry on when
deadlines are missed.

Allowing the `MissedStart` event emission code path is the first step
to getting telemetry on missed deadlines.

There have been attempts to fix this by removing the "TooLate
(MissedStart)" check wholesale:

1. kubernetes#81557 (open)
2. kubernetes#74058 (closed)

I don't believe this check should be removed since the user should be
notified when their deadline is missed.

Changes:
- test when startingDeadline elapses, should count as an unmet start time

- make tests expect MissedStart event when startingDeadline is missed

- fix TooLate (MissedStart) condition never being hit
@b0b0haha
Copy link

Sounds like they're expecting people to upgrade to the v2 cronjob controller to get the fix? Seems a bit rough

Could you please point out why the v2 cronjob controller can get the fix? It still includes the 100 missed start times limit 'feature' from the v1 controller.
https://github.com/alaypatel07/kubernetes/blob/8d7dd4415e28bded77667ce857e1c58016f9ab3a/pkg/controller/cronjob/utils.go#L194-L214

@2rs2ts
Copy link
Contributor

2rs2ts commented Mar 15, 2022

@b0b0haha I don't know, that was over a year ago. Maybe there had been discussion somewhere else about reimplementing this fix, but for the v2 controller, but it sounds like that never happened.

I see comments from people thinking that the "feature" had been fixed (i.e. removed) in the new implementation, but maybe that is mistaken: #42649 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/workload-api/cronjob cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet