-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not list CronJob unmet starting times beyond deadline #39608
Do not list CronJob unmet starting times beyond deadline #39608
Conversation
Hi @peay. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
if schedulingDeadline.After(earliestTime) { | ||
earliestTime = schedulingDeadline | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you prepare test cases covering this new flow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, will do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the PR with tests.
@@ -163,7 +171,7 @@ func getRecentUnmetScheduleTimes(sj batch.CronJob, now time.Time) ([]time.Time, | |||
// but less than "lots". | |||
if len(starts) > 100 { | |||
// We can't get the most recent times so just return an empty slice | |||
return []time.Time{}, fmt.Errorf("Too many missed start times to list") | |||
return []time.Time{}, fmt.Errorf("Too many missed start times (decrease StartingDeadlineSeconds or check clock)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're touching this I'd suggest something like:
Too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew.
@kubernetes/sig-apps-misc fyi |
b0cefa8
to
d141a43
Compare
This LGTM, I've updated the title, to drop the WIP prefix. @k8s-bot ok to test |
@k8s-bot kubemark e2e test this |
Jenkins Kubemark GCE e2e failed for commit d141a43. Full PR test history. cc @peay The magic incantation to run this job again is Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@k8s-bot kubemark e2e test this |
@k8s-bot test this [submit-queue is verifying that this PR is safe to merge] |
Automatic merge from submit-queue |
Automatic merge from submit-queue (batch tested with PRs 40196, 40143, 40277) Emit warning event when CronJob cannot determine starting time **What this PR does / why we need it**: In #39608, we've modified the error message for when a CronJob has too many unmet starting times to enumerate to figure out the next starting time. This makes it more "actionable", and the user can now set a deadline to avoid running into this. However, the error message is still only controller level AFAIK and thus not exposed to the user. From his perspective, there is no way to tell why the CronJob is not scheduling the next instance. The PR adds a warning event in addition to the error in the controller manager's log. **Which issue this PR fixes**: This is an addition to PR #39608 regarding #36311. **Special notes for your reviewer**: cc @soltysh **Release note**: ```release-note ```
…to-list Automatic merge from submit-queue Do not list CronJob unmet starting times beyond deadline **What this PR does / why we need it**: See kubernetes#36311. `getRecentUnmetScheduleTimes` gives up after 100 unmet times to avoid wasting too much CPU or memory generating all the times, as it generates them sequentially. When concurrency is forbidden, this is conceptually un-necessary: we only need the last unmet start time. This suggests that when concurrency is forbidden, we could generate times by going backward in time from now. This is not very practical as CronJob currently relies on a package that only provides `Next` and no `Prev`. Hand-cooking a `Prev` does not seem like a good idea. I could submit a PR to the cron library to add a `Prev` method, and use that when concurrency is forbidden through something like `getLastUnmetScheduleTime`. This would be `O(1)` and there would be no limit involved. (edit: actually, even for the other concurrency settings, we only start the last unmet start times -- there is a `TODO` in the controller to actually start all of them, but that is not implemented at the moment. This means the solution would apply, at least temporarily, to all concurrency settings). cc @soltysh what do you think? In the meantime, I would suggest to do something simple. Currently, the user has no way to configure anything to ensure that his CronJob will not get stuck if one job takes more that 100 unmet times. `getRecentUnmetScheduleTimes` starts with an initial time corresponding to the last start (or to the creation of the CronJob, if nothing has started yet). However, when `StartingDeadlineSeconds` is set, the controller will not start anything that is older than the deadline, so if the last start is way beyond the deadline, we are generating potentially lots of unmet start times that will not be considered by the scheduler for scheduling anyway. Consider a job running every minute, where the last instance has taken 120 minutes. This means there are more than 100 unmet times when we start counting from the last start time. **The PR makes `getRecentUnmetScheduleTimes` only consider times that do not fall beyond the deadline.** Here, the CronJob can be configured with a `StartingDeadlineSeconds` of, say, 10 minutes. After the 120min job has run, `getRecentUnmetScheduleTimes` will only consider the times in the last 10 minutes from now, and will not get stuck. As a side note on the max. number of unmet times to use as limits in terms of CPU used by the controller: I have run a quick benchmark on my i7 mac. Schedules corresponding to "once a week" tend to be more expensive to generate unmet times for. Just FYI. ``` +--------------+---------------+--------------+ | SCHEDULE | MISSED STARTS | TIMING | +--------------+---------------+--------------+ | */1 * * * ? | 100 | 383.645µs | | */30 * * * ? | 100 | 354.765µs | | 30 1 * * ? | 100 | 1.065124ms | | 30 1 * * 0 | 100 | 1.80034ms | | */1 * * * ? | 500 | 1.341365ms | | */30 * * * ? | 500 | 1.814441ms | | 30 1 * * ? | 500 | 8.475012ms | | 30 1 * * 0 | 500 | 10.020613ms | | */1 * * * ? | 1000 | 2.551697ms | | */30 * * * ? | 1000 | 4.075813ms | | 30 1 * * ? | 1000 | 17.674945ms | | 30 1 * * 0 | 1000 | 19.149324ms | | */1 * * * ? | 10000 | 25.725531ms | | */30 * * * ? | 10000 | 87.520022ms | | 30 1 * * ? | 10000 | 174.29216ms | | 30 1 * * 0 | 10000 | 196.565748ms | +--------------+---------------+--------------+ ``` using ```.go package main import ( "fmt" "time" "os" "strconv" "github.com/robfig/cron" "github.com/olekukonko/tablewriter" ) func timeSchedule(schedule string, iterations int) (time.Duration) { sched, err := cron.ParseStandard(schedule) if err != nil { panic(fmt.Sprintf("Unparseable schedule: %s", err)) } start := time.Now() t := time.Now() for i := 1; i <= iterations; i++ { t = sched.Next(t) } return time.Since(start) } func main() { table := tablewriter.NewWriter(os.Stdout) table.SetHeader([]string{"Schedule", "Missed starts", "Timing"}) schedules := []string{"*/1 * * * ?", "*/30 * * * ?", "30 1 * * ?", "30 1 * * 0"} iteration_nums := []int{100, 500, 1000, 10000} for _, iterations := range iteration_nums { for _, schedule := range schedules { table.Append([]string{schedule, strconv.Itoa(iterations), timeSchedule(schedule, iterations).String()}) } } table.Render() } ``` **Which issue this PR fixes**: fixes kubernetes#36311 **Special notes for your reviewer**: **Release note**: ```release-note ```
…ot-start Automatic merge from submit-queue (batch tested with PRs 40196, 40143, 40277) Emit warning event when CronJob cannot determine starting time **What this PR does / why we need it**: In kubernetes#39608, we've modified the error message for when a CronJob has too many unmet starting times to enumerate to figure out the next starting time. This makes it more "actionable", and the user can now set a deadline to avoid running into this. However, the error message is still only controller level AFAIK and thus not exposed to the user. From his perspective, there is no way to tell why the CronJob is not scheduling the next instance. The PR adds a warning event in addition to the error in the controller manager's log. **Which issue this PR fixes**: This is an addition to PR kubernetes#39608 regarding kubernetes#36311. **Special notes for your reviewer**: cc @soltysh **Release note**: ```release-note ```
What this PR does / why we need it:
See #36311.
getRecentUnmetScheduleTimes
gives up after 100 unmet times to avoid wasting too much CPU or memory generating all the times, as it generates them sequentially.When concurrency is forbidden, this is conceptually un-necessary: we only need the last unmet start time. This suggests that when concurrency is forbidden, we could generate times by going backward in time from now. This is not very practical as CronJob currently relies on a package that only provides
Next
and noPrev
. Hand-cooking aPrev
does not seem like a good idea. I could submit a PR to the cron library to add aPrev
method, and use that when concurrency is forbidden through something likegetLastUnmetScheduleTime
. This would beO(1)
and there would be no limit involved.(edit: actually, even for the other concurrency settings, we only start the last unmet start times -- there is a
TODO
in the controller to actually start all of them, but that is not implemented at the moment. This means the solution would apply, at least temporarily, to all concurrency settings).cc @soltysh what do you think?
In the meantime, I would suggest to do something simple. Currently, the user has no way to configure anything to ensure that his CronJob will not get stuck if one job takes more that 100 unmet times.
getRecentUnmetScheduleTimes
starts with an initial time corresponding to the last start (or to the creation of the CronJob, if nothing has started yet). However, whenStartingDeadlineSeconds
is set, the controller will not start anything that is older than the deadline, so if the last start is way beyond the deadline, we are generating potentially lots of unmet start times that will not be considered by the scheduler for scheduling anyway.Consider a job running every minute, where the last instance has taken 120 minutes. This means there are more than 100 unmet times when we start counting from the last start time.
The PR makes
getRecentUnmetScheduleTimes
only consider times that do not fall beyond the deadline. Here, the CronJob can be configured with aStartingDeadlineSeconds
of, say, 10 minutes. After the 120min job has run,getRecentUnmetScheduleTimes
will only consider the times in the last 10 minutes from now, and will not get stuck.As a side note on the max. number of unmet times to use as limits in terms of CPU used by the controller: I have run a quick benchmark on my i7 mac. Schedules corresponding to "once a week" tend to be more expensive to generate unmet times for. Just FYI.
using
Which issue this PR fixes: fixes #36311
Special notes for your reviewer:
Release note: