Do not list CronJob unmet starting times beyond deadline #39608

peay · 2017-01-09T13:11:39Z

What this PR does / why we need it:

See #36311. getRecentUnmetScheduleTimes gives up after 100 unmet times to avoid wasting too much CPU or memory generating all the times, as it generates them sequentially.

When concurrency is forbidden, this is conceptually un-necessary: we only need the last unmet start time. This suggests that when concurrency is forbidden, we could generate times by going backward in time from now. This is not very practical as CronJob currently relies on a package that only provides Next and no Prev. Hand-cooking a Prev does not seem like a good idea. I could submit a PR to the cron library to add a Prev method, and use that when concurrency is forbidden through something like getLastUnmetScheduleTime. This would be O(1) and there would be no limit involved.

(edit: actually, even for the other concurrency settings, we only start the last unmet start times -- there is a TODO in the controller to actually start all of them, but that is not implemented at the moment. This means the solution would apply, at least temporarily, to all concurrency settings).

cc @soltysh what do you think?

In the meantime, I would suggest to do something simple. Currently, the user has no way to configure anything to ensure that his CronJob will not get stuck if one job takes more that 100 unmet times.

getRecentUnmetScheduleTimes starts with an initial time corresponding to the last start (or to the creation of the CronJob, if nothing has started yet). However, when StartingDeadlineSeconds is set, the controller will not start anything that is older than the deadline, so if the last start is way beyond the deadline, we are generating potentially lots of unmet start times that will not be considered by the scheduler for scheduling anyway.

Consider a job running every minute, where the last instance has taken 120 minutes. This means there are more than 100 unmet times when we start counting from the last start time.

The PR makes getRecentUnmetScheduleTimes only consider times that do not fall beyond the deadline. Here, the CronJob can be configured with a StartingDeadlineSeconds of, say, 10 minutes. After the 120min job has run, getRecentUnmetScheduleTimes will only consider the times in the last 10 minutes from now, and will not get stuck.

As a side note on the max. number of unmet times to use as limits in terms of CPU used by the controller: I have run a quick benchmark on my i7 mac. Schedules corresponding to "once a week" tend to be more expensive to generate unmet times for. Just FYI.

+--------------+---------------+--------------+
|   SCHEDULE   | MISSED STARTS |    TIMING    |
+--------------+---------------+--------------+
| */1 * * * ?  |           100 | 383.645µs    |
| */30 * * * ? |           100 | 354.765µs    |
| 30 1 * * ?   |           100 | 1.065124ms   |
| 30 1 * * 0   |           100 | 1.80034ms    |
| */1 * * * ?  |           500 | 1.341365ms   |
| */30 * * * ? |           500 | 1.814441ms   |
| 30 1 * * ?   |           500 | 8.475012ms   |
| 30 1 * * 0   |           500 | 10.020613ms  |
| */1 * * * ?  |          1000 | 2.551697ms   |
| */30 * * * ? |          1000 | 4.075813ms   |
| 30 1 * * ?   |          1000 | 17.674945ms  |
| 30 1 * * 0   |          1000 | 19.149324ms  |
| */1 * * * ?  |         10000 | 25.725531ms  |
| */30 * * * ? |         10000 | 87.520022ms  |
| 30 1 * * ?   |         10000 | 174.29216ms  |
| 30 1 * * 0   |         10000 | 196.565748ms |
+--------------+---------------+--------------+

using

package main

import (
    "fmt"
    "time"
    "os"
    "strconv"

    "github.com/robfig/cron"
    "github.com/olekukonko/tablewriter"
)

func timeSchedule(schedule string, iterations int) (time.Duration) {
    sched, err := cron.ParseStandard(schedule)

    if err != nil {
        panic(fmt.Sprintf("Unparseable schedule: %s", err))
    }

    start := time.Now()
    t := time.Now()

    for i := 1; i <= iterations; i++ {
        t = sched.Next(t)
    }

    return time.Since(start)
}

func main() {
    table := tablewriter.NewWriter(os.Stdout)
    table.SetHeader([]string{"Schedule", "Missed starts", "Timing"})

    schedules := []string{"*/1 * * * ?", "*/30 * * * ?", "30 1 * * ?", "30 1 * * 0"}
    iteration_nums := []int{100, 500, 1000, 10000}

    for _, iterations := range iteration_nums {
        for _, schedule := range schedules {
            table.Append([]string{schedule,
                                  strconv.Itoa(iterations),
                                  timeSchedule(schedule, iterations).String()})
        }
    }
    table.Render()
}

Which issue this PR fixes: fixes #36311

Special notes for your reviewer:

Release note:

k8s-ci-robot · 2017-01-09T13:11:39Z

Hi @peay. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with @k8s-bot ok to test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-reviewable · 2017-01-09T13:11:46Z

This change is

soltysh · 2017-01-12T11:39:43Z

@peay sorry for the long waiting time. I like your idea, esp. that we explicitly say that anything above startingDeadlineSeconds is considered as failed.

soltysh · 2017-01-12T11:40:23Z

pkg/controller/cronjob/utils.go

+		if schedulingDeadline.After(earliestTime) {
+			earliestTime = schedulingDeadline
+		}
+	}


Can you prepare test cases covering this new flow?

Yes, will do

I have updated the PR with tests.

soltysh · 2017-01-12T11:41:32Z

pkg/controller/cronjob/utils.go

@@ -163,7 +171,7 @@ func getRecentUnmetScheduleTimes(sj batch.CronJob, now time.Time) ([]time.Time,
 		// but less than "lots".
 		if len(starts) > 100 {
 			// We can't get the most recent times so just return an empty slice
-			return []time.Time{}, fmt.Errorf("Too many missed start times to list")
+			return []time.Time{}, fmt.Errorf("Too many missed start times (decrease StartingDeadlineSeconds or check clock)")


If you're touching this I'd suggest something like:

Too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew.

soltysh · 2017-01-12T11:42:03Z

@kubernetes/sig-apps-misc fyi

soltysh · 2017-01-16T13:34:55Z

This LGTM, I've updated the title, to drop the WIP prefix.

@k8s-bot ok to test

soltysh · 2017-01-16T14:40:12Z

@k8s-bot kubemark e2e test this

k8s-ci-robot · 2017-01-16T15:46:32Z

Jenkins Kubemark GCE e2e failed for commit d141a43. Full PR test history. cc @peay

The magic incantation to run this job again is @k8s-bot kubemark e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

soltysh · 2017-01-17T07:07:33Z

@k8s-bot kubemark e2e test this

k8s-github-robot · 2017-01-17T07:59:56Z

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

k8s-github-robot · 2017-01-17T08:41:45Z

Automatic merge from submit-queue

@soltysh

Automatic merge from submit-queue (batch tested with PRs 40196, 40143, 40277) Emit warning event when CronJob cannot determine starting time **What this PR does / why we need it**: In #39608, we've modified the error message for when a CronJob has too many unmet starting times to enumerate to figure out the next starting time. This makes it more "actionable", and the user can now set a deadline to avoid running into this. However, the error message is still only controller level AFAIK and thus not exposed to the user. From his perspective, there is no way to tell why the CronJob is not scheduling the next instance. The PR adds a warning event in addition to the error in the controller manager's log. **Which issue this PR fixes**: This is an addition to PR #39608 regarding #36311. **Special notes for your reviewer**: cc @soltysh **Release note**: ```release-note ```

@soltysh

…to-list Automatic merge from submit-queue Do not list CronJob unmet starting times beyond deadline **What this PR does / why we need it**: See kubernetes#36311. `getRecentUnmetScheduleTimes` gives up after 100 unmet times to avoid wasting too much CPU or memory generating all the times, as it generates them sequentially. When concurrency is forbidden, this is conceptually un-necessary: we only need the last unmet start time. This suggests that when concurrency is forbidden, we could generate times by going backward in time from now. This is not very practical as CronJob currently relies on a package that only provides `Next` and no `Prev`. Hand-cooking a `Prev` does not seem like a good idea. I could submit a PR to the cron library to add a `Prev` method, and use that when concurrency is forbidden through something like `getLastUnmetScheduleTime`. This would be `O(1)` and there would be no limit involved. (edit: actually, even for the other concurrency settings, we only start the last unmet start times -- there is a `TODO` in the controller to actually start all of them, but that is not implemented at the moment. This means the solution would apply, at least temporarily, to all concurrency settings). cc @soltysh what do you think? In the meantime, I would suggest to do something simple. Currently, the user has no way to configure anything to ensure that his CronJob will not get stuck if one job takes more that 100 unmet times. `getRecentUnmetScheduleTimes` starts with an initial time corresponding to the last start (or to the creation of the CronJob, if nothing has started yet). However, when `StartingDeadlineSeconds` is set, the controller will not start anything that is older than the deadline, so if the last start is way beyond the deadline, we are generating potentially lots of unmet start times that will not be considered by the scheduler for scheduling anyway. Consider a job running every minute, where the last instance has taken 120 minutes. This means there are more than 100 unmet times when we start counting from the last start time. **The PR makes `getRecentUnmetScheduleTimes` only consider times that do not fall beyond the deadline.** Here, the CronJob can be configured with a `StartingDeadlineSeconds` of, say, 10 minutes. After the 120min job has run, `getRecentUnmetScheduleTimes` will only consider the times in the last 10 minutes from now, and will not get stuck. As a side note on the max. number of unmet times to use as limits in terms of CPU used by the controller: I have run a quick benchmark on my i7 mac. Schedules corresponding to "once a week" tend to be more expensive to generate unmet times for. Just FYI. ``` +--------------+---------------+--------------+ | SCHEDULE | MISSED STARTS | TIMING | +--------------+---------------+--------------+ | */1 * * * ? | 100 | 383.645µs | | */30 * * * ? | 100 | 354.765µs | | 30 1 * * ? | 100 | 1.065124ms | | 30 1 * * 0 | 100 | 1.80034ms | | */1 * * * ? | 500 | 1.341365ms | | */30 * * * ? | 500 | 1.814441ms | | 30 1 * * ? | 500 | 8.475012ms | | 30 1 * * 0 | 500 | 10.020613ms | | */1 * * * ? | 1000 | 2.551697ms | | */30 * * * ? | 1000 | 4.075813ms | | 30 1 * * ? | 1000 | 17.674945ms | | 30 1 * * 0 | 1000 | 19.149324ms | | */1 * * * ? | 10000 | 25.725531ms | | */30 * * * ? | 10000 | 87.520022ms | | 30 1 * * ? | 10000 | 174.29216ms | | 30 1 * * 0 | 10000 | 196.565748ms | +--------------+---------------+--------------+ ``` using ```.go package main import ( "fmt" "time" "os" "strconv" "github.com/robfig/cron" "github.com/olekukonko/tablewriter" ) func timeSchedule(schedule string, iterations int) (time.Duration) { sched, err := cron.ParseStandard(schedule) if err != nil { panic(fmt.Sprintf("Unparseable schedule: %s", err)) } start := time.Now() t := time.Now() for i := 1; i <= iterations; i++ { t = sched.Next(t) } return time.Since(start) } func main() { table := tablewriter.NewWriter(os.Stdout) table.SetHeader([]string{"Schedule", "Missed starts", "Timing"}) schedules := []string{"*/1 * * * ?", "*/30 * * * ?", "30 1 * * ?", "30 1 * * 0"} iteration_nums := []int{100, 500, 1000, 10000} for _, iterations := range iteration_nums { for _, schedule := range schedules { table.Append([]string{schedule, strconv.Itoa(iterations), timeSchedule(schedule, iterations).String()}) } } table.Render() } ``` **Which issue this PR fixes**: fixes kubernetes#36311 **Special notes for your reviewer**: **Release note**: ```release-note ```

@soltysh

…ot-start Automatic merge from submit-queue (batch tested with PRs 40196, 40143, 40277) Emit warning event when CronJob cannot determine starting time **What this PR does / why we need it**: In kubernetes#39608, we've modified the error message for when a CronJob has too many unmet starting times to enumerate to figure out the next starting time. This makes it more "actionable", and the user can now set a deadline to avoid running into this. However, the error message is still only controller level AFAIK and thus not exposed to the user. From his perspective, there is no way to tell why the CronJob is not scheduling the next instance. The PR adds a warning event in addition to the error in the controller manager's log. **Which issue this PR fixes**: This is an addition to PR kubernetes#39608 regarding kubernetes#36311. **Special notes for your reviewer**: cc @soltysh **Release note**: ```release-note ```

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 9, 2017

k8s-github-robot assigned bprashanth Jan 9, 2017

k8s-github-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. release-note-label-needed labels Jan 9, 2017

soltysh assigned soltysh and unassigned bprashanth Jan 12, 2017

soltysh added release-note-none Denotes a PR that doesn't merit a release note. and removed release-note-label-needed labels Jan 12, 2017

soltysh requested changes Jan 12, 2017

View reviewed changes

soltysh added sig/apps Categorizes an issue or PR as relevant to SIG Apps. area/batch labels Jan 12, 2017

soltysh added the area/workload-api/cronjob label Jan 12, 2017

Do not list CronJob unmet starting times beyond deadline

d141a43

peay force-pushed the cronjob-too-many-times-to-list branch from b0cefa8 to d141a43 Compare January 16, 2017 12:52

k8s-github-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 16, 2017

soltysh approved these changes Jan 16, 2017

View reviewed changes

soltysh changed the title ~~WIP: Do not list CronJob unmet starting times beyond deadline~~ Do not list CronJob unmet starting times beyond deadline Jan 16, 2017

soltysh added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 16, 2017

k8s-github-robot merged commit 9d2fce7 into kubernetes:master Jan 17, 2017

peay mentioned this pull request Jan 19, 2017

Emit warning event when CronJob cannot determine starting time #40143

Merged

butenkor mentioned this pull request Jun 14, 2018

Set defaults for startingDeadlineSeconds which ensures that cron jobs do not get stuck commercetools/k8s-charts#17

Closed

mengjiao-liu mentioned this pull request Jan 10, 2024

MissSchedule event will never be produced #101148

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not list CronJob unmet starting times beyond deadline #39608

Do not list CronJob unmet starting times beyond deadline #39608

peay commented Jan 9, 2017 •

edited

k8s-ci-robot commented Jan 9, 2017

k8s-reviewable commented Jan 9, 2017

soltysh commented Jan 12, 2017

soltysh Jan 12, 2017

peay Jan 12, 2017

peay Jan 16, 2017

soltysh Jan 12, 2017

soltysh commented Jan 12, 2017

soltysh commented Jan 16, 2017

soltysh commented Jan 16, 2017

k8s-ci-robot commented Jan 16, 2017

soltysh commented Jan 17, 2017

k8s-github-robot commented Jan 17, 2017

k8s-github-robot commented Jan 17, 2017

Do not list CronJob unmet starting times beyond deadline #39608

Do not list CronJob unmet starting times beyond deadline #39608

Conversation

peay commented Jan 9, 2017 • edited

k8s-ci-robot commented Jan 9, 2017

k8s-reviewable commented Jan 9, 2017

soltysh commented Jan 12, 2017

soltysh Jan 12, 2017

Choose a reason for hiding this comment

peay Jan 12, 2017

Choose a reason for hiding this comment

peay Jan 16, 2017

Choose a reason for hiding this comment

soltysh Jan 12, 2017

Choose a reason for hiding this comment

soltysh commented Jan 12, 2017

soltysh commented Jan 16, 2017

soltysh commented Jan 16, 2017

k8s-ci-robot commented Jan 16, 2017

soltysh commented Jan 17, 2017

k8s-github-robot commented Jan 17, 2017

k8s-github-robot commented Jan 17, 2017

peay commented Jan 9, 2017 •

edited