Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not list CronJob unmet starting times beyond deadline #39608

Merged

Conversation

peay
Copy link
Contributor

@peay peay commented Jan 9, 2017

What this PR does / why we need it:

See #36311. getRecentUnmetScheduleTimes gives up after 100 unmet times to avoid wasting too much CPU or memory generating all the times, as it generates them sequentially.

When concurrency is forbidden, this is conceptually un-necessary: we only need the last unmet start time. This suggests that when concurrency is forbidden, we could generate times by going backward in time from now. This is not very practical as CronJob currently relies on a package that only provides Next and no Prev. Hand-cooking a Prev does not seem like a good idea. I could submit a PR to the cron library to add a Prev method, and use that when concurrency is forbidden through something like getLastUnmetScheduleTime. This would be O(1) and there would be no limit involved.

(edit: actually, even for the other concurrency settings, we only start the last unmet start times -- there is a TODO in the controller to actually start all of them, but that is not implemented at the moment. This means the solution would apply, at least temporarily, to all concurrency settings).

cc @soltysh what do you think?

In the meantime, I would suggest to do something simple. Currently, the user has no way to configure anything to ensure that his CronJob will not get stuck if one job takes more that 100 unmet times.

getRecentUnmetScheduleTimes starts with an initial time corresponding to the last start (or to the creation of the CronJob, if nothing has started yet). However, when StartingDeadlineSeconds is set, the controller will not start anything that is older than the deadline, so if the last start is way beyond the deadline, we are generating potentially lots of unmet start times that will not be considered by the scheduler for scheduling anyway.

Consider a job running every minute, where the last instance has taken 120 minutes. This means there are more than 100 unmet times when we start counting from the last start time.

The PR makes getRecentUnmetScheduleTimes only consider times that do not fall beyond the deadline. Here, the CronJob can be configured with a StartingDeadlineSeconds of, say, 10 minutes. After the 120min job has run, getRecentUnmetScheduleTimes will only consider the times in the last 10 minutes from now, and will not get stuck.

As a side note on the max. number of unmet times to use as limits in terms of CPU used by the controller: I have run a quick benchmark on my i7 mac. Schedules corresponding to "once a week" tend to be more expensive to generate unmet times for. Just FYI.

+--------------+---------------+--------------+
|   SCHEDULE   | MISSED STARTS |    TIMING    |
+--------------+---------------+--------------+
| */1 * * * ?  |           100 | 383.645µs    |
| */30 * * * ? |           100 | 354.765µs    |
| 30 1 * * ?   |           100 | 1.065124ms   |
| 30 1 * * 0   |           100 | 1.80034ms    |
| */1 * * * ?  |           500 | 1.341365ms   |
| */30 * * * ? |           500 | 1.814441ms   |
| 30 1 * * ?   |           500 | 8.475012ms   |
| 30 1 * * 0   |           500 | 10.020613ms  |
| */1 * * * ?  |          1000 | 2.551697ms   |
| */30 * * * ? |          1000 | 4.075813ms   |
| 30 1 * * ?   |          1000 | 17.674945ms  |
| 30 1 * * 0   |          1000 | 19.149324ms  |
| */1 * * * ?  |         10000 | 25.725531ms  |
| */30 * * * ? |         10000 | 87.520022ms  |
| 30 1 * * ?   |         10000 | 174.29216ms  |
| 30 1 * * 0   |         10000 | 196.565748ms |
+--------------+---------------+--------------+

using

package main

import (
    "fmt"
    "time"
    "os"
    "strconv"

    "github.com/robfig/cron"
    "github.com/olekukonko/tablewriter"
)

func timeSchedule(schedule string, iterations int) (time.Duration) {
    sched, err := cron.ParseStandard(schedule)

    if err != nil {
        panic(fmt.Sprintf("Unparseable schedule: %s", err))
    }

    start := time.Now()
    t := time.Now()

    for i := 1; i <= iterations; i++ {
        t = sched.Next(t)
    }

    return time.Since(start)
}

func main() {
    table := tablewriter.NewWriter(os.Stdout)
    table.SetHeader([]string{"Schedule", "Missed starts", "Timing"})

    schedules := []string{"*/1 * * * ?", "*/30 * * * ?", "30 1 * * ?", "30 1 * * 0"}
    iteration_nums := []int{100, 500, 1000, 10000}

    for _, iterations := range iteration_nums {
        for _, schedule := range schedules {
            table.Append([]string{schedule,
                                  strconv.Itoa(iterations),
                                  timeSchedule(schedule, iterations).String()})
        }
    }
    table.Render()
}

Which issue this PR fixes: fixes #36311

Special notes for your reviewer:

Release note:

@k8s-ci-robot
Copy link
Contributor

Hi @peay. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with @k8s-bot ok to test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 9, 2017
@k8s-reviewable
Copy link

This change is Reviewable

@k8s-github-robot k8s-github-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. release-note-label-needed labels Jan 9, 2017
@soltysh soltysh assigned soltysh and unassigned bprashanth Jan 12, 2017
@soltysh soltysh added release-note-none Denotes a PR that doesn't merit a release note. and removed release-note-label-needed labels Jan 12, 2017
@soltysh
Copy link
Contributor

soltysh commented Jan 12, 2017

@peay sorry for the long waiting time. I like your idea, esp. that we explicitly say that anything above startingDeadlineSeconds is considered as failed.

if schedulingDeadline.After(earliestTime) {
earliestTime = schedulingDeadline
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you prepare test cases covering this new flow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will do

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the PR with tests.

@@ -163,7 +171,7 @@ func getRecentUnmetScheduleTimes(sj batch.CronJob, now time.Time) ([]time.Time,
// but less than "lots".
if len(starts) > 100 {
// We can't get the most recent times so just return an empty slice
return []time.Time{}, fmt.Errorf("Too many missed start times to list")
return []time.Time{}, fmt.Errorf("Too many missed start times (decrease StartingDeadlineSeconds or check clock)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're touching this I'd suggest something like:

Too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew.

@soltysh soltysh added sig/apps Categorizes an issue or PR as relevant to SIG Apps. area/batch labels Jan 12, 2017
@soltysh
Copy link
Contributor

soltysh commented Jan 12, 2017

@kubernetes/sig-apps-misc fyi

@peay peay force-pushed the cronjob-too-many-times-to-list branch from b0cefa8 to d141a43 Compare January 16, 2017 12:52
@k8s-github-robot k8s-github-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 16, 2017
@soltysh soltysh changed the title WIP: Do not list CronJob unmet starting times beyond deadline Do not list CronJob unmet starting times beyond deadline Jan 16, 2017
@soltysh
Copy link
Contributor

soltysh commented Jan 16, 2017

This LGTM, I've updated the title, to drop the WIP prefix.

@k8s-bot ok to test

@soltysh soltysh added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 16, 2017
@soltysh
Copy link
Contributor

soltysh commented Jan 16, 2017

@k8s-bot kubemark e2e test this

@k8s-ci-robot
Copy link
Contributor

Jenkins Kubemark GCE e2e failed for commit d141a43. Full PR test history. cc @peay

The magic incantation to run this job again is @k8s-bot kubemark e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@soltysh
Copy link
Contributor

soltysh commented Jan 17, 2017

@k8s-bot kubemark e2e test this

@k8s-github-robot
Copy link

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

@k8s-github-robot
Copy link

Automatic merge from submit-queue

@k8s-github-robot k8s-github-robot merged commit 9d2fce7 into kubernetes:master Jan 17, 2017
k8s-github-robot pushed a commit that referenced this pull request Jan 23, 2017
Automatic merge from submit-queue (batch tested with PRs 40196, 40143, 40277)

Emit warning event when CronJob cannot determine starting time

**What this PR does / why we need it**:
In #39608, we've modified the error message for when a CronJob has too many unmet starting times to enumerate to figure out the next starting time. This makes it more "actionable", and the user can now set a deadline to avoid running into this. However, the error message is still only controller level AFAIK and thus not exposed to the user. From his perspective, there is no way to tell why the CronJob is not scheduling the next instance.

The PR adds a warning event in addition to the error in the controller manager's log. 

**Which issue this PR fixes**: This is an addition to PR #39608  regarding #36311.

**Special notes for your reviewer**: cc @soltysh 

**Release note**:

```release-note
```
hiddeco pushed a commit to hiddeco/kubernetes that referenced this pull request Apr 6, 2020
…to-list

Automatic merge from submit-queue

Do not list CronJob unmet starting times beyond deadline

**What this PR does / why we need it**:

See kubernetes#36311. `getRecentUnmetScheduleTimes` gives up after 100 unmet times to avoid wasting too much CPU or memory generating all the times, as it generates them sequentially.

When concurrency is forbidden, this is conceptually un-necessary: we only need the last unmet start time. This suggests that when concurrency is forbidden, we could generate times by going backward in time from now. This is not very practical as CronJob currently relies on a package that only provides `Next` and no `Prev`. Hand-cooking a `Prev` does not seem like a good idea. I could submit a PR to the cron library to add a `Prev` method, and use that when concurrency is forbidden through something like `getLastUnmetScheduleTime`. This would be `O(1)` and there would be no limit involved.

(edit: actually, even for the other concurrency settings, we only start the last unmet start times -- there is a `TODO` in the controller to actually start all of them, but that is not implemented at the moment. This means the solution would apply, at least temporarily, to all concurrency settings).

cc @soltysh what do you think?

In the meantime, I would suggest to do something simple. Currently, the user has no way to configure anything to ensure that his CronJob will not get stuck if one job takes more that 100 unmet times.

 `getRecentUnmetScheduleTimes` starts with an initial time corresponding to the last start (or to the creation of the CronJob, if nothing has started yet). However, when `StartingDeadlineSeconds` is set, the controller will not start anything that is older than the deadline, so if the last start is way beyond the deadline, we are generating potentially lots of unmet start times that will not be considered by the scheduler for scheduling anyway.

Consider a job running every minute, where the last instance has taken 120 minutes. This means there are more than 100 unmet times when we start counting from the last start time.

**The PR makes `getRecentUnmetScheduleTimes` only consider times that do not fall beyond the deadline.** Here, the CronJob can be configured with a `StartingDeadlineSeconds` of, say, 10 minutes. After the 120min job has run, `getRecentUnmetScheduleTimes` will only consider the times in the last 10 minutes from now, and will not get stuck.

As a side note on the max. number of unmet times to use as limits in terms of CPU used by the controller: I have run a quick benchmark on my i7 mac. Schedules corresponding to "once a week" tend to be more expensive to generate unmet times for. Just FYI.

```
+--------------+---------------+--------------+
|   SCHEDULE   | MISSED STARTS |    TIMING    |
+--------------+---------------+--------------+
| */1 * * * ?  |           100 | 383.645µs    |
| */30 * * * ? |           100 | 354.765µs    |
| 30 1 * * ?   |           100 | 1.065124ms   |
| 30 1 * * 0   |           100 | 1.80034ms    |
| */1 * * * ?  |           500 | 1.341365ms   |
| */30 * * * ? |           500 | 1.814441ms   |
| 30 1 * * ?   |           500 | 8.475012ms   |
| 30 1 * * 0   |           500 | 10.020613ms  |
| */1 * * * ?  |          1000 | 2.551697ms   |
| */30 * * * ? |          1000 | 4.075813ms   |
| 30 1 * * ?   |          1000 | 17.674945ms  |
| 30 1 * * 0   |          1000 | 19.149324ms  |
| */1 * * * ?  |         10000 | 25.725531ms  |
| */30 * * * ? |         10000 | 87.520022ms  |
| 30 1 * * ?   |         10000 | 174.29216ms  |
| 30 1 * * 0   |         10000 | 196.565748ms |
+--------------+---------------+--------------+
```

using

```.go
package main

import (
    "fmt"
    "time"
    "os"
    "strconv"

    "github.com/robfig/cron"
    "github.com/olekukonko/tablewriter"
)

func timeSchedule(schedule string, iterations int) (time.Duration) {
    sched, err := cron.ParseStandard(schedule)

    if err != nil {
        panic(fmt.Sprintf("Unparseable schedule: %s", err))
    }

    start := time.Now()
    t := time.Now()

    for i := 1; i <= iterations; i++ {
        t = sched.Next(t)
    }

    return time.Since(start)
}

func main() {
    table := tablewriter.NewWriter(os.Stdout)
    table.SetHeader([]string{"Schedule", "Missed starts", "Timing"})

    schedules := []string{"*/1 * * * ?", "*/30 * * * ?", "30 1 * * ?", "30 1 * * 0"}
    iteration_nums := []int{100, 500, 1000, 10000}

    for _, iterations := range iteration_nums {
        for _, schedule := range schedules {
            table.Append([]string{schedule,
                                  strconv.Itoa(iterations),
                                  timeSchedule(schedule, iterations).String()})
        }
    }
    table.Render()
}
```

**Which issue this PR fixes**: fixes kubernetes#36311

**Special notes for your reviewer**:

**Release note**:

```release-note
```
hiddeco pushed a commit to hiddeco/kubernetes that referenced this pull request Apr 6, 2020
…ot-start

Automatic merge from submit-queue (batch tested with PRs 40196, 40143, 40277)

Emit warning event when CronJob cannot determine starting time

**What this PR does / why we need it**:
In kubernetes#39608, we've modified the error message for when a CronJob has too many unmet starting times to enumerate to figure out the next starting time. This makes it more "actionable", and the user can now set a deadline to avoid running into this. However, the error message is still only controller level AFAIK and thus not exposed to the user. From his perspective, there is no way to tell why the CronJob is not scheduling the next instance.

The PR adds a warning event in addition to the error in the controller manager's log. 

**Which issue this PR fixes**: This is an addition to PR kubernetes#39608  regarding kubernetes#36311.

**Special notes for your reviewer**: cc @soltysh 

**Release note**:

```release-note
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/batch area/workload-api/cronjob cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Controller manager cannot determine if scheduledjob needs to be started
6 participants