Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cronjob controller v2 #93370

Merged

Conversation

alaypatel07
Copy link
Contributor

@alaypatel07 alaypatel07 commented Jul 23, 2020

What type of PR is this?
/kind feature

What this PR does / why we need it:
We need the new controller because it is built using informers instead of polling. This re-work will help with the scale and performance issue of the old controller.

Special notes for your reviewer:
/assign @soltysh @wojtek-t

Does this PR introduce a user-facing change?:

Users can try the cronjob controller v2 using the feature gate. This will be the default controller in future releases.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP] - https://github.com/kubernetes/enhancements/blob/master/keps/sig-apps/19-Graduate-CronJob-to-Stable/README.md

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 23, 2020
@k8s-ci-robot
Copy link
Contributor

Hi @alaypatel07. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jul 23, 2020
@alaypatel07
Copy link
Contributor Author

/assign @soltysh

Copy link
Contributor

@sftim sftim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some early feedback

@fedebongio
Copy link
Contributor

/assign @cheftako
keeping the label until is not WIP anymore

Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a first pass, I'll look into this more when I'm back from my PTO.

if features.DefaultMutableFeatureGate.Enabled(features.CronjobController2) {
cj2c, err := cronjob.NewController2(ctx.InformerFactory.Batch().V1().Jobs(),
ctx.InformerFactory.Batch().V1beta1().CronJobs(),
ctx.ClientBuilder.ClientOrDie("cronjob-controller"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: cronjob-controllerv2 to make it explicit in the metrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, modified

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For other readers, we went with the same name, so that we can re-use the SA for this controller.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For other readers, we went with the same name, so that we can re-use the SA for this controller.

Please create a new SA and tie the existing rolebinding to it so we can figure out what is being used. This can be a followup (still in 1.120) to avoid making this PR larger.

cmd/kube-controller-manager/app/batch.go Outdated Show resolved Hide resolved
pkg/controller/cronjob/utils.go Outdated Show resolved Hide resolved
return nil, nil
default:
// multiple unmet start times, start the last one.
klog.V(4).Infof("Multiple unmet start times for %s so only starting last one", nameForLog)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work as you wish to, since getRecentUnmetScheduleTimes2 returns error when 100+ missed times. You need to react differently or change that function.

pkg/controller/cronjob/cronjob_controller2.go Outdated Show resolved Hide resolved
pkg/controller/cronjob/cronjob_controller2.go Outdated Show resolved Hide resolved
recorder.Eventf(cj, v1.EventTypeWarning, "UnparseableSchedule", "unparseable schedule: %s : %s", cj.Spec.Schedule, err)
return nil, nil
}
times, err := getRecentUnmetScheduleTimes2(*cj, now, sched)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading through this more, I don't think we need the separate method for this calculation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed this, PTAL, thanks

}
times, err := getRecentUnmetScheduleTimes2(*cj, now, sched)
switch {
case err != nil && len(times) == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the error situations in getRecentUnmetScheduleTimes2 are returning error and an empty array, so all errors will pick this condition and not the others. You need to change this switch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but with getRecentUnmetScheduleTimes2 the only place error is not nil is when too many missed start time occurs, so functionally, I am thinking this should give us desired output

pkg/controller/cronjob/cronjob_controller2.go Outdated Show resolved Hide resolved
@alaypatel07 alaypatel07 reopened this Jul 26, 2020
@alaypatel07 alaypatel07 force-pushed the add-new-cronjob-controller branch 2 times, most recently from 971d9c7 to e74eea6 Compare July 26, 2020 21:53
@alaypatel07
Copy link
Contributor Author

/retitle [WIP]: Add cronjob controller v2

@soltysh
Copy link
Contributor

soltysh commented Nov 10, 2020

/retest

Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 10, 2020
@alaypatel07
Copy link
Contributor Author

/test pull-kubernetes-e2e-gce-alpha-features

@alaypatel07
Copy link
Contributor Author

/test pull-kubernetes-e2e-gce-ubuntu-containerd

@deads2k
Copy link
Contributor

deads2k commented Nov 11, 2020

/lgtm
/approve

@deads2k deads2k added this to the v1.20 milestone Nov 11, 2020
@liggitt
Copy link
Member

liggitt commented Nov 11, 2020

/approve
for config API change

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alaypatel07, deads2k, liggitt, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 11, 2020
@alaypatel07
Copy link
Contributor Author

test/e2e/network/service.go:2587
Nov 11 17:37:26.272: failed to create replication controller with headless service:  in the namespace: services-6960
Unexpected error:
    <*errors.errorString | 0xc001b18c60>: {
        s: "1 containers failed which is more than allowed 0",
    }
    1 containers failed which is more than allowed 0
occurred
test/e2e/network/service.go:2602

seems to be unrelated to the PR

@alaypatel07
Copy link
Contributor Author

/test pull-kubernetes-e2e-gce-ubuntu-containerd

@alaypatel07
Copy link
Contributor Author

/retest

// CronJobControllerV2 controls whether the controller manager starts old cronjob
// controller or new one which is implemented with informers and delaying queue
//
// This feature is deprecated, and will be removed in v1.22.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

???

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll fix that in follow-ups 😉

@b0b0haha
Copy link

Why the v2 cronjob controller can get the fix? It still includes the 100 missed start times limit 'feature' from the v1 controller.
https://github.com/alaypatel07/kubernetes/blob/8d7dd4415e28bded77667ce857e1c58016f9ab3a/pkg/controller/cronjob/utils.go#L194-L214

@soltysh
Copy link
Contributor

soltysh commented Apr 4, 2022

Why the v2 cronjob controller can get the fix? It still includes the 100 missed start times limit 'feature' from the v1 controller. https://github.com/alaypatel07/kubernetes/blob/8d7dd4415e28bded77667ce857e1c58016f9ab3a/pkg/controller/cronjob/utils.go#L194-L214

@b0b0haha ff you look carefully through

if numberOfMissedSchedules > 100 {
// An object might miss several starts. For example, if
// controller gets wedged on friday at 5:01pm when everyone has
// gone home, and someone comes in on tuesday AM and discovers
// the problem and restarts the controller, then all the hourly
// jobs, more than 80 of them for one hourly cronJob, should
// all start running with no further intervention (if the cronJob
// allows concurrency and late starts).
//
// However, if there is a bug somewhere, or incorrect clock
// on controller's server or apiservers (for setting creationTimestamp)
// then there could be so many missed start times (it could be off
// by decades or more), that it would eat up all the CPU and memory
// of this controller. In that case, we want to not try to list
// all the missed start times.
//
// I've somewhat arbitrarily picked 100, as more than 80,
// but less than "lots".
recorder.Eventf(&cj, corev1.EventTypeWarning, "TooManyMissedTimes", "too many missed start times: %d. Set or decrease .spec.startingDeadlineSeconds or check clock skew", numberOfMissedSchedules)
klog.InfoS("too many missed times", "cronjob", klog.KRef(cj.GetNamespace(), cj.GetName()), "missed times", numberOfMissedSchedules)
}
you'll notice that 100+ missed start times only logs the problem and reports warning event, but still proceeds picking the most recent time to create a job.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver area/code-generation area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/backlog Higher priority than priority/awaiting-more-evidence. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet