Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cronjobs stopped starting after Kube upgrade from `1.13` to `1.14` #82539

Closed
kraftaa opened this issue Sep 10, 2019 · 6 comments

Comments

@kraftaa
Copy link

commented Sep 10, 2019

What happened:
Last week we've upgraded Kube from 1.13 to 1.14

Client Version: v1.14.0
Server Version: v1.14.6-eks-5047ed

k get nodes shows that all nodes have version v1.14.6-eks-5047ed

And after that all Cronjobs didn't start and didn't make any error description (worked before upgrade).
Cronjobs have apiVersion batch/v1beta1

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: materialize-py
  namespace: datascience
  labels:
    app: athena
    task: materialize
spec:
  schedule: "*/1 */1 * * *"
  successfulJobsHistoryLimit: 1
  failedJobsHistoryLimit: 1
  concurrencyPolicy: Forbid
  jobTemplate:
    spec:
      template:
        metadata:
          labels:
            app: athena
            task: materialize
          annotations:
            iam.amazonaws.com/role: ..our role..
        spec:
          restartPolicy: Never
          containers:
            - name: materialize-py
              image: ..our image..
              imagePullPolicy: Always
              command:
                - bash
                - "-c"
                - "sleep infinity"

I still could manually create and successfully run Jobs as

k -n datascience create job --from=cronjob/materialize-py onejob

it runs without any error

What you expected to happen:
I expect Cronjobs start according schedule.
They all perfectly run before update from 1.13 to 1.14

How to reproduce it (as minimally and precisely as possible):

k -n datascience create -f materialize.yml

k -n datascience get cronjobs
NAME             SCHEDULE        SUSPEND   ACTIVE   LAST SCHEDULE   AGE
materialize-py   */1 */1 * * *   False             0                      <none>          106s

no pods created

Anything else we need to know?:

Environment:

Client Version: v1.14.0
Server Version: v1.14.6-eks-5047ed

AWS::EKS::Cluster
K3

thank you

/sig release

@kraftaa kraftaa added the kind/bug label Sep 10, 2019

@kraftaa kraftaa changed the title Cronjobs stopped starting after Kube update from `1.13` to `1.14` Cronjobs stopped starting after Kube upgrade from `1.13` to `1.14` Sep 10, 2019

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Sep 10, 2019

@kraftaa: The label(s) sig/cronjobs cannot be appled. These labels are supported: api-review, community/discussion, community/maintenance, community/question, cuj/build-train-deploy, cuj/multi-user, platform/aws, platform/azure, platform/gcp, platform/minikube, platform/other

In response to this:

What happened:
Last week we've upgraded Kube from 1.13 to 1.14

Client Version: v1.14.0
Server Version: v1.14.6-eks-5047ed

k get nodes shows that all nodes have version v1.14.6-eks-5047ed

And after that all Cronjobs didn't start and didn't make any error description (worked before upgrade).
Cronjobs have apiVersion batch/v1beta1

apiVersion: batch/v1beta1
kind: CronJob
metadata:
 name: materialize-py
 namespace: datascience
 labels:
   app: athena
   task: materialize
spec:
 schedule: "*/1 */1 * * *"
 successfulJobsHistoryLimit: 1
 failedJobsHistoryLimit: 1
 concurrencyPolicy: Forbid
 jobTemplate:
   spec:
     template:
       metadata:
         labels:
           app: athena
           task: materialize
         annotations:
           iam.amazonaws.com/role: ..our role..
       spec:
         restartPolicy: Never
         containers:
           - name: materialize-py
             image: ..our image..
             imagePullPolicy: Always
             command:
               - bash
               - "-c"
               - "sleep infinity"

I still could manually create and successfully run Jobs as

k -n datascience create job --from=cronjob/materialize-py onejob

it runs without any error

What you expected to happen:
I expect Cronjobs start according schedule.
They all perfectly run before update from 1.13 to 1.14

How to reproduce it (as minimally and precisely as possible):

k -n datascience create -f materialize.yml

k -n datascience get cronjobs
NAME             SCHEDULE        SUSPEND   ACTIVE   LAST SCHEDULE   AGE
materialize-py   */1 */1 * * *   False             0                      <none>          106s

no pods created

Anything else we need to know?:

Environment:

Client Version: v1.14.0
Server Version: v1.14.6-eks-5047ed

AWS::EKS::Cluster
K3

thank you

/sig cronjobs

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added sig/release and removed needs-sig labels Sep 10, 2019

@Joseph-Irving

This comment has been minimized.

Copy link
Contributor

commented Sep 10, 2019

/sig apps
You may have encountered this issue #77465
Once you've got over 500 jobs in the cluster cronjobs can no longer schedule, there's currently a cherry pick waiting to be merged into the 1.14 branch #79178

@kraftaa

This comment has been minimized.

Copy link
Author

commented Sep 10, 2019

@Joseph-Irving thank you!
I've seen that issue too, we checked our jobs, their number exceeded 500, but after we deleted many of them and went below 500 nothing changed (I've deleted and created cronjob to test).

update: my bad total number of jobs is > 500 is all namespaces.

I've misunderstood that issue was fixed, thank you for information about cherry picking now I see it.
I've realized to check CloudWatch logs and seeing (At about time of version update)

17:44:55
I0906 17:44:55.773395 6 resource_quota_monitor.go:228] QuotaMonitor created object count evaluator for cronjobs.batch

17:44:55
I0906 17:44:55.813517 6 controllermanager.go:482] Starting "cronjob"

17:44:55
I0906 17:44:55.821051 6 controllermanager.go:497] Started "cronjob"

17:44:55
I0906 17:44:55.821083 6 cronjob_controller.go:94] Starting CronJob Manager

17:44:55
E0906 17:44:55.935079 6 cronjob_controller.go:117] expected type *batchv1.JobList, got type *internalversion.List

looking at the #79178 probably it's the case, checking now QuotaMonitor.

Thank you!

@kraftaa

This comment has been minimized.

Copy link
Author

commented Sep 11, 2019

update: even after reducing number of jobs to 0, cronjobs didn't fire. We reduced number of Cronjobs too, and cronjobs started working as scheduled.

@liggitt

This comment has been minimized.

Copy link
Member

commented Sep 12, 2019

the total number of cronjobs across all namespaces being > 500 is the issue. #79178 is now merged and the v1.14.7 release (targeted for 2019-09-18) will resolve this.

/close

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Sep 12, 2019

@liggitt: Closing this issue.

In response to this:

the total number of cronjobs across all namespaces being > 500 is the issue. #79178 is now merged and the v1.14.7 release will resolve this.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.