Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infinite ImagePullBackOff CronJob results in resource leak #76570

Open
DarrienG opened this Issue Apr 14, 2019 · 2 comments

Comments

Projects
None yet
3 participants
@DarrienG
Copy link

DarrienG commented Apr 14, 2019

What happened:
A CronJob without a ConcurrencyPolicy or history limit that uses an image that doesn't exist will slowly consume almost all cluster resources. In our cluster we started hitting the pod limit on all of our nodes, and began losing our ability to schedule new pods.

What you expected to happen:
Even without a ConcurrencyPolicy, CronJob should probably have the same behavior as most of the other pod schedulers. If I try to start a deployment with X replicas and I get ImagePullBackOff on one of the containers in a pod, the deployment won't keep trying to schedule more pods on different nodes until it consumes all cluster resources.

This is especially bad with CronJob, because unlike Deployment where an upper limit for horizontal scalability has to be set, CronJob with no history limit and ConcurrencyPolicy will slowly consume all resources on a cluster.

While this is up for debate, I would personally say when a scheduled Job has the ImagePullBackOff error, it shouldn't try to keep scheduling new pods. It should probably kill the pod trying to pull an image and make a new one, or wait for the pod to successfully pull the image.

Worst case scenario it will consume all cluster resources, best case scenario there is a thundering herd of CronJobs all rushing to completion when the image becomes available.

How to reproduce it (as minimally and precisely as possible):

apiVersion: batch/v1beta1
kind: CronJob
spec:
  schedule: "* * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: hello
            image: darrienglasser.com/busybox:does-not-exist

Deploy the above and wait. Your cluster will collapse over time.

Anything else we need to know?:
No

Environment:

  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.0", GitCommit:"641856db18352033a0d96dbc99153fa3b27298e5", GitTreeState:"clean", BuildDate:"2019-03-25T15:53:57Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.2", GitCommit:"cff46ab41ff0bb44d8584413b598ad8360ec1def", GitTreeState:"clean", BuildDate:"2019-01-10T23:28:14Z", GoVersion:"go1.11.4", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration:
    On prem. A number of high powered nodes with Xeons. (256Gi+ memory and the latest Xeon Gold processors)

  • OS (e.g: cat /etc/os-release):

core@k8s-node [23:36:33]~ $ cat /etc/os-release
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1911.4.0
VERSION_ID=1911.4.0
BUILD_ID=2018-11-26-1924
PRETTY_NAME="Container Linux by CoreOS 1911.4.0 (Rhyolite)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"
  • Kernel (e.g. uname -a):
Linux k8s-node 4.14.81-coreos #1 SMP Mon Nov 26 18:51:57 UTC 2018 x86_64 Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz GenuineIntel GNU/Linux
@DarrienG

This comment has been minimized.

Copy link
Author

DarrienG commented Apr 14, 2019

/sig scheduling

@cizixs

This comment has been minimized.

Copy link

cizixs commented Apr 16, 2019

/sig apps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.