Skip to content

The safe-to-evict annotation prevent worker-nodes scale-down continously #3183

Closed
@cdlliuy

Description

@cdlliuy

First of all, I used safe-to-evict annotation to prevent my existing pod be broken by scale down . But with this annotation, cluster autoscaler simply jump to evaluate the worker node , and then NEW workloads with safe-to-evict come to that worker node again, so the worker node can't be removed at all ...

Below is the detail experiments:

First of all, I added a LARGE workload to trigger scale-out with job workload. Annotation is added, and once the job is done, the pod is shown as completed.

cat <<EOF | kubectl apply -f -
apiVersion: batch/v1
kind: Job
metadata:
  name: busybox-job-600
spec:
  template:
    metadata:
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: "false"  
    spec:
      restartPolicy: Never
      containers:
      - name: busybox
        image: busybox
        imagePullPolicy: IfNotPresent
        resources:
          requests:
            memory: "1Gi"
            cpu: "200m"
          limits:
            memory: "1Gi"
            cpu: "200m"
        command: ['sh', '-c', 'echo Container 1 is Running ; sleep 600']
  backoffLimit: 4
  parallelism: 50
EOF

Then, I created a smaller workload with cronjob to simulate the continuous job jumped in with the evict annotation.

cat <<EOF | kubectl apply -f -
apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: busybox-cronjob
spec:
  schedule: "*/2 * * * *"
  jobTemplate:
    spec:
      parallelism: 10
      template:
        metadata:
          annotations:
            cluster-autoscaler.kubernetes.io/safe-to-evict: "false"        
        spec:
          restartPolicy: Never
          containers:
          - name: busybox
            image: busybox
            args:
            - /bin/sh
            - -c
            - date; echo Hello from the Kubernetes cluster; sleep 30
            resources:
              requests:
                memory: "1Gi"
                cpu: "200m"
              limits:
                memory: "1Gi"
                cpu: "200m"
EOF

The cluster autoscaler setting is

      - command:
        - ./cluster-autoscaler
        - --v=4
        - --balance-similar-node-groups=true
        - --alsologtostderr=true
        - --stderrthreshold=info
        - --cloud-provider=IKS
        - --skip-nodes-with-local-storage=true
        - --skip-nodes-with-system-pods=true
        - --scale-down-unneeded-time=10m
        - --scale-down-delay-after-add=10m
        - --scale-down-delay-after-delete=10m
        - --scale-down-utilization-threshold=0.5
        - --scan-interval=1m
        - --expander=random
        - --max-inactivity=10m
        - --max-failing-time=15m
        - --leader-elect=false
        - --max-node-provision-time=120m
        - --ignore-daemonsets-utilization=false
        - --max-bulk-soft-taint-count=0
        - --max-bulk-soft-taint-time=10m

====================

Observations:

  • Once The BIG job workload was ended, the cpu/memory utilization is reduced, but the worker node can't be removed due to the annotation (expected behavior)
I0603 07:04:25.173566       1 scale_down.go:462] Node 10.243.128.9 - cpu utilization 0.556266
I0603 07:04:25.173579       1 scale_down.go:466] Node 10.243.128.9 is not suitable for removal - cpu utilization too big (0.556266)
I0603 07:04:25.173592       1 scale_down.go:462] Node 10.243.128.16 - cpu utilization 0.121483
I0603 07:04:25.173602       1 scale_down.go:462] Node 10.243.128.17 - memory utilization 0.175274
I0603 07:04:25.173624       1 scale_down.go:462] Node 10.243.128.4 - cpu utilization 0.588747
I0603 07:04:25.173634       1 scale_down.go:466] Node 10.243.128.4 is not suitable for removal - cpu utilization too big (0.588747)
I0603 07:04:25.173819       1 scale_down.go:511] Finding additional 2 candidates for scale down.
I0603 07:04:25.173899       1 cluster.go:93] Fast evaluation: 10.243.128.16 for removal
I0603 07:04:25.173911       1 cluster.go:107] Fast evaluation: node 10.243.128.16 cannot be removed: pod annotated as not safe to evict present: busybox-cronjob-1591167840-qzrfr
I0603 07:04:25.173916       1 cluster.go:93] Fast evaluation: 10.243.128.17 for removal
I0603 07:04:25.173922       1 cluster.go:107] Fast evaluation: node 10.243.128.17 cannot be removed: pod annotated as not safe to evict present: busybox-cronjob-1591167840-nwdvl
I0603 07:04:25.173931       1 scale_down.go:548] 2 nodes found to be unremovable in simulation, will re-check them at 2020-06-03 07:09:23.562749096 +0000 UTC m=+22534.296804927
  • But given the cronjob workload jumped in continuously with the annotation, so all the worker nodes will have running pods with evict:false annotation ... Finally, cluster autoscaler can't scale down any worker node at all

=============

I expected the cluster autoscaler can mark the worker node unschedule to prevent further workload jumped in when it sees evict annotation attached.

Any comment?

Metadata

Metadata

Assignees

No one assigned

    Labels

    lifecycle/rottenDenotes an issue or PR that has aged beyond stale and will be auto-closed.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions