Description
First of all, I used safe-to-evict
annotation to prevent my existing pod be broken by scale down . But with this annotation, cluster autoscaler simply jump to evaluate the worker node , and then NEW workloads with safe-to-evict
come to that worker node again, so the worker node can't be removed at all ...
Below is the detail experiments:
First of all, I added a LARGE workload to trigger scale-out with job workload. Annotation is added, and once the job is done, the pod is shown as completed
.
cat <<EOF | kubectl apply -f -
apiVersion: batch/v1
kind: Job
metadata:
name: busybox-job-600
spec:
template:
metadata:
annotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
spec:
restartPolicy: Never
containers:
- name: busybox
image: busybox
imagePullPolicy: IfNotPresent
resources:
requests:
memory: "1Gi"
cpu: "200m"
limits:
memory: "1Gi"
cpu: "200m"
command: ['sh', '-c', 'echo Container 1 is Running ; sleep 600']
backoffLimit: 4
parallelism: 50
EOF
Then, I created a smaller workload with cronjob
to simulate the continuous job jumped in with the evict annotation.
cat <<EOF | kubectl apply -f -
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: busybox-cronjob
spec:
schedule: "*/2 * * * *"
jobTemplate:
spec:
parallelism: 10
template:
metadata:
annotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
spec:
restartPolicy: Never
containers:
- name: busybox
image: busybox
args:
- /bin/sh
- -c
- date; echo Hello from the Kubernetes cluster; sleep 30
resources:
requests:
memory: "1Gi"
cpu: "200m"
limits:
memory: "1Gi"
cpu: "200m"
EOF
The cluster autoscaler setting is
- command:
- ./cluster-autoscaler
- --v=4
- --balance-similar-node-groups=true
- --alsologtostderr=true
- --stderrthreshold=info
- --cloud-provider=IKS
- --skip-nodes-with-local-storage=true
- --skip-nodes-with-system-pods=true
- --scale-down-unneeded-time=10m
- --scale-down-delay-after-add=10m
- --scale-down-delay-after-delete=10m
- --scale-down-utilization-threshold=0.5
- --scan-interval=1m
- --expander=random
- --max-inactivity=10m
- --max-failing-time=15m
- --leader-elect=false
- --max-node-provision-time=120m
- --ignore-daemonsets-utilization=false
- --max-bulk-soft-taint-count=0
- --max-bulk-soft-taint-time=10m
====================
Observations:
- Once The BIG
job
workload was ended, the cpu/memory utilization is reduced, but the worker node can't be removed due to the annotation (expected behavior)
I0603 07:04:25.173566 1 scale_down.go:462] Node 10.243.128.9 - cpu utilization 0.556266
I0603 07:04:25.173579 1 scale_down.go:466] Node 10.243.128.9 is not suitable for removal - cpu utilization too big (0.556266)
I0603 07:04:25.173592 1 scale_down.go:462] Node 10.243.128.16 - cpu utilization 0.121483
I0603 07:04:25.173602 1 scale_down.go:462] Node 10.243.128.17 - memory utilization 0.175274
I0603 07:04:25.173624 1 scale_down.go:462] Node 10.243.128.4 - cpu utilization 0.588747
I0603 07:04:25.173634 1 scale_down.go:466] Node 10.243.128.4 is not suitable for removal - cpu utilization too big (0.588747)
I0603 07:04:25.173819 1 scale_down.go:511] Finding additional 2 candidates for scale down.
I0603 07:04:25.173899 1 cluster.go:93] Fast evaluation: 10.243.128.16 for removal
I0603 07:04:25.173911 1 cluster.go:107] Fast evaluation: node 10.243.128.16 cannot be removed: pod annotated as not safe to evict present: busybox-cronjob-1591167840-qzrfr
I0603 07:04:25.173916 1 cluster.go:93] Fast evaluation: 10.243.128.17 for removal
I0603 07:04:25.173922 1 cluster.go:107] Fast evaluation: node 10.243.128.17 cannot be removed: pod annotated as not safe to evict present: busybox-cronjob-1591167840-nwdvl
I0603 07:04:25.173931 1 scale_down.go:548] 2 nodes found to be unremovable in simulation, will re-check them at 2020-06-03 07:09:23.562749096 +0000 UTC m=+22534.296804927
- But given the
cronjob
workload jumped in continuously with the annotation, so all the worker nodes will have running pods withevict:false
annotation ... Finally, cluster autoscaler can't scale down any worker node at all
=============
I expected the cluster autoscaler can mark the worker node unschedule
to prevent further workload jumped in when it sees evict
annotation attached.
Any comment?