DaemonSet fails to scale down during the rolling update when `maxUnavailable=0` #118823

alebedev87 · 2023-06-23T08:08:02Z

What happened?

I have a daemonset with the rolling update strategy. I have maxSurge parameter set to a non zero value which means that the maxUnavailable parameter is set to 0. When I replace the toleration in the daemonset's template spec from the one which helped me to be scheduled on the master node into any other toleration, I see the new pods still trying to be scheduled on the master nodes. The old pods from the tolerated nodes can be lucky enough to be recreated but only if they go before any pod from the intolerable node.

$ kubectl -n openshift-dns get ds dns-default -o yaml | yq .spec.updateStrategy
rollingUpdate:
  maxSurge: 10%
  maxUnavailable: 0
type: RollingUpdate

$ kubectl -n openshift-dns get ds dns-default -o yaml | yq .spec.template.spec.tolerations
- key: node-role.kubernetes.io/master
  operator: Exists
  
$ kubectl -n openshift-dns get pods  -o wide | grep dns-default
dns-default-6bfmf     2/2     Running   0          119m    10.129.0.40   ci-ln-sb5ply2-72292-qlhc8-master-2         <none>           <none>
dns-default-9cjdf     2/2     Running   0          2m35s   10.129.2.15   ci-ln-sb5ply2-72292-qlhc8-worker-c-m5wzq   <none>           <none>
dns-default-c6j9x     2/2     Running   0          119m    10.128.0.13   ci-ln-sb5ply2-72292-qlhc8-master-0         <none>           <none>
dns-default-fhqrs     2/2     Running   0          2m12s   10.131.0.29   ci-ln-sb5ply2-72292-qlhc8-worker-a-6q7hs   <none>           <none>
dns-default-lx2nf     2/2     Running   0          119m    10.130.0.15   ci-ln-sb5ply2-72292-qlhc8-master-1         <none>           <none>
dns-default-mmc78     2/2     Running   0          112m    10.128.2.7    ci-ln-sb5ply2-72292-qlhc8-worker-b-bpjdk   <none>           <none>

$ kubectl -n openshift-dns get ds dns-default -o yaml | yq .spec.template.spec.tolerations
- key: test-taint
  operator: Exists
  
$ kubectl -n openshift-dns get ds dns-default
NAME             DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR
dns-default   3                 3                  3             1                       3                    kubernetes.io/os=linux 
  
$ kubectl -n openshift-dns get pods  -o wide | grep dns-default
dns-default-6bfmf     2/2     Running   0          124m    10.129.0.40   ci-ln-sb5ply2-72292-qlhc8-master-2         <none>           <none>
dns-default-76vjz     0/2     Pending   0          3m2s    <none>        <none>                                     <none>           <none>
dns-default-9cjdf     2/2     Running   0          7m24s   10.129.2.15   ci-ln-sb5ply2-72292-qlhc8-worker-c-m5wzq   <none>           <none>
dns-default-c6j9x     2/2     Running   0          124m    10.128.0.13   ci-ln-sb5ply2-72292-qlhc8-master-0         <none>           <none>
dns-default-fhqrs     2/2     Running   0          7m1s    10.131.0.29   ci-ln-sb5ply2-72292-qlhc8-worker-a-6q7hs   <none>           <none>
dns-default-lx2nf     2/2     Running   0          124m    10.130.0.15   ci-ln-sb5ply2-72292-qlhc8-master-1         <none>           <none>
dns-default-mmc78     2/2     Running   0          117m    10.128.2.7    ci-ln-sb5ply2-72292-qlhc8-worker-b-bpjdk   <none>           <none>

$ kubectl -n openshift-dns get pods dns-default-76vjz -o yaml | yq .spec.affinity.nodeAffinity
requiredDuringSchedulingIgnoredDuringExecution:
  nodeSelectorTerms:
    - matchFields:
        - key: metadata.name
          operator: In
          values:
            - ci-ln-sb5ply2-72292-qlhc8-master-0

What did you expect to happen?

I don't expect the new pods to be scheduled on the nodes which are not tolerated by the new DamonSet's template spec. The daemonset controller should just delete the old pods from the nodes which cannot be tolerated anymore. The old pods from the nodes which can still be tolerated should be recreated according to the rolling update parameters.

How can we reproduce it (as minimally and precisely as possible)?

On a Kubernetes cluster with 3 or more master node and 3 or more worker node (2 and 2 may be enough too, I guess):

Create a DaemonSet which
- has the rolling update strategy with maxSurge=10% and maxUnavailable=0
- has a toleration of the master node role taint: node-role.kubernetes.io/master
Wait until all the pods become ready
Update the DaemonSet replacing node-role.kubernetes.io/master toleration with any other (even not existing).
See the new pod trying to be scheduled on the master node and getting into Pending state

Anything else we need to know?

No response

Kubernetes version

Kubernetes Version: v1.26.5+7d22122
Kubernetes Version: v1.27.2+b451817

Cloud provider

Doesn't matter.

OS version

Doesn't matter.

Install tools

Container runtime (CRI) and version (if applicable)

CRI-O

Related plugins (CNI, CSI, ...) and versions (if applicable)

Doesn't matter.

The text was updated successfully, but these errors were encountered:

alebedev87 · 2023-06-23T08:11:11Z

/sig scheduling

olderTaoist · 2023-06-26T00:45:38Z

the new pod should pending when you change any another toleration, because master have node-role.kubernetes.io/master toleration. as for why update the old pod on master node, maybe should know implement of daemonset.

alebedev87 · 2023-06-26T09:19:04Z

@olderTaoist: Sorry, I don't quite understand what you are trying to say. Can you please rephrase it?

atiratree · 2023-06-28T14:43:06Z

this seems like a bug in the DS surge logic, the old pod is not being removed because it is NoSchedule taint only and the surge presumes that the pods that are there are surgeable without checking the scheduling aspect again

atiratree · 2023-06-28T14:45:26Z

/remove-sig scheduling
/sig apps
/triage accepted
/priority important-soon

mochizuki875 · 2023-07-11T02:57:31Z

I'll try to fix it.

/assign

mochizuki875 · 2023-07-11T09:12:54Z

Hi, @alebedev87
Now I'm working for it and I have one question.

You expect that:

The daemonset controller should just delete the old pods from the nodes which cannot be tolerated anymore.

However if master Nodes have taint as below(it's default setting of k8s), and DaemonSet which will be updated has no torelation, what do you expect to happen?

 taints:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master

I think the old Pod should remain without been updated and deleted according to the Taints and Tolerations mechanism.

For example:

Before Update

pod_on_master01_v1
pod_on_master02_v1
pod_on_master03_v1
pod_on_node01_v1
pod_on_node02_v1
pod_on_node03_v1

↓ Update DaemonSet

After Update

pod_on_master01_v1
pod_on_master02_v1
pod_on_master03_v1
pod_on_node01_v2
pod_on_node02_v2
pod_on_node03_v2

From the Taint section of Kubernetes document,

The Node having NoSchedule toleration means that no Pod will be able to schedule onto the Node unless the Pod has a matching toleration.(https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/#concepts)
The Node having NoExecute means that the Pod will be evicted from the Node unless the Pod has a matching toleration, and if the Pod has a matching toleration, it remains on the Node.(https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/#taint-based-evictions)

So if master Nodes have taint as below, and DaemonSet which will be updated has no torelation, I think the old Pod should be deleted.

 taints:
  - effect: NoExecute
    key: node-role.kubernetes.io/control-plane
  - effect: NoSchedule
    key: node-role.kubernetes.io/control-plane

For example:

Before Update

pod_on_master01_v1
pod_on_master02_v1
pod_on_master03_v1
pod_on_node01_v1
pod_on_node02_v1
pod_on_node03_v1

↓ Update DaemonSet

After Update

pod_on_node01_v2
pod_on_node02_v2
pod_on_node03_v2

What do you think about that?

mochizuki875 · 2023-07-14T05:28:42Z

I've created #119317 according to the above.

alebedev87 · 2023-07-17T09:17:20Z

@mochizuki875: I think I'm fine with the approach to keep the old pods on the master nodes if the master nodes don't have the NoExecute taint.
The only doubt I have is about the status of the daemonset. I noticed that it changes according to the scheduling options, including the tolerations. You can see in the description that the number of the desired pods changes to 3 (only workers). So, according to your proposal we will see the difference in the status and the number of the really running pods (3 vs 6 if we take the example from the description).

mochizuki875 · 2023-07-18T06:38:19Z

Hi @alebedev87

I think I'm fine with the approach to keep the old pods on the master nodes if the master nodes don't have the NoExecute taint.

Thanks for your consent.

The only doubt I have is about the status of the DaemonSet. I noticed that it changes according to the scheduling options, including the tolerations. You can see in the description that the number of the desired pods changes to 3 (only workers). So, according to your proposal we will see the difference in the status and the number of the really running pods (3 vs 6 if we take the example from the description).

I've noticed that too.
In conclusion, I think the defference is no problem because:

all nodes which don't match the schedule option should be excluded from target of DaemonSet
daemon pods which is running excluded nodes should not be managed by DaemonSet

FYI: The implementation of updating DaemonSet status is as below.

kubernetes/pkg/controller/daemon/daemon_controller.go

Lines 1160 to 1171 in b2a9c06

    
           func (dsc *DaemonSetsController) updateDaemonSetStatus(ctx context.Context, ds *apps.DaemonSet, nodeList []*v1.Node, hash string, updateObservedGen bool) error { 
        
           	logger := klog.FromContext(ctx) 
        
           	logger.V(4).Info("Updating daemon set status") 
        
           	nodeToDaemonPods, err := dsc.getNodesToDaemonPods(ctx, ds, false) 
        
           	if err != nil { 
        
           		return fmt.Errorf("couldn't get node to daemon pod mapping for daemon set %q: %v", ds.Name, err) 
        
           	} 
        
           	var desiredNumberScheduled, currentNumberScheduled, numberMisscheduled, numberReady, updatedNumberScheduled, numberAvailable int 
        
           	now := dsc.failedPodsBackoff.Clock.Now() 
        
           	for _, node := range nodeList { 
        
           		shouldRun, _ := NodeShouldRunDaemonPod(node, ds)

In this case, by updating DaemonSet without tolerations, 3 master with taints are no longer the target of DaemonSet and 3 daemon pods on them are no longer managed by DaemonSet.
3 daemon pods on master should depend on taints of master(NoSchedule: should remain , NoExecute: should be deleted), and if we delete them, they should be deleted and not come back.

You can see DaemonSet status is that desired=3, current=3 even though there are actually 6 daemon pods running.

So in this case, I think it is no problem that desired number of pods changes 6 to 3 and 6 daemon pods(3 pods on worker are managed by DaemonSet, the other are not) are running by updating without tolerations.

You can see the similar case if you update DaemonSet by OnDelete updateStrategy.

Create new DaemonSet
It should be that:
- desired: 6
- running: 6
Update DaemonSet by OnDelete updateStrategy without tolerations
It should be that:
- desired: 3
- running: 6
If you delete one pod on master, it will be deleted and not come back
- desired: 3
- running: 5

atiratree · 2023-07-24T15:48:50Z

In my opinion we should remove the old pods that are running from the old revision when the NoSchedule tolerations or taints change. I think the rollout aspect and consistency of the DS takes precedence over the scheduling aspect. We are interested in running a new revision and not keeping old revisions running in currently unschedulable nodes. This applies to the rollingUpdate strategy that should automate transition between revisions and not bother users with manual removal of pods as compared to onDelete strategy.

Reproducibility/predicatability is also an issue, if you delete the updated DS and create the same DS you should achieve the same state of the cluster.

Please also take a look at #119317 (review)

mochizuki875 · 2023-07-27T04:04:34Z

I think the rollout aspect and consistency of the DS takes precedence over the scheduling aspect

Reproducibility/predicatability is also an issue, if you delete the updated DS and create the same DS you should achieve the same state of the cluster.

Thank your for your comment.
OK, I've understood:)
I'll fix my PR!

alebedev87 added the kind/bug Categorizes issue or PR as related to a bug. label Jun 23, 2023

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 23, 2023

k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 23, 2023

k8s-ci-robot assigned mochizuki875 Jul 11, 2023

mochizuki875 mentioned this issue Jul 14, 2023

Exclude nodes from rolling update depending on tolerations #119317

Merged

candita mentioned this issue Sep 15, 2023

OCPBUGS-20024: Revert "OCPBUGS-13366: ignore repeated TopologyAwareHintsDisabled events" openshift/origin#28233

Merged

k8s-ci-robot closed this as completed in #119317 Sep 19, 2023

mochizuki875 mentioned this issue Nov 24, 2023

REQUEST: New membership for mochizuki875 kubernetes/org#4604

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DaemonSet fails to scale down during the rolling update when `maxUnavailable=0` #118823

DaemonSet fails to scale down during the rolling update when `maxUnavailable=0` #118823

alebedev87 commented Jun 23, 2023 •

edited

alebedev87 commented Jun 23, 2023

olderTaoist commented Jun 26, 2023

alebedev87 commented Jun 26, 2023

atiratree commented Jun 28, 2023

atiratree commented Jun 28, 2023

mochizuki875 commented Jul 11, 2023

mochizuki875 commented Jul 11, 2023 •

edited

mochizuki875 commented Jul 14, 2023

alebedev87 commented Jul 17, 2023

mochizuki875 commented Jul 18, 2023

atiratree commented Jul 24, 2023

mochizuki875 commented Jul 27, 2023 •

edited

DaemonSet fails to scale down during the rolling update when maxUnavailable=0 #118823

DaemonSet fails to scale down during the rolling update when maxUnavailable=0 #118823

Comments

alebedev87 commented Jun 23, 2023 • edited

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

alebedev87 commented Jun 23, 2023

olderTaoist commented Jun 26, 2023

alebedev87 commented Jun 26, 2023

atiratree commented Jun 28, 2023

atiratree commented Jun 28, 2023

mochizuki875 commented Jul 11, 2023

mochizuki875 commented Jul 11, 2023 • edited

mochizuki875 commented Jul 14, 2023

alebedev87 commented Jul 17, 2023

mochizuki875 commented Jul 18, 2023

atiratree commented Jul 24, 2023

mochizuki875 commented Jul 27, 2023 • edited

DaemonSet fails to scale down during the rolling update when `maxUnavailable=0` #118823

DaemonSet fails to scale down during the rolling update when `maxUnavailable=0` #118823

alebedev87 commented Jun 23, 2023 •

edited

mochizuki875 commented Jul 11, 2023 •

edited

mochizuki875 commented Jul 27, 2023 •

edited