Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DaemonSet fails to scale down during the rolling update when maxUnavailable=0 #118823

Closed
alebedev87 opened this issue Jun 23, 2023 · 12 comments · Fixed by #119317
Closed

DaemonSet fails to scale down during the rolling update when maxUnavailable=0 #118823

alebedev87 opened this issue Jun 23, 2023 · 12 comments · Fixed by #119317
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/apps Categorizes an issue or PR as relevant to SIG Apps. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@alebedev87
Copy link

alebedev87 commented Jun 23, 2023

What happened?

I have a daemonset with the rolling update strategy. I have maxSurge parameter set to a non zero value which means that the maxUnavailable parameter is set to 0. When I replace the toleration in the daemonset's template spec from the one which helped me to be scheduled on the master node into any other toleration, I see the new pods still trying to be scheduled on the master nodes. The old pods from the tolerated nodes can be lucky enough to be recreated but only if they go before any pod from the intolerable node.

$ kubectl -n openshift-dns get ds dns-default -o yaml | yq .spec.updateStrategy
rollingUpdate:
  maxSurge: 10%
  maxUnavailable: 0
type: RollingUpdate

$ kubectl -n openshift-dns get ds dns-default -o yaml | yq .spec.template.spec.tolerations
- key: node-role.kubernetes.io/master
  operator: Exists
  
$ kubectl -n openshift-dns get pods  -o wide | grep dns-default
dns-default-6bfmf     2/2     Running   0          119m    10.129.0.40   ci-ln-sb5ply2-72292-qlhc8-master-2         <none>           <none>
dns-default-9cjdf     2/2     Running   0          2m35s   10.129.2.15   ci-ln-sb5ply2-72292-qlhc8-worker-c-m5wzq   <none>           <none>
dns-default-c6j9x     2/2     Running   0          119m    10.128.0.13   ci-ln-sb5ply2-72292-qlhc8-master-0         <none>           <none>
dns-default-fhqrs     2/2     Running   0          2m12s   10.131.0.29   ci-ln-sb5ply2-72292-qlhc8-worker-a-6q7hs   <none>           <none>
dns-default-lx2nf     2/2     Running   0          119m    10.130.0.15   ci-ln-sb5ply2-72292-qlhc8-master-1         <none>           <none>
dns-default-mmc78     2/2     Running   0          112m    10.128.2.7    ci-ln-sb5ply2-72292-qlhc8-worker-b-bpjdk   <none>           <none>

$ kubectl -n openshift-dns get ds dns-default -o yaml | yq .spec.template.spec.tolerations
- key: test-taint
  operator: Exists
  
$ kubectl -n openshift-dns get ds dns-default
NAME             DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR
dns-default   3                 3                  3             1                       3                    kubernetes.io/os=linux 
  
$ kubectl -n openshift-dns get pods  -o wide | grep dns-default
dns-default-6bfmf     2/2     Running   0          124m    10.129.0.40   ci-ln-sb5ply2-72292-qlhc8-master-2         <none>           <none>
dns-default-76vjz     0/2     Pending   0          3m2s    <none>        <none>                                     <none>           <none>
dns-default-9cjdf     2/2     Running   0          7m24s   10.129.2.15   ci-ln-sb5ply2-72292-qlhc8-worker-c-m5wzq   <none>           <none>
dns-default-c6j9x     2/2     Running   0          124m    10.128.0.13   ci-ln-sb5ply2-72292-qlhc8-master-0         <none>           <none>
dns-default-fhqrs     2/2     Running   0          7m1s    10.131.0.29   ci-ln-sb5ply2-72292-qlhc8-worker-a-6q7hs   <none>           <none>
dns-default-lx2nf     2/2     Running   0          124m    10.130.0.15   ci-ln-sb5ply2-72292-qlhc8-master-1         <none>           <none>
dns-default-mmc78     2/2     Running   0          117m    10.128.2.7    ci-ln-sb5ply2-72292-qlhc8-worker-b-bpjdk   <none>           <none>

$ kubectl -n openshift-dns get pods dns-default-76vjz -o yaml | yq .spec.affinity.nodeAffinity
requiredDuringSchedulingIgnoredDuringExecution:
  nodeSelectorTerms:
    - matchFields:
        - key: metadata.name
          operator: In
          values:
            - ci-ln-sb5ply2-72292-qlhc8-master-0

What did you expect to happen?

I don't expect the new pods to be scheduled on the nodes which are not tolerated by the new DamonSet's template spec. The daemonset controller should just delete the old pods from the nodes which cannot be tolerated anymore. The old pods from the nodes which can still be tolerated should be recreated according to the rolling update parameters.

How can we reproduce it (as minimally and precisely as possible)?

On a Kubernetes cluster with 3 or more master node and 3 or more worker node (2 and 2 may be enough too, I guess):

  • Create a DaemonSet which
    • has the rolling update strategy with maxSurge=10% and maxUnavailable=0
    • has a toleration of the master node role taint: node-role.kubernetes.io/master
  • Wait until all the pods become ready
  • Update the DaemonSet replacing node-role.kubernetes.io/master toleration with any other (even not existing).
  • See the new pod trying to be scheduled on the master node and getting into Pending state

Anything else we need to know?

No response

Kubernetes version

Kubernetes Version: v1.26.5+7d22122
Kubernetes Version: v1.27.2+b451817

Cloud provider

Doesn't matter.

OS version

Doesn't matter.

Install tools

Container runtime (CRI) and version (if applicable)

CRI-O

Related plugins (CNI, CSI, ...) and versions (if applicable)

Doesn't matter.
@alebedev87 alebedev87 added the kind/bug Categorizes issue or PR as related to a bug. label Jun 23, 2023
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 23, 2023
@alebedev87
Copy link
Author

/sig scheduling

@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 23, 2023
@olderTaoist
Copy link
Contributor

the new pod should pending when you change any another toleration, because master have node-role.kubernetes.io/master toleration. as for why update the old pod on master node, maybe should know implement of daemonset.

@alebedev87
Copy link
Author

@olderTaoist: Sorry, I don't quite understand what you are trying to say. Can you please rephrase it?

@atiratree
Copy link
Member

this seems like a bug in the DS surge logic, the old pod is not being removed because it is NoSchedule taint only and the surge presumes that the pods that are there are surgeable without checking the scheduling aspect again

@atiratree
Copy link
Member

/remove-sig scheduling
/sig apps
/triage accepted
/priority important-soon

@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Jun 28, 2023
@mochizuki875
Copy link
Member

I'll try to fix it.

/assign

@mochizuki875
Copy link
Member

mochizuki875 commented Jul 11, 2023

Hi, @alebedev87
Now I'm working for it and I have one question.

You expect that:

The daemonset controller should just delete the old pods from the nodes which cannot be tolerated anymore.

However if master Nodes have taint as below(it's default setting of k8s), and DaemonSet which will be updated has no torelation, what do you expect to happen?

 taints:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master

I think the old Pod should remain without been updated and deleted according to the Taints and Tolerations mechanism.

For example:

Before Update

pod_on_master01_v1
pod_on_master02_v1
pod_on_master03_v1
pod_on_node01_v1
pod_on_node02_v1
pod_on_node03_v1

↓ Update DaemonSet

After Update

pod_on_master01_v1
pod_on_master02_v1
pod_on_master03_v1
pod_on_node01_v2
pod_on_node02_v2
pod_on_node03_v2

From the Taint section of Kubernetes document,

So if master Nodes have taint as below, and DaemonSet which will be updated has no torelation, I think the old Pod should be deleted.

 taints:
  - effect: NoExecute
    key: node-role.kubernetes.io/control-plane
  - effect: NoSchedule
    key: node-role.kubernetes.io/control-plane

For example:

Before Update

pod_on_master01_v1
pod_on_master02_v1
pod_on_master03_v1
pod_on_node01_v1
pod_on_node02_v1
pod_on_node03_v1

↓ Update DaemonSet

After Update

pod_on_node01_v2
pod_on_node02_v2
pod_on_node03_v2

What do you think about that?

@mochizuki875
Copy link
Member

I've created #119317 according to the above.

@alebedev87
Copy link
Author

@mochizuki875: I think I'm fine with the approach to keep the old pods on the master nodes if the master nodes don't have the NoExecute taint.
The only doubt I have is about the status of the daemonset. I noticed that it changes according to the scheduling options, including the tolerations. You can see in the description that the number of the desired pods changes to 3 (only workers). So, according to your proposal we will see the difference in the status and the number of the really running pods (3 vs 6 if we take the example from the description).

@mochizuki875
Copy link
Member

Hi @alebedev87

I think I'm fine with the approach to keep the old pods on the master nodes if the master nodes don't have the NoExecute taint.

Thanks for your consent.

The only doubt I have is about the status of the DaemonSet. I noticed that it changes according to the scheduling options, including the tolerations. You can see in the description that the number of the desired pods changes to 3 (only workers). So, according to your proposal we will see the difference in the status and the number of the really running pods (3 vs 6 if we take the example from the description).

I've noticed that too.
In conclusion, I think the defference is no problem because:

  • all nodes which don't match the schedule option should be excluded from target of DaemonSet
  • daemon pods which is running excluded nodes should not be managed by DaemonSet

FYI: The implementation of updating DaemonSet status is as below.

func (dsc *DaemonSetsController) updateDaemonSetStatus(ctx context.Context, ds *apps.DaemonSet, nodeList []*v1.Node, hash string, updateObservedGen bool) error {
logger := klog.FromContext(ctx)
logger.V(4).Info("Updating daemon set status")
nodeToDaemonPods, err := dsc.getNodesToDaemonPods(ctx, ds, false)
if err != nil {
return fmt.Errorf("couldn't get node to daemon pod mapping for daemon set %q: %v", ds.Name, err)
}
var desiredNumberScheduled, currentNumberScheduled, numberMisscheduled, numberReady, updatedNumberScheduled, numberAvailable int
now := dsc.failedPodsBackoff.Clock.Now()
for _, node := range nodeList {
shouldRun, _ := NodeShouldRunDaemonPod(node, ds)

In this case, by updating DaemonSet without tolerations, 3 master with taints are no longer the target of DaemonSet and 3 daemon pods on them are no longer managed by DaemonSet.
3 daemon pods on master should depend on taints of master(NoSchedule: should remain , NoExecute: should be deleted), and if we delete them, they should be deleted and not come back.

You can see DaemonSet status is that desired=3, current=3 even though there are actually 6 daemon pods running.

So in this case, I think it is no problem that desired number of pods changes 6 to 3 and 6 daemon pods(3 pods on worker are managed by DaemonSet, the other are not) are running by updating without tolerations.

You can see the similar case if you update DaemonSet by OnDelete updateStrategy.

  • Create new DaemonSet
  • It should be that:
    • desired: 6
    • running: 6
  • Update DaemonSet by OnDelete updateStrategy without tolerations
  • It should be that:
    • desired: 3
    • running: 6
  • If you delete one pod on master, it will be deleted and not come back
    • desired: 3
    • running: 5

@atiratree
Copy link
Member

In my opinion we should remove the old pods that are running from the old revision when the NoSchedule tolerations or taints change. I think the rollout aspect and consistency of the DS takes precedence over the scheduling aspect. We are interested in running a new revision and not keeping old revisions running in currently unschedulable nodes. This applies to the rollingUpdate strategy that should automate transition between revisions and not bother users with manual removal of pods as compared to onDelete strategy.

Reproducibility/predicatability is also an issue, if you delete the updated DS and create the same DS you should achieve the same state of the cluster.

Please also take a look at #119317 (review)

@mochizuki875
Copy link
Member

mochizuki875 commented Jul 27, 2023

I think the rollout aspect and consistency of the DS takes precedence over the scheduling aspect

Reproducibility/predicatability is also an issue, if you delete the updated DS and create the same DS you should achieve the same state of the cluster.

Thank your for your comment.
OK, I've understood:)
I'll fix my PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/apps Categorizes an issue or PR as relevant to SIG Apps. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants