Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update coredns rolling update strategy #10748

Merged

Conversation

tu1h
Copy link
Member

@tu1h tu1h commented Dec 27, 2023

What type of PR is this?
/kind bug

What this PR does / why we need it:
In a two-node cluster(1 master, 1 worker), when upgrading the cluster(e.g. from 1.26.0 to 1.26.7), a coredns pod in pending state will eventually appear.

$ kubectl get -n kube-system pod | grep coredns
coredns-5ffbf89ff4-wpsl8                   0/1     Pending   0             35m
coredns-d77484b69-lwnc8                    1/1     Running   0             34m
coredns-f8cb45497-ptwn7                    1/1     Running   1 (35m ago)   36m
$ kubectl get -n kube-system rs | grep coredns
coredns-5ffbf89ff4                   1         1         0       36m
coredns-d77484b69                    1         1         1       53m
coredns-f8cb45497                    1         1         1       37m

This can be solved by increasing the number of maxUnavailable, it is reasonable to set as 1.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Set the `maxUnavailable` of the coredns rolling update strategy to 1

Signed-off-by: tu1h <lihai.tu@daocloud.io>
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Dec 27, 2023
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Dec 27, 2023
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 28, 2023
@yankay yankay added approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Dec 28, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tu1h, yankay

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@VannTen
Copy link
Contributor

VannTen commented Dec 29, 2023

No objection to align ourselves with other coredns distributions, but I do have a question:
Could you clarify what's happening with that pending pods (maybe share the describe output) ? Is that related to the dns autoscaler ?

Also, please complete the release note.

@tu1h
Copy link
Member Author

tu1h commented Jan 2, 2024

I think it's caused by maxUnavailable being equal to 0, because there are only two nodes in this cluster and the upgrade performs cordon on each node next to each other, which causes all the pods on the current node to be evicted, whereas the coredns have a pod anti-affinity and a maximum unavailable number of 0 for rolling upgrades, which results in the new pods being not able to be scheduled during the upgrade.

There are some details during the upgrading described at below.

Before upgrade:

$ kubectl logs -n kube-system dns-autoscaler-56dd9d6549-7ksd5
I0102 07:27:34.035819       1 autoscaler.go:49] Scaling Namespace: kube-system, Target: deployment/coredns
I0102 07:27:34.292225       1 autoscaler_server.go:157] ConfigMap not found: configmaps "dns-autoscaler" not found, will create one with default params
I0102 07:27:34.309545       1 k8sclient.go:148] Created ConfigMap dns-autoscaler in namespace kube-system
I0102 07:27:34.309609       1 plugin.go:50] Set control mode to linear
I0102 07:27:34.309641       1 linear_controller.go:60] ConfigMap version change (old:  new: 976) - rebuilding params
I0102 07:27:34.309658       1 linear_controller.go:61] Params from apiserver:
{"coresPerReplica":256,"min":2,"nodesPerReplica":16,"preventSinglePointFailure":true}
I0102 07:27:34.319510       1 k8sclient.go:273] Cluster status: SchedulableNodes[2], SchedulableCores[20]
I0102 07:27:34.319585       1 k8sclient.go:274] Replicas are not as expected : updating replicas from 1 to 2

After:

$ kubectl logs -n kube-system dns-autoscaler-56dd9d6549-ntc4w -f
I0102 07:49:31.121668       1 autoscaler.go:49] Scaling Namespace: kube-system, Target: deployment/coredns
I0102 07:49:31.377089       1 plugin.go:50] Set control mode to linear
I0102 07:49:31.377185       1 linear_controller.go:60] ConfigMap version change (old:  new: 976) - rebuilding params
I0102 07:49:31.377206       1 linear_controller.go:61] Params from apiserver:
{"coresPerReplica":256,"min":2,"nodesPerReplica":16,"preventSinglePointFailure":true}
$ kubectl describe deployments.apps coredns
..................
Replicas:               2 desired | 2 updated | 3 total | 2 available | 1 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  0 max unavailable, 10% max surge
..................
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      True    MinimumReplicasAvailable
  Progressing    True    ReplicaSetUpdated
OldReplicaSets:  coredns-f8cb45497 (1/1 replicas created)
NewReplicaSet:   coredns-5ffbf89ff4 (2/2 replicas created)
Events:
  Type    Reason             Age    From                   Message
  ----    ------             ----   ----                   -------
  Normal  ScalingReplicaSet  32m    deployment-controller  Scaled up replica set coredns-d77484b69 to 1
  Normal  ScalingReplicaSet  32m    deployment-controller  Scaled up replica set coredns-d77484b69 to 2 from 1
  Normal  ScalingReplicaSet  12m    deployment-controller  Scaled up replica set coredns-f8cb45497 to 1
  Normal  ScalingReplicaSet  12m    deployment-controller  Scaled down replica set coredns-d77484b69 to 1 from 2
  Normal  ScalingReplicaSet  12m    deployment-controller  Scaled up replica set coredns-f8cb45497 to 2 from 1
  Normal  ScalingReplicaSet  11m    deployment-controller  Scaled down replica set coredns-f8cb45497 to 1 from 2
  Normal  ScalingReplicaSet  11m    deployment-controller  Scaled up replica set coredns-5ffbf89ff4 to 1 from 0
  Normal  ScalingReplicaSet  5m17s  deployment-controller  Scaled down replica set coredns-d77484b69 to 0 from 1
  Normal  ScalingReplicaSet  5m17s  deployment-controller  Scaled up replica set coredns-5ffbf89ff4 to 2 from 1
$ kubectl get -n kube-system rs | grep coredns
coredns-5ffbf89ff4                   2         2         1       12m
coredns-d77484b69                    0         0         0       33m
coredns-f8cb45497                    1         1         1       13m
$ kubectl -n kube-system describe rs coredns-5ffbf89ff4
.............
Annotations:    deployment.kubernetes.io/desired-replicas: 2
                deployment.kubernetes.io/max-replicas: 3
                deployment.kubernetes.io/revision: 3
Controlled By:  Deployment/coredns
Replicas:       2 current / 2 desired
Pods Status:    1 Running / 1 Waiting / 0 Succeeded / 0 Failed
.............
Events:
  Type    Reason            Age    From                   Message
  ----    ------            ----   ----                   -------
  Normal  SuccessfulCreate  13m    replicaset-controller  Created pod: coredns-5ffbf89ff4-jjvg2
  Normal  SuccessfulCreate  6m32s  replicaset-controller  Created pod: coredns-5ffbf89ff4-q94sd
$ kubectl -n kube-system describe rs coredns-d77484b69
.............
Annotations:    deployment.kubernetes.io/desired-replicas: 2
                deployment.kubernetes.io/max-replicas: 3
                deployment.kubernetes.io/revision: 1
Controlled By:  Deployment/coredns
Replicas:       0 current / 0 desired
Pods Status:    0 Running / 0 Waiting / 0 Succeeded / 0 Failed
.............
Events:
  Type     Reason            Age                From                   Message
  ----     ------            ----               ----                   -------
  Warning  FailedCreate      34m (x7 over 34m)  replicaset-controller  Error creating: pods "coredns-d77484b69-" is forbidden: error looking up service account kube-system/coredns: serviceaccount "coredns" not found
  Normal   SuccessfulCreate  34m                replicaset-controller  Created pod: coredns-d77484b69-vjxmd
  Normal   SuccessfulCreate  34m                replicaset-controller  Created pod: coredns-d77484b69-qzl2x
  Normal   SuccessfulCreate  20m                replicaset-controller  Created pod: coredns-d77484b69-ngnn6
  Normal   SuccessfulDelete  14m                replicaset-controller  Deleted pod: coredns-d77484b69-ngnn6
  Normal   SuccessfulCreate  12m                replicaset-controller  Created pod: coredns-d77484b69-vfbtp
  Normal   SuccessfulDelete  7m4s               replicaset-controller  Deleted pod: coredns-d77484b69-vfbtp
kubectl -n kube-system describe rs coredns-f8cb45497
.............
Annotations:    deployment.kubernetes.io/desired-replicas: 2
                deployment.kubernetes.io/max-replicas: 3
                deployment.kubernetes.io/revision: 2
Controlled By:  Deployment/coredns
Replicas:       1 current / 1 desired
Pods Status:    1 Running / 0 Waiting / 0 Succeeded / 0 Failed
.............
Events:
  Type    Reason            Age   From                   Message
  ----    ------            ----  ----                   -------
  Normal  SuccessfulCreate  14m   replicaset-controller  Created pod: coredns-f8cb45497-rjjbv
  Normal  SuccessfulCreate  14m   replicaset-controller  Created pod: coredns-f8cb45497-tgnsk
  Normal  SuccessfulDelete  13m   replicaset-controller  Deleted pod: coredns-f8cb45497-rjjbv

@VannTen
Copy link
Contributor

VannTen commented Jan 8, 2024

Thanks for the explanation !
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 8, 2024
@k8s-ci-robot k8s-ci-robot merged commit ddf5c6e into kubernetes-sigs:master Jan 8, 2024
65 checks passed
ledroide pushed a commit to ledroide/kubespray that referenced this pull request Jan 9, 2024
Signed-off-by: tu1h <lihai.tu@daocloud.io>
@yankay yankay mentioned this pull request Jan 12, 2024
pedro-peter pushed a commit to pedro-peter/kubespray that referenced this pull request May 8, 2024
Signed-off-by: tu1h <lihai.tu@daocloud.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants