Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scale kube-dns to multiple nodes #2

Closed
jakolehm opened this issue Mar 2, 2018 · 7 comments
Closed

Scale kube-dns to multiple nodes #2

jakolehm opened this issue Mar 2, 2018 · 7 comments
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@jakolehm
Copy link
Contributor

jakolehm commented Mar 2, 2018

No description provided.

@jakolehm jakolehm added the enhancement New feature or request label Mar 2, 2018
@jakolehm jakolehm added this to the 0.3 milestone Mar 5, 2018
@SpComb
Copy link
Contributor

SpComb commented Mar 5, 2018

Relevant pod template spec parts from the kube-dns deployment:

spec:
  template:
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: beta.kubernetes.io/arch
                operator: In
                values:
                - amd64
      tolerations:
      - key: CriticalAddonsOnly
        operator: Exists
      - effect: NoSchedule
        key: node-role.kubernetes.io/master

Doesn't seem to be any node selectors/affinities that would limit what nodes the dns pods get scheduled onto... it presumably ends up on the master node because that's the first node that happens to be available. Should just be a matter of PATCHing the .spec.replicas on the existing /apis/extensions/v1beta1/namespaces/kube-system/deployments/kube-dns (or shelling out to kubectl scale).

@jakolehm
Copy link
Contributor Author

jakolehm commented Mar 5, 2018

I think it should have self anti-affinity, see: kubernetes/kubernetes#57683

@SpComb
Copy link
Contributor

SpComb commented Mar 5, 2018

FWIW that PR was reverted in kubernetes/kubernetes#59357 due to kubernetes/kubernetes#54164 scaling issues.

Also, that PR did not touch the kube-dns manifest used by kubeadm: https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/phases/addons/dns/manifests.go

@SpComb
Copy link
Contributor

SpComb commented Mar 5, 2018

kubeadm will update (PUT) the kube-dns deployment on upgrades, which will presumably lose any scaling replicas/affinity changes... so with the kubeadm kube-dns deployment, we would need to PATCH / kubectl scale it again after every upgrade as well?

Alternative (kubernetes/kubernetes#40063 (comment)) is using the Horizontal DNS autoscaling controller with --default-params={"linear":{"min":2}} (kubernetes/kubernetes#40281) --default-params={"linear":{"min":1, "preventSinglePointFailure": true}} (kubernetes-sigs/cluster-proportional-autoscaler#23).

@SpComb
Copy link
Contributor

SpComb commented Mar 5, 2018

The .spec.template.spec.affinity.podAntiAffinity with preferredDuringSchedulingIgnoredDuringExecution seems to be flawed: if all of the nodes temporarily go down, all of the kube-dns pods might end up getting scheduled back onto the one master node.

Once the other nodes go back online, I don't see what would end up rescheduling the other pod off the master node.

terom@terom-kube-master:~$ kubectl get nodes
NAME                STATUS    ROLES     AGE       VERSION
terom-kube-master   Ready     master    5d        v1.9.2
terom-kube-node1    Ready     <none>    4d        v1.9.2
terom-kube-node2    Ready     <none>    2d        v1.9.3
terom@terom-kube-master:~$ kubectl -n kube-system get deployments/kube-dns -o json | jq .spec.template.spec.affinity.podAntiAffinity
{
  "preferredDuringSchedulingIgnoredDuringExecution": [
    {
      "podAffinityTerm": {
        "labelSelector": {
          "matchExpressions": [
            {
              "key": "k8s-app",
              "operator": "In",
              "values": [
                "kube-dns"
              ]
            }
          ]
        },
        "topologyKey": "kubernetes.io/hostname"
      },
      "weight": 100
    }
  ]
}
terom@terom-kube-master:~$ kubectl -n kube-system get pods -o wide --selector k8s-app=kube-dns
NAME                       READY     STATUS    RESTARTS   AGE       IP          NODE
kube-dns-d9ddc5479-fg9tj   3/3       Running   0          12m       10.40.0.2   terom-kube-master
kube-dns-d9ddc5479-hhdj8   3/3       Running   0          15m       10.40.0.1   terom-kube-master

@SpComb
Copy link
Contributor

SpComb commented Mar 6, 2018

The pragmatic approach to this issue would be to make the number of DNS replicas a configurable parameter (maybe default to something sensible based on the number of nodes in the config?), and then PATCH the kubeadm-managed deployments/kube-dns to add .spec.replicas and .spec.template.spec.affinity.podAntiAffinity.requiredDuringSchedulingIgnoredDuringExcecution. Those will just presumably need to get re-PATCH'd after every kubeadm init/upgrade run.

Long-term I think the best idea would be to replace the problematic kubeadm-managed deployment with a daemonset using node labels for the DNS addon, but that would require more work?

@jakolehm
Copy link
Contributor Author

jakolehm commented Mar 6, 2018

The pragmatic approach to this issue would be to make the number of DNS replicas a configurable parameter (maybe default to something sensible based on the number of nodes in the config?)

Yes, I think this is the way to go (for now).

Long-term I think the best idea would be to replace the problematic kubeadm-managed deployment with a daemonset using node labels for the DNS addon, but that would require more work?

Long-term solutions probably requires contributions to kubeadm (to make it less hacky)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants