Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calico's Typha pods should prefer to run on masters, but tolerate running elsewhere #9608

Closed
seh opened this issue Jul 21, 2020 · 12 comments · Fixed by #9609
Closed

Calico's Typha pods should prefer to run on masters, but tolerate running elsewhere #9608

seh opened this issue Jul 21, 2020 · 12 comments · Fixed by #9609

Comments

@seh
Copy link
Contributor

seh commented Jul 21, 2020

Following up from #9240 (comment), we'd like to run more Calico Typha pods than we have master nodes, for clusters with 600 nodes or more. Unfortunately, we can't get more Typha pods scheduled than we have master nodes, even when we have other non-master nodes that could host Typha as well.

  1. What kops version are you running?
    Version 1.18.0-beta.2 (git-63f9ae1099)

  2. What Kubernetes version are you running?

  • Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.6", GitCommit:"dff82dc0de47299ab66c83c626e08b245ab19037", GitTreeState:"clean", BuildDate:"2020-07-16T00:04:31Z", GoVersion:"go1.14.4", Compiler:"gc", Platform:"darwin/amd64"}
  • Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.6", GitCommit:"dff82dc0de47299ab66c83c626e08b245ab19037", GitTreeState:"clean", BuildDate:"2020-07-15T16:51:04Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
  1. What cloud provider are you using?
    AWS EC2

  2. What commands did you run? What is the simplest way to reproduce this issue?

networking:
calico:
  typhaReplicas: 4
  • Observe that three of the pods created by the Deployment's ReplicaSet get scheduled—one per master node—but the fourth one remains pending.
  • Observe that the pending pod's spec contains a "spec.nodeSelector" field.
spec:
  # ...
  nodeSelector:
    kubernetes.io/os: linux
    kubernetes.io/role: master
  1. What happened after the commands executed?
    One of the Typha pods remains pending, not placed by the Kubernetes scheduler because each node can host at most one such pod, and there are fewer master nodes that Typha pods.

  2. What did you expect to happen?
    The first three Typha pods would be scheduled on the master nodes, but the fourth Typha pod would be scheduled on a different non-master node.

  3. Anything else do we need to know?
    Using a node affinity rule in the Typha Deployment's pod spec template would allow us to express a preference for Typha running on a master node, but not a strict requirement. In this case, we'd use a preferredDuringSchedulingIgnoredDuringExecution entry with a match expression adapted from the existing node selector, and we'd remove the node selector.

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/os
            operator: In
            values:
            - linux
      preferredDuringSchedulingIgnoredDuringExecution:
      - preference:
          matchExpressions:
          - key: kubernetes.io/role
            operator: In
            values:
            - master
       weight: 75
@hakman
Copy link
Member

hakman commented Jul 21, 2020

@KashifSaadat what do you think?

@seh
Copy link
Contributor Author

seh commented Jul 21, 2020

Note that we should probably also accommodate the more modern node label used by kubedam: node-role.kubernetes.io/master, just mandating that it exists, its value being immaterial.

@seh
Copy link
Contributor Author

seh commented Jul 23, 2020

Thank you for the fast resolution, @hakman!

@KashifSaadat
Copy link
Contributor

Sorry for the delayed response! I ran into a similar issue and so happy with the approach taken and the fix that has been merged in. 👍

@hakman
Copy link
Member

hakman commented Jul 27, 2020

@KashifSaadat should I add this change in the Canal manifest also? At the moment it is only in Calico.

@KashifSaadat
Copy link
Contributor

Yes please that would be great 👌

@tmjd
Copy link
Contributor

tmjd commented Aug 18, 2020

What is the reason or use case for nodeSelect'ing or setting affinity to the master nodes for typha?

@seh
Copy link
Contributor Author

seh commented Aug 18, 2020

In clusters with a lot of worker node churn—especially with the cluster autoscaler dropping ASGs down to zero instances—placing Typha on the masters keeps the pods running consistently without holding up worker nodes unnecessarily.

@tmjd
Copy link
Contributor

tmjd commented Aug 19, 2020

What do you mean by

without holding up worker nodes unnecessarily

To clear up a possible point of confusion here, Typha temporarily being unavailable would not disrupt any pod traffic. The only thing that would be disrupted would be new policy changes would be delayed until a new Typha pod was running.

@seh
Copy link
Contributor Author

seh commented Aug 19, 2020

I've seen cases where the cluster autoscaler can't drop a worker node because a Typha pod is running on it. That Typha pod could run on one of the master nodes that are not subject to the autoscaler's control. The safe-to-evict annotation didn't work as expected. We wound up paying to run that worker node unnecessarily.

Ideally we'd scale the Typha Deployment with something like the horizontal proportional autoscaler based on the number of nodes in the cluster.

@tmjd
Copy link
Contributor

tmjd commented Aug 20, 2020

It sounds like this is just working around the safe-to-evict annotation not working.

Was there a bug submitted for kops about the safe-to-evict annotation not working? I tried searching for one and didn't find one that sounded like what you described.

@seh
Copy link
Contributor Author

seh commented Aug 20, 2020

No, not quite. We don't want Typha running on worker nodes that are less likely to stay up than the master nodes. We used a preference in our node affinity. We prefer but don't require that Typha run on the masters. Since you can only run one Typha pod per node anyway, once you need more than three Typha pods, you may—but probably don't—want more than three masters. At that point, placing them on worker nodes is fine.

Regarding filing a defect, I wasn't using kops when I saw that problem with the cluster autoscaler failing to evict Typha's pods. I recall the problem having to do with the difference between the "critical" annotations and running the Typha pods at a high enough priority. At some point Tigera moved from using that deprecated annotation (maybe "scheduler.alpha.kubernetes.io/critical-pod") to using the "system-cluster-critical" priority class and the toleration for the "CriticalAddonsOnly" taint. Eviction worked before then, but I think the pod priority may be higher than the autoscaler will tolerate for its default eviction configuration. It's been about ten months since I last looked into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants