Calico's Typha pods should prefer to run on masters, but tolerate running elsewhere #9608

seh · 2020-07-21T17:58:14Z

Following up from #9240 (comment), we'd like to run more Calico Typha pods than we have master nodes, for clusters with 600 nodes or more. Unfortunately, we can't get more Typha pods scheduled than we have master nodes, even when we have other non-master nodes that could host Typha as well.

What kops version are you running?
Version 1.18.0-beta.2 (git-63f9ae1099)
What Kubernetes version are you running?

Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.6", GitCommit:"dff82dc0de47299ab66c83c626e08b245ab19037", GitTreeState:"clean", BuildDate:"2020-07-16T00:04:31Z", GoVersion:"go1.14.4", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.6", GitCommit:"dff82dc0de47299ab66c83c626e08b245ab19037", GitTreeState:"clean", BuildDate:"2020-07-15T16:51:04Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}

What cloud provider are you using?
AWS EC2
What commands did you run? What is the simplest way to reproduce this issue?

Create a kops cluster using the Calico CNI implementation.
Enable Typha.
In a cluster using three master machines, configure Typha's Deployment to use at least one replica more.

networking:
calico:
  typhaReplicas: 4

Observe that three of the pods created by the Deployment's ReplicaSet get scheduled—one per master node—but the fourth one remains pending.
Observe that the pending pod's spec contains a "spec.nodeSelector" field.

spec:
  # ...
  nodeSelector:
    kubernetes.io/os: linux
    kubernetes.io/role: master

What happened after the commands executed?
One of the Typha pods remains pending, not placed by the Kubernetes scheduler because each node can host at most one such pod, and there are fewer master nodes that Typha pods.
What did you expect to happen?
The first three Typha pods would be scheduled on the master nodes, but the fourth Typha pod would be scheduled on a different non-master node.
Anything else do we need to know?
Using a node affinity rule in the Typha Deployment's pod spec template would allow us to express a preference for Typha running on a master node, but not a strict requirement. In this case, we'd use a preferredDuringSchedulingIgnoredDuringExecution entry with a match expression adapted from the existing node selector, and we'd remove the node selector.

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/os
            operator: In
            values:
            - linux
      preferredDuringSchedulingIgnoredDuringExecution:
      - preference:
          matchExpressions:
          - key: kubernetes.io/role
            operator: In
            values:
            - master
       weight: 75

The text was updated successfully, but these errors were encountered:

hakman · 2020-07-21T18:03:50Z

@KashifSaadat what do you think?

seh · 2020-07-21T18:15:52Z

Note that we should probably also accommodate the more modern node label used by kubedam: node-role.kubernetes.io/master, just mandating that it exists, its value being immaterial.

seh · 2020-07-23T13:58:55Z

Thank you for the fast resolution, @hakman!

KashifSaadat · 2020-07-27T13:11:51Z

Sorry for the delayed response! I ran into a similar issue and so happy with the approach taken and the fix that has been merged in. 👍

hakman · 2020-07-27T13:13:24Z

@KashifSaadat should I add this change in the Canal manifest also? At the moment it is only in Calico.

KashifSaadat · 2020-07-27T13:19:35Z

Yes please that would be great 👌

tmjd · 2020-08-18T21:22:43Z

What is the reason or use case for nodeSelect'ing or setting affinity to the master nodes for typha?

seh · 2020-08-18T23:02:27Z

In clusters with a lot of worker node churn—especially with the cluster autoscaler dropping ASGs down to zero instances—placing Typha on the masters keeps the pods running consistently without holding up worker nodes unnecessarily.

tmjd · 2020-08-19T16:06:00Z

What do you mean by

without holding up worker nodes unnecessarily

To clear up a possible point of confusion here, Typha temporarily being unavailable would not disrupt any pod traffic. The only thing that would be disrupted would be new policy changes would be delayed until a new Typha pod was running.

seh · 2020-08-19T16:11:44Z

I've seen cases where the cluster autoscaler can't drop a worker node because a Typha pod is running on it. That Typha pod could run on one of the master nodes that are not subject to the autoscaler's control. The safe-to-evict annotation didn't work as expected. We wound up paying to run that worker node unnecessarily.

Ideally we'd scale the Typha Deployment with something like the horizontal proportional autoscaler based on the number of nodes in the cluster.

tmjd · 2020-08-20T15:41:04Z

It sounds like this is just working around the safe-to-evict annotation not working.

Was there a bug submitted for kops about the safe-to-evict annotation not working? I tried searching for one and didn't find one that sounded like what you described.

seh · 2020-08-20T16:16:16Z

No, not quite. We don't want Typha running on worker nodes that are less likely to stay up than the master nodes. We used a preference in our node affinity. We prefer but don't require that Typha run on the masters. Since you can only run one Typha pod per node anyway, once you need more than three Typha pods, you may—but probably don't—want more than three masters. At that point, placing them on worker nodes is fine.

Regarding filing a defect, I wasn't using kops when I saw that problem with the cluster autoscaler failing to evict Typha's pods. I recall the problem having to do with the difference between the "critical" annotations and running the Typha pods at a high enough priority. At some point Tigera moved from using that deprecated annotation (maybe "scheduler.alpha.kubernetes.io/critical-pod") to using the "system-cluster-critical" priority class and the toleration for the "CriticalAddonsOnly" taint. Eviction worked before then, but I think the pod priority may be higher than the autoscaler will tolerate for its default eviction configuration. It's been about ten months since I last looked into it.

This was referenced Jul 21, 2020

Calico Typha doesn't tolerate masters taint #9240

Closed

Update Calico to v3.10.2 #8104

Merged

hakman mentioned this issue Jul 22, 2020

Prefer nodes with "master" role for Calico Typha pods #9609

Merged

k8s-ci-robot closed this as completed in #9609 Jul 23, 2020

ozdanborne mentioned this issue Aug 27, 2020

remove nodeAffinity from typha #9826

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calico's Typha pods should prefer to run on masters, but tolerate running elsewhere #9608

Calico's Typha pods should prefer to run on masters, but tolerate running elsewhere #9608

seh commented Jul 21, 2020 •

edited

Loading

hakman commented Jul 21, 2020

seh commented Jul 21, 2020

seh commented Jul 23, 2020

KashifSaadat commented Jul 27, 2020

hakman commented Jul 27, 2020

KashifSaadat commented Jul 27, 2020

tmjd commented Aug 18, 2020

seh commented Aug 18, 2020

tmjd commented Aug 19, 2020

seh commented Aug 19, 2020

tmjd commented Aug 20, 2020

seh commented Aug 20, 2020

Calico's Typha pods should prefer to run on masters, but tolerate running elsewhere #9608

Calico's Typha pods should prefer to run on masters, but tolerate running elsewhere #9608

Comments

seh commented Jul 21, 2020 • edited Loading

hakman commented Jul 21, 2020

seh commented Jul 21, 2020

seh commented Jul 23, 2020

KashifSaadat commented Jul 27, 2020

hakman commented Jul 27, 2020

KashifSaadat commented Jul 27, 2020

tmjd commented Aug 18, 2020

seh commented Aug 18, 2020

tmjd commented Aug 19, 2020

seh commented Aug 19, 2020

tmjd commented Aug 20, 2020

seh commented Aug 20, 2020

seh commented Jul 21, 2020 •

edited

Loading