add anti affinity to virt pods #2089

ksimon1 · 2019-03-06T07:49:17Z

What this PR does / why we need it:
This PR adds pod antiAffinity to virt pods (virt-api, virt-controller).
When user uses 2 or more nodes and one died, virt pods from the node which died will not be scheduled on the node which already contains virt pods, e.g. virt-api will not be scheduled on the node which already has virt-api pod and stays in pending status.
@rmohr, @davidvossel, @petrkotas, @stu-gott, @fabiand please review.

Which issue(s) this PR fixes:
Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1671511

Special notes for your reviewer:

Release note:

Add pod antiAffinity to virt pods (virt-api, virt-controller)

rmohr

@ksimon1 looks good as a start, however, since no one seems to like my suggestion to use daemonsets (which would solve a lot of these problems), see #975 for the background, you will have to make the operator node-aware now. Otherwise it will happily spawn two virt-controller instances where one can't be scheduled and will never become ready. Another option is PreferredDuringSchedulingIgnoredDuringExecution. That would be fine for me in this PR, but it does not solve the underlying problem.

MarSik · 2019-03-08T14:52:57Z

@rmohr should the operator be changed to report good status when at least one api and controller is up? Or would you prefer the DaemonSet with tolerations instead?

@ksimon1 followed the decision from the bugzilla (@stu-gott and @fabiand). Your issue #975 is a relevant source of information though. Pity it was not mentioned in the bug as well.

ksimon1 · 2019-03-08T14:57:31Z

@MarSik i talked with @stu-gott and he said to update this work to something like: operator should watch for nodes and if there will be numberOfNode<numberOfVirtPods then limit virt pods.

stu-gott · 2019-03-08T14:59:21Z

The trouble with using PreferredDuringSchedulingIgnoredDuringExecution is that it doesn't prevent the possibility that multiple virt-controller or virt-api pods would end up on the same node. This completely defeats the point of the PR. Which means a node-aware operator would be needed if we're going to continue to use deployments.

@rmohr did bring up a very good point in the Community Meeting this week: using a Deployment also schedules all pods in the same zone. In the case of losing an entire zone, this can lead to a delay as the pods are brought up in a different zone.

Overall I feel like a deployment is still the better strategy in the sense that there's really no need for a pod per node, but that's just my opinion.

rmohr · 2019-03-08T15:00:36Z

Overall I feel like a deployment is still the better strategy in the sense that there's really no need for a pod per node, but that's just my opinion.

It is a pod per master-node. We would run exactly on the same nodes like apiservers.

In the case of losing an entire zone, this can lead to a delay as the pods are brought up in a different zone.

Yes, just think about the registry being unavailable for some time. Our pods can't start, so you can't do anything with the VMIs, while the k8s control plane recovered almost immediately.

stu-gott · 2019-03-08T15:08:58Z

Ah! I hadn't noticed that NodeSelectors could be used with DaemonSets. https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/

Perhaps a Daemonset really is the superior choice here.

rmohr · 2019-03-08T15:16:44Z

Perhaps a Daemonset really is the superior choice here.

I would describe my preferences this way:

If the master nodes are visible (not all on-demand clusters show you master nodes, even if you are admin, like I learned in Let our HA configuration scale automatically when masters are added or deleted #975) then clearly yes. It will automatically scale with the environment and provide the best uptime guarantee.
If they are not visible, letting users on the CR selecting replacement selectors and tolerations for the daemonset.
If the user really dosn't want that, fall back to a Deployment. But even there we need the option that people can set at least additional tolerations. This provides the least uptime guarantee.

However, letting the operator scale a Deployment based on nodes, or a subset on nodes via labels or such, is basically a Daemonset. There we can then directly use a Daemonset.

davidvossel · 2019-03-08T20:18:27Z

I would describe my preferences this way:
If the master nodes are visible (not all on-demand clusters show you master nodes, even if you are admin, like I learned in #975) then clearly yes. It will automatically scale with the environment and provide the best uptime guarantee.
If they are not visible, letting users on the CR selecting replacement selectors and tolerations for the daemonset.
If the user really dosn't want that, fall back to a Deployment. But even there we need the option that people can set at least additional tolerations. This provides the least uptime guarantee.
However, letting the operator scale a Deployment based on nodes, or a subset on nodes via labels or such, is basically a Daemonset. There we can then directly use a Daemonset.

I think this might be getting a little crazy.

For the sake of this PR, can't we just have the operator automatically inject this anti-affinity rule for all deployments when more than a single node exists?

kfox1111 · 2019-03-08T20:27:39Z

Yeah, I personally would prefer deployments. Why? For large clusters, you may have way way too many instances. This puts additional load on any lock management/leader election plumbing as well as wastes a significant portion of capacity. It also spreads around secrets to many more nodes. Deployments let you decouple the size of your allocation from the size of your cluster. Anti-affinity allows covering the higher availability case without the coupling.

You could use node labels with daemonset and a small pool, but now you must carefully manage it, instead of optionally doing this, which you could do with deployments.

ksimon1 · 2019-03-11T08:58:07Z

@rmohr, @davidvossel, @stu-gott so what should be next steps?
Update to daemonset? Or use @davidvossel approach?

SchSeba · 2019-03-11T16:11:14Z

In my opinion the daemonset is the best solution for that

davidvossel · 2019-03-11T21:14:08Z

In my opinion the daemonset is the best solution for that

I'm interested in your view point here. What has convinced you that daemonset is a better fit than a deployment for this use case?

SchSeba · 2019-03-12T08:35:04Z

Hi @davidvossel

I'm interested in your view point here. What has convinced you that daemonset is a better fit than a deployment for this use case?

From my point of view our control plane (virt-api and virt-controller) is important just like the kubernetes control plane, if you don't have any available kube-api you can't start any pod same as if you don't have any virt-api.

Also DaemonSet are more infrastructure objects than deployments (required more privilege to start a DaemonSet) . One example is if you try to drain a node that have a DaemonSet running on it you will get and error and the cluster admin needs to explicitly approve that is going to drain a node that have important process running on it. (He explicitly knows that he's going to take down part of the control plane), with deployment is not the case it will just access the drain command.

kubectl drain <node_name> --ignore-daemonsets

Another point is that is much easy to control the request of one pod per host with a DaemonSet than start working with anti affinity and node labels.

For large clusters, you may have way too many instances. This puts additional load on any lock management/leader election plumbing as well as wastes a significant portion of capacity.

Related to my understanding of kubevirt and my first comment if the admin have a large number if master nodes (because he want is control plane to be very ha and allow itself to lose a large number of nodes before is cluster because unavailable) we need to provide the same ability.
If is application consist both from virtualmachines and pods and he knows he can take down 2 out of 5 master in the same time we may become unavailable if we only run on this 2 nodes(it will take time to restart our control plane on the other nodes)

rmohr · 2019-03-12T09:36:41Z

@SchSeba describes exactly my view.

For the sake of this PR, can't we just have the operator automatically inject this anti-affinity rule for all deployments when more than a single node exists?

@davidvossel You would have to investigate what kind of nodes they are. Master nodes, Worker nodes, special taints, ...

A properly configured DaemonSet solves exactly this (there are now pretty clear standardized rules which tains and labels exist on master nodes). Being able to fine-tune this needs to be provided anyway. I think we agreed on that already in another PR.

fabiand · 2019-03-12T13:24:53Z

Also DaemonSet are more infrastructure objects than deployments (required more privilege to start a DaemonSet) .

Why would we want more privileges required for a user to deploy kubevirt?

One example is if you try to drain a node that have a DaemonSet running on it you will get and error and the cluster admin needs to explicitly approve that is going to drain a node that have important process running on it. (He explicitly knows that he's going to take down part of the control plane), with deployment is not the case it will just access the drain command.

Is this a pro or a con? :)
Why should we block a node drain?
Our non-node infra components, like controller, should not depend on specific nodes, nor lock the drain of such a node.
And if you combine this with deploy-ctlplane-on-every-node then you need exceptions to drain any node. No, to me this is an anti-pattern.

However, I do see a value in colocating our control plane (if possible) on the smae nodes which kubernetes is using for it's own masters.

Now - As there is no agreement on the discussion I wonder if this rather simple fix should be blocked by the broader discussion if we want to do a broader change.

Let's please get this bug fixed and continue the discussion regardless.

davidvossel · 2019-03-12T13:25:34Z

thanks for the feedback @SchSeba

(He explicitly knows that he's going to take down part of the control plane), with deployment is not the case it will just access the drain command.

this is what disruption budgets are made for

Another point is that is much easy to control the request of one pod per host with a DaemonSet than start working with anti affinity and node labels.

daemonsets require messing with the rules to ensure they are limited to master nodes as well. i don't think this is necessarily easier or harder

ksimon1 · 2019-03-15T14:37:15Z

@fabiand, @davidvossel, @rmohr, @SchSeba, @stu-gott how this bug should be fixed?

davidvossel · 2019-03-15T21:07:10Z

how this bug should be fixed?

@rmohr @fabiand
always set the anti affinity rule for deployments and change the operator to only require a single pod in a deployment to be Ready? Does that sound reasonable?

rmohr · 2019-03-19T11:34:37Z

@davidvossel yep. Works for me for now.

ksimon1 · 2019-03-27T09:00:58Z

@davidvossel So it should work as it is implemented in this PR? Or when one node from two nodes is down, then decrease number of required pods to number of running nodes?

rmohr · 2019-03-27T09:03:16Z

@davidvossel So it should work as it is implemented in this PR? Or when one node from two nodes is down, then decrease number of required pods to number of running nodes?

Even simpler. Just always leave it to 2. The operator should just check if at least one is ready (and not if both are ready, which is what it does right now).

davidvossel · 2019-03-27T18:57:46Z

Even simpler. Just always leave it to 2. The operator should just check if at least one is ready (and not if both are ready, which is what it does right now).

yup, that's fine with me

fabiand · 2019-04-02T12:03:10Z

Ping?

What need sto happen here @ksimon1 ?

ksimon1 · 2019-04-02T12:40:45Z

@fabiand Working on this now. I had a lot of other stuff to do.

ksimon1 · 2019-04-08T08:45:10Z

@davidvossel, @rmohr please can you review?

fabiand · 2019-04-08T08:57:11Z

or maybe @mfranczy @slintes @petrkotas

davidvossel · 2019-04-09T20:15:19Z

ci test please

davidvossel · 2019-04-09T20:17:01Z

I'm seeing the operator update functional test fail in a couple of the lanes.

should be able to update kubevirt install with custom image tag

It's unclear to me right now if the failure is related to this PR or not. I'm re-running the tests.

Other than sorting out this test failure, the PR looks great to me.

pkg/virt-operator/util/readycheck.go

ksimon1 · 2019-04-10T06:03:22Z

@davidvossel that failing test is related to this PR, because e.g. you have 2 nodes. Test creates kubevirt object > it creates 2 virt-apis, virt-controllers, .... Then it tries to update kubevirt object with custom image tag > Then it spawns new virt-controller (in this time there are 3 virt-controllers). But since you have only 2 nodes, the third virt-controller will never start due to pod anti affinity and kubevirt-object will be forever in deploing phase and it timeouts after 160 seconds.

davidvossel · 2019-04-10T20:37:10Z

@davidvossel that failing test is related to this PR, because e.g. you have 2 nodes. Test creates kubevirt object > it creates 2 virt-apis, virt-controllers, .... Then it tries to update kubevirt object with custom image tag > Then it spawns new virt-controller (in this time there are 3 virt-controllers). But since you have only 2 nodes, the third virt-controller will never start due to pod anti affinity and kubevirt-object will be forever in deploing phase and it timeouts after 160 seconds.

I see. So we can't update because there's no way to schedule the new pod on a single node cluster.

I'd be fine with us switching to "preferredDuringSchedulingIgnoredDuringExecution". That should resolve the issue for single node clusters and naturally spread out the pods in multi-node clusters

fabiand · 2019-04-10T20:54:12Z

I'd be fine with us switching to "preferredDuringSchedulingIgnoredDuringExecution". That should resolve the issue for single node clusters and naturally spread out the pods in multi-node clusters

If it is that way, then this sounds like a reasonable workaround - until we see need to refine it.

pkg/virt-operator/creation/components/deployments.go

davidvossel

looks good. Can you run 'make generate' and commit the results please. That will make travis pass. After that I'd just like to see the CI lanes pass before merging

ksimon1 · 2019-04-11T16:33:08Z

@davidvossel done, the tests are failing on tests which are not probably related to this change

kubevirt-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M labels Mar 6, 2019

rmohr requested changes Mar 6, 2019

View reviewed changes

ksimon1 force-pushed the feature/virt-pods-affinity branch from bc5b21a to b6a2f3e Compare March 6, 2019 08:28

ksimon1 force-pushed the feature/virt-pods-affinity branch 3 times, most recently from b6546a2 to dc0b715 Compare April 8, 2019 08:29

davidvossel reviewed Apr 9, 2019

View reviewed changes

pkg/virt-operator/util/readycheck.go Outdated Show resolved Hide resolved

ksimon1 force-pushed the feature/virt-pods-affinity branch from dc0b715 to f5b471e Compare April 10, 2019 06:07

ksimon1 commented Apr 11, 2019

View reviewed changes

pkg/virt-operator/creation/components/deployments.go Show resolved Hide resolved

add antiAffinity to virt pods

5248a0e

ksimon1 force-pushed the feature/virt-pods-affinity branch from e3489e7 to 0af5f33 Compare April 11, 2019 11:06

davidvossel requested changes Apr 11, 2019

View reviewed changes

changed pod anti affinity scheduling rule

bdf6e39

ksimon1 force-pushed the feature/virt-pods-affinity branch from 0af5f33 to bdf6e39 Compare April 11, 2019 13:01

davidvossel approved these changes Apr 11, 2019

View reviewed changes

davidvossel merged commit 2427364 into kubevirt:master Apr 11, 2019

ksimon1 deleted the feature/virt-pods-affinity branch April 12, 2019 08:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add anti affinity to virt pods #2089

add anti affinity to virt pods #2089

ksimon1 commented Mar 6, 2019 •

edited

rmohr left a comment •

edited

MarSik commented Mar 8, 2019

ksimon1 commented Mar 8, 2019

stu-gott commented Mar 8, 2019

rmohr commented Mar 8, 2019 •

edited

stu-gott commented Mar 8, 2019

rmohr commented Mar 8, 2019 •

edited

davidvossel commented Mar 8, 2019

kfox1111 commented Mar 8, 2019

ksimon1 commented Mar 11, 2019

SchSeba commented Mar 11, 2019

davidvossel commented Mar 11, 2019

SchSeba commented Mar 12, 2019

rmohr commented Mar 12, 2019 •

edited

fabiand commented Mar 12, 2019

davidvossel commented Mar 12, 2019

ksimon1 commented Mar 15, 2019

davidvossel commented Mar 15, 2019

rmohr commented Mar 19, 2019

ksimon1 commented Mar 27, 2019

rmohr commented Mar 27, 2019

davidvossel commented Mar 27, 2019

fabiand commented Apr 2, 2019

ksimon1 commented Apr 2, 2019

ksimon1 commented Apr 8, 2019

fabiand commented Apr 8, 2019

davidvossel commented Apr 9, 2019

davidvossel commented Apr 9, 2019

ksimon1 commented Apr 10, 2019

davidvossel commented Apr 10, 2019

fabiand commented Apr 10, 2019

davidvossel left a comment

ksimon1 commented Apr 11, 2019

add anti affinity to virt pods #2089

add anti affinity to virt pods #2089

Conversation

ksimon1 commented Mar 6, 2019 • edited

rmohr left a comment • edited

Choose a reason for hiding this comment

MarSik commented Mar 8, 2019

ksimon1 commented Mar 8, 2019

stu-gott commented Mar 8, 2019

rmohr commented Mar 8, 2019 • edited

stu-gott commented Mar 8, 2019

rmohr commented Mar 8, 2019 • edited

davidvossel commented Mar 8, 2019

kfox1111 commented Mar 8, 2019

ksimon1 commented Mar 11, 2019

SchSeba commented Mar 11, 2019

davidvossel commented Mar 11, 2019

SchSeba commented Mar 12, 2019

rmohr commented Mar 12, 2019 • edited

fabiand commented Mar 12, 2019

davidvossel commented Mar 12, 2019

ksimon1 commented Mar 15, 2019

davidvossel commented Mar 15, 2019

rmohr commented Mar 19, 2019

ksimon1 commented Mar 27, 2019

rmohr commented Mar 27, 2019

davidvossel commented Mar 27, 2019

fabiand commented Apr 2, 2019

ksimon1 commented Apr 2, 2019

ksimon1 commented Apr 8, 2019

fabiand commented Apr 8, 2019

davidvossel commented Apr 9, 2019

davidvossel commented Apr 9, 2019

ksimon1 commented Apr 10, 2019

davidvossel commented Apr 10, 2019

fabiand commented Apr 10, 2019

davidvossel left a comment

Choose a reason for hiding this comment

ksimon1 commented Apr 11, 2019

ksimon1 commented Mar 6, 2019 •

edited

rmohr left a comment •

edited

rmohr commented Mar 8, 2019 •

edited

rmohr commented Mar 8, 2019 •

edited

rmohr commented Mar 12, 2019 •

edited