Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add anti affinity to virt pods #2089

Merged
merged 2 commits into from Apr 11, 2019

Conversation

ksimon1
Copy link
Member

@ksimon1 ksimon1 commented Mar 6, 2019

What this PR does / why we need it:
This PR adds pod antiAffinity to virt pods (virt-api, virt-controller).
When user uses 2 or more nodes and one died, virt pods from the node which died will not be scheduled on the node which already contains virt pods, e.g. virt-api will not be scheduled on the node which already has virt-api pod and stays in pending status.
@rmohr, @davidvossel, @petrkotas, @stu-gott, @fabiand please review.

Which issue(s) this PR fixes:
Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1671511

Special notes for your reviewer:

Release note:

Add pod antiAffinity to virt pods (virt-api, virt-controller)

@kubevirt-bot kubevirt-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M labels Mar 6, 2019
Copy link
Member

@rmohr rmohr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ksimon1 looks good as a start, however, since no one seems to like my suggestion to use daemonsets (which would solve a lot of these problems), see #975 for the background, you will have to make the operator node-aware now. Otherwise it will happily spawn two virt-controller instances where one can't be scheduled and will never become ready. Another option is PreferredDuringSchedulingIgnoredDuringExecution. That would be fine for me in this PR, but it does not solve the underlying problem.

@ksimon1 ksimon1 force-pushed the feature/virt-pods-affinity branch from bc5b21a to b6a2f3e Compare March 6, 2019 08:28
@MarSik
Copy link
Contributor

MarSik commented Mar 8, 2019

@rmohr should the operator be changed to report good status when at least one api and controller is up? Or would you prefer the DaemonSet with tolerations instead?

@ksimon1 followed the decision from the bugzilla (@stu-gott and @fabiand). Your issue #975 is a relevant source of information though. Pity it was not mentioned in the bug as well.

@ksimon1
Copy link
Member Author

ksimon1 commented Mar 8, 2019

@MarSik i talked with @stu-gott and he said to update this work to something like: operator should watch for nodes and if there will be numberOfNode<numberOfVirtPods then limit virt pods.

@stu-gott
Copy link
Member

stu-gott commented Mar 8, 2019

The trouble with using PreferredDuringSchedulingIgnoredDuringExecution is that it doesn't prevent the possibility that multiple virt-controller or virt-api pods would end up on the same node. This completely defeats the point of the PR. Which means a node-aware operator would be needed if we're going to continue to use deployments.

@rmohr did bring up a very good point in the Community Meeting this week: using a Deployment also schedules all pods in the same zone. In the case of losing an entire zone, this can lead to a delay as the pods are brought up in a different zone.

Overall I feel like a deployment is still the better strategy in the sense that there's really no need for a pod per node, but that's just my opinion.

@rmohr
Copy link
Member

rmohr commented Mar 8, 2019

Overall I feel like a deployment is still the better strategy in the sense that there's really no need for a pod per node, but that's just my opinion.

It is a pod per master-node. We would run exactly on the same nodes like apiservers.

In the case of losing an entire zone, this can lead to a delay as the pods are brought up in a different zone.

Yes, just think about the registry being unavailable for some time. Our pods can't start, so you can't do anything with the VMIs, while the k8s control plane recovered almost immediately.

@stu-gott
Copy link
Member

stu-gott commented Mar 8, 2019

Ah! I hadn't noticed that NodeSelectors could be used with DaemonSets. https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/

Perhaps a Daemonset really is the superior choice here.

@rmohr
Copy link
Member

rmohr commented Mar 8, 2019

Perhaps a Daemonset really is the superior choice here.

I would describe my preferences this way:

  1. If the master nodes are visible (not all on-demand clusters show you master nodes, even if you are admin, like I learned in Let our HA configuration scale automatically when masters are added or deleted #975) then clearly yes. It will automatically scale with the environment and provide the best uptime guarantee.
  2. If they are not visible, letting users on the CR selecting replacement selectors and tolerations for the daemonset.
  3. If the user really dosn't want that, fall back to a Deployment. But even there we need the option that people can set at least additional tolerations. This provides the least uptime guarantee.

However, letting the operator scale a Deployment based on nodes, or a subset on nodes via labels or such, is basically a Daemonset. There we can then directly use a Daemonset.

@davidvossel
Copy link
Member

I would describe my preferences this way:
If the master nodes are visible (not all on-demand clusters show you master nodes, even if you are admin, like I learned in #975) then clearly yes. It will automatically scale with the environment and provide the best uptime guarantee.
If they are not visible, letting users on the CR selecting replacement selectors and tolerations for the daemonset.
If the user really dosn't want that, fall back to a Deployment. But even there we need the option that people can set at least additional tolerations. This provides the least uptime guarantee.
However, letting the operator scale a Deployment based on nodes, or a subset on nodes via labels or such, is basically a Daemonset. There we can then directly use a Daemonset.

I think this might be getting a little crazy.

For the sake of this PR, can't we just have the operator automatically inject this anti-affinity rule for all deployments when more than a single node exists?

@kfox1111
Copy link
Contributor

kfox1111 commented Mar 8, 2019

Yeah, I personally would prefer deployments. Why? For large clusters, you may have way way too many instances. This puts additional load on any lock management/leader election plumbing as well as wastes a significant portion of capacity. It also spreads around secrets to many more nodes. Deployments let you decouple the size of your allocation from the size of your cluster. Anti-affinity allows covering the higher availability case without the coupling.

You could use node labels with daemonset and a small pool, but now you must carefully manage it, instead of optionally doing this, which you could do with deployments.

@ksimon1
Copy link
Member Author

ksimon1 commented Mar 11, 2019

@rmohr, @davidvossel, @stu-gott so what should be next steps?
Update to daemonset? Or use @davidvossel approach?

@SchSeba
Copy link
Contributor

SchSeba commented Mar 11, 2019

In my opinion the daemonset is the best solution for that

@davidvossel
Copy link
Member

In my opinion the daemonset is the best solution for that

I'm interested in your view point here. What has convinced you that daemonset is a better fit than a deployment for this use case?

@SchSeba
Copy link
Contributor

SchSeba commented Mar 12, 2019

Hi @davidvossel

I'm interested in your view point here. What has convinced you that daemonset is a better fit than a deployment for this use case?

From my point of view our control plane (virt-api and virt-controller) is important just like the kubernetes control plane, if you don't have any available kube-api you can't start any pod same as if you don't have any virt-api.

Also DaemonSet are more infrastructure objects than deployments (required more privilege to start a DaemonSet) . One example is if you try to drain a node that have a DaemonSet running on it you will get and error and the cluster admin needs to explicitly approve that is going to drain a node that have important process running on it. (He explicitly knows that he's going to take down part of the control plane), with deployment is not the case it will just access the drain command.

kubectl drain <node_name> --ignore-daemonsets

Another point is that is much easy to control the request of one pod per host with a DaemonSet than start working with anti affinity and node labels.

For large clusters, you may have way too many instances. This puts additional load on any lock management/leader election plumbing as well as wastes a significant portion of capacity.

Related to my understanding of kubevirt and my first comment if the admin have a large number if master nodes (because he want is control plane to be very ha and allow itself to lose a large number of nodes before is cluster because unavailable) we need to provide the same ability.
If is application consist both from virtualmachines and pods and he knows he can take down 2 out of 5 master in the same time we may become unavailable if we only run on this 2 nodes(it will take time to restart our control plane on the other nodes)

@rmohr
Copy link
Member

rmohr commented Mar 12, 2019

@SchSeba describes exactly my view.

For the sake of this PR, can't we just have the operator automatically inject this anti-affinity rule for all deployments when more than a single node exists?

@davidvossel You would have to investigate what kind of nodes they are. Master nodes, Worker nodes, special taints, ...

A properly configured DaemonSet solves exactly this (there are now pretty clear standardized rules which tains and labels exist on master nodes). Being able to fine-tune this needs to be provided anyway. I think we agreed on that already in another PR.

@fabiand
Copy link
Member

fabiand commented Mar 12, 2019

Also DaemonSet are more infrastructure objects than deployments (required more privilege to start a DaemonSet) .

Why would we want more privileges required for a user to deploy kubevirt?

One example is if you try to drain a node that have a DaemonSet running on it you will get and error and the cluster admin needs to explicitly approve that is going to drain a node that have important process running on it. (He explicitly knows that he's going to take down part of the control plane), with deployment is not the case it will just access the drain command.

Is this a pro or a con? :)
Why should we block a node drain?
Our non-node infra components, like controller, should not depend on specific nodes, nor lock the drain of such a node.
And if you combine this with deploy-ctlplane-on-every-node then you need exceptions to drain any node. No, to me this is an anti-pattern.

However, I do see a value in colocating our control plane (if possible) on the smae nodes which kubernetes is using for it's own masters.

Now - As there is no agreement on the discussion I wonder if this rather simple fix should be blocked by the broader discussion if we want to do a broader change.

Let's please get this bug fixed and continue the discussion regardless.

@davidvossel
Copy link
Member

thanks for the feedback @SchSeba

(He explicitly knows that he's going to take down part of the control plane), with deployment is not the case it will just access the drain command.

this is what disruption budgets are made for

Another point is that is much easy to control the request of one pod per host with a DaemonSet than start working with anti affinity and node labels.

daemonsets require messing with the rules to ensure they are limited to master nodes as well. i don't think this is necessarily easier or harder

@ksimon1
Copy link
Member Author

ksimon1 commented Mar 15, 2019

@fabiand, @davidvossel, @rmohr, @SchSeba, @stu-gott how this bug should be fixed?

@davidvossel
Copy link
Member

how this bug should be fixed?

@rmohr @fabiand
always set the anti affinity rule for deployments and change the operator to only require a single pod in a deployment to be Ready? Does that sound reasonable?

@rmohr
Copy link
Member

rmohr commented Mar 19, 2019

@davidvossel yep. Works for me for now.

@ksimon1
Copy link
Member Author

ksimon1 commented Mar 27, 2019

@davidvossel So it should work as it is implemented in this PR? Or when one node from two nodes is down, then decrease number of required pods to number of running nodes?

@rmohr
Copy link
Member

rmohr commented Mar 27, 2019

@davidvossel So it should work as it is implemented in this PR? Or when one node from two nodes is down, then decrease number of required pods to number of running nodes?

Even simpler. Just always leave it to 2. The operator should just check if at least one is ready (and not if both are ready, which is what it does right now).

@davidvossel
Copy link
Member

Even simpler. Just always leave it to 2. The operator should just check if at least one is ready (and not if both are ready, which is what it does right now).

yup, that's fine with me

@fabiand
Copy link
Member

fabiand commented Apr 2, 2019

Ping?

What need sto happen here @ksimon1 ?

@ksimon1
Copy link
Member Author

ksimon1 commented Apr 2, 2019

@fabiand Working on this now. I had a lot of other stuff to do.

@ksimon1 ksimon1 force-pushed the feature/virt-pods-affinity branch 3 times, most recently from b6546a2 to dc0b715 Compare April 8, 2019 08:29
@ksimon1
Copy link
Member Author

ksimon1 commented Apr 8, 2019

@davidvossel, @rmohr please can you review?

@fabiand
Copy link
Member

fabiand commented Apr 8, 2019

or maybe @mfranczy @slintes @petrkotas

@davidvossel
Copy link
Member

ci test please

@davidvossel
Copy link
Member

I'm seeing the operator update functional test fail in a couple of the lanes.

should be able to update kubevirt install with custom image tag

It's unclear to me right now if the failure is related to this PR or not. I'm re-running the tests.

Other than sorting out this test failure, the PR looks great to me.

@ksimon1
Copy link
Member Author

ksimon1 commented Apr 10, 2019

@davidvossel that failing test is related to this PR, because e.g. you have 2 nodes. Test creates kubevirt object > it creates 2 virt-apis, virt-controllers, .... Then it tries to update kubevirt object with custom image tag > Then it spawns new virt-controller (in this time there are 3 virt-controllers). But since you have only 2 nodes, the third virt-controller will never start due to pod anti affinity and kubevirt-object will be forever in deploing phase and it timeouts after 160 seconds.

@davidvossel
Copy link
Member

@davidvossel that failing test is related to this PR, because e.g. you have 2 nodes. Test creates kubevirt object > it creates 2 virt-apis, virt-controllers, .... Then it tries to update kubevirt object with custom image tag > Then it spawns new virt-controller (in this time there are 3 virt-controllers). But since you have only 2 nodes, the third virt-controller will never start due to pod anti affinity and kubevirt-object will be forever in deploing phase and it timeouts after 160 seconds.

I see. So we can't update because there's no way to schedule the new pod on a single node cluster.

I'd be fine with us switching to "preferredDuringSchedulingIgnoredDuringExecution". That should resolve the issue for single node clusters and naturally spread out the pods in multi-node clusters

@fabiand
Copy link
Member

fabiand commented Apr 10, 2019

I'd be fine with us switching to "preferredDuringSchedulingIgnoredDuringExecution". That should resolve the issue for single node clusters and naturally spread out the pods in multi-node clusters

If it is that way, then this sounds like a reasonable workaround - until we see need to refine it.

Copy link
Member

@davidvossel davidvossel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. Can you run 'make generate' and commit the results please. That will make travis pass. After that I'd just like to see the CI lanes pass before merging

@ksimon1
Copy link
Member Author

ksimon1 commented Apr 11, 2019

@davidvossel done, the tests are failing on tests which are not probably related to this change

@davidvossel davidvossel merged commit 2427364 into kubevirt:master Apr 11, 2019
@ksimon1 ksimon1 deleted the feature/virt-pods-affinity branch April 12, 2019 08:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants