New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add anti affinity to virt pods #2089
add anti affinity to virt pods #2089
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ksimon1 looks good as a start, however, since no one seems to like my suggestion to use daemonsets (which would solve a lot of these problems), see #975 for the background, you will have to make the operator node-aware now. Otherwise it will happily spawn two virt-controller instances where one can't be scheduled and will never become ready. Another option is PreferredDuringSchedulingIgnoredDuringExecution
. That would be fine for me in this PR, but it does not solve the underlying problem.
bc5b21a
to
b6a2f3e
Compare
@rmohr should the operator be changed to report good status when at least one api and controller is up? Or would you prefer the DaemonSet with tolerations instead? @ksimon1 followed the decision from the bugzilla (@stu-gott and @fabiand). Your issue #975 is a relevant source of information though. Pity it was not mentioned in the bug as well. |
The trouble with using @rmohr did bring up a very good point in the Community Meeting this week: using a Deployment also schedules all pods in the same zone. In the case of losing an entire zone, this can lead to a delay as the pods are brought up in a different zone. Overall I feel like a deployment is still the better strategy in the sense that there's really no need for a pod per node, but that's just my opinion. |
It is a pod per master-node. We would run exactly on the same nodes like apiservers.
Yes, just think about the registry being unavailable for some time. Our pods can't start, so you can't do anything with the VMIs, while the k8s control plane recovered almost immediately. |
Ah! I hadn't noticed that NodeSelectors could be used with DaemonSets. https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/ Perhaps a Daemonset really is the superior choice here. |
I would describe my preferences this way:
However, letting the operator scale a Deployment based on nodes, or a subset on nodes via labels or such, is basically a Daemonset. There we can then directly use a Daemonset. |
I think this might be getting a little crazy. For the sake of this PR, can't we just have the operator automatically inject this anti-affinity rule for all deployments when more than a single node exists? |
Yeah, I personally would prefer deployments. Why? For large clusters, you may have way way too many instances. This puts additional load on any lock management/leader election plumbing as well as wastes a significant portion of capacity. It also spreads around secrets to many more nodes. Deployments let you decouple the size of your allocation from the size of your cluster. Anti-affinity allows covering the higher availability case without the coupling. You could use node labels with daemonset and a small pool, but now you must carefully manage it, instead of optionally doing this, which you could do with deployments. |
@rmohr, @davidvossel, @stu-gott so what should be next steps? |
In my opinion the daemonset is the best solution for that |
I'm interested in your view point here. What has convinced you that daemonset is a better fit than a deployment for this use case? |
Hi @davidvossel
From my point of view our control plane (virt-api and virt-controller) is important just like the kubernetes control plane, if you don't have any available kube-api you can't start any pod same as if you don't have any virt-api. Also DaemonSet are more infrastructure objects than deployments (required more privilege to start a DaemonSet) . One example is if you try to drain a node that have a DaemonSet running on it you will get and error and the cluster admin needs to explicitly approve that is going to drain a node that have important process running on it. (He explicitly knows that he's going to take down part of the control plane), with deployment is not the case it will just access the drain command.
Another point is that is much easy to control the request of one pod per host with a DaemonSet than start working with anti affinity and node labels.
Related to my understanding of kubevirt and my first comment if the admin have a large number if master nodes (because he want is control plane to be very ha and allow itself to lose a large number of nodes before is cluster because unavailable) we need to provide the same ability. |
@SchSeba describes exactly my view.
@davidvossel You would have to investigate what kind of nodes they are. Master nodes, Worker nodes, special taints, ... A properly configured DaemonSet solves exactly this (there are now pretty clear standardized rules which tains and labels exist on master nodes). Being able to fine-tune this needs to be provided anyway. I think we agreed on that already in another PR. |
Why would we want more privileges required for a user to deploy kubevirt?
Is this a pro or a con? :) However, I do see a value in colocating our control plane (if possible) on the smae nodes which kubernetes is using for it's own masters. Now - As there is no agreement on the discussion I wonder if this rather simple fix should be blocked by the broader discussion if we want to do a broader change. Let's please get this bug fixed and continue the discussion regardless. |
thanks for the feedback @SchSeba
this is what disruption budgets are made for
daemonsets require messing with the rules to ensure they are limited to master nodes as well. i don't think this is necessarily easier or harder |
@fabiand, @davidvossel, @rmohr, @SchSeba, @stu-gott how this bug should be fixed? |
@davidvossel yep. Works for me for now. |
@davidvossel So it should work as it is implemented in this PR? Or when one node from two nodes is down, then decrease number of required pods to number of running nodes? |
Even simpler. Just always leave it to 2. The operator should just check if at least one is ready (and not if both are ready, which is what it does right now). |
yup, that's fine with me |
Ping? What need sto happen here @ksimon1 ? |
@fabiand Working on this now. I had a lot of other stuff to do. |
b6546a2
to
dc0b715
Compare
@davidvossel, @rmohr please can you review? |
or maybe @mfranczy @slintes @petrkotas |
ci test please |
I'm seeing the operator update functional test fail in a couple of the lanes.
It's unclear to me right now if the failure is related to this PR or not. I'm re-running the tests. Other than sorting out this test failure, the PR looks great to me. |
@davidvossel that failing test is related to this PR, because e.g. you have 2 nodes. Test creates kubevirt object > it creates 2 virt-apis, virt-controllers, .... Then it tries to update kubevirt object with custom image tag > Then it spawns new virt-controller (in this time there are 3 virt-controllers). But since you have only 2 nodes, the third virt-controller will never start due to pod anti affinity and kubevirt-object will be forever in deploing phase and it timeouts after 160 seconds. |
dc0b715
to
f5b471e
Compare
I see. So we can't update because there's no way to schedule the new pod on a single node cluster. I'd be fine with us switching to "preferredDuringSchedulingIgnoredDuringExecution". That should resolve the issue for single node clusters and naturally spread out the pods in multi-node clusters |
If it is that way, then this sounds like a reasonable workaround - until we see need to refine it. |
e3489e7
to
0af5f33
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good. Can you run 'make generate' and commit the results please. That will make travis pass. After that I'd just like to see the CI lanes pass before merging
0af5f33
to
bdf6e39
Compare
@davidvossel done, the tests are failing on tests which are not probably related to this change |
What this PR does / why we need it:
This PR adds pod antiAffinity to virt pods (virt-api, virt-controller).
When user uses 2 or more nodes and one died, virt pods from the node which died will not be scheduled on the node which already contains virt pods, e.g. virt-api will not be scheduled on the node which already has virt-api pod and stays in pending status.
@rmohr, @davidvossel, @petrkotas, @stu-gott, @fabiand please review.
Which issue(s) this PR fixes:
Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1671511
Special notes for your reviewer:
Release note: