-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow custom tolerations with operator #1617
Comments
Hi @cosandr and thanks for the issue! For some plugins we support |
@cosandr Are you running GPU workloads on control plane node in production, or is this just for being able to test things with single-node setup? As that seems quite uncommon practice, do you have any other use-case where tolerations would be useful? |
The example is not from production, no. I would say it's relatively common to taint specialized nodes (for example the |
Ok, that's a really good use-case, and NFD actually supports tainting nodes with specific devices. I think we would want to support such option in the operator too:
@tkatila, any comments? EDIT: I think it should be a per-plugin option, as some nodes might have multiple device types, and multiple different taints per node could be awkward. |
#1571 discussed this area too. perhaps we need to think through the cases |
I don't understand this. Can you clarify?
Yup, making it per CR seems like a good solution to me. |
It's experimental NFD feature: https://nfd.sigs.k8s.io/usage/customization-guide#node-tainting EDIT: Because it's still experimental, needs NFD to run with enabling flag, and NFD worker would need also toleration, it may be better to start just by supporting user specified toleration in the operator (and adding NFD node tainting once NFD has support for that enabled by default). |
We can try them out, document the use and maybe create examples. But I would keep them as optional/advanced scenarios. |
@cosandr the operator takes the daemonset "base" during compile-time (see |
We also have other taints we put on certain nodes to restrict scheduling for specific workloads. Adding tolerations is a must since the device is present on those nodes. |
@cosandr & @winromulus question or concern about this request. By the node having a taint and the plugin having a toleration, it would also mean that the workloads would require the same toleration. Compared to just having the resource request, it feels bad from an user experience point of view. Is this something you'd be fine with? |
@tkatila this is actually very much intended. If you need a node to run only certain workloads, you apply taints and give the workloads tolerations. |
Thanks @winromulus So to summary: run GPU plugin on all nodes with GPU hardware, regardless of the taints. Workloads request the GPU resource + have toleration(s) for the tainted node. I will look into adding the tolerations support to the operator. |
Hi,
I'd like to deploy the gpu plugin with custom tolerations, i.e.
but this doesn't appear to be possible. The CRD supports node selectors but not tolerations.
My workaround is deploying the daemonset with kustomize and patching it afterwards to have my tolerations, but I'd like to switch to the operator if possible. I admit I haven't tried to patch the daemonset deployed by the operator, I assumed that's a bad idea and that it would eventually replace it.
The text was updated successfully, but these errors were encountered: