-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rename the key for ovs-cni.network.kubevirt.io/br1 #107
Comments
We cannot change the annotation name, the code that consumes the annotation is not owned by us (the resource injector). Can you just omit the br1 resource name? It should still work as long as you have the bridge configured on all nodes with intel.com/mellanox_snic0. |
@phoracek Yes, the ovs bridge name can be omitted from net-attach-def, but there are some worker nodes which may not have that ovs bridge. |
I would accept a PR for the marker to expose additional resource labels. However, you would need to introduce your own webhook injecting the resource request to Pods (like I said, we don't own the resource injector webhook). Alternative to the webhook would be if you put the resource requirement on your Pods manually. |
If you insert the resource request manually, we don't need any change to the marker. |
For Mellanox SmartNIC network attachment definitions there is no need to refer to the OVS bridge name, as the OVS bridge is connected to the Linux bond of the two SmartNIC PFs, while all the SmartNIC VFs are configured to be part of the device pool for the SmartNIC. OVS bridge and device pool are "synonymous" so that referring to the VF device pool guarantees injection of the VF resource into the pod spec as well as scheduling of the pod to a suitable node. I assume the introduction of the OVS CNI marker resource was a "trick" to solve the scheduling of pods in case of heterogeneous K8s nodes without the need for using node selector labels. Have you considered using simple resource (node selector) labels instead of a dummy bridge resource instead (with a new/enhanced webhook)? |
We have considered node selectors. However, we wanted to reuse the existing mechanism available through the resource injector, so we don't need to reinvent new webhooks. |
In KubeVirt, we use resource injector to handle scheduling for SR-IOV and linux bridge (https://github.com/kubevirt/bridge-marker) too. |
raised PR in network resources injector |
@phoracek why did you use nfd [1] with Feature Detector Hooks. can you explain why ovs bridge is countable resource? |
Modelling OVS bridges as node features looks like an interesting alternative to the trick with countable bridge resources. However, there are two aspects of the current bridge resource approach that I would like to keep:
|
so can you explain how the marker work today? is it create/remove bridges per adding/delete network in the network attachment? |
@moshe010 AFAIK ovs-cni marker would expose ovs bridge as a scalar resource with capacity of As @JanScheurich mentioned, it would be more appropriate if ovs-cni marker exposes ovs bridges as a node feature and use node selector label in network resources injector would be an appropriate solution. @phoracek WDYT ? |
/cc @zshi-redhat |
You meant to ask why did not we use it? We considered NFD in the beginning, however, it has a few drawbacks. It would require us to deploy NFD components and it would require us to write a network-label-injector webhook. Writting webhooks is difficult due to certificate handling and deploying NFD may be complicated since it is not a trivial component. Since the resource handling way of scheduling was already employed in SR-IOV, we decided to use it even with OVS and Linux bridge CNIs. All we need is a simple Daemon Set exposing found bridges as resources on the node. Resource injector is already deployed on the cluster for us (handled through kubevirt or SR-IOV). We chose to follow the simpler way, despite the API is maybe not as good as it would be with NFD.
I'm really hesitant to switch to NFD now. We don't have any issues with the current approach. If somebody is willing to contribute NFD support, including the injector webhook, I would be willing do discuss it. But like I said, I would be really hesitant unless we have a better reason than that it would look better. Note that you can decide to just deploy OVS CNI without the marker and handle the scheduling in any way you like. tl;dr: Why do we want to switch from the current approach? |
Maybe a stupid question: Can't we change to using node-selector labels in a cheaper way? |
There are no dummy bridge resources. We don't use DP. We just merely expose the available bridge as a node resource, not countable: https://github.com/kubevirt/ovs-cni/blob/master/pkg/marker/marker.go#L115 I don't see what is so expensive here. If the resource injector works with labels, that would be great. But I don't see a big difference between exposing the value in resource or label field. But then, I don't know that much about Kubernetes internals. |
But maybe I just don't understand "The OVS CNI marker could just add bridge node-selector labels to the node resource instead of creating dummy bridge resources." properly. |
If I'm not mistaken each bridge is represented as a countable node resource with 1000 units: Yes, this works (as long as there are not more than 1000 ports per bridge), but it still is trick as there is no actual well-defined limit of ports that can be connected to an OVS bridge.
It would just be conceptually cleaner than consuming pseudo-resources. I do understand the original reasoning for re-using the existing Network Resource Injector scheme, but perhaps we can improve that by allowing injection of labels as well. |
It has available amount of 1000, but it is not list of 1000 devices as it would be in case of DP where each item represents a device. Instead, it is 1000 as in 1 kilo, as if it was amount of available memory. Therefore I believe there is not much overhead present.
The maximum limit is not an issue. Is it? I doubt there would be 1000 connections to a single bridge on a node. It is simply implicit "infinite".
It would absolutely be cleaner. I would like to hear more about the needed changes. If I have a clean Kubernetes cluster, what would I need to do to get NFD based solution running. How would we deliver an NFD hook etc. I'm not strictly against this, I just want to be sure the effort/outcome ratio is reasonable. |
@phoracek: I didn't mean to imply that there was a lot of overhead with modelling bridge as resource. My thought was if we could do the label approach without the dependency on NFD and by still relying on a (slightly enhanced) Network Resources Injector. |
(Sorry if I look annoyed and push back too hard, I don't mean to) @JanScheurich could you summarize the benefits of such a change? Anything apart from getting rid of the countable API which does not make sense for "uncountable" resources? BTW, even though we don't implement QoS now, the countable approach may be helpful once we decide to. Let users request 1GB of the throughput. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with /lifecycle rotten |
Rotten issues close after 30d of inactivity. /close |
@kubevirt-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
We have problem with net-attach-def objects when it is required to have both
device
(for device plugin) andbridge
(for ovs-cni) as resource names.This is mainly because both are described with same
k8s.v1.cni.cncf.io/resourceName
key which makes recent value replaces the old one.For example:
Can we have different annotation key for defining ovs bridge name like below ?
Then we can enhance network resources injector to support both resources and pod spec can be injected with appropriate resource requests/limits.
The text was updated successfully, but these errors were encountered: