Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rename the key for ovs-cni.network.kubevirt.io/br1 #107

Closed
pperiyasamy opened this issue Feb 25, 2020 · 25 comments
Closed

Rename the key for ovs-cni.network.kubevirt.io/br1 #107

pperiyasamy opened this issue Feb 25, 2020 · 25 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@pperiyasamy
Copy link
Collaborator

We have problem with net-attach-def objects when it is required to have both device (for device plugin) and bridge (for ovs-cni) as resource names.
This is mainly because both are described with same k8s.v1.cni.cncf.io/resourceName key which makes recent value replaces the old one.

For example:

cat <<EOF | kubectl create -f -
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: ovs-conf
  annotations:
    k8s.v1.cni.cncf.io/resourceName: ovs-cni.network.kubevirt.io/br1
    k8s.v1.cni.cncf.io/resourceName: intel.com/mellanox_snic0
spec:
  config: '{
      "cniVersion": "0.3.1",
      "type": "ovs",
      "bridge": "br1",
      "vlan": 100
    }'
EOF

Can we have different annotation key for defining ovs bridge name like below ?

  annotations:
    k8s.v1.cni.cncf.io/bridgeName: ovs-cni.network.kubevirt.io/br1
    k8s.v1.cni.cncf.io/resourceName: intel.com/mellanox_snic0

Then we can enhance network resources injector to support both resources and pod spec can be injected with appropriate resource requests/limits.

@phoracek
Copy link
Member

We cannot change the annotation name, the code that consumes the annotation is not owned by us (the resource injector). Can you just omit the br1 resource name? It should still work as long as you have the bridge configured on all nodes with intel.com/mellanox_snic0.

@pperiyasamy
Copy link
Collaborator Author

@phoracek Yes, the ovs bridge name can be omitted from net-attach-def, but there are some worker nodes which may not have that ovs bridge.
Hence it's needed to inject ovs bridge into pod spec as resource requirements so that kube scheduler can place it on the right node.

@phoracek
Copy link
Member

I would accept a PR for the marker to expose additional resource labels.

However, you would need to introduce your own webhook injecting the resource request to Pods (like I said, we don't own the resource injector webhook).

Alternative to the webhook would be if you put the resource requirement on your Pods manually.

@phoracek
Copy link
Member

If you insert the resource request manually, we don't need any change to the marker.

@JanScheurich
Copy link

For Mellanox SmartNIC network attachment definitions there is no need to refer to the OVS bridge name, as the OVS bridge is connected to the Linux bond of the two SmartNIC PFs, while all the SmartNIC VFs are configured to be part of the device pool for the SmartNIC. OVS bridge and device pool are "synonymous" so that referring to the VF device pool guarantees injection of the VF resource into the pod spec as well as scheduling of the pod to a suitable node.

I assume the introduction of the OVS CNI marker resource was a "trick" to solve the scheduling of pods in case of heterogeneous K8s nodes without the need for using node selector labels. Have you considered using simple resource (node selector) labels instead of a dummy bridge resource instead (with a new/enhanced webhook)?

@phoracek
Copy link
Member

We have considered node selectors. However, we wanted to reuse the existing mechanism available through the resource injector, so we don't need to reinvent new webhooks.

@phoracek
Copy link
Member

In KubeVirt, we use resource injector to handle scheduling for SR-IOV and linux bridge (https://github.com/kubevirt/bridge-marker) too.

@pperiyasamy
Copy link
Collaborator Author

raised PR in network resources injector

@moshe010
Copy link
Contributor

moshe010 commented Mar 3, 2020

@phoracek why did you use nfd [1] with Feature Detector Hooks. can you explain why ovs bridge is countable resource?
[1] - https://github.com/kubernetes-sigs/node-feature-discovery#feature-labels

@JanScheurich
Copy link

Modelling OVS bridges as node features looks like an interesting alternative to the trick with countable bridge resources. However, there are two aspects of the current bridge resource approach that I would like to keep:

  1. The marker monitors OVS bridges on the nodes and creates new bridge resources as needed. Could the node feature detector do something similar?
  2. The Network Resources Injector automatically adds the bridge resource requirement to the pod spec so that applications do not need to specify node selector labels to solve the scheduling.
    Could we enhance the Network Resources Injector to be able to also inject node selector labels?

@moshe010
Copy link
Contributor

moshe010 commented Mar 3, 2020

so can you explain how the marker work today? is it create/remove bridges per adding/delete network in the network attachment?

@pperiyasamy
Copy link
Collaborator Author

@moshe010 AFAIK ovs-cni marker would expose ovs bridge as a scalar resource with capacity of 1k and net-attach-def object can refer to it via k8s.v1.cni.cncf.io/resourceName annotation. This helps k8 scheduler to place the pod on the node where the particular ovs bridge exists, but it's never used by device plugin or multus/ovs-cni.

As @JanScheurich mentioned, it would be more appropriate if ovs-cni marker exposes ovs bridges as a node feature and use node selector label in network resources injector would be an appropriate solution.

@phoracek WDYT ?

@pperiyasamy
Copy link
Collaborator Author

pperiyasamy commented Mar 3, 2020

/cc @zshi-redhat

@phoracek
Copy link
Member

phoracek commented Mar 3, 2020

@phoracek why did you use nfd [1] with Feature Detector Hooks. can you explain why ovs bridge is countable resource?
[1] - https://github.com/kubernetes-sigs/node-feature-discovery#feature-labels

You meant to ask why did not we use it?

We considered NFD in the beginning, however, it has a few drawbacks. It would require us to deploy NFD components and it would require us to write a network-label-injector webhook. Writting webhooks is difficult due to certificate handling and deploying NFD may be complicated since it is not a trivial component.

Since the resource handling way of scheduling was already employed in SR-IOV, we decided to use it even with OVS and Linux bridge CNIs. All we need is a simple Daemon Set exposing found bridges as resources on the node. Resource injector is already deployed on the cluster for us (handled through kubevirt or SR-IOV).

We chose to follow the simpler way, despite the API is maybe not as good as it would be with NFD.

@moshe010 AFAIK ovs-cni marker would expose ovs bridge as a scalar resource with capacity of 1k and net-attach-def object can refer to it via k8s.v1.cni.cncf.io/resourceName annotation. This helps k8 scheduler to place the pod on the node where the particular ovs bridge exists, but it's never used by device plugin or multus/ovs-cni.

As @JanScheurich mentioned, it would be more appropriate if ovs-cni marker exposes ovs bridges as a node feature and use node selector label in network resources injector would be an appropriate solution.

@phoracek WDYT ?

I'm really hesitant to switch to NFD now. We don't have any issues with the current approach. If somebody is willing to contribute NFD support, including the injector webhook, I would be willing do discuss it. But like I said, I would be really hesitant unless we have a better reason than that it would look better.

Note that you can decide to just deploy OVS CNI without the marker and handle the scheduling in any way you like.

tl;dr: Why do we want to switch from the current approach?

@JanScheurich
Copy link

Maybe a stupid question: Can't we change to using node-selector labels in a cheaper way?
The OVS CNI marker could just add bridge node-selector labels to the node resource instead of creating dummy bridge resources. And the Network Resources Injector web-hook could be enhanced to check for label annotations in NADs and inject them as node-selector labels into the pod spec.

@phoracek
Copy link
Member

phoracek commented Mar 3, 2020

There are no dummy bridge resources. We don't use DP. We just merely expose the available bridge as a node resource, not countable: https://github.com/kubevirt/ovs-cni/blob/master/pkg/marker/marker.go#L115 I don't see what is so expensive here.

If the resource injector works with labels, that would be great. But I don't see a big difference between exposing the value in resource or label field. But then, I don't know that much about Kubernetes internals.

@phoracek
Copy link
Member

phoracek commented Mar 3, 2020

But maybe I just don't understand "The OVS CNI marker could just add bridge node-selector labels to the node resource instead of creating dummy bridge resources." properly.

@JanScheurich
Copy link

There are no dummy bridge resources. We don't use DP. We just merely expose the available bridge as a node resource, not countable: https://github.com/kubevirt/ovs-cni/blob/master/pkg/marker/marker.go#L115 I don't see what is so expensive here.

If I'm not mistaken each bridge is represented as a countable node resource with 1000 units:
https://github.com/kubevirt/ovs-cni/blob/master/pkg/marker/marker.go#L34

Yes, this works (as long as there are not more than 1000 ports per bridge), but it still is trick as there is no actual well-defined limit of ports that can be connected to an OVS bridge.

If the resource injector works with labels, that would be great. But I don't see a big difference between exposing the value in resource or label field. But then, I don't know that much about Kubernetes internals.

It would just be conceptually cleaner than consuming pseudo-resources. I do understand the original reasoning for re-using the existing Network Resource Injector scheme, but perhaps we can improve that by allowing injection of labels as well.

@phoracek
Copy link
Member

phoracek commented Mar 3, 2020

If I'm not mistaken each bridge is represented as a countable node resource with 1000 units:
https://github.com/kubevirt/ovs-cni/blob/master/pkg/marker/marker.go#L34

It has available amount of 1000, but it is not list of 1000 devices as it would be in case of DP where each item represents a device. Instead, it is 1000 as in 1 kilo, as if it was amount of available memory. Therefore I believe there is not much overhead present.

Yes, this works (as long as there are not more than 1000 ports per bridge), but it still is trick as there is no actual well-defined limit of ports that can be connected to an OVS bridge.

The maximum limit is not an issue. Is it? I doubt there would be 1000 connections to a single bridge on a node. It is simply implicit "infinite".

It would just be conceptually cleaner than consuming pseudo-resources. I do understand the original reasoning for re-using the existing Network Resource Injector scheme, but perhaps we can improve that by allowing injection of labels as well.

It would absolutely be cleaner. I would like to hear more about the needed changes. If I have a clean Kubernetes cluster, what would I need to do to get NFD based solution running. How would we deliver an NFD hook etc. I'm not strictly against this, I just want to be sure the effort/outcome ratio is reasonable.

@JanScheurich
Copy link

@phoracek: I didn't mean to imply that there was a lot of overhead with modelling bridge as resource.

My thought was if we could do the label approach without the dependency on NFD and by still relying on a (slightly enhanced) Network Resources Injector.

@phoracek
Copy link
Member

phoracek commented Mar 3, 2020

(Sorry if I look annoyed and push back too hard, I don't mean to)

@JanScheurich could you summarize the benefits of such a change? Anything apart from getting rid of the countable API which does not make sense for "uncountable" resources?

BTW, even though we don't implement QoS now, the countable approach may be helpful once we decide to. Let users request 1GB of the throughput.

@kubevirt-bot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@kubevirt-bot kubevirt-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 1, 2020
@kubevirt-bot
Copy link
Collaborator

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

@kubevirt-bot kubevirt-bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 1, 2020
@kubevirt-bot
Copy link
Collaborator

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

@kubevirt-bot
Copy link
Collaborator

@kubevirt-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

5 participants