ipam: integrate w/ k8s multi-network defacto standard persistent IPs #279

maiqueb · 2024-03-18T11:03:13Z

What this PR does / why we need it:
This PR adds a design document outlining how to integrate the persistent IPs feature of the kubernetes multi-networking de-facto standard into KubeVirt.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

Checklist

This checklist is not enforcing, but it's a reminder of items that could be relevant to every PR.
Approvers are expected to review this list.

Design: A design document was considered and is present (link) or not required
PR: The PR description is expressive enough and will help future contributors
Code: Write code that humans can understand and Keep it simple
Refactor: You have left the code cleaner than you found it (Boy Scout Rule)
Upgrade: Impact of this change on upgrade flows was considered and addressed if required
Testing: New code requires new unit tests. New features and bug fixes require at least on e2e test
Documentation: A user-guide update was considered and is present (link) or not required. You want a user-guide update if it's a user facing feature / API change.
Community: Announcement to kubevirt-dev was considered

Release note:

NONE

maiqueb · 2024-03-19T12:53:16Z

/assign @AlonaKaplan

EdDev · 2024-03-20T05:32:10Z

/cc

/sig network

EdDev

Thank you for posting this interesting proposal.

In addition to the review comments, I was thinking about trying to propose an alternative that decouples Kubevirt from the IPClaim in both directions.

I'm not sure if it is best to discuss it in a thread here, but it can be a good initial start.

It is in the spirit of IPClaim having its own controller, but not being dependent on Kubevirt directly.
I will try to write something early next week or update back if I fail.

EdDev · 2024-03-21T14:19:37Z

design-proposals/sdn-ipam-secondary-nets.md

+
+## Motivation
+Virtual Machine owners want to offload IPAM from their custom solutions (e.g.
+custom DHCP server running on their cluster network) to SDN.


Could you please clarify if this is about the primary pod network, secondary networks or both?

I think it is also important to specify if we are talking about a single interface or more.
For a multi-interface machine (be it physical or virtual), IP management for all the interfaces is tricky. E.g. having several interfaces with dhcp-client may result in an undesired network configuration (multi default routes, mix of dns entries, etc).

How existing clusters resolve the configuration challenges will help (e.g. with a DHCP server, cloud-init config, scripts, etc).

Could you please clarify if this is about the primary pod network, secondary networks or both?

Only secondary networks, as explicitly indicated on the goals (and non-goals) section.

I think it is also important to specify if we are talking about a single interface or more. For a multi-interface machine (be it physical or virtual), IP management for all the interfaces is tricky. E.g. having several interfaces with dhcp-client may result in an undesired network configuration (multi default routes, mix of dns entries, etc).

Well, for multiple ones.

That depends on the configuration provided by the cluster admin in the NAD (routes / DNS). It has nothing to do with this feature AFAIU.

How existing clusters resolve the configuration challenges will help (e.g. with a DHCP server, cloud-init config, scripts, etc).

I do not understand what you're asking for. Are you asking for anything here ?

Only secondary networks, as explicitly indicated on the goals (and non-goals) section.

When reading the motivation, I did not understand the actual need is for secondary networks.
I guess you could explain why this is not needed for the pod network but is needed for the secondaries.

Well, for multiple ones.

That depends on the configuration provided by the cluster admin in the NAD (routes / DNS). It has nothing to do with this feature AFAIU.

The feature is triggered by a need. I think it is worth expressing if the need is for a single one or more.
It may influence the solution proposed.

How existing clusters resolve the configuration challenges will help (e.g. with a DHCP server, cloud-init config, scripts, etc).

I do not understand what you're asking for. Are you asking for anything here ?

Yes, I am asking to explain how the legacy VM or physical machines handled this need.
That way, I can understand if you take a similar solution to Kubevirt or you invent a new solution to an old problem.
Given a problem/challenge, how it was solved so far in other platforms (including bare-metal) can help provide context and support this proposal.

Only secondary networks, as explicitly indicated on the goals (and non-goals) section.

When reading the motivation, I did not understand the actual need is for secondary networks. I guess you could explain why this is not needed for the pod network but is needed for the secondaries.

The pod network is managed / owned by kubernetes. A direct consequence of that is users get IPAM on it already. It's part of what it does. I.e. they don't have to manage IP addresses / configure DHCP servers / etc. The platform does that for them.

I hope this is clear enough.

Well, for multiple ones.
That depends on the configuration provided by the cluster admin in the NAD (routes / DNS). It has nothing to do with this feature AFAIU.

The feature is triggered by a need. I think it is worth expressing if the need is for a single one or more. It may influence the solution proposed.

It isn't specified, but I assume asking for IPAM on secondary interfaces means just that ... IPAM on secondary interfaces. I.e. every secondary interface can have IPAM depending on the config set by the user.

How existing clusters resolve the configuration challenges will help (e.g. with a DHCP server, cloud-init config, scripts, etc).

I do not understand what you're asking for. Are you asking for anything here ?

Yes, I am asking to explain how the legacy VM or physical machines handled this need. That way, I can understand if you take a similar solution to Kubevirt or you invent a new solution to an old problem. Given a problem/challenge, how it was solved so far in other platforms (including bare-metal) can help provide context and support this proposal.

I still don't understand. Are you asking how other platforms implement IPAM, or are you asking how other platforms get around not having IPAM ? The response to the latter is stated right there in the motivation - they use DHCP servers, and static IP addressing.

It will be nice if you add the information you explained very well here in the proposal text as well.

I still don't understand. Are you asking how other platforms implement IPAM, or are you asking how other platforms get around not having IPAM ? The response to the latter is stated right there in the motivation - they use DHCP servers, and static IP addressing.

I ask how other platforms solved the IPAM problem for multiple interface, if at all.
I can be specific if you prefer: Has Openstack assigned IP addresses to more than one interface in a VM and how (all interfaces used DHCP, one used DHCP and all other static, all static, etc).

I am just having trouble understanding how a machine with multiple DHCP clients can work, unless heavy restrictions are placed on them.

design-proposals/sdn-ipam-secondary-nets.md

EdDev · 2024-03-21T14:59:05Z

design-proposals/sdn-ipam-secondary-nets.md

+Thus, we propose to introduce a new CRD to the k8snetworkplumbingwg, and make it
+part of the Kubernetes multi-network defacto standard in this update proposal.
+The `IPAMClaim` CRD was added to the Kubernetes multi-networking de-facto
+[standard](https://github.com/k8snetworkplumbingwg/multi-net-spec/blob/master/v1.3/%5Bv1.3%5D%20Kubernetes%20Network%20Custom%20Resource%20Definition%20De-facto%20Standard.pdf)
+version 1.3. The sections describing the CRD and how to use them are in sections
+4.1.2.1.11 (network selection element update) and 8 (IPAMClaim CRD).


This paragraph seems not to be related to Kubevirt, but to a given generic feature.
Referencing about depending on it for the functionality and further reading should be enough.

Please do mention in what release stage it is at the moment.

What exactly do you propose I do different ? I am listing the standard we're following, pointing to it, and indicating the relevant portions of the standard, since I cannot point to particular sections of the pdf file.

All of that is IMHO required / helpful to the reader.

This section seems to propose the IPClaim CRD and its solution.
It is fine to give an overview on what it does and ref to the relevant information.

All I say here is that you place this in its own section and describe it as an existing service/solution which Kubevirt will use.

EdDev · 2024-03-21T15:00:35Z

design-proposals/sdn-ipam-secondary-nets.md

+The CNI plugin must be adapted to compute its IP pool not only from the live
+pods in the cluster, but also from these CRs.


I am not sure how this is related to kubevirt. I guess explicitly mentioning how it relates to kubevirt will help.

I assume only indicating how to integrate a feature into kubevirt is probably not enough for the reader to understand how the feature works.

Thus - now and then - I try to indicate in a couple of sentences what the systems kubevirt is interacting with need to do.

I.e. just creating an ipam claim (and not putting IPs there) is not good enough. I guess the reader needs the information of who assigns the IPs, who stores the IPs in the claims, and what the claim is used for.

All I ask is to explain it from the Kubevirt perspective.
I agree we need some details to understand how it works, but we surely do not need everything. Even if you think it is better to have it all here, I suggest to place it in a specific section so it will not overload the Kubevirt part (i.e. if you want more details, go-to ).

EdDev · 2024-03-21T15:03:18Z

design-proposals/sdn-ipam-secondary-nets.md

+### Configuring the feature
+We envision this feature to be configurable per network, meaning the network
+admin should enable the feature by enabling the `allowPersistentIPs` flag in the
+CNI configuration for the secondary network (i.e. in the
+`NetworkAttachmentDefinition` spec.config attribute).
+
+A feature gate may (or may not) be required in the KubeVirt.


Same, please refer to the IPClaim as a given 3rd party feature which kubevirt uses, not define it here.

Regarding the FG, I think it is required, just because of the change magnitude and the usage of a 3rd party early release stage (Alpha?) of that feature.

Same, please refer to the IPClaim as a given 3rd party feature which kubevirt uses, not define it here.

I honestly don't follow what you want. Please clarify, and if possible, suggest a change.

Regarding the FG, I think it is required, just because of the change magnitude and the usage of a 3rd party early release stage (Alpha?) of that feature.

I disagree. I think the magnitude of the feature is pretty small, and he feature is disabled by default on the networks. But I'll bend - if the community wants a feature gate, I'll give them one.

@dankenigsberg / @fabiand IIRC you've argued extensively against adding more feature gates. You may want to chime in here.

I honestly don't follow what you want. Please clarify, and if possible, suggest a change.

Quoting: We envision this feature to be configurable per network.
The feature Kubevirt is to use is given, it is per network per my understanding and you do not define it here. This is what I meant.
When reading, it sounds as if that feature is suggested and not already provided.
Moreover, I think it is worth explaining in simple words who provides the IPClaim service, e.g. Multus, CNI and a network provider. Providing the specific ones will help clarify that this is currently specific to OVN Kubernetes (but not limited to IIUC).

I disagree. I think the magnitude of the feature is pretty small, and he feature is disabled by default on the networks.

The integration to support this IP claim is pretty intrusive into Kubevirt IMO.
It touches Kubevirt controllers, requiring the project to know about two specific CNI parameters, a new CRD in its Alpha release stage.
I was under the impression that the FG has been negotiated already and is going to be added.

dankenigsberg / fabiand IIRC you've argued extensively against adding more feature gates. You may want to chime in here.

It depends on the size of the functionality, the risk of the code and the doubts about the new API. I did not read the current proposal, so I don't have an opinion yet. I must confess that after reading #251 I look at FGs more positively than before, as they are different from a redundant configurable. A Feature Gate is only a temporary thing if we are not sure that the feature would graduate.

As Eddy mentioned, since the feature resides on an alpha stage CRD I think it should be behind a FG.

OK, I can do that.

EdDev · 2024-03-21T15:44:02Z

design-proposals/sdn-ipam-secondary-nets.md

+```
+
+#### Hot-plug a VM interface
+This flow is exactly the same as


Is the expected integration point at the controller the same?
Perhaps the integration point should be explicitly defined here.

Please help me understand your expectation. I don't understand what you mean with integration point.

The location in the VMI controller you intend to process the logic.
Or any other place you need to place logic at to reconcile this.

E.g. if you need to integrate this at the VM controller and not at the VMI controller, it matters.

I'm pretty sure the plan is to integrate at the VMI controller, given we also want to support IP stickiness for VMI migration.

I am not in favor of supporting it at the VMI level, but it will be interesting to understand the exact need and usefulness. Te migration is covered if initiated from a VM object.

We have several features, like hotplug, that are supported only at the VM level.
I'm not seeing the added value of supporting VMI independently, but if in practice we do have such users whith common usage, I should reconsider my position on this.

About if to use the logic on VM or VMI controller
Few reasons why imho we should use VMI controller

Only VMI controller knows when the pod is created.
Creating the claims there, just before the pod is created, removes the need to recreate / re-validate on every VM reconcile iteration, and it is much better design imho.

VMI is the one that already reads the NAD, it is more natural for it to read the NAD, and i dont think we should add NAD reading also to VM (we need this info for the claims, we also need the interfaces name scheme which is calculated on VMI controller).

For hot (un)plug we would need also to sync the claims, the annotations updating is the point that already calculates the difference that need to be done (and it is on VMI controller), note that it is bit tricky (if operations are failed), but will be even more tricky on VM controller.

If supporting sticky IP for VMI is not complicated, I don't see a reason why not supporting it.
The hotplug/unplug feature is invoked by editing VM object. The VMI cannot be edited by the user, so it is pretty obvious to the user the feature won't work for VMI only.
However, in the sticky IP case, the user will have the NAD with a persistent IP, use OVN CNI and won't understand why the IP is not sticky.
It worth adding a section in the doc explaining the VMI only flows.

Ultimately it's @AlonaKaplan's call - do we treat VMIs as a an API that's meant to be used by users, or no.

Do you have any specific opinion here ? It would make our life easier if we knew exactly what the boundaries are / where they are.

I do understand hotplug only works for VMs.

... and here I hear that there's a precedent for not implementing features for VMIs.

@AlonaKaplan the VMI only flows are pretty much the same.

The notable exceptions are:

hotplug is not supported (no hotplug on VMI)

restart is not supported (restart essentially spuns up a new VMI)

Everything else is pretty much the same - with the notable exception the IPAMClaim's reference owner is a VMI, instead of a VM.

@oshoval please add more detail if you think it is important

you covered it all
supporting VMI standalone is even easier than not supporting it, because we don't need to add a check if there is VM owner or not when updating the annotation / creating claims.
The only condition that exists about those is if there is no VM owner, then set the ownerRef as VMI as you said.

design-proposals/sdn-ipam-secondary-nets.md

EdDev · 2024-03-21T15:59:10Z

design-proposals/sdn-ipam-secondary-nets.md

+- the spec.network attribute
+
+The `IPAMClaim` name will be created using the following pattern:
+`<vm name>.<logical network name>`. The logical network name is the [name of the


I think I asked about how this is done previously, please try and add a ref to here, it may help the next reader.

What happens if there are two VMs with the same name in different namespaces? Is the namespace part of the IPClaim somehow?

Having a ref to a name is problematic and raceful. Most objects have a UID to resolve the race.
Perhaps an attempt to explain this using an example may help: We unplug and then plug quickly an interface, in the meantime the IPClaim is not yet removed completely and on creation it is still there.
The same can happen I guess when deleting a VM and quickly creating another one with the same name.

I think I asked about how this is done previously, please try and add a ref to here, it may help the next reader.

What happens if there are two VMs with the same name in different namespaces? Is the namespace part of the IPClaim somehow?

The ipam claims are created on the namespace of the vm.

Having a ref to a name is problematic and raceful. Most objects have a UID to resolve the race.

The owner references of the objects carry the UID. We ensure we're handling the proper object by comparing owner references.

Perhaps an attempt to explain this using an example may help: We unplug and then plug quickly an interface, in the meantime the IPClaim is not yet removed completely and on creation it is still there.
The same can happen I guess when deleting a VM and quickly creating another one with the same name.

We check the owner references of the VMI / IPAMClaims. They must match. If they don't, we return an error, which will cause the reconcile loop to be retried later. Eventual consistency will ensure we eventually get this right (eventually the k8s GC will remove the stale resource; then eventually the "new" version manages to be created). Nothing new here IMO.

Please add this to the text.

Regarding the potential race, I understand you use the ref owner to match the UID.
What I am not sure about is how do you manage to do the same for the hot{un}plug scenario.

I might be missing something, but I don't see the difference.

Hotplug / unplug will still happen for a VM; the IPAMClaim will still have an owner reference.

If they don't match, we throw an error, and will try to reconcile later. If it never manages to plug / unplug the interface .. well, we'll keep trying with exponential backoff.

From the network interface you ref a claim, that ref is by name and not UID.
If one plugs, unplugs, plugs, unplugs etc.. the same interface, I do not understand how you can differentiate between claims of the same interface.

The difference from the VM UID is that here the interfaces are changing but the UID pointing to the VM is the same.

I think we may be able to simply indicate the dimension is interface per VM.

As long as we adjust the user's expectation, we should be OK.

I think it can go either way, but I feel failing until the "old" is removed is more aligned with the existing flow.

Technically, I"n not sure the VMI controller has a way to know the interface was just hotplugged and therefore an old IPClaim shouldn't exist.

@oshoval can you chime in ?

We have this logic on VMI controller
so if the annotation is different (previous didn't exist and now exists) at updateMultusAnnotation it means an interface was hotplugged, isn't it ?

It may be raceful if the IPAMClaim was created but for some reason updating the pod annotation failed.
Maybe checking if the claim has a deletion timestamp should be enough.

EdDev · 2024-03-21T16:02:27Z

design-proposals/sdn-ipam-secondary-nets.md

+The feature is opt-in via the network-attachment-definition, (thus disabled by
+default) and it does not impose any performance costs when disabled; as a
+result, adding a feature gate is not recommended.


This seems to confuse between feature lifecycle and the operational control of a feature.
A FG is required due to other reasons but if the configuration parameter is needed, it needs to be specified here with its default setting.

Isn't "The feature is opt-in via the network-attachment-definition, (thus disabled bydefault)" basically that ?

Furthermore, I've already thoroughly indicated how it works in the Configuring the feature section. Doing it again here feels like repeating myself.

This is a FG section and you describe an operational configuration parameter.
I tried to explain the difference so you could decide how to adjust it.

As I mentioned in a previous comment, I agree with Eddy and think a FG is required.
Besides what I mentioned in the previous comment regarding residing on an v1aplha1 API, I prefer not to parse the NAD's config if it is not required.

Whoever creates the IPAMClaims must indicate for which network does the claim correspond. Without that, the CNI's IP pool cannot be reconciled.

I agree, I was talking about the FG. If it is disabled there is no need to parse the NAD config.

Ah !!! My bad.

Right.

EdDev · 2024-03-21T16:03:46Z

design-proposals/sdn-ipam-secondary-nets.md

+default) and it does not impose any performance costs when disabled; as a
+result, adding a feature gate is not recommended.
+
+#### KubeVirt API changes


Please add another section about API dependencies.
I.e. the dependency on the IPClaim CRD and the CNI spec (in the NAD config).

Please also mention the FG and operational configuration with their details here if needed.

Is that really necessary ? I've explained that already in the Design - you've even put comments there. Here's what I said above:

Thus, we propose to introduce a new CRD to the k8snetworkplumbingwg, and make it
part of the Kubernetes multi-network defacto standard in this update proposal.
The IPAMClaim CRD was added to the Kubernetes multi-networking de-facto
standard version 1.3. The sections describing the CRD and how to use them are in sections
4.1.2.1.11 (network selection element update) and 8 (IPAMClaim CRD).

Seems like a good section to move some of the details to.

If the kubevirt client changes, this is also a good place to mention it.

maiqueb · 2024-03-22T11:31:34Z

Thank you for posting this interesting proposal.

In addition to the review comments, I was thinking about trying to propose an alternative that decouples Kubevirt from the IPClaim in both directions.

I'm not sure if it is best to discuss it in a thread here, but it can be a good initial start.

It is in the spirit of IPClaim having its own controller, but not being dependent on Kubevirt directly. I will try to write something early next week or update back if I fail.

I honestly don't see how that is possible without traversing the pod's owner reference tree upward until its root using unstructured ... which is ... very ugly. That was the only "generic" way I can think of tying the IPAMClaim to an arbitrary resource type that would own it.

I prefer to go for something simple, straight-forward, and easy to reason about for the one use case we have in front of us instead of going for the gold and try to envision workload types we have no requirements for (and don't even know what they might be).

But be my guest - try to come up with something.

EdDev · 2024-03-24T13:00:48Z

design-proposals/sdn-ipam-secondary-nets.md

This is an attempt to describe an alternative on how the integration with the IPClaim service can be done without an explicit dependency on the IPClaim or NetworkAttachmentDefinition from the client side.

The basic starting points are these:

The client application that wants to use sticky IP does not need access to any new object.

The client application expresses its desired state through the pod object.

The sticky IP service does not need to know or depend on any specific client.

IPClaim controller

The controller monitors (through an informer) all pods and NetwokAttachmentDefinition/s in the cluster.

Note: Instead of watching all pods/nads, the list may be filtered with a label that marks that the pod/nad is using this IP claim service.

The controller looks at the following pod information:

The (multus) network annotation in which the networks are specified with the ipam-claim-reference.

A dedicated (new) ipclaim-owner annotation that contains as a value a serialized list of owners to the claim [1].

[1] https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.29/#ownerreference-v1-meta

The IPClaim controller maintains the lifecycle of IPClaim objects, from creation to deletion:

When a pod is detected with one or more ipam-claim-reference, it will lookup for the referenced NAD and accordingly create an IPClaim (if the NAD specifies that it supports it).
The IPClaim is created with the owner reference as appearing in the pod annotation (ipclaim-owner).

In case the pod network annotation changes, the relevant IPClaim will be added or removed (hotplug support) based on the NAD information.

In case the pod is deleted, no action is taken.

To assure the IPClaim will not be deleted before the owner object is removed, a finalizer can be added and the IPClaim controller will check the following before removing it:

Identify if any active pod has the mentioned owner ref in the annotation. If no such pod exists, the finalizer can be removed.

Note: It is assumed that the finalizer removal process is triggered when the GC marks the IPClaim object for deletion, i.e. the owner is indeed gone and the GC acted. At this stage, if there is no pod left, it means we can delete the claim.
If someone else deletes the IPClaim, we have a problem.

Kubevirt

The client side, e.g. Kubevirt, will express the desired state as follows:

The VM controller, when the IP claim service is "enabled", will add the ipclaim-owner annotation to the VMI object with the value of itself. The VMI controller will just copy it as-is to the pod later on.

The VMI controller will add to the pod network annotation the ipam-claim-reference if the feature is "enabled", regardless if the NAD is actually going to support it or not. It says something like this: "If IPAM and sticky IP is set in the NAD, I want it sticky and here is the IPClaim obj name to use".

For hotplug/hotunplug, the pod network annotation will change the same as today with no special handling needed beyond the addition of the new CNI field.

Note: If there is a need to control per VM interface network the sticky IP option, it should be done explicitly using a knob or policy.
By adding always the new ipam-claim-reference, we risk having problems with other CNI plugins that do not support it. If this is indeed the case, we will probably need to solve it with a webhook, the same way SR-IOV operator/cni solves such things.

Summary

The suggested solution uses a IPClaim controller to monitor pods and have access to NADs. The pod annotation is the interface between the clients (e.g. Kubevirt) and the IPClaim controller, allowing each to be decoupled from the each other.
An IPClaim mutation admitter can be used to mark the pod network annotation, on a per interface, per NAD case, to set or unset the ipam-claim.

This is an attempt to describe an alternative on how the integration with the IPClaim service can be done without an explicit dependency on the IPClaim or NetworkAttachmentDefinition from the client side.

The basic starting points are these:

The client application that wants to use sticky IP does not need access to any new object.

The client application expresses its desired state through the pod object.

The sticky IP service does not need to know or depend on any specific client.

IPClaim controller

The controller monitors (through an informer) all pods and NetwokAttachmentDefinition/s in the cluster.

Note: Instead of watching all pods/nads, the list may be filtered with a label that marks that the pod/nad is using this IP claim service.

The controller looks at the following pod information:

The (multus) network annotation in which the networks are specified with the ipam-claim-reference.

A dedicated (new) ipclaim-owner annotation that contains as a value a serialized list of owners to the claim [1].

Who would be the owner of the IPAMClaim ? I assume it would be (exclusively) the VM.

[1] https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.29/#ownerreference-v1-meta

The IPClaim controller maintains the lifecycle of IPClaim objects, from creation to deletion:

When a pod is detected with one or more ipam-claim-reference, it will lookup for the referenced NAD and accordingly create an IPClaim (if the NAD specifies that it supports it).
The IPClaim is created with the owner reference as appearing in the pod annotation (ipclaim-owner).

In case the pod network annotation changes, the relevant IPClaim will be added or removed (hotplug support) based on the NAD information.

In case the pod is deleted, no action is taken.

To assure the IPClaim will not be deleted before the owner object is removed, a finalizer can be added and the IPClaim controller will check the following before removing it:

Identify if any active pod has the mentioned owner ref in the annotation. If no such pod exists, the finalizer can be removed.

IIUC this would be on IPAMClaim removal you'd have to fish for any pod w/ this particular annotation. From all pods in the system. Or, if we have a special label on the pods subject to this feature, from within that subset.

If none is found, you remove the finalizer.

Did I get this right ?

Note: It is assumed that the finalizer removal process is triggered when the GC marks the IPClaim object for deletion, i.e. the owner is indeed gone and the GC acted. At this stage, if there is no pod left, it means we can delete the claim.
If someone else deletes the IPClaim, we have a problem.

I would argue this is OK - for someone to be able to manually remove the claim, it's because they've manually removed the finalizer, thus, they're trying to shoot themselves in the foot.

Kubevirt

The client side, e.g. Kubevirt, will express the desired state as follows:

The VM controller, when the IP claim service is "enabled", will add the ipclaim-owner annotation to the VMI object with the value of itself. The VMI controller will just copy it as-is to the pod later on.

How does the VM controller know if the IP claim service (I assume that is the controller ... - but I'm not sure) is running ?

The VMI controller will add to the pod network annotation the ipam-claim-reference if the feature is "enabled", regardless if the NAD is actually going to support it or not. It says something like this: "If IPAM and sticky IP is set in the NAD, I want it sticky and here is the IPClaim obj name to use".

For hotplug/hotunplug, the pod network annotation will change the same as today with no special handling needed beyond the addition of the new CNI field.

Note: If there is a need to control per VM interface network the sticky IP option, it should be done explicitly using a knob or policy.
By adding always the new ipam-claim-reference, we risk having problems with other CNI plugins that do not support it. If this is indeed the case, we will probably need to solve it with a webhook, the same way SR-IOV operator/cni solves such things.

Quoting from CNI spec:

Plugins may define additional fields that they accept and may generate an error if called with unknown fields.

So we cannot simply have KubeVirt always request this feature (since it may break arbitrary CNIs), which would require us to use webhook based solutions. We know those are not a good fit since they will require looking past the object being "looked upon". I think a solution based on eventual consistency is required.

Going past that ... what workloads would you make "subject" of the webhook ? Would it be all pods w/ a certain label ? If so, which pods will have the label ? How will KubeVirt know which pods to label ?

I am assuming that without looking in the NAD config, you'd set it on all launcher pods, which will make all KubeVirt VMs subject to the webhook. Which in turn means that if the webhook goes down (for whatever reason, but let's say certificate rotation, because that happens) we'd pretty much make it impossible for new VMs to be created in the entire system.

Are these assumptions correct ? The real question here I guess is which pods will be subject to the webhook.

Summary

The suggested solution uses a IPClaim controller to monitor pods and have access to NADs. The pod annotation is the interface between the clients (e.g. Kubevirt) and the IPClaim controller, allowing each to be decoupled from the each other. An IPClaim mutation admitter can be used to mark the pod network annotation, on a per interface, per NAD case, to set or unset the ipam-claim.

Who would be the owner of the IPAMClaim ? I assume it would be (exclusively) the VM.

This sentence is taken from the IPClaim controller context, so at its base, that controller does not care who is the owner.
From Kubevirt point of view, I would expect it to be the VM indeed, no different from the original proposal.

To assure the IPClaim will not be deleted before the owner object is removed, a finalizer can be added and the IPClaim controller will check the following before removing it:

Identify if any active pod has the mentioned owner ref in the annotation. If no such pod exists, the finalizer can be removed.

IIUC this would be on IPAMClaim removal you'd have to fish for any pod w/ this particular annotation. From all pods in the system. Or, if we have a special label on the pods subject to this feature, from within that subset.

If none is found, you remove the finalizer.

Did I get this right ?

Yes.
Usually an informer is present which caches changes on the cluster.
The solution can require the present of a label on the pod to be active in this "game", part of the "protocol" when working with IPClaims. It is then up to the pod creator or webhook to declare it part of the IP claim playground.

The VM controller, when the IP claim service is "enabled", will add the ipclaim-owner annotation to the VMI object with the value of itself. The VMI controller will just copy it as-is to the pod later on.

How does the VM controller know if the IP claim service (I assume that is the controller ... - but I'm not sure) is running ?

I meant the desire to have the sticky IP service. Nothing assures it can be served.
If and when the controller will run, it will be served.

So we cannot simply have KubeVirt always request this feature (since it may break arbitrary CNIs), which would require us to use webhook based solutions. We know those are not a good fit since they will require looking past the object being "looked upon". I think a solution based on eventual consistency is required.

In this case, a webhook is the only solution I can think of that works without requiring other clients to know and have access to a NAD.
We already have a version of it working with SR-IOV [1]. It basically parses the pod annotation, fetches the NADs and patches the pod resources accordingly.

[1] https://github.com/openshift/sriov-dp-admission-controller/blob/master/pkg/webhook/webhook.go#L609

Going past that ... what workloads would you make "subject" of the webhook ? Would it be all pods w/ a certain label ? If so, which pods will have the label ? How will KubeVirt know which pods to label ?

Only pods that have a network annotation and marked by the creator to be part of the IP claim "game".
This is something the IPClaim solution can define so users will mark accordingly.

I am assuming that without looking in the NAD config, you'd set it on all launcher pods, which will make all KubeVirt VMs subject to the webhook. Which in turn means that if the webhook goes down (for whatever reason, but let's say certificate rotation, because that happens) we'd pretty much make it impossible for new VMs to be created in the entire system.

Per what I know, webhooks can be registered so their absence will not block anything.
As the controller and webhook can reside on the same executable, we can argue that even through something was requested, it may not have been served. So it is all about what you really desire to happen and how (or if) it can be fixed as part of a reconciliation.

If I try to compare to the current proposal, I guess it is equal to failing to create an IPClaim CR or to start the pod if the CNI fails to detect the IPClaim (or something else). If the webhook fails due to downtime, it means it needs to retry in the VMI reconcile loop.

Are these assumptions correct ? The real question here I guess is which pods will be subject to the webhook.

I think all your points are relevant here. The client needs to express the desire to use the IP claim, therefore it should follow the "rules of engagement" with this service and mark the pod accordingly so it will work.

We already have a version of it working with SR-IOV [1]. It basically parses the pod annotation, fetches the NADs and patches the pod resources accordingly.

[1] https://github.com/openshift/sriov-dp-admission-controller/blob/master/pkg/webhook/webhook.go#L609

This is the U/S project, I should have ref it instead of the one above.

https://github.com/k8snetworkplumbingwg/network-resources-injector

Trimming this down so we can focus on the most important points. I hope I'm not leaving anything behind.

So we cannot simply have KubeVirt always request this feature (since it may break arbitrary CNIs), which would require us to use webhook based solutions. We know those are not a good fit since they will require looking past the object being "looked upon". I think a solution based on eventual consistency is required.

In this case, a webhook is the only solution I can think of that works without requiring other clients to know and have access to a NAD.
We already have a version of it working with SR-IOV [1]. It basically parses the pod annotation, fetches the NADs and patches the pod resources accordingly.

[1] https://github.com/openshift/sriov-dp-admission-controller/blob/master/pkg/webhook/webhook.go#L609

The fact they looked past the object they are mutating is not something that encourages me to do the same. That's inherently raceful.

Furthermore, we already maintain a couple of webhooks and had sub-par experiences with them. Adding a webhook should not be done lightly, and I would avoid it if possible. I am especially reticent about maintaining a webhook based solution.

Going past that ... what workloads would you make "subject" of the webhook ? Would it be all pods w/ a certain label ? If so, which pods will have the label ? How will KubeVirt know which pods to label ?

Only pods that have a network annotation and marked by the creator to be part of the IP claim "game".
This is something the IPClaim solution can define so users will mark accordingly.

Can you elaborate on this ? What does "marked by the creator" mean ? What's the mark ? A label ? How will the client (let's assume kubevirt) know which ones to "mark" without looking into the configuration ?

Per what I know, webhooks can be registered so their absence will not block anything.
As the controller and webhook can reside on the same executable, we can argue that even through something was requested, it may not have been served. So it is all about what you really desire to happen and how (or if) it can be fixed as part of a reconciliation.

You can indeed indicate the behavior if the webhook fails: you can ignore failure (which might cause the workload to be accepted without the ipam claim being created, or you can fail. Which will reject the workload. Then, you can have a controller on top to ensure the IPAMClaims exist as part of their sync loop.

... which begs the question of why having the webhook in the first place if I'm going to have to reconcile. I would rather push for only setting the ipam-claim-reference network selection element attribute based on the feature gate, and have it mean the user understand the CNI they're using can accept (or at least tolerate ...) the existence of this attribute. I think this would be a fair compromise.

Unfortunately, IMHO this proposal addresses the wrong problem. It could work - at a glance. But it focuses on presenting a generic solution for things that run on pods, while all the requirements we have in front of us are related to VMs. Plus, there's an agreed enhancement proposal (including agreement from the sig-network maintainer) for this existing proposal. When we agreed to write this proposal we did it because we were discussing how to integrate with KubeVirt. This seems like a different solution, with a different scope (more ambitious).

IMO narrowing the scope down to VMs/VMIs is cleaner, and cheaper to develop and maintain (across the ecosystem) than what you're proposing - at the expense of forcing kubevirt to consume a CRD and read a configuration object ( ... which it already does). Your proposal does have the advantage of looking like it could work for more types of workloads than VMs. But again: no one asked for that .

The fact they looked past the object they are mutating is not something that encourages me to do the same. That's inherently raceful.

The system is asynchronous by definition, it is not more raceful than this proposal.
When the VMI controller reads the NAD, is also cannot assure it continues to exist a msec later.

The system will eventually reconcile as the pod creation will fail and the VMI controller will retry to create the pod and hopefully the NAD will be there.
In both solutions it will basically recover the same way.

Furthermore, we already maintain a couple of webhooks and had sub-par experiences with them. Adding a webhook should not be done lightly, and I would avoid it if possible. I am especially reticent about maintaining a webhook based solution.

The difficulty to develop and maintain the service is a valid concern, however, considering its ability to integrate with potential clients is also a valid point to consider.
I find it hard to accept the level of administration rights you expect from a client, i.e. accessing NAD/s requires a level of trust I do not think you should depend on.

Only pods that have a network annotation and marked by the creator to be part of the IP claim "game".
This is something the IPClaim solution can define so users will mark accordingly.

Can you elaborate on this ? What does "marked by the creator" mean ? What's the mark ? A label ? How will the client (let's assume kubevirt) know which ones to "mark" without looking into the configuration ?

"Marked by creator" means that the entity that created the pod, needs to mark it somehow (e.g. label) to be considered for IPClaim. This is a protocol you can define as part of the "service".

Kubevirt will probably mark them all, as it wants its pods to be processed if an IP claim is requested.

You can indeed indicate the behavior if the webhook fails: you can ignore failure (which might cause the workload to be accepted without the ipam claim being created, or you can fail. Which will reject the workload. Then, you can have a controller on top to ensure the IPAMClaims exist as part of their sync loop.

... which begs the question of why having the webhook in the first place if I'm going to have to reconcile. I would rather push for only setting the ipam-claim-reference network selection element attribute based on the feature gate, and have it mean the user understand the CNI they're using can accept (or at least tolerate ...) the existence of this attribute. I think this would be a fair compromise.

When I read this I think you confuse a FG with an operational configuration.
But I really lost track of what we discuss here.

Unfortunately, IMHO this proposal addresses the wrong problem. It could work - at a glance. But it focuses on presenting a generic solution for things that run on pods, while all the requirements we have in front of us are related to VMs. Plus, there's an agreed enhancement proposal (including agreement from the sig-network maintainer) for this existing proposal. When we agreed to write this proposal we did it because we were discussing how to integrate with KubeVirt. This seems like a different solution, with a different scope (more ambitious).

From my perspective, the original approved proposal has been focusing on the IP reservation from the network provider side, the "service". I think the solution indeed answers the needs for potential clients.
However, I do not think the integration with Kubevirt is serving well the need of both projects.

On one side, Kubevirt is expected to heavily integrate the IPClaim in its controllers and depend on admin-level resources (NetworkAttachmentDefinition).
From the IPClaim side, it makes it hard to use as you expect the other clients to have all this permissions, which they may not.

As a reviewer of this integration proposal into Kubevirt, I am trying to raise my concerns and warn about the implications and cons. I think Kubevirt should not increase its dependency surface, especially not against something like a NAD. I was pushing towards extracting specialized logic out of core Kubevirt and integrating it from outside (e.g. network binding plugin). I find this moving into the opposite direction and therefore tried to suggest an alternative.
I have no veto on the final decision, the decision was and still is of the maintainer.

IMO narrowing the scope down to VMs/VMIs is cleaner, and cheaper to develop and maintain (across the ecosystem) than what you're proposing - at the expense of forcing kubevirt to consume a CRD and read a configuration object ( ... which it already does). Your proposal does have the advantage of looking like it could work for more types of workloads than VMs. But again: no one asked for that .

All fair points.
I personally find the value of a self sustained service with its own dedicated owners a more maintainable, safe and stable solution.

maiqueb

@EdDev please let's continue the discussion.

maiqueb · 2024-03-26T15:54:30Z

design-proposals/sdn-ipam-secondary-nets.md

+`IPAMClaim` status. Finally, the CNI will configure the interface with these
+IP addresses.


Routing and DNS configuration are not in scope of this feature, and are configured by the network admin.

This feature's scope is IMO quite clear - and listed in the objectives. Ensure an IP address survives a VM restart & migration.

Routing & DNS are configured by the network / cluster admin either on the NAD, or cloud-init.

I can add a short paragraph indicating vm owners can use cloud-init to configure the guest if it does not have a dhcp client.

maiqueb · 2024-03-26T16:01:29Z

design-proposals/sdn-ipam-secondary-nets.md

+- the spec.network attribute
+
+The `IPAMClaim` name will be created using the following pattern:
+`<vm name>.<logical network name>`. The logical network name is the [name of the


I might be missing something, but I don't see the difference.

Hotplug / unplug will still happen for a VM; the IPAMClaim will still have an owner reference.

If they don't match, we throw an error, and will try to reconcile later. If it never manages to plug / unplug the interface .. well, we'll keep trying with exponential backoff.

maiqueb · 2024-03-26T16:08:21Z

design-proposals/sdn-ipam-secondary-nets.md

+```
+
+#### Hot-plug a VM interface
+This flow is exactly the same as


This is afaiu a requirement being asked from stake-holders.
@phoracek

It's a good point though: if hotplug is only supported at VM level - this is news to me; when I implemented it a VMI could be hotplugged / unplugged from - it may be one less thing to drop from the requirements.

I will not die on this hill fwiw.

maiqueb · 2024-03-26T16:11:01Z

design-proposals/sdn-ipam-secondary-nets.md

+When creating the pod, the OVN-Kubernetes IPAM module finds existing
+`IPAMClaim`s for the workload. It will thus use those already reserved
+allocations, instead of generating brand new allocations for the pod where the
+encapsulating object will run. The migration scenario is similar.


Some packets will surely be dropped, but we have an epic and a way forward to mitigate it, by changing the OVS port to node association dynamically based on the state of the migration.

Hypershift already does this tracking, but for something different; IIRC correctly, it configures the ARP proxy only when the destination pod has taken over.
@qinqon can you chime in ?

maiqueb · 2024-03-26T16:12:21Z

design-proposals/sdn-ipam-secondary-nets.md

+#### Starting a (previously stopped) Virtual Machine
+This flow is - from a CNI perspective - quite similar to the
+[Creating a VM flow](#creating-a-virtual-machine):
+1. the workload controller (KubeVirt) templates the pod, featuring the required


We'll fail, and the VM will not start until the old IPAMClaims are deleted.

maiqueb · 2024-03-27T10:19:54Z

design-proposals/sdn-ipam-secondary-nets.md

+
+## Motivation
+Virtual Machine owners want to offload IPAM from their custom solutions (e.g.
+custom DHCP server running on their cluster network) to SDN.


Only secondary networks, as explicitly indicated on the goals (and non-goals) section.

When reading the motivation, I did not understand the actual need is for secondary networks. I guess you could explain why this is not needed for the pod network but is needed for the secondaries.

The pod network is managed / owned by kubernetes. A direct consequence of that is users get IPAM on it already. It's part of what it does. I.e. they don't have to manage IP addresses / configure DHCP servers / etc. The platform does that for them.

I hope this is clear enough.

Well, for multiple ones.
That depends on the configuration provided by the cluster admin in the NAD (routes / DNS). It has nothing to do with this feature AFAIU.

The feature is triggered by a need. I think it is worth expressing if the need is for a single one or more. It may influence the solution proposed.

It isn't specified, but I assume asking for IPAM on secondary interfaces means just that ... IPAM on secondary interfaces. I.e. every secondary interface can have IPAM depending on the config set by the user.

How existing clusters resolve the configuration challenges will help (e.g. with a DHCP server, cloud-init config, scripts, etc).

I do not understand what you're asking for. Are you asking for anything here ?

Yes, I am asking to explain how the legacy VM or physical machines handled this need. That way, I can understand if you take a similar solution to Kubevirt or you invent a new solution to an old problem. Given a problem/challenge, how it was solved so far in other platforms (including bare-metal) can help provide context and support this proposal.

I still don't understand. Are you asking how other platforms implement IPAM, or are you asking how other platforms get around not having IPAM ? The response to the latter is stated right there in the motivation - they use DHCP servers, and static IP addressing.

maiqueb · 2024-03-27T10:34:11Z

design-proposals/sdn-ipam-secondary-nets.md

Trimming this down so we can focus on the most important points. I hope I'm not leaving anything behind.

So we cannot simply have KubeVirt always request this feature (since it may break arbitrary CNIs), which would require us to use webhook based solutions. We know those are not a good fit since they will require looking past the object being "looked upon". I think a solution based on eventual consistency is required.

In this case, a webhook is the only solution I can think of that works without requiring other clients to know and have access to a NAD.
We already have a version of it working with SR-IOV [1]. It basically parses the pod annotation, fetches the NADs and patches the pod resources accordingly.

[1] https://github.com/openshift/sriov-dp-admission-controller/blob/master/pkg/webhook/webhook.go#L609

The fact they looked past the object they are mutating is not something that encourages me to do the same. That's inherently raceful.

Furthermore, we already maintain a couple of webhooks and had sub-par experiences with them. Adding a webhook should not be done lightly, and I would avoid it if possible. I am especially reticent about maintaining a webhook based solution.

Going past that ... what workloads would you make "subject" of the webhook ? Would it be all pods w/ a certain label ? If so, which pods will have the label ? How will KubeVirt know which pods to label ?

Only pods that have a network annotation and marked by the creator to be part of the IP claim "game".
This is something the IPClaim solution can define so users will mark accordingly.

Can you elaborate on this ? What does "marked by the creator" mean ? What's the mark ? A label ? How will the client (let's assume kubevirt) know which ones to "mark" without looking into the configuration ?

Per what I know, webhooks can be registered so their absence will not block anything.
As the controller and webhook can reside on the same executable, we can argue that even through something was requested, it may not have been served. So it is all about what you really desire to happen and how (or if) it can be fixed as part of a reconciliation.

You can indeed indicate the behavior if the webhook fails: you can ignore failure (which might cause the workload to be accepted without the ipam claim being created, or you can fail. Which will reject the workload. Then, you can have a controller on top to ensure the IPAMClaims exist as part of their sync loop.

... which begs the question of why having the webhook in the first place if I'm going to have to reconcile. I would rather push for only setting the ipam-claim-reference network selection element attribute based on the feature gate, and have it mean the user understand the CNI they're using can accept (or at least tolerate ...) the existence of this attribute. I think this would be a fair compromise.

Unfortunately, IMHO this proposal addresses the wrong problem. It could work - at a glance. But it focuses on presenting a generic solution for things that run on pods, while all the requirements we have in front of us are related to VMs. Plus, there's an agreed enhancement proposal (including agreement from the sig-network maintainer) for this existing proposal. When we agreed to write this proposal we did it because we were discussing how to integrate with KubeVirt. This seems like a different solution, with a different scope (more ambitious).

IMO narrowing the scope down to VMs/VMIs is cleaner, and cheaper to develop and maintain (across the ecosystem) than what you're proposing - at the expense of forcing kubevirt to consume a CRD and read a configuration object ( ... which it already does). Your proposal does have the advantage of looking like it could work for more types of workloads than VMs. But again: no one asked for that .

qinqon · 2024-03-18T11:08:39Z

design-proposals/sdn-ipam-secondary-nets.md

+Given OVN-Kubernetes (the CNI plugin where this feature will be implemented)
+operates at pod level (i.e. it does **not** know what a KubeVirt VM is), and its
+source of truth is in essence the live pods on the cluster, we need to have
+something in the data model representing this existing allocation when the pod
+is deleted (the stopped VM scenario). If we don't, those IP addresses would be
+allocatable for other workloads while the VM is stopped.


IPAMClaim is part of the de-facto standard we can say "Given a CNI supporting IPAMClaim"

design-proposals/sdn-ipam-secondary-nets.md

qinqon · 2024-04-01T07:03:38Z

design-proposals/sdn-ipam-secondary-nets.md

+When creating the pod, the OVN-Kubernetes IPAM module finds existing
+`IPAMClaim`s for the workload. It will thus use those already reserved
+allocations, instead of generating brand new allocations for the pod where the
+encapsulating object will run. The migration scenario is similar.


At hypershift we skip ip configuration at pod and we deliver addresses direclty to the VM using DHCP, this allow us to use bridge binding since it's just L2.

The have the same gateway independently of the node where VM is running (at ovn-k depending on the node the VM is runing the gateway is different for defeafult pod network) we use ARP proxy and that's is always onfigured, The only think that is configured with the pod has taken over is the point to point routes.

More details here https://github.com/ovn-org/ovn-kubernetes/blob/master/docs/design/live_migration.md

AlonaKaplan · 2024-04-14T15:30:00Z

design-proposals/sdn-ipam-secondary-nets.md

+pods in the cluster, but also from these CRs.
+
+### Configuring the feature
+We envision this feature to be configurable per network, meaning the network


What if multiple NADs have the same network but different allowPersistentIPs?
Should it fail in kubevirt level? I guess the CNI won't check it.

That's not possible according to ovn-kubernetes definition of a network - i.e. the configuration must be the same in all NADs.

I can't speak for other CNIs though.

That's not possible according to ovn-kubernetes definition of a network - i.e. the configuration must be the same in all NADs.

I can't speak for other CNIs though.

I don't expect the client side to care about that. It should care about the setting in the NAD.

AlonaKaplan · 2024-04-14T15:34:19Z

design-proposals/sdn-ipam-secondary-nets.md

+### Configuring the feature
+We envision this feature to be configurable per network, meaning the network
+admin should enable the feature by enabling the `allowPersistentIPs` flag in the
+CNI configuration for the secondary network (i.e. in the
+`NetworkAttachmentDefinition` spec.config attribute).
+
+A feature gate may (or may not) be required in the KubeVirt.


As Eddy mentioned, since the feature resides on an alpha stage CRD I think it should be behind a FG.

AlonaKaplan · 2024-04-14T15:40:28Z

design-proposals/sdn-ipam-secondary-nets.md

+We envision this feature to be configurable per network, meaning the network
+admin should enable the feature by enabling the `allowPersistentIPs` flag in the
+CNI configuration for the secondary network (i.e. in the
+`NetworkAttachmentDefinition` spec.config attribute).


Why is it part of the spec.config and not an attribute in the spec? Should the CNI look at the value of allowPersistentIPs?

Why is it part of the spec.config and not an attribute in the spec?

I might be missing something. An attribute in the spec of what ?

Should the CNI look at the value of allowPersistentIPs?

Yes, the CNI also checks for this.

I might be missing something. An attribute in the spec of what ?

Of the NAD. I wasn't sure the CNI is using the value of allowPersistentIPs.
If it is it makes sense it if part of the config.

yes, the CNI plugin should also ensure the plugin configuration has the knob enabled.

AlonaKaplan · 2024-04-14T15:42:44Z

design-proposals/sdn-ipam-secondary-nets.md

+
+### Configuring the feature
+We envision this feature to be configurable per network, meaning the network
+admin should enable the feature by enabling the `allowPersistentIPs` flag in the


What if allowPersistentIPs is disabled but there is IPClaim, or the other way around? The CNI is looking for the IPClaim and ignores the allowPersistentIPs value?

we're actually discussing right now ignore vs error.

I think it should error, because it's pretty much clear we won't give the user what they want.

Thus failing seems more "honest".

AlonaKaplan · 2024-04-14T17:14:44Z

design-proposals/sdn-ipam-secondary-nets.md

+- an external controller creates the persistent IPs allocation
+
+The first two options require KubeVirt to instruct which allocations are
+relevant when creating the pod encapsulating the VM workload; the last one only


the last - it is written the last, but it seems the paragraph is about the the CNI plugin creates the persistent IPs allocation option.
Please add the pros and cons of the an external controller creates the persistent IPs allocation approach.

It's listed on the alternatives section, in detail.

I'll see what I can do about it here.

AlonaKaplan · 2024-04-15T13:50:03Z

design-proposals/sdn-ipam-secondary-nets.md

+2. the user migrates the VM
+3. the interfaces marked as absent will not be templated on the destination pod
+   (i.e. the migration destination pod will not have those interfaces)
+4. KubeVirt dettaches the interface from the live VM


Actually this is step 2. It may happen before the migration (depending how fast the migration was invoked).

OK, when I wrote this unplug worked differently.

AlonaKaplan · 2024-04-15T14:00:13Z

design-proposals/sdn-ipam-secondary-nets.md

+- the spec.network attribute
+
+The `IPAMClaim` name will be created using the following pattern:
+`<vm name>.<logical network name>`. The logical network name is the [name of the


Technically, I"n not sure the VMI controller has a way to know the interface was just hotplugged and therefore an old IPClaim shouldn't exist.

AlonaKaplan · 2024-04-15T14:10:38Z

design-proposals/sdn-ipam-secondary-nets.md

+network](https://kubevirt.io/api-reference/main/definitions.html#_v1_network)
+in the KubeVirt API.
+
+The `IPAMClaim.spec.network` must be the name of network as per the


In the VM startup diagram the CNI is doing - subnet = GetNAD(ipamClaim).Subnet.
If the IPClaim has only the network name, how the IPAM CNI find the NAD/s?

it uses the network selection element. Seems the diagram is wrong, good catch.

If the network selection element is used, why does the IPClaim need the network field? When is it used?

the CNI uses it when synchronizing its IP pool using IPAMClaims. It needs to know to which network does the allocation belong.

Do you mean in case the CNI crashes and restarts?

I still don't get why the network name is needed on the IPAMClaim. If the IPAM CNI has an access to the pods and to the NADs it can get the network name.

Added an explanation about this.

AlonaKaplan · 2024-04-15T14:13:48Z

design-proposals/sdn-ipam-secondary-nets.md

+apiVersion: k8s.cni.cncf.io/v1alpha1
+kind: IPAMClaim
+metadata:
+  name: vm1.tenantblue


The fact the kubevirt logical network name and the OVN network name are both tenantblue is confusing. Please use different names (and add the VM/VMI snippet).

AlonaKaplan · 2024-04-15T14:18:03Z

design-proposals/sdn-ipam-secondary-nets.md

+The feature is opt-in via the network-attachment-definition, (thus disabled by
+default) and it does not impose any performance costs when disabled; as a
+result, adding a feature gate is not recommended.


As I mentioned in a previous comment, I agree with Eddy and think a FG is required.
Besides what I mentioned in the previous comment regarding residing on an v1aplha1 API, I prefer not to parse the NAD's config if it is not required.

AlonaKaplan · 2024-04-15T15:21:27Z

design-proposals/sdn-ipam-secondary-nets.md

+We envision this feature to be configurable per network, meaning the network
+admin should enable the feature by enabling the `allowPersistentIPs` flag in the
+CNI configuration for the secondary network (i.e. in the
+`NetworkAttachmentDefinition` spec.config attribute).


I might be missing something. An attribute in the spec of what ?

Of the NAD. I wasn't sure the CNI is using the value of allowPersistentIPs.
If it is it makes sense it if part of the config.

AlonaKaplan · 2024-04-15T15:22:26Z

design-proposals/sdn-ipam-secondary-nets.md

+Please refer to the diagram below to better understand the proposed workflow
+for VM creation:
+1. the user provisions a VM object
+2. the KubeVirt controller creates an IPAMClaim (in the same namespace of the


Why?
Would kubevirt raise a warning if a VM is using a NAD in the default NS and has allowPersistentIPs?

AlonaKaplan · 2024-04-15T15:27:44Z

design-proposals/sdn-ipam-secondary-nets.md

+  note over KubeVirt: we only iterate Multus non-default networks
+  loop for network := range vm.spec.networks
+    note over KubeVirt, apiserver: claimName := <vmName>.<network.Name>
+    KubeVirt->>apiserver: createIPAMClaim(claimName)


The design suggested two options - 1. Kubevirt will create IP claim for any secondary interface. 2. kubevirt with create the IPClaim according to the NAD allowPersistentIPs.
The diagram leaves the impression option 1 was chosen.

AlonaKaplan · 2024-04-15T15:29:26Z

design-proposals/sdn-ipam-secondary-nets.md

+encapsulating object will run. The migration scenario is similar.
+
+#### Removing a Virtual Machine
+This flow is - from a CNI perspective - quite similar to the


I meant in the doc:)

That finalizer is removed once the vm is deleted.

what if we don't have a VM (VMI only)?

AlonaKaplan · 2024-04-15T15:31:08Z

design-proposals/sdn-ipam-secondary-nets.md

+network](https://kubevirt.io/api-reference/main/definitions.html#_v1_network)
+in the KubeVirt API.
+
+The `IPAMClaim.spec.network` must be the name of network as per the


If the network selection element is used, why does the IPClaim need the network field? When is it used?

AlonaKaplan · 2024-04-15T15:34:54Z

design-proposals/sdn-ipam-secondary-nets.md

+  currently manages. **Only afterwards** will it update the corresponding
+  `IPAMClaim` with the generated IPs. Users can only rely / use the `IPAMClaim`
+  status for informational purposes.
+4. this step occurs in parallel to step 3; KubeVirt templates the KubeVirt


So step 3 cannot be finished up until step 4 is done?
I mean, the OVN-K need the pod with its network selection elements to complete step 3. Right?

The numeration is wrong ... first kubevirt templates the pod, then CNI creates the sandbox, and part of that is IP allocation. Once we have an allocated IP, we persist it on the IPAMClaim (if the a claim is being referenced, and if the network allows it).

I'll fix this. Good catch.

AlonaKaplan · 2024-04-17T10:06:10Z

design-proposals/sdn-ipam-secondary-nets.md

+element the name of the IPAMClaim to use. If we use a webhook, we could take
+this a step further, and mutate the templated pod to feature the required
+IPAMClaim reference in the network selection element. Without looking into the
+OVN-K configuration (i.e. the NAD.spec.config) we would have to have all


Seems the end of the sentence was cut.

Right. Finished the sentence properly.

Is it OK now ?

AlonaKaplan · 2024-04-17T10:06:32Z

design-proposals/sdn-ipam-secondary-nets.md

+OVN-K configuration (i.e. the NAD.spec.config) we would have to have all
+
+
+account since the CNI plugin operates exclusively at pod level, it cannot


Seems the beginning of the sentence was cut.

AlonaKaplan · 2024-04-17T10:09:23Z

design-proposals/sdn-ipam-secondary-nets.md

+The first two options require KubeVirt to indicate via the network selection
+element the name of the `IPAMClaim` when creating the pod encapsulating the VM
+workload; the third option, requires KubeVirt to send all the required
+information to OVN-Kubernetes, and it will create the persistent allocation,


Can you please elaborate. What is "all the required information"?

Hm, it's the owner reference. I'll add it.

Good catch.

AlonaKaplan · 2024-04-17T10:39:53Z

design-proposals/sdn-ipam-secondary-nets.md

+        devices:
+...          
+          interfaces:
+          - name: data-network


AlonaKaplan · 2024-04-17T10:57:22Z

design-proposals/sdn-ipam-secondary-nets.md

+  note over CNI: wait until IPAMClaims associated
+
+  loop for ipamClaim := range ipamClaims
+  IPAM CNI->>IPAM CNI: subnet = GetNAD(ipamClaim).Subnet


The current diagram gives the feeling the IPAM CNI is traversing the IPAMClaims and gives them IPs, independently with the NetworkSelectionElement.

Right, I've tried to correct it in the last push.

Does it make sense now ?

AlonaKaplan · 2024-04-17T11:13:35Z

design-proposals/sdn-ipam-secondary-nets.md

+encapsulating object will run. The migration scenario is similar.
+
+#### Removing a Virtual Machine
+This flow is - from a CNI perspective - quite similar to the


I still cannot find foreground and background deletion explained in the doc.
IIRC, foreground was a big issue when thinking about the design (what happens if the IPAMClaim is removed before the VM). I believe it is important to describe the problem and how the current design solves it in the doc. I would expect to see an explanation how do we prevent the IPAMClaim from being removed before the VM, VMI and virt-launcher pod.

AlonaKaplan · 2024-04-17T11:26:40Z

design-proposals/sdn-ipam-secondary-nets.md

+network](https://kubevirt.io/api-reference/main/definitions.html#_v1_network)
+in the KubeVirt API.
+
+The `IPAMClaim.spec.network` must be the name of network as per the


I still don't get why the network name is needed on the IPAMClaim. If the IPAM CNI has an access to the pods and to the NADs it can get the network name.

AlonaKaplan · 2024-04-17T17:54:58Z

design-proposals/sdn-ipam-secondary-nets.md

+#### Removing a Virtual Machine
+This flow is - from a CNI perspective - quite similar to the
+[Stopping a VM flow](#stopping-a-virtual-machine). The main difference is after
+the VM is deleted, Kubernetes Garbage Collection will kick in, and remove the


In a foreground deletion -

The VM is marked to be removed.

Garbage collector tries to remove the IPAMClaim, since the IPAMClaim has a finalizer a deletion timestamp is added.

The VM is never removed since it has a dependent object that wasn't removed.

The IPAMClaim is not removed since the VM is not removed.

@oshoval can you please explain how does this work?

Exactly as VM has a finalizer that is removed only when:

VM is marked for deletion

VMI is gone

when [1] + [2] happens, the VM finalizer is removed
this is the exact time when we also remove the Ipam claim finalizer
https://github.com/kubevirt/kubevirt/blob/e87c12294ae31810fdb75a6fb6f6b6d25f1f3107/pkg/virt-controller/watch/vm.go#L2974

oh, I see. The finalizar is removed when the deletion timestamp is added and the VMI is gone. OK, thanks!

Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>

AlonaKaplan · 2024-04-18T10:23:06Z

Thanks!

/approve

kubevirt-bot · 2024-04-18T10:23:12Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: AlonaKaplan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [AlonaKaplan]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

oshoval · 2024-04-18T10:32:13Z

Thanks !

/lgtm

Document the persistent IPs for virtualization workloads feature. This feature is described in detail in the following KubeVirt enhancement - [0]. [0] - kubevirt/community#279 Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>

kubevirt-bot added the dco-signoff: yes Indicates the PR's author has DCO signed all their commits. label Mar 18, 2024

kubevirt-bot requested review from cwilkers and jobbler March 18, 2024 11:03

kubevirt-bot added the size/L label Mar 18, 2024

maiqueb force-pushed the sdn-ipam-secondary-nets branch 3 times, most recently from 50bb44b to 0fa05e6 Compare March 19, 2024 12:37

maiqueb mentioned this pull request Mar 19, 2024

IPAM for secondary networks kubevirt/kubevirt#11410

Closed

8 tasks

kubevirt-bot assigned AlonaKaplan Mar 19, 2024

kubevirt-bot requested a review from EdDev March 20, 2024 05:32

kubevirt-bot added the sig/network label Mar 20, 2024

maiqueb force-pushed the sdn-ipam-secondary-nets branch 2 times, most recently from 4f6a8de to c191af7 Compare March 20, 2024 14:37

EdDev reviewed Mar 21, 2024

View reviewed changes

maiqueb requested a review from EdDev March 22, 2024 12:25

EdDev reviewed Mar 24, 2024

View reviewed changes

maiqueb commented Mar 27, 2024

View reviewed changes

maiqueb requested a review from EdDev March 27, 2024 15:39

qinqon suggested changes Apr 1, 2024

View reviewed changes

kubevirt-bot assigned qinqon Apr 1, 2024

maiqueb force-pushed the sdn-ipam-secondary-nets branch from c191af7 to 355a8cd Compare April 1, 2024 15:17

maiqueb force-pushed the sdn-ipam-secondary-nets branch from 355a8cd to 5c643fb Compare April 15, 2024 08:40

AlonaKaplan reviewed Apr 15, 2024

View reviewed changes

maiqueb force-pushed the sdn-ipam-secondary-nets branch from 5c643fb to 0ad7cd8 Compare April 16, 2024 17:16

kubevirt-bot added size/XL and removed size/L labels Apr 16, 2024

AlonaKaplan reviewed Apr 17, 2024

View reviewed changes

maiqueb force-pushed the sdn-ipam-secondary-nets branch 2 times, most recently from ca239ef to be606e5 Compare April 17, 2024 16:58

AlonaKaplan reviewed Apr 17, 2024

View reviewed changes

maiqueb force-pushed the sdn-ipam-secondary-nets branch 4 times, most recently from 28b458b to f1a78d4 Compare April 18, 2024 08:47

ipam: integrate w/ k8s multi-network defacto standard persistent IPs

96a8f0a

Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>

maiqueb force-pushed the sdn-ipam-secondary-nets branch from f1a78d4 to 96a8f0a Compare April 18, 2024 10:22

kubevirt-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 18, 2024

kubevirt-bot assigned oshoval Apr 18, 2024

kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Apr 18, 2024

kubevirt-bot merged commit e42362c into kubevirt:main Apr 18, 2024
2 checks passed

maiqueb mentioned this pull request Jun 12, 2024

docs, multi-homing, virt: persistent IPs for virt workloads ovn-org/ovn-kubernetes#4437

Merged

		The CNI plugin must be adapted to compute its IP pool not only from the live
		pods in the cluster, but also from these CRs.

		`IPAMClaim` status. Finally, the CNI will configure the interface with these
		IP addresses.

		OVN-K configuration (i.e. the NAD.spec.config) we would have to have all


		account since the CNI plugin operates exclusively at pod level, it cannot

ipam: integrate w/ k8s multi-network defacto standard persistent IPs #279

ipam: integrate w/ k8s multi-network defacto standard persistent IPs #279

Conversation

maiqueb commented Mar 18, 2024

maiqueb commented Mar 19, 2024

EdDev commented Mar 20, 2024

EdDev left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oshoval Apr 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlonaKaplan Apr 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maiqueb commented Mar 22, 2024 • edited Loading

Choose a reason for hiding this comment

IPClaim controller

Kubevirt

Summary

Choose a reason for hiding this comment

IPClaim controller

Kubevirt

Summary

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maiqueb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oshoval Apr 2, 2024 •

edited

Loading

AlonaKaplan Apr 17, 2024 •

edited

Loading

maiqueb commented Mar 22, 2024 •

edited

Loading