Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ipam: integrate w/ k8s multi-network defacto standard persistent IPs #279

Merged
merged 1 commit into from
Apr 18, 2024

Conversation

maiqueb
Copy link
Contributor

@maiqueb maiqueb commented Mar 18, 2024

What this PR does / why we need it:
This PR adds a design document outlining how to integrate the persistent IPs feature of the kubernetes multi-networking de-facto standard into KubeVirt.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

Checklist

This checklist is not enforcing, but it's a reminder of items that could be relevant to every PR.
Approvers are expected to review this list.

Release note:

NONE

@kubevirt-bot kubevirt-bot added the dco-signoff: yes Indicates the PR's author has DCO signed all their commits. label Mar 18, 2024
@maiqueb maiqueb force-pushed the sdn-ipam-secondary-nets branch 3 times, most recently from 50bb44b to 0fa05e6 Compare March 19, 2024 12:37
@maiqueb
Copy link
Contributor Author

maiqueb commented Mar 19, 2024

/assign @AlonaKaplan

@EdDev
Copy link
Member

EdDev commented Mar 20, 2024

/cc

/sig network

@kubevirt-bot kubevirt-bot requested a review from EdDev March 20, 2024 05:32
@maiqueb maiqueb force-pushed the sdn-ipam-secondary-nets branch 2 times, most recently from 4f6a8de to c191af7 Compare March 20, 2024 14:37
Copy link
Member

@EdDev EdDev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for posting this interesting proposal.

In addition to the review comments, I was thinking about trying to propose an alternative that decouples Kubevirt from the IPClaim in both directions.

I'm not sure if it is best to discuss it in a thread here, but it can be a good initial start.

It is in the spirit of IPClaim having its own controller, but not being dependent on Kubevirt directly.
I will try to write something early next week or update back if I fail.


## Motivation
Virtual Machine owners want to offload IPAM from their custom solutions (e.g.
custom DHCP server running on their cluster network) to SDN.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please clarify if this is about the primary pod network, secondary networks or both?

I think it is also important to specify if we are talking about a single interface or more.
For a multi-interface machine (be it physical or virtual), IP management for all the interfaces is tricky. E.g. having several interfaces with dhcp-client may result in an undesired network configuration (multi default routes, mix of dns entries, etc).

How existing clusters resolve the configuration challenges will help (e.g. with a DHCP server, cloud-init config, scripts, etc).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please clarify if this is about the primary pod network, secondary networks or both?

Only secondary networks, as explicitly indicated on the goals (and non-goals) section.

I think it is also important to specify if we are talking about a single interface or more. For a multi-interface machine (be it physical or virtual), IP management for all the interfaces is tricky. E.g. having several interfaces with dhcp-client may result in an undesired network configuration (multi default routes, mix of dns entries, etc).

Well, for multiple ones.

That depends on the configuration provided by the cluster admin in the NAD (routes / DNS). It has nothing to do with this feature AFAIU.

How existing clusters resolve the configuration challenges will help (e.g. with a DHCP server, cloud-init config, scripts, etc).

I do not understand what you're asking for. Are you asking for anything here ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only secondary networks, as explicitly indicated on the goals (and non-goals) section.

When reading the motivation, I did not understand the actual need is for secondary networks.
I guess you could explain why this is not needed for the pod network but is needed for the secondaries.

Well, for multiple ones.

That depends on the configuration provided by the cluster admin in the NAD (routes / DNS). It has nothing to do with this feature AFAIU.

The feature is triggered by a need. I think it is worth expressing if the need is for a single one or more.
It may influence the solution proposed.

How existing clusters resolve the configuration challenges will help (e.g. with a DHCP server, cloud-init config, scripts, etc).

I do not understand what you're asking for. Are you asking for anything here ?

Yes, I am asking to explain how the legacy VM or physical machines handled this need.
That way, I can understand if you take a similar solution to Kubevirt or you invent a new solution to an old problem.
Given a problem/challenge, how it was solved so far in other platforms (including bare-metal) can help provide context and support this proposal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only secondary networks, as explicitly indicated on the goals (and non-goals) section.

When reading the motivation, I did not understand the actual need is for secondary networks. I guess you could explain why this is not needed for the pod network but is needed for the secondaries.

The pod network is managed / owned by kubernetes. A direct consequence of that is users get IPAM on it already. It's part of what it does. I.e. they don't have to manage IP addresses / configure DHCP servers / etc. The platform does that for them.

I hope this is clear enough.

Well, for multiple ones.
That depends on the configuration provided by the cluster admin in the NAD (routes / DNS). It has nothing to do with this feature AFAIU.

The feature is triggered by a need. I think it is worth expressing if the need is for a single one or more. It may influence the solution proposed.

It isn't specified, but I assume asking for IPAM on secondary interfaces means just that ... IPAM on secondary interfaces. I.e. every secondary interface can have IPAM depending on the config set by the user.

How existing clusters resolve the configuration challenges will help (e.g. with a DHCP server, cloud-init config, scripts, etc).

I do not understand what you're asking for. Are you asking for anything here ?

Yes, I am asking to explain how the legacy VM or physical machines handled this need. That way, I can understand if you take a similar solution to Kubevirt or you invent a new solution to an old problem. Given a problem/challenge, how it was solved so far in other platforms (including bare-metal) can help provide context and support this proposal.

I still don't understand. Are you asking how other platforms implement IPAM, or are you asking how other platforms get around not having IPAM ? The response to the latter is stated right there in the motivation - they use DHCP servers, and static IP addressing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be nice if you add the information you explained very well here in the proposal text as well.

I still don't understand. Are you asking how other platforms implement IPAM, or are you asking how other platforms get around not having IPAM ? The response to the latter is stated right there in the motivation - they use DHCP servers, and static IP addressing.

I ask how other platforms solved the IPAM problem for multiple interface, if at all.
I can be specific if you prefer: Has Openstack assigned IP addresses to more than one interface in a VM and how (all interfaces used DHCP, one used DHCP and all other static, all static, etc).

I am just having trouble understanding how a machine with multiple DHCP clients can work, unless heavy restrictions are placed on them.

design-proposals/sdn-ipam-secondary-nets.md Show resolved Hide resolved
Comment on lines +63 to +74
Thus, we propose to introduce a new CRD to the k8snetworkplumbingwg, and make it
part of the Kubernetes multi-network defacto standard in this update proposal.
The `IPAMClaim` CRD was added to the Kubernetes multi-networking de-facto
[standard](https://github.com/k8snetworkplumbingwg/multi-net-spec/blob/master/v1.3/%5Bv1.3%5D%20Kubernetes%20Network%20Custom%20Resource%20Definition%20De-facto%20Standard.pdf)
version 1.3. The sections describing the CRD and how to use them are in sections
4.1.2.1.11 (network selection element update) and 8 (IPAMClaim CRD).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This paragraph seems not to be related to Kubevirt, but to a given generic feature.
Referencing about depending on it for the functionality and further reading should be enough.

Please do mention in what release stage it is at the moment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly do you propose I do different ? I am listing the standard we're following, pointing to it, and indicating the relevant portions of the standard, since I cannot point to particular sections of the pdf file.

All of that is IMHO required / helpful to the reader.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section seems to propose the IPClaim CRD and its solution.
It is fine to give an overview on what it does and ref to the relevant information.

All I say here is that you place this in its own section and describe it as an existing service/solution which Kubevirt will use.

Comment on lines +73 to +80
The CNI plugin must be adapted to compute its IP pool not only from the live
pods in the cluster, but also from these CRs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how this is related to kubevirt. I guess explicitly mentioning how it relates to kubevirt will help.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume only indicating how to integrate a feature into kubevirt is probably not enough for the reader to understand how the feature works.

Thus - now and then - I try to indicate in a couple of sentences what the systems kubevirt is interacting with need to do.

I.e. just creating an ipam claim (and not putting IPs there) is not good enough. I guess the reader needs the information of who assigns the IPs, who stores the IPs in the claims, and what the claim is used for.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All I ask is to explain it from the Kubevirt perspective.
I agree we need some details to understand how it works, but we surely do not need everything. Even if you think it is better to have it all here, I suggest to place it in a specific section so it will not overload the Kubevirt part (i.e. if you want more details, go-to ).

Comment on lines +76 to +88
### Configuring the feature
We envision this feature to be configurable per network, meaning the network
admin should enable the feature by enabling the `allowPersistentIPs` flag in the
CNI configuration for the secondary network (i.e. in the
`NetworkAttachmentDefinition` spec.config attribute).

A feature gate may (or may not) be required in the KubeVirt.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, please refer to the IPClaim as a given 3rd party feature which kubevirt uses, not define it here.

Regarding the FG, I think it is required, just because of the change magnitude and the usage of a 3rd party early release stage (Alpha?) of that feature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, please refer to the IPClaim as a given 3rd party feature which kubevirt uses, not define it here.

I honestly don't follow what you want. Please clarify, and if possible, suggest a change.

Regarding the FG, I think it is required, just because of the change magnitude and the usage of a 3rd party early release stage (Alpha?) of that feature.

I disagree. I think the magnitude of the feature is pretty small, and he feature is disabled by default on the networks. But I'll bend - if the community wants a feature gate, I'll give them one.

@dankenigsberg / @fabiand IIRC you've argued extensively against adding more feature gates. You may want to chime in here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly don't follow what you want. Please clarify, and if possible, suggest a change.

Quoting: We envision this feature to be configurable per network.
The feature Kubevirt is to use is given, it is per network per my understanding and you do not define it here. This is what I meant.
When reading, it sounds as if that feature is suggested and not already provided.
Moreover, I think it is worth explaining in simple words who provides the IPClaim service, e.g. Multus, CNI and a network provider. Providing the specific ones will help clarify that this is currently specific to OVN Kubernetes (but not limited to IIUC).

I disagree. I think the magnitude of the feature is pretty small, and he feature is disabled by default on the networks.

The integration to support this IP claim is pretty intrusive into Kubevirt IMO.
It touches Kubevirt controllers, requiring the project to know about two specific CNI parameters, a new CRD in its Alpha release stage.
I was under the impression that the FG has been negotiated already and is going to be added.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dankenigsberg / fabiand IIRC you've argued extensively against adding more feature gates. You may want to chime in here.

It depends on the size of the functionality, the risk of the code and the doubts about the new API. I did not read the current proposal, so I don't have an opinion yet. I must confess that after reading #251 I look at FGs more positively than before, as they are different from a redundant configurable. A Feature Gate is only a temporary thing if we are not sure that the feature would graduate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Eddy mentioned, since the feature resides on an alpha stage CRD I think it should be behind a FG.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I can do that.

```

#### Hot-plug a VM interface
This flow is exactly the same as
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the expected integration point at the controller the same?
Perhaps the integration point should be explicitly defined here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please help me understand your expectation. I don't understand what you mean with integration point.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The location in the VMI controller you intend to process the logic.
Or any other place you need to place logic at to reconcile this.

E.g. if you need to integrate this at the VM controller and not at the VMI controller, it matters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure the plan is to integrate at the VMI controller, given we also want to support IP stickiness for VMI migration.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not in favor of supporting it at the VMI level, but it will be interesting to understand the exact need and usefulness. Te migration is covered if initiated from a VM object.

We have several features, like hotplug, that are supported only at the VM level.
I'm not seeing the added value of supporting VMI independently, but if in practice we do have such users whith common usage, I should reconsider my position on this.

Copy link

@oshoval oshoval Apr 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About if to use the logic on VM or VMI controller
Few reasons why imho we should use VMI controller

  1. Only VMI controller knows when the pod is created.
    Creating the claims there, just before the pod is created, removes the need to recreate / re-validate on every VM reconcile iteration, and it is much better design imho.

  2. VMI is the one that already reads the NAD, it is more natural for it to read the NAD, and i dont think we should add NAD reading also to VM (we need this info for the claims, we also need the interfaces name scheme which is calculated on VMI controller).

  3. For hot (un)plug we would need also to sync the claims, the annotations updating is the point that already calculates the difference that need to be done (and it is on VMI controller), note that it is bit tricky (if operations are failed), but will be even more tricky on VM controller.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If supporting sticky IP for VMI is not complicated, I don't see a reason why not supporting it.
The hotplug/unplug feature is invoked by editing VM object. The VMI cannot be edited by the user, so it is pretty obvious to the user the feature won't work for VMI only.
However, in the sticky IP case, the user will have the NAD with a persistent IP, use OVN CNI and won't understand why the IP is not sticky.
It worth adding a section in the doc explaining the VMI only flows.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ultimately it's @AlonaKaplan's call - do we treat VMIs as a an API that's meant to be used by users, or no.

Do you have any specific opinion here ? It would make our life easier if we knew exactly what the boundaries are / where they are.

I do understand hotplug only works for VMs.

... and here I hear that there's a precedent for not implementing features for VMIs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlonaKaplan the VMI only flows are pretty much the same.

The notable exceptions are:

  • hotplug is not supported (no hotplug on VMI)
  • restart is not supported (restart essentially spuns up a new VMI)

Everything else is pretty much the same - with the notable exception the IPAMClaim's reference owner is a VMI, instead of a VM.

@oshoval please add more detail if you think it is important

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you covered it all
supporting VMI standalone is even easier than not supporting it, because we don't need to add a check if there is VM owner or not when updating the annotation / creating claims.
The only condition that exists about those is if there is no VM owner, then set the ownerRef as VMI as you said.

design-proposals/sdn-ipam-secondary-nets.md Show resolved Hide resolved
- the spec.network attribute

The `IPAMClaim` name will be created using the following pattern:
`<vm name>.<logical network name>`. The logical network name is the [name of the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I think I asked about how this is done previously, please try and add a ref to here, it may help the next reader.
  • What happens if there are two VMs with the same name in different namespaces? Is the namespace part of the IPClaim somehow?
  • Having a ref to a name is problematic and raceful. Most objects have a UID to resolve the race.
    Perhaps an attempt to explain this using an example may help: We unplug and then plug quickly an interface, in the meantime the IPClaim is not yet removed completely and on creation it is still there.
    The same can happen I guess when deleting a VM and quickly creating another one with the same name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I think I asked about how this is done previously, please try and add a ref to here, it may help the next reader.
  • What happens if there are two VMs with the same name in different namespaces? Is the namespace part of the IPClaim somehow?

The ipam claims are created on the namespace of the vm.

  • Having a ref to a name is problematic and raceful. Most objects have a UID to resolve the race.

The owner references of the objects carry the UID. We ensure we're handling the proper object by comparing owner references.

Perhaps an attempt to explain this using an example may help: We unplug and then plug quickly an interface, in the meantime the IPClaim is not yet removed completely and on creation it is still there.
The same can happen I guess when deleting a VM and quickly creating another one with the same name.

We check the owner references of the VMI / IPAMClaims. They must match. If they don't, we return an error, which will cause the reconcile loop to be retried later. Eventual consistency will ensure we eventually get this right (eventually the k8s GC will remove the stale resource; then eventually the "new" version manages to be created). Nothing new here IMO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add this to the text.

Regarding the potential race, I understand you use the ref owner to match the UID.
What I am not sure about is how do you manage to do the same for the hot{un}plug scenario.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing something, but I don't see the difference.

Hotplug / unplug will still happen for a VM; the IPAMClaim will still have an owner reference.

If they don't match, we throw an error, and will try to reconcile later. If it never manages to plug / unplug the interface .. well, we'll keep trying with exponential backoff.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the network interface you ref a claim, that ref is by name and not UID.
If one plugs, unplugs, plugs, unplugs etc.. the same interface, I do not understand how you can differentiate between claims of the same interface.

The difference from the VM UID is that here the interfaces are changing but the UID pointing to the VM is the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may be able to simply indicate the dimension is interface per VM.

As long as we adjust the user's expectation, we should be OK.

I think it can go either way, but I feel failing until the "old" is removed is more aligned with the existing flow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, I"n not sure the VMI controller has a way to know the interface was just hotplugged and therefore an old IPClaim shouldn't exist.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oshoval can you chime in ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have this logic on VMI controller
so if the annotation is different (previous didn't exist and now exists) at updateMultusAnnotation it means an interface was hotplugged, isn't it ?

Copy link
Member

@AlonaKaplan AlonaKaplan Apr 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be raceful if the IPAMClaim was created but for some reason updating the pod annotation failed.
Maybe checking if the claim has a deletion timestamp should be enough.

Comment on lines 399 to 433
The feature is opt-in via the network-attachment-definition, (thus disabled by
default) and it does not impose any performance costs when disabled; as a
result, adding a feature gate is not recommended.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to confuse between feature lifecycle and the operational control of a feature.
A FG is required due to other reasons but if the configuration parameter is needed, it needs to be specified here with its default setting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't "The feature is opt-in via the network-attachment-definition, (thus disabled bydefault)" basically that ?

Furthermore, I've already thoroughly indicated how it works in the Configuring the feature section. Doing it again here feels like repeating myself.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a FG section and you describe an operational configuration parameter.
I tried to explain the difference so you could decide how to adjust it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in a previous comment, I agree with Eddy and think a FG is required.
Besides what I mentioned in the previous comment regarding residing on an v1aplha1 API, I prefer not to parse the NAD's config if it is not required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoever creates the IPAMClaims must indicate for which network does the claim correspond. Without that, the CNI's IP pool cannot be reconciled.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I was talking about the FG. If it is disabled there is no need to parse the NAD config.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah !!! My bad.

Right.

default) and it does not impose any performance costs when disabled; as a
result, adding a feature gate is not recommended.

#### KubeVirt API changes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add another section about API dependencies.
I.e. the dependency on the IPClaim CRD and the CNI spec (in the NAD config).

Please also mention the FG and operational configuration with their details here if needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that really necessary ? I've explained that already in the Design - you've even put comments there. Here's what I said above:

Thus, we propose to introduce a new CRD to the k8snetworkplumbingwg, and make it
part of the Kubernetes multi-network defacto standard in this update proposal.
The IPAMClaim CRD was added to the Kubernetes multi-networking de-facto
standard version 1.3. The sections describing the CRD and how to use them are in sections
4.1.2.1.11 (network selection element update) and 8 (IPAMClaim CRD).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a good section to move some of the details to.

If the kubevirt client changes, this is also a good place to mention it.

@maiqueb
Copy link
Contributor Author

maiqueb commented Mar 22, 2024

Thank you for posting this interesting proposal.

In addition to the review comments, I was thinking about trying to propose an alternative that decouples Kubevirt from the IPClaim in both directions.

I'm not sure if it is best to discuss it in a thread here, but it can be a good initial start.

It is in the spirit of IPClaim having its own controller, but not being dependent on Kubevirt directly. I will try to write something early next week or update back if I fail.

I honestly don't see how that is possible without traversing the pod's owner reference tree upward until its root using unstructured ... which is ... very ugly. That was the only "generic" way I can think of tying the IPAMClaim to an arbitrary resource type that would own it.

I prefer to go for something simple, straight-forward, and easy to reason about for the one use case we have in front of us instead of going for the gold and try to envision workload types we have no requirements for (and don't even know what they might be).

But be my guest - try to come up with something.

@maiqueb maiqueb requested a review from EdDev March 22, 2024 12:25
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an attempt to describe an alternative on how the integration with the IPClaim service can be done without an explicit dependency on the IPClaim or NetworkAttachmentDefinition from the client side.

The basic starting points are these:

  • The client application that wants to use sticky IP does not need access to any new object.
  • The client application expresses its desired state through the pod object.
  • The sticky IP service does not need to know or depend on any specific client.

IPClaim controller

The controller monitors (through an informer) all pods and NetwokAttachmentDefinition/s in the cluster.

Note: Instead of watching all pods/nads, the list may be filtered with a label that marks that the pod/nad is using this IP claim service.

The controller looks at the following pod information:

  • The (multus) network annotation in which the networks are specified with the ipam-claim-reference.
  • A dedicated (new) ipclaim-owner annotation that contains as a value a serialized list of owners to the claim [1].

[1] https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.29/#ownerreference-v1-meta

The IPClaim controller maintains the lifecycle of IPClaim objects, from creation to deletion:

  • When a pod is detected with one or more ipam-claim-reference, it will lookup for the referenced NAD and accordingly create an IPClaim (if the NAD specifies that it supports it).
    The IPClaim is created with the owner reference as appearing in the pod annotation (ipclaim-owner).
  • In case the pod network annotation changes, the relevant IPClaim will be added or removed (hotplug support) based on the NAD information.
  • In case the pod is deleted, no action is taken.
  • To assure the IPClaim will not be deleted before the owner object is removed, a finalizer can be added and the IPClaim controller will check the following before removing it:
    • Identify if any active pod has the mentioned owner ref in the annotation. If no such pod exists, the finalizer can be removed.

Note: It is assumed that the finalizer removal process is triggered when the GC marks the IPClaim object for deletion, i.e. the owner is indeed gone and the GC acted. At this stage, if there is no pod left, it means we can delete the claim.
If someone else deletes the IPClaim, we have a problem.

Kubevirt

The client side, e.g. Kubevirt, will express the desired state as follows:

  • The VM controller, when the IP claim service is "enabled", will add the ipclaim-owner annotation to the VMI object with the value of itself. The VMI controller will just copy it as-is to the pod later on.
  • The VMI controller will add to the pod network annotation the ipam-claim-reference if the feature is "enabled", regardless if the NAD is actually going to support it or not. It says something like this: "If IPAM and sticky IP is set in the NAD, I want it sticky and here is the IPClaim obj name to use".
  • For hotplug/hotunplug, the pod network annotation will change the same as today with no special handling needed beyond the addition of the new CNI field.

Note: If there is a need to control per VM interface network the sticky IP option, it should be done explicitly using a knob or policy.
By adding always the new ipam-claim-reference, we risk having problems with other CNI plugins that do not support it. If this is indeed the case, we will probably need to solve it with a webhook, the same way SR-IOV operator/cni solves such things.

Summary

The suggested solution uses a IPClaim controller to monitor pods and have access to NADs. The pod annotation is the interface between the clients (e.g. Kubevirt) and the IPClaim controller, allowing each to be decoupled from the each other.
An IPClaim mutation admitter can be used to mark the pod network annotation, on a per interface, per NAD case, to set or unset the ipam-claim.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an attempt to describe an alternative on how the integration with the IPClaim service can be done without an explicit dependency on the IPClaim or NetworkAttachmentDefinition from the client side.

The basic starting points are these:

  • The client application that wants to use sticky IP does not need access to any new object.
  • The client application expresses its desired state through the pod object.
  • The sticky IP service does not need to know or depend on any specific client.

IPClaim controller

The controller monitors (through an informer) all pods and NetwokAttachmentDefinition/s in the cluster.

Note: Instead of watching all pods/nads, the list may be filtered with a label that marks that the pod/nad is using this IP claim service.

The controller looks at the following pod information:

  • The (multus) network annotation in which the networks are specified with the ipam-claim-reference.
  • A dedicated (new) ipclaim-owner annotation that contains as a value a serialized list of owners to the claim [1].

Who would be the owner of the IPAMClaim ? I assume it would be (exclusively) the VM.

[1] https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.29/#ownerreference-v1-meta

The IPClaim controller maintains the lifecycle of IPClaim objects, from creation to deletion:

  • When a pod is detected with one or more ipam-claim-reference, it will lookup for the referenced NAD and accordingly create an IPClaim (if the NAD specifies that it supports it).
    The IPClaim is created with the owner reference as appearing in the pod annotation (ipclaim-owner).

  • In case the pod network annotation changes, the relevant IPClaim will be added or removed (hotplug support) based on the NAD information.

  • In case the pod is deleted, no action is taken.

  • To assure the IPClaim will not be deleted before the owner object is removed, a finalizer can be added and the IPClaim controller will check the following before removing it:

    • Identify if any active pod has the mentioned owner ref in the annotation. If no such pod exists, the finalizer can be removed.

IIUC this would be on IPAMClaim removal you'd have to fish for any pod w/ this particular annotation. From all pods in the system. Or, if we have a special label on the pods subject to this feature, from within that subset.

If none is found, you remove the finalizer.

Did I get this right ?

Note: It is assumed that the finalizer removal process is triggered when the GC marks the IPClaim object for deletion, i.e. the owner is indeed gone and the GC acted. At this stage, if there is no pod left, it means we can delete the claim.
If someone else deletes the IPClaim, we have a problem.

I would argue this is OK - for someone to be able to manually remove the claim, it's because they've manually removed the finalizer, thus, they're trying to shoot themselves in the foot.

Kubevirt

The client side, e.g. Kubevirt, will express the desired state as follows:

  • The VM controller, when the IP claim service is "enabled", will add the ipclaim-owner annotation to the VMI object with the value of itself. The VMI controller will just copy it as-is to the pod later on.

How does the VM controller know if the IP claim service (I assume that is the controller ... - but I'm not sure) is running ?

  • The VMI controller will add to the pod network annotation the ipam-claim-reference if the feature is "enabled", regardless if the NAD is actually going to support it or not. It says something like this: "If IPAM and sticky IP is set in the NAD, I want it sticky and here is the IPClaim obj name to use".
  • For hotplug/hotunplug, the pod network annotation will change the same as today with no special handling needed beyond the addition of the new CNI field.

Note: If there is a need to control per VM interface network the sticky IP option, it should be done explicitly using a knob or policy.
By adding always the new ipam-claim-reference, we risk having problems with other CNI plugins that do not support it. If this is indeed the case, we will probably need to solve it with a webhook, the same way SR-IOV operator/cni solves such things.

Quoting from CNI spec:

Plugins may define additional fields that they accept and may generate an error if called with unknown fields.

So we cannot simply have KubeVirt always request this feature (since it may break arbitrary CNIs), which would require us to use webhook based solutions. We know those are not a good fit since they will require looking past the object being "looked upon". I think a solution based on eventual consistency is required.

Going past that ... what workloads would you make "subject" of the webhook ? Would it be all pods w/ a certain label ? If so, which pods will have the label ? How will KubeVirt know which pods to label ?

I am assuming that without looking in the NAD config, you'd set it on all launcher pods, which will make all KubeVirt VMs subject to the webhook. Which in turn means that if the webhook goes down (for whatever reason, but let's say certificate rotation, because that happens) we'd pretty much make it impossible for new VMs to be created in the entire system.

Are these assumptions correct ? The real question here I guess is which pods will be subject to the webhook.

Summary

The suggested solution uses a IPClaim controller to monitor pods and have access to NADs. The pod annotation is the interface between the clients (e.g. Kubevirt) and the IPClaim controller, allowing each to be decoupled from the each other. An IPClaim mutation admitter can be used to mark the pod network annotation, on a per interface, per NAD case, to set or unset the ipam-claim.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who would be the owner of the IPAMClaim ? I assume it would be (exclusively) the VM.

This sentence is taken from the IPClaim controller context, so at its base, that controller does not care who is the owner.
From Kubevirt point of view, I would expect it to be the VM indeed, no different from the original proposal.

  • To assure the IPClaim will not be deleted before the owner object is removed, a finalizer can be added and the IPClaim controller will check the following before removing it:
    • Identify if any active pod has the mentioned owner ref in the annotation. If no such pod exists, the finalizer can be removed.

IIUC this would be on IPAMClaim removal you'd have to fish for any pod w/ this particular annotation. From all pods in the system. Or, if we have a special label on the pods subject to this feature, from within that subset.

If none is found, you remove the finalizer.

Did I get this right ?

Yes.
Usually an informer is present which caches changes on the cluster.
The solution can require the present of a label on the pod to be active in this "game", part of the "protocol" when working with IPClaims. It is then up to the pod creator or webhook to declare it part of the IP claim playground.

The VM controller, when the IP claim service is "enabled", will add the ipclaim-owner annotation to the VMI object with the value of itself. The VMI controller will just copy it as-is to the pod later on.

How does the VM controller know if the IP claim service (I assume that is the controller ... - but I'm not sure) is running ?

I meant the desire to have the sticky IP service. Nothing assures it can be served.
If and when the controller will run, it will be served.

So we cannot simply have KubeVirt always request this feature (since it may break arbitrary CNIs), which would require us to use webhook based solutions. We know those are not a good fit since they will require looking past the object being "looked upon". I think a solution based on eventual consistency is required.

In this case, a webhook is the only solution I can think of that works without requiring other clients to know and have access to a NAD.
We already have a version of it working with SR-IOV [1]. It basically parses the pod annotation, fetches the NADs and patches the pod resources accordingly.

[1] https://github.com/openshift/sriov-dp-admission-controller/blob/master/pkg/webhook/webhook.go#L609

Going past that ... what workloads would you make "subject" of the webhook ? Would it be all pods w/ a certain label ? If so, which pods will have the label ? How will KubeVirt know which pods to label ?

Only pods that have a network annotation and marked by the creator to be part of the IP claim "game".
This is something the IPClaim solution can define so users will mark accordingly.

I am assuming that without looking in the NAD config, you'd set it on all launcher pods, which will make all KubeVirt VMs subject to the webhook. Which in turn means that if the webhook goes down (for whatever reason, but let's say certificate rotation, because that happens) we'd pretty much make it impossible for new VMs to be created in the entire system.

Per what I know, webhooks can be registered so their absence will not block anything.
As the controller and webhook can reside on the same executable, we can argue that even through something was requested, it may not have been served. So it is all about what you really desire to happen and how (or if) it can be fixed as part of a reconciliation.

If I try to compare to the current proposal, I guess it is equal to failing to create an IPClaim CR or to start the pod if the CNI fails to detect the IPClaim (or something else). If the webhook fails due to downtime, it means it needs to retry in the VMI reconcile loop.

Are these assumptions correct ? The real question here I guess is which pods will be subject to the webhook.

I think all your points are relevant here. The client needs to express the desire to use the IP claim, therefore it should follow the "rules of engagement" with this service and mark the pod accordingly so it will work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a version of it working with SR-IOV [1]. It basically parses the pod annotation, fetches the NADs and patches the pod resources accordingly.

[1] https://github.com/openshift/sriov-dp-admission-controller/blob/master/pkg/webhook/webhook.go#L609

This is the U/S project, I should have ref it instead of the one above.

https://github.com/k8snetworkplumbingwg/network-resources-injector

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trimming this down so we can focus on the most important points. I hope I'm not leaving anything behind.

So we cannot simply have KubeVirt always request this feature (since it may break arbitrary CNIs), which would require us to use webhook based solutions. We know those are not a good fit since they will require looking past the object being "looked upon". I think a solution based on eventual consistency is required.

In this case, a webhook is the only solution I can think of that works without requiring other clients to know and have access to a NAD.
We already have a version of it working with SR-IOV [1]. It basically parses the pod annotation, fetches the NADs and patches the pod resources accordingly.

[1] https://github.com/openshift/sriov-dp-admission-controller/blob/master/pkg/webhook/webhook.go#L609

The fact they looked past the object they are mutating is not something that encourages me to do the same. That's inherently raceful.

Furthermore, we already maintain a couple of webhooks and had sub-par experiences with them. Adding a webhook should not be done lightly, and I would avoid it if possible. I am especially reticent about maintaining a webhook based solution.

Going past that ... what workloads would you make "subject" of the webhook ? Would it be all pods w/ a certain label ? If so, which pods will have the label ? How will KubeVirt know which pods to label ?

Only pods that have a network annotation and marked by the creator to be part of the IP claim "game".
This is something the IPClaim solution can define so users will mark accordingly.

Can you elaborate on this ? What does "marked by the creator" mean ? What's the mark ? A label ? How will the client (let's assume kubevirt) know which ones to "mark" without looking into the configuration ?

Per what I know, webhooks can be registered so their absence will not block anything.
As the controller and webhook can reside on the same executable, we can argue that even through something was requested, it may not have been served. So it is all about what you really desire to happen and how (or if) it can be fixed as part of a reconciliation.

You can indeed indicate the behavior if the webhook fails: you can ignore failure (which might cause the workload to be accepted without the ipam claim being created, or you can fail. Which will reject the workload. Then, you can have a controller on top to ensure the IPAMClaims exist as part of their sync loop.

... which begs the question of why having the webhook in the first place if I'm going to have to reconcile. I would rather push for only setting the ipam-claim-reference network selection element attribute based on the feature gate, and have it mean the user understand the CNI they're using can accept (or at least tolerate ...) the existence of this attribute. I think this would be a fair compromise.

Unfortunately, IMHO this proposal addresses the wrong problem. It could work - at a glance. But it focuses on presenting a generic solution for things that run on pods, while all the requirements we have in front of us are related to VMs. Plus, there's an agreed enhancement proposal (including agreement from the sig-network maintainer) for this existing proposal. When we agreed to write this proposal we did it because we were discussing how to integrate with KubeVirt. This seems like a different solution, with a different scope (more ambitious).

IMO narrowing the scope down to VMs/VMIs is cleaner, and cheaper to develop and maintain (across the ecosystem) than what you're proposing - at the expense of forcing kubevirt to consume a CRD and read a configuration object ( ... which it already does). Your proposal does have the advantage of looking like it could work for more types of workloads than VMs. But again: no one asked for that .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact they looked past the object they are mutating is not something that encourages me to do the same. That's inherently raceful.

The system is asynchronous by definition, it is not more raceful than this proposal.
When the VMI controller reads the NAD, is also cannot assure it continues to exist a msec later.

The system will eventually reconcile as the pod creation will fail and the VMI controller will retry to create the pod and hopefully the NAD will be there.
In both solutions it will basically recover the same way.

Furthermore, we already maintain a couple of webhooks and had sub-par experiences with them. Adding a webhook should not be done lightly, and I would avoid it if possible. I am especially reticent about maintaining a webhook based solution.

The difficulty to develop and maintain the service is a valid concern, however, considering its ability to integrate with potential clients is also a valid point to consider.
I find it hard to accept the level of administration rights you expect from a client, i.e. accessing NAD/s requires a level of trust I do not think you should depend on.

Only pods that have a network annotation and marked by the creator to be part of the IP claim "game".
This is something the IPClaim solution can define so users will mark accordingly.

Can you elaborate on this ? What does "marked by the creator" mean ? What's the mark ? A label ? How will the client (let's assume kubevirt) know which ones to "mark" without looking into the configuration ?

"Marked by creator" means that the entity that created the pod, needs to mark it somehow (e.g. label) to be considered for IPClaim. This is a protocol you can define as part of the "service".

Kubevirt will probably mark them all, as it wants its pods to be processed if an IP claim is requested.

You can indeed indicate the behavior if the webhook fails: you can ignore failure (which might cause the workload to be accepted without the ipam claim being created, or you can fail. Which will reject the workload. Then, you can have a controller on top to ensure the IPAMClaims exist as part of their sync loop.

... which begs the question of why having the webhook in the first place if I'm going to have to reconcile. I would rather push for only setting the ipam-claim-reference network selection element attribute based on the feature gate, and have it mean the user understand the CNI they're using can accept (or at least tolerate ...) the existence of this attribute. I think this would be a fair compromise.

When I read this I think you confuse a FG with an operational configuration.
But I really lost track of what we discuss here.

Unfortunately, IMHO this proposal addresses the wrong problem. It could work - at a glance. But it focuses on presenting a generic solution for things that run on pods, while all the requirements we have in front of us are related to VMs. Plus, there's an agreed enhancement proposal (including agreement from the sig-network maintainer) for this existing proposal. When we agreed to write this proposal we did it because we were discussing how to integrate with KubeVirt. This seems like a different solution, with a different scope (more ambitious).

From my perspective, the original approved proposal has been focusing on the IP reservation from the network provider side, the "service". I think the solution indeed answers the needs for potential clients.
However, I do not think the integration with Kubevirt is serving well the need of both projects.

On one side, Kubevirt is expected to heavily integrate the IPClaim in its controllers and depend on admin-level resources (NetworkAttachmentDefinition).
From the IPClaim side, it makes it hard to use as you expect the other clients to have all this permissions, which they may not.

As a reviewer of this integration proposal into Kubevirt, I am trying to raise my concerns and warn about the implications and cons. I think Kubevirt should not increase its dependency surface, especially not against something like a NAD. I was pushing towards extracting specialized logic out of core Kubevirt and integrating it from outside (e.g. network binding plugin). I find this moving into the opposite direction and therefore tried to suggest an alternative.
I have no veto on the final decision, the decision was and still is of the maintainer.

IMO narrowing the scope down to VMs/VMIs is cleaner, and cheaper to develop and maintain (across the ecosystem) than what you're proposing - at the expense of forcing kubevirt to consume a CRD and read a configuration object ( ... which it already does). Your proposal does have the advantage of looking like it could work for more types of workloads than VMs. But again: no one asked for that .

All fair points.
I personally find the value of a self sustained service with its own dedicated owners a more maintainable, safe and stable solution.

Copy link
Contributor Author

@maiqueb maiqueb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EdDev please let's continue the discussion.

Comment on lines +126 to +151
`IPAMClaim` status. Finally, the CNI will configure the interface with these
IP addresses.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Routing and DNS configuration are not in scope of this feature, and are configured by the network admin.

This feature's scope is IMO quite clear - and listed in the objectives. Ensure an IP address survives a VM restart & migration.

Routing & DNS are configured by the network / cluster admin either on the NAD, or cloud-init.

I can add a short paragraph indicating vm owners can use cloud-init to configure the guest if it does not have a dhcp client.

- the spec.network attribute

The `IPAMClaim` name will be created using the following pattern:
`<vm name>.<logical network name>`. The logical network name is the [name of the
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing something, but I don't see the difference.

Hotplug / unplug will still happen for a VM; the IPAMClaim will still have an owner reference.

If they don't match, we throw an error, and will try to reconcile later. If it never manages to plug / unplug the interface .. well, we'll keep trying with exponential backoff.

```

#### Hot-plug a VM interface
This flow is exactly the same as
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is afaiu a requirement being asked from stake-holders.
@phoracek

It's a good point though: if hotplug is only supported at VM level - this is news to me; when I implemented it a VMI could be hotplugged / unplugged from - it may be one less thing to drop from the requirements.

I will not die on this hill fwiw.

When creating the pod, the OVN-Kubernetes IPAM module finds existing
`IPAMClaim`s for the workload. It will thus use those already reserved
allocations, instead of generating brand new allocations for the pod where the
encapsulating object will run. The migration scenario is similar.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some packets will surely be dropped, but we have an epic and a way forward to mitigate it, by changing the OVS port to node association dynamically based on the state of the migration.

Hypershift already does this tracking, but for something different; IIRC correctly, it configures the ARP proxy only when the destination pod has taken over.
@qinqon can you chime in ?

#### Starting a (previously stopped) Virtual Machine
This flow is - from a CNI perspective - quite similar to the
[Creating a VM flow](#creating-a-virtual-machine):
1. the workload controller (KubeVirt) templates the pod, featuring the required
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll fail, and the VM will not start until the old IPAMClaims are deleted.


## Motivation
Virtual Machine owners want to offload IPAM from their custom solutions (e.g.
custom DHCP server running on their cluster network) to SDN.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only secondary networks, as explicitly indicated on the goals (and non-goals) section.

When reading the motivation, I did not understand the actual need is for secondary networks. I guess you could explain why this is not needed for the pod network but is needed for the secondaries.

The pod network is managed / owned by kubernetes. A direct consequence of that is users get IPAM on it already. It's part of what it does. I.e. they don't have to manage IP addresses / configure DHCP servers / etc. The platform does that for them.

I hope this is clear enough.

Well, for multiple ones.
That depends on the configuration provided by the cluster admin in the NAD (routes / DNS). It has nothing to do with this feature AFAIU.

The feature is triggered by a need. I think it is worth expressing if the need is for a single one or more. It may influence the solution proposed.

It isn't specified, but I assume asking for IPAM on secondary interfaces means just that ... IPAM on secondary interfaces. I.e. every secondary interface can have IPAM depending on the config set by the user.

How existing clusters resolve the configuration challenges will help (e.g. with a DHCP server, cloud-init config, scripts, etc).

I do not understand what you're asking for. Are you asking for anything here ?

Yes, I am asking to explain how the legacy VM or physical machines handled this need. That way, I can understand if you take a similar solution to Kubevirt or you invent a new solution to an old problem. Given a problem/challenge, how it was solved so far in other platforms (including bare-metal) can help provide context and support this proposal.

I still don't understand. Are you asking how other platforms implement IPAM, or are you asking how other platforms get around not having IPAM ? The response to the latter is stated right there in the motivation - they use DHCP servers, and static IP addressing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trimming this down so we can focus on the most important points. I hope I'm not leaving anything behind.

So we cannot simply have KubeVirt always request this feature (since it may break arbitrary CNIs), which would require us to use webhook based solutions. We know those are not a good fit since they will require looking past the object being "looked upon". I think a solution based on eventual consistency is required.

In this case, a webhook is the only solution I can think of that works without requiring other clients to know and have access to a NAD.
We already have a version of it working with SR-IOV [1]. It basically parses the pod annotation, fetches the NADs and patches the pod resources accordingly.

[1] https://github.com/openshift/sriov-dp-admission-controller/blob/master/pkg/webhook/webhook.go#L609

The fact they looked past the object they are mutating is not something that encourages me to do the same. That's inherently raceful.

Furthermore, we already maintain a couple of webhooks and had sub-par experiences with them. Adding a webhook should not be done lightly, and I would avoid it if possible. I am especially reticent about maintaining a webhook based solution.

Going past that ... what workloads would you make "subject" of the webhook ? Would it be all pods w/ a certain label ? If so, which pods will have the label ? How will KubeVirt know which pods to label ?

Only pods that have a network annotation and marked by the creator to be part of the IP claim "game".
This is something the IPClaim solution can define so users will mark accordingly.

Can you elaborate on this ? What does "marked by the creator" mean ? What's the mark ? A label ? How will the client (let's assume kubevirt) know which ones to "mark" without looking into the configuration ?

Per what I know, webhooks can be registered so their absence will not block anything.
As the controller and webhook can reside on the same executable, we can argue that even through something was requested, it may not have been served. So it is all about what you really desire to happen and how (or if) it can be fixed as part of a reconciliation.

You can indeed indicate the behavior if the webhook fails: you can ignore failure (which might cause the workload to be accepted without the ipam claim being created, or you can fail. Which will reject the workload. Then, you can have a controller on top to ensure the IPAMClaims exist as part of their sync loop.

... which begs the question of why having the webhook in the first place if I'm going to have to reconcile. I would rather push for only setting the ipam-claim-reference network selection element attribute based on the feature gate, and have it mean the user understand the CNI they're using can accept (or at least tolerate ...) the existence of this attribute. I think this would be a fair compromise.

Unfortunately, IMHO this proposal addresses the wrong problem. It could work - at a glance. But it focuses on presenting a generic solution for things that run on pods, while all the requirements we have in front of us are related to VMs. Plus, there's an agreed enhancement proposal (including agreement from the sig-network maintainer) for this existing proposal. When we agreed to write this proposal we did it because we were discussing how to integrate with KubeVirt. This seems like a different solution, with a different scope (more ambitious).

IMO narrowing the scope down to VMs/VMIs is cleaner, and cheaper to develop and maintain (across the ecosystem) than what you're proposing - at the expense of forcing kubevirt to consume a CRD and read a configuration object ( ... which it already does). Your proposal does have the advantage of looking like it could work for more types of workloads than VMs. But again: no one asked for that .

@maiqueb maiqueb requested a review from EdDev March 27, 2024 15:39
Comment on lines +56 to +67
Given OVN-Kubernetes (the CNI plugin where this feature will be implemented)
operates at pod level (i.e. it does **not** know what a KubeVirt VM is), and its
source of truth is in essence the live pods on the cluster, we need to have
something in the data model representing this existing allocation when the pod
is deleted (the stopped VM scenario). If we don't, those IP addresses would be
allocatable for other workloads while the VM is stopped.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IPAMClaim is part of the de-facto standard we can say "Given a CNI supporting IPAMClaim"

design-proposals/sdn-ipam-secondary-nets.md Show resolved Hide resolved
When creating the pod, the OVN-Kubernetes IPAM module finds existing
`IPAMClaim`s for the workload. It will thus use those already reserved
allocations, instead of generating brand new allocations for the pod where the
encapsulating object will run. The migration scenario is similar.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At hypershift we skip ip configuration at pod and we deliver addresses direclty to the VM using DHCP, this allow us to use bridge binding since it's just L2.

The have the same gateway independently of the node where VM is running (at ovn-k depending on the node the VM is runing the gateway is different for defeafult pod network) we use ARP proxy and that's is always onfigured, The only think that is configured with the pod has taken over is the point to point routes.

More details here https://github.com/ovn-org/ovn-kubernetes/blob/master/docs/design/live_migration.md

pods in the cluster, but also from these CRs.

### Configuring the feature
We envision this feature to be configurable per network, meaning the network
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if multiple NADs have the same network but different allowPersistentIPs?
Should it fail in kubevirt level? I guess the CNI won't check it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not possible according to ovn-kubernetes definition of a network - i.e. the configuration must be the same in all NADs.

I can't speak for other CNIs though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not possible according to ovn-kubernetes definition of a network - i.e. the configuration must be the same in all NADs.

I can't speak for other CNIs though.

I don't expect the client side to care about that. It should care about the setting in the NAD.

Comment on lines +76 to +88
### Configuring the feature
We envision this feature to be configurable per network, meaning the network
admin should enable the feature by enabling the `allowPersistentIPs` flag in the
CNI configuration for the secondary network (i.e. in the
`NetworkAttachmentDefinition` spec.config attribute).

A feature gate may (or may not) be required in the KubeVirt.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Eddy mentioned, since the feature resides on an alpha stage CRD I think it should be behind a FG.

We envision this feature to be configurable per network, meaning the network
admin should enable the feature by enabling the `allowPersistentIPs` flag in the
CNI configuration for the secondary network (i.e. in the
`NetworkAttachmentDefinition` spec.config attribute).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it part of the spec.config and not an attribute in the spec? Should the CNI look at the value of allowPersistentIPs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it part of the spec.config and not an attribute in the spec?

I might be missing something. An attribute in the spec of what ?

Should the CNI look at the value of allowPersistentIPs?

Yes, the CNI also checks for this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing something. An attribute in the spec of what ?

Of the NAD. I wasn't sure the CNI is using the value of allowPersistentIPs.
If it is it makes sense it if part of the config.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the CNI plugin should also ensure the plugin configuration has the knob enabled.


### Configuring the feature
We envision this feature to be configurable per network, meaning the network
admin should enable the feature by enabling the `allowPersistentIPs` flag in the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if allowPersistentIPs is disabled but there is IPClaim, or the other way around? The CNI is looking for the IPClaim and ignores the allowPersistentIPs value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're actually discussing right now ignore vs error.

I think it should error, because it's pretty much clear we won't give the user what they want.

Thus failing seems more "honest".

- an external controller creates the persistent IPs allocation

The first two options require KubeVirt to instruct which allocations are
relevant when creating the pod encapsulating the VM workload; the last one only
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the last - it is written the last, but it seems the paragraph is about the the CNI plugin creates the persistent IPs allocation option.
Please add the pros and cons of the an external controller creates the persistent IPs allocation approach.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's listed on the alternatives section, in detail.

I'll see what I can do about it here.

2. the user migrates the VM
3. the interfaces marked as absent will not be templated on the destination pod
(i.e. the migration destination pod will not have those interfaces)
4. KubeVirt dettaches the interface from the live VM
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this is step 2. It may happen before the migration (depending how fast the migration was invoked).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, when I wrote this unplug worked differently.

- the spec.network attribute

The `IPAMClaim` name will be created using the following pattern:
`<vm name>.<logical network name>`. The logical network name is the [name of the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, I"n not sure the VMI controller has a way to know the interface was just hotplugged and therefore an old IPClaim shouldn't exist.

network](https://kubevirt.io/api-reference/main/definitions.html#_v1_network)
in the KubeVirt API.

The `IPAMClaim.spec.network` must be the name of network as per the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the VM startup diagram the CNI is doing - subnet = GetNAD(ipamClaim).Subnet.
If the IPClaim has only the network name, how the IPAM CNI find the NAD/s?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it uses the network selection element. Seems the diagram is wrong, good catch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the network selection element is used, why does the IPClaim need the network field? When is it used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the CNI uses it when synchronizing its IP pool using IPAMClaims. It needs to know to which network does the allocation belong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean in case the CNI crashes and restarts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't get why the network name is needed on the IPAMClaim. If the IPAM CNI has an access to the pods and to the NADs it can get the network name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an explanation about this.

apiVersion: k8s.cni.cncf.io/v1alpha1
kind: IPAMClaim
metadata:
name: vm1.tenantblue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact the kubevirt logical network name and the OVN network name are both tenantblue is confusing. Please use different names (and add the VM/VMI snippet).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 399 to 433
The feature is opt-in via the network-attachment-definition, (thus disabled by
default) and it does not impose any performance costs when disabled; as a
result, adding a feature gate is not recommended.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in a previous comment, I agree with Eddy and think a FG is required.
Besides what I mentioned in the previous comment regarding residing on an v1aplha1 API, I prefer not to parse the NAD's config if it is not required.

We envision this feature to be configurable per network, meaning the network
admin should enable the feature by enabling the `allowPersistentIPs` flag in the
CNI configuration for the secondary network (i.e. in the
`NetworkAttachmentDefinition` spec.config attribute).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing something. An attribute in the spec of what ?

Of the NAD. I wasn't sure the CNI is using the value of allowPersistentIPs.
If it is it makes sense it if part of the config.

Please refer to the diagram below to better understand the proposed workflow
for VM creation:
1. the user provisions a VM object
2. the KubeVirt controller creates an IPAMClaim (in the same namespace of the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?
Would kubevirt raise a warning if a VM is using a NAD in the default NS and has allowPersistentIPs?

note over KubeVirt: we only iterate Multus non-default networks
loop for network := range vm.spec.networks
note over KubeVirt, apiserver: claimName := <vmName>.<network.Name>
KubeVirt->>apiserver: createIPAMClaim(claimName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design suggested two options - 1. Kubevirt will create IP claim for any secondary interface. 2. kubevirt with create the IPClaim according to the NAD allowPersistentIPs.
The diagram leaves the impression option 1 was chosen.

encapsulating object will run. The migration scenario is similar.

#### Removing a Virtual Machine
This flow is - from a CNI perspective - quite similar to the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant in the doc:)

That finalizer is removed once the vm is deleted.

what if we don't have a VM (VMI only)?

network](https://kubevirt.io/api-reference/main/definitions.html#_v1_network)
in the KubeVirt API.

The `IPAMClaim.spec.network` must be the name of network as per the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the network selection element is used, why does the IPClaim need the network field? When is it used?

currently manages. **Only afterwards** will it update the corresponding
`IPAMClaim` with the generated IPs. Users can only rely / use the `IPAMClaim`
status for informational purposes.
4. this step occurs in parallel to step 3; KubeVirt templates the KubeVirt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So step 3 cannot be finished up until step 4 is done?
I mean, the OVN-K need the pod with its network selection elements to complete step 3. Right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The numeration is wrong ... first kubevirt templates the pod, then CNI creates the sandbox, and part of that is IP allocation. Once we have an allocated IP, we persist it on the IPAMClaim (if the a claim is being referenced, and if the network allows it).

I'll fix this. Good catch.

element the name of the IPAMClaim to use. If we use a webhook, we could take
this a step further, and mutate the templated pod to feature the required
IPAMClaim reference in the network selection element. Without looking into the
OVN-K configuration (i.e. the NAD.spec.config) we would have to have all
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems the end of the sentence was cut.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Finished the sentence properly.

Is it OK now ?

OVN-K configuration (i.e. the NAD.spec.config) we would have to have all


account since the CNI plugin operates exclusively at pod level, it cannot
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems the beginning of the sentence was cut.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

The first two options require KubeVirt to indicate via the network selection
element the name of the `IPAMClaim` when creating the pod encapsulating the VM
workload; the third option, requires KubeVirt to send all the required
information to OVN-Kubernetes, and it will create the persistent allocation,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please elaborate. What is "all the required information"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, it's the owner reference. I'll add it.

Good catch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

devices:
...
interfaces:
- name: data-network
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tenantblue

note over CNI: wait until IPAMClaims associated

loop for ipamClaim := range ipamClaims
IPAM CNI->>IPAM CNI: subnet = GetNAD(ipamClaim).Subnet
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current diagram gives the feeling the IPAM CNI is traversing the IPAMClaims and gives them IPs, independently with the NetworkSelectionElement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I've tried to correct it in the last push.

Does it make sense now ?

encapsulating object will run. The migration scenario is similar.

#### Removing a Virtual Machine
This flow is - from a CNI perspective - quite similar to the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still cannot find foreground and background deletion explained in the doc.
IIRC, foreground was a big issue when thinking about the design (what happens if the IPAMClaim is removed before the VM). I believe it is important to describe the problem and how the current design solves it in the doc. I would expect to see an explanation how do we prevent the IPAMClaim from being removed before the VM, VMI and virt-launcher pod.

network](https://kubevirt.io/api-reference/main/definitions.html#_v1_network)
in the KubeVirt API.

The `IPAMClaim.spec.network` must be the name of network as per the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't get why the network name is needed on the IPAMClaim. If the IPAM CNI has an access to the pods and to the NADs it can get the network name.

@maiqueb maiqueb force-pushed the sdn-ipam-secondary-nets branch 2 times, most recently from ca239ef to be606e5 Compare April 17, 2024 16:58
#### Removing a Virtual Machine
This flow is - from a CNI perspective - quite similar to the
[Stopping a VM flow](#stopping-a-virtual-machine). The main difference is after
the VM is deleted, Kubernetes Garbage Collection will kick in, and remove the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a foreground deletion -

  1. The VM is marked to be removed.
  2. Garbage collector tries to remove the IPAMClaim, since the IPAMClaim has a finalizer a deletion timestamp is added.
  3. The VM is never removed since it has a dependent object that wasn't removed.
  4. The IPAMClaim is not removed since the VM is not removed.

@oshoval can you please explain how does this work?

Copy link

@oshoval oshoval Apr 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly as VM has a finalizer that is removed only when:

  1. VM is marked for deletion
  2. VMI is gone

when [1] + [2] happens, the VM finalizer is removed
this is the exact time when we also remove the Ipam claim finalizer
https://github.com/kubevirt/kubevirt/blob/e87c12294ae31810fdb75a6fb6f6b6d25f1f3107/pkg/virt-controller/watch/vm.go#L2974

Copy link
Member

@AlonaKaplan AlonaKaplan Apr 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I see. The finalizar is removed when the deletion timestamp is added and the VMI is gone. OK, thanks!

@maiqueb maiqueb force-pushed the sdn-ipam-secondary-nets branch 4 times, most recently from 28b458b to f1a78d4 Compare April 18, 2024 08:47
Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>
@AlonaKaplan
Copy link
Member

Thanks!

/approve

@kubevirt-bot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: AlonaKaplan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubevirt-bot kubevirt-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 18, 2024
@oshoval
Copy link

oshoval commented Apr 18, 2024

Thanks !

/lgtm

@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Apr 18, 2024
@kubevirt-bot kubevirt-bot merged commit e42362c into kubevirt:main Apr 18, 2024
2 checks passed
maiqueb added a commit to maiqueb/ovn-kubernetes that referenced this pull request Jun 12, 2024
Document the persistent IPs for virtualization workloads feature.
This feature is described in detail in the following KubeVirt
enhancement - [0].

[0] - kubevirt/community#279

Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>
dtzhou2 pushed a commit to dtzhou2/ovn-kubernetes that referenced this pull request Jul 8, 2024
Document the persistent IPs for virtualization workloads feature.
This feature is described in detail in the following KubeVirt
enhancement - [0].

[0] - kubevirt/community#279

Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. lgtm Indicates that a PR is ready to be merged. sig/network size/XL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants