Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DHCP-less IPAM proposal #373

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

DHCP-less IPAM proposal #373

wants to merge 3 commits into from

Conversation

hardys
Copy link
Member

@hardys hardys commented Dec 19, 2023

Draft proposal to explore how we can enable the CAPM3 IPAM flow in a DHCP-less environment, following on from recent discussions on Slack

/cc @Rozzii @dtantsur

@metal3-io-bot metal3-io-bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 19, 2023
@hardys hardys changed the title [WIP] DHCP-less proposal [WIP] DHCP-less CAPM3/IPAM proposal Dec 19, 2023
Copy link

@pierrecregut pierrecregut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are my comments: the main issues I see are that I do not think there is a real need for the intermediate CR Metal3PreprovisioningTemplate and that we need to choose if we go for a format specific solution or a pure template based one.

design/dhcp-less.md Outdated Show resolved Hide resolved
design/dhcp-less.md Outdated Show resolved Hide resolved
design/dhcp-less.md Outdated Show resolved Hide resolved
design/dhcp-less.md Outdated Show resolved Hide resolved
design/dhcp-less.md Outdated Show resolved Hide resolved
design/dhcp-less.md Outdated Show resolved Hide resolved
@hardys
Copy link
Member Author

hardys commented Dec 21, 2023

Thanks for the review @pierrecregut - and apologies for not tagging you originally, I wasn't sure of your github ID but I see now I should have figured it out 😂

Copy link
Member

@Rozzii Rozzii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to give feedback sooner but I was busy, on first glance these are my ideas related to the topic, ofc we will discuss this proposal further on 31st of January on the M3TM.

design/cluster-api-provider-metal3/dhcpless_ipam.md Outdated Show resolved Hide resolved
design/cluster-api-provider-metal3/dhcpless_ipam.md Outdated Show resolved Hide resolved
design/cluster-api-provider-metal3/dhcpless_ipam.md Outdated Show resolved Hide resolved
design/cluster-api-provider-metal3/dhcpless_ipam.md Outdated Show resolved Hide resolved
design/cluster-api-provider-metal3/dhcpless_ipam.md Outdated Show resolved Hide resolved
@pierrecregut
Copy link

@hardys At some point the document was moved to the cluster-api-provider-metal3. But the solution should work at the baremetal-operator level and should not imply the use of of a cluster. I would also have not changed the title to include CAPM3 explicitly.

@hardys
Copy link
Member Author

hardys commented Jan 17, 2024

@hardys At some point the document was moved to the cluster-api-provider-metal3. But the solution should work at the baremetal-operator level and should not imply the use of of a cluster. I would also have not changed the title to include CAPM3 explicitly.

@pierrecregut this was in response to our previous discussion and an attempt to constrain the scope of this proposal such that it's actionable - we are exploring some BMO/BMH API dependencies, but the main aim is to extend the current CAPM3 templating so the existing IPAM integration works in DHCP-less situations?

@tuminoid
Copy link
Member

Please rebase at convenient time to get the new markdownlinter check.

@hardys hardys force-pushed the dhcp-less branch 2 times, most recently from 9f29eb4 to 3b00020 Compare January 25, 2024 16:38
@hardys
Copy link
Member Author

hardys commented Jan 25, 2024

@hardys At some point the document was moved to the cluster-api-provider-metal3. But the solution should work at the baremetal-operator level and should not imply the use of of a cluster. I would also have not changed the title to include CAPM3 explicitly.

I move the proposal back to the top-level since it now contains changes for CAPM3 and BMO, and I can adjust the PR title if that makes things clearer.

@hardys
Copy link
Member Author

hardys commented Jan 25, 2024

/retitle DHCP-less IPAM proposal

@metal3-io-bot metal3-io-bot changed the title [WIP] DHCP-less CAPM3/IPAM proposal DHCP-less IPAM proposal Jan 25, 2024
@metal3-io-bot metal3-io-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 25, 2024
@hardys
Copy link
Member Author

hardys commented Jan 25, 2024

Thanks everyone for your feedback so far - I've reworked the proposal and tried to take account of all the comments, please let me know if there are any further suggestions or things that I missed.

Copy link

@pierrecregut pierrecregut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hardys To summarize my comments: the proposal is great except may be the attempt for reclaiming IP that looks complex. There are two sections where alternatives are described and finally we should address how to give feedback when the template cannot expand on a host.

the pre-provisioning phase as described above. To reduce the required size of the pool, the IPClaim will be deleted after the
preprovisioning phase is completed, e.g the BMH resource becomes available.

In the provisioning phase another pool, associated with a cluster is used to template networkData as in the existing process.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I understand well, you want to use two network configuration for the IPA ? One during inspection and another during provisioning ? This seems complex and I thought that the same configuration would be used. If you really want to delete the claim. I would do it after the node is provisioned but this requires to also erase the preprovisioningNetworkDataName and so what it points to, so that when we upgrade the node we regenerate the data with a usable IP. Is IP address reuse on the provisioning network worth the complexity ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first use-case Preprovisioning - Common IPPool covers the case where the same network configuration is used for IPA and during provisioning, I agree this is the simpler and probably more common case.

My understanding is there are also a use-case where a separate network is used for the pre-provisioning phase, so I'm trying to cover that scenario here.

I agree perhaps reclaiming the IPClaim after inspection introduces undue complexity, but if we don't do that the provisioning network address range potentially becomes a lot larger (number of deployed BMH's vs the concurrency allowed by BMO), perhaps that's a worthwhile tradeoff and we can consider the IPClaim deletion in future if it proves necessary?

Copy link
Member

@zaneb zaneb Apr 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use case where there'd be two is if you have hosts that are dynamically assigned to clusters (and the networks reconfigured at the same time). I'm not sure this exists in the wild, but it would be good to allow it.
In the more common case of the network staying the same, I think you really want the IP address not to change. The one in the inspection data is the one you want to have in the cluster. So I think ideally you only want to delete the preprovisioning IP claim at provisioning time, and only if a separate provisioned IP claim exists.
In fact, I think you have to keep the claim until after provisioning starts, because the host will continue to use its preprovisioning IP to talk to ironic (and that is the only the host can get provisioned). The "preprovisioning" phase extends up until the BMH hits the provisioned state (i.e. after ironic reboots the host after writing the image to disk, which I think is when the node moves to active), not just after inspection.

design/dhcpless_ipam.md Outdated Show resolved Hide resolved
a customized ISO, it can embed the configuration instead of Ironic to achieve the same result.

### Network configuration failure can't be reflected in status

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this part is out of the scope of the proposal and it is not worse than when the IPA fails because of a DHCP configuration problem. But we should address the feedback given when the templater fails to generate the preprovisioningNetworkData for a given baremetal host. Feedback at the level of the template would be strange if the reason is specific to a BMH. A condition on the BMH looks weird as this is not performed by the main controller. A solution would be to at least generate an event associated to the BMH.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks this is a good point - there are probably a few error-paths to consider:

  • baremetalhost.metal3.io/preprovisioningnetworkdata-reference value does not reference a valid resource (perhaps BMO should validate this, but just check the resource exists?)
  • Problem with the DataTemplate resource - the schema is enforced by the API but AFAICS for Metal3DataTemplate any other error will be reported via the resource status, so I think we'll do the same for Metal3PreprovisioningDataTemplate given they will likely share the same code
  • Some other problem in CAPM3 related to generating and setting the Secret, here we don't have a Machine resource during the pre-provisioning phase so will have to do as you suggest and generate an event associated with the BMH

There are several alternative approaches to solve the problem of preventing BMO starting inspection immediately on BMH registration:

* Add a new BareMetalHost API `PreprovisioningNetworkDataRequired` which defaults to false, but when set to true will describe that the host cannot move from Registering -> Inspecting until `preprovisioningNetworkDataName` has been set.
* Create the BMH with a preprovisioningNetworkDataName pointing to an empty Secret. BMO refuses to start inspecting until the Secret contains some data.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The principle that necessary data can be specified as non existing resource and that the BMH will just wait and will not err until the data is provided is nice. Even with the current approach, this could be used by template custom resource that can generate the data but without the power to modify the BMH fields. And the principle can be generalized to other data (userData, networkData).

@hardys
Copy link
Member Author

hardys commented Feb 6, 2024

To cross-reference we had some good discussions related to this proposal in the community meetup call last week

Copy link
Member

@Rozzii Rozzii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my part I am okay with everything, just suggested tiny change in one place otherwise LGTM

design/dhcpless_ipam.md Outdated Show resolved Hide resolved
@metal3-io-bot
Copy link
Contributor

metal3-io-bot commented Mar 17, 2024

@hardys: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
shellcheck ff02bb4 link true /test shellcheck

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Draft proposal to explore how we can enable the CAPM3 IPAM flow in a DHCP-less
environment

Signed-off-by: Steven Hardy <steven.hardy@suse.com>
@Rozzii
Copy link
Member

Rozzii commented Mar 25, 2024

/approve

@Rozzii
Copy link
Member

Rozzii commented Mar 25, 2024

/lgtm

@metal3-io-bot metal3-io-bot added the lgtm Indicates that a PR is ready to be merged. label Mar 25, 2024
Rozzii
Rozzii previously approved these changes Apr 23, 2024
@Rozzii
Copy link
Member

Rozzii commented May 6, 2024

/cc @lentzi90 @kashifest please take a look

@metal3-io-bot
Copy link
Contributor

@Rozzii: GitHub didn't allow me to request PR reviews from the following users: please, take, a, look.

Note that only metal3-io members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @lentzi90 @kashifest please take a look

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Member

@lentzi90 lentzi90 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the much delayed review.
Thanks for the proposal and great work on the writeup!

I have two concerns/questions:

  1. Using an annotation may seem like an easy solution but I worry that it will not be nice in the long run. We get no schema validation (typos will be nasty to find) and no way of evolving gracefully with conversions between API versions. To me the annotation feels like we are bolting on something rather than building a proper feature with full support because of this.
    There were some alternatives described with the disadvantage that they would not handle the association of template to BMH. Why is this? Could we not have a proper BMH API for that?
  2. There are discussions on-going around how to make CAPM3 support multi-tenancy. One of the use-cases I see is that the owner of the BMH does not have to be the same as the owner of the (Metal3)Machine. By handling the preprovisioningNetworkData through CAPM3 we would essentially tie them together even more and make multi-tenancy harder.
    My naive proposal would be that BMO should have the new controller that handles the preprovisioningNetworkData and CAPM3 should not know or care about this. If one team is owning the whole thing there is no difference, but if it is split up, I expect that the team owning the BMHs would also handle preprovisioning. What do you think about this?

design/dhcpless_ipam.md Outdated Show resolved Hide resolved
design/dhcpless_ipam.md Outdated Show resolved Hide resolved
@tuminoid
Copy link
Member

Hi @hardys, there is few open review comments/questions here, are you planning on continuing/finishing this?

@hardys
Copy link
Member Author

hardys commented Jun 4, 2024

Hi @hardys, there is few open review comments/questions here, are you planning on continuing/finishing this?

Hi @tuminoid yes I would still like to finish this but have been busy with other tasks - I see @lentzi90 has some feedback so I will look into those, and perhaps we may need to have some discussions on the weekly call to reach consensus.

Add code review suggestion

Co-authored-by: Lennart Jern <lennart.jern@gmail.com>
@metal3-io-bot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@metal3-io-bot metal3-io-bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 4, 2024
@metal3-io-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Rozzii
Once this PR has been reviewed and has the lgtm label, please assign zaneb for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

1 similar comment
@metal3-io-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Rozzii
Once this PR has been reviewed and has the lgtm label, please assign zaneb for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Add code review suggestion

Co-authored-by: Lennart Jern <lennart.jern@gmail.com>
@hardys
Copy link
Member Author

hardys commented Jun 4, 2024

Sorry for the much delayed review. Thanks for the proposal and great work on the writeup!

I have two concerns/questions:

  1. Using an annotation may seem like an easy solution but I worry that it will not be nice in the long run. We get no schema validation (typos will be nasty to find) and no way of evolving gracefully with conversions between API versions. To me the annotation feels like we are bolting on something rather than building a proper feature with full support because of this.
    There were some alternatives described with the disadvantage that they would not handle the association of template to BMH. Why is this? Could we not have a proper BMH API for that?

Please could you check the previous thread where this was discussed with @pierrecregut and @zaneb in some detail - the consensus was that adding an annotation was a lightweight option which seems more appropriate given that the BMH doesn't consume this template (it's layered on top via another controller)

I'm open to revisiting that direction but it would be really helpful if you could articulate a specific alternative so we can understand the pros/cons vs the minimally invasive annotation choice previously agreed.

  1. There are discussions on-going around how to make CAPM3 support multi-tenancy. One of the use-cases I see is that the owner of the BMH does not have to be the same as the owner of the (Metal3)Machine. By handling the preprovisioningNetworkData through CAPM3 we would essentially tie them together even more and make multi-tenancy harder.

I'm not sure this is true, check the IPAM Scenario 2 - decoupled preprovisioning/provisioning IPPool scenario in this proposal, I think that describes exactly this use-case?

My naive proposal would be that BMO should have the new controller that handles the preprovisioningNetworkData and CAPM3 should not know or care about this. If one team is owning the whole thing there is no difference, but if it is split up, I expect that the team owning the BMHs would also handle preprovisioning. What do you think about this?

Logically what you describe makes sense, we could have a new controller as described in this proposal added to BMO instead of CAPM3, but practically my concern is it'll basically duplicate the existing CAPM3 controller which handles networkData - we'd need to find a way to handle that which won't make future refactoring (for example to enable templating of nmstate format as an alternative to cloud-init) more difficult.

@pierrecregut @zaneb what are your thoughts? Is it worth reworking this proposal again to consider e.g adding a preprovisioningDataTemplate field to the BMH directly, along with an optional/pluggable templating controller?

@pierrecregut
Copy link

On one hand, I also agree that explicit fields are better than labels or annotations. I am not sure there is an issue with version migration as it is just the reference but there is also the burnt in documentation of the CRD that makes them much more usable. On the other hand, it is also clear that there is no real consensus on the network format. So the field must be able to consume different templates, so different resource kinds. A ref with a kind and a name may be fine. It is still true that the content will not be handled by the BareMetalHost controller itself but another one even if a default may be bundled in the BmhController pod.

@lentzi90
Copy link
Member

lentzi90 commented Jun 6, 2024

Sorry for the much delayed review. Thanks for the proposal and great work on the writeup!
I have two concerns/questions:

  1. Using an annotation may seem like an easy solution but I worry that it will not be nice in the long run. We get no schema validation (typos will be nasty to find) and no way of evolving gracefully with conversions between API versions. To me the annotation feels like we are bolting on something rather than building a proper feature with full support because of this.
    There were some alternatives described with the disadvantage that they would not handle the association of template to BMH. Why is this? Could we not have a proper BMH API for that?

Please could you check the previous thread where this was discussed with @pierrecregut and @zaneb in some detail - the consensus was that adding an annotation was a lightweight option which seems more appropriate given that the BMH doesn't consume this template (it's layered on top via another controller)

I'm open to revisiting that direction but it would be really helpful if you could articulate a specific alternative so we can understand the pros/cons vs the minimally invasive annotation choice previously agreed.

Thanks for the pointer, I have now gone through the thread and understand better the reasoning. I would like to suggest at minimum one change in the design, and also to give a couple of comparisons to alternatives that we could consider.

The minimum change I would like is that BMO should not care about the annotation. Instead, the preprovisioningNetworkDataName would be set to the name of the secret to be created. BMO would wait until it exists before inspection is attempted. The controller reading the annotation would be resposible for checking what name the secret should have and create it.

This flow is identical to how cert-manager handles annotations on Ingress resources and should be familiar to many users. Note that this way of working is limited. It is not possible to request any special settings through the annotation since it is only a reference to another resource.
Because of this limitation, it is quite common to create Certificate resources instead directly. Translating this to Metal3, I think it should be expected that users may want to create explicit Metal3DataClaims or IPAddressClaims to get better control also. I'm unsure if this would be possible with the proposal as it is?

The second comparison I want to make is to PersistentVolume(Claims) and StatefulSets. I think this is also similar to what we want to achieve and it is solved in a different way. StatefulSets have volumeClaimTemplates that provide a way to dynamically create PersistentVolumeClaims for the Pods, that would then be filled from a StorageClass.

I think this has the advantage of giving the user better control than the annotation, while also automating claim creation. However, it would be a larger API change, and we do not have the same need for dynamically creating the claims since the BMHs don't scale. I still wanted to give this as an alternative to consider, but I am fine with the annotation as described above.

  1. There are discussions on-going around how to make CAPM3 support multi-tenancy. One of the use-cases I see is that the owner of the BMH does not have to be the same as the owner of the (Metal3)Machine. By handling the preprovisioningNetworkData through CAPM3 we would essentially tie them together even more and make multi-tenancy harder.

I'm not sure this is true, check the IPAM Scenario 2 - decoupled preprovisioning/provisioning IPPool scenario in this proposal, I think that describes exactly this use-case?

You are right! Somehow I missed that. It does address the use-case form the "user point of view". I.e. one user can manage the BMHs and preprovisioning, while another is "consuming" the BMHs for running workload. This is all good.

I still have a concern when thinking one step further though. If CAPM3 is going to handle the preprovisioningNetworkData and directly interact with the BMHs, then the user must delegate enough privileges to CAPM3 to be able to do this. My long term goal is to add an indirection so that CAPM3 would not need this access. The point is to avoid giving the "consumer" access to BMC credentials, Ironic endpoints and so on.

However, I admit that CAPM3 for now (even with the multi-tenancy proposal) needs this access so adding the preprovisioningNetworkData is just one more thing. If we can agree that CAPM3 will just read the annotation and create the secret, then that is good enough for me. My concern would be if CAPM3 would directly change the BMH object to add the preprovisioningNetworkData.

My naive proposal would be that BMO should have the new controller that handles the preprovisioningNetworkData and CAPM3 should not know or care about this. If one team is owning the whole thing there is no difference, but if it is split up, I expect that the team owning the BMHs would also handle preprovisioning. What do you think about this?

Logically what you describe makes sense, we could have a new controller as described in this proposal added to BMO instead of CAPM3, but practically my concern is it'll basically duplicate the existing CAPM3 controller which handles networkData - we'd need to find a way to handle that which won't make future refactoring (for example to enable templating of nmstate format as an alternative to cloud-init) more difficult.

Good point. I am fine ignoring this for now. As we progress with multi-tenancy and "BMH claims", we may need to revisit, but I don't want to block this proposal based on some imaginary future proposal and design.

@pierrecregut
Copy link

Logically what you describe makes sense, we could have a new controller as described in this proposal added to BMO instead of CAPM3, but practically my concern is it'll basically duplicate the existing CAPM3 controller which handles networkData - we'd need to find a way to handle that which won't make future refactoring (for example to enable templating of nmstate format as an alternative to cloud-init) more difficult.

I am not sure code reuse which (a go issue) and location of controllers need to be mixed. We could have a module (separate to avoid linking dependencies of BMO controller and capm3 controller) in BMO repository that handles network templating with respect to a set of meta-data sources. It would also contain the types describing networks with marshalling annotations. It would be used for the PreprovisioningTemplate controller in BMO with only BareMetalHost as source and in CAPM3 by the DataTemplate controller with machine, M3M and BareMetalHost as sources.

Defining the preprovisioning configuration is really a responsibility of the teams handling the hardware because its primary purpose is to talk to the Ironic instance they manage. They have no reason to deploy capm3 on their cluster if BMH and cluster management are separate.

There are several potential workarounds for this situation, but it would be preferable for a common solution
to be found which can benefit both Metal3 and the broader Ironic community, as such solving this problem is considered
outside the scope of this proposal, but we will follow and engage with the
[Ironic community spec related to this problem](https://review.opendev.org/c/openstack/ironic-specs/+/906324/1/specs/approved/fix-vmedia-boot-config.rst )
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussing upstream on IRC the following patches were mentioned which should help resolve the config-2 issue:

https://review.opendev.org/c/openstack/ironic-python-agent-builder/+/920184

https://review.opendev.org/c/openstack/ironic-python-agent-builder/+/915170
https://review.opendev.org/q/I9b74ec977fabc0a7f8ed6f113595a3f1624f6ee6

I will update to reflect this latest status

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants