Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPClaims missing for Machines #1354

Closed
lentzi90 opened this issue Dec 11, 2023 · 1 comment · Fixed by #1361
Closed

IPClaims missing for Machines #1354

lentzi90 opened this issue Dec 11, 2023 · 1 comment · Fixed by #1361
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue lacks a `triage/foo` label and requires one.

Comments

@lentzi90
Copy link
Member

lentzi90 commented Dec 11, 2023

What steps did you take and what happened:

It sometimes happens that a Metal3Machine appears without IPClaim after a rolling upgrade of a MachineDeployment.

Update

I think we have pinpointed the issue!

Code path:

If the new Metal3Data is created before the old IPClaim, it could take the old IPClaim and render the secret before the old IPClaim is deleted.

Reproduction criteria:

  • The IPClaim name must be the same before and after rollout (this can happen in at least two cases) so that the controller finds it when looking for the new IPClaim.
    • If BMH name based pre-allocation is enabled, and the same BMH happens to be used for the old and new M3M. The IPClaim gets the name <bmh-name>-<ippool-name>.
    • If not using BMH name based pre-allocation, and the Metal3Data happens to get the same index/name. The IPClaim gets the name <m3d-name>-<ippool-name>.
  • IPClaim must not get the deletionTimestamp until after the new Metal3Data has gotten the IPAddress from it

End of update

Second update

After trying to find out why/if the IPClaim was not deleted before/together with the Metal3Data, I have concluded that the IPClaim should always be deleted before the finalizers are removed from the Metal3Data. So how come we see this issue still?
I think the answer is cache. Everything except BMHs, ConfigMaps and Secrets is cached, so we could simply get a cached IPClaim 🤦

End of second update

Unfortunately we have not been able to reproduce the issue yet.
Here is what we know so far:

  1. During an upgrade, a Metal3Machine comes up without IPClaims (or with some of them missing).
  2. It is unclear if the IPClaim was there and then deleted or if it was never there.
  3. It is unclear if the Metal3Data and Metal3DataClaim was also missing.
  4. Before this we have heard that one Machine was stuck deleting and was forcefully removed, possibly by deleting the Node.
  5. BMH name based pre-allocation is enabled but not used.
  6. Node reuse is disabled.
  7. Disk cleaning is disabled.
  8. There are multiple MachineDeployments, but only one is rolled out at a time and using scale-in with max 1 unavailable.
  9. There are many IPPools, about 6 addresses per Machine, but only used for metadata (no networkdata).
  10. A new IPPool was added as part of the upgrade.
  11. Each MachineDeployment has its own Metal3MachineTemplate and Metal3DataTemplate.
  12. New Metal3MachineTemplates and Metal3DataTemplates are created and used when rolling out (needed because of the new IPPool).
  13. The Metal3DataTemplates all use templateReference. The reference is unique per MachineDeployment and kept the same when rolling to a new Metal3DataTemplate.

What did you expect to happen:

All Metal3Machines that are configured to use IPClaims should have the proper IPClaims created.
Otherwise there is risk of IP clash and all kinds of strange things.

Anything else you would like to add:

We are still investigating this issue and we are not 100 % sure that there is an issue in CAPM3, but it is our best lead so far.
I'm creating this issue now to keep track of findings and link to potentially relevant known bugs.

Environment:

  • Cluster-api version: v1.5.3
  • Cluster-api-provider-metal3 version: v1.5.2
  • IP-address-manager: v1.5.2
  • Environment (metal3-dev-env or other): other
  • Kubernetes version: (use kubectl version): v1.28.3

Discovered or related issues:

/kind bug

@metal3-io-bot metal3-io-bot added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue lacks a `triage/foo` label and requires one. labels Dec 11, 2023
@metal3-io-bot
Copy link
Contributor

This issue is currently awaiting triage.
If Metal3.io contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.
The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue lacks a `triage/foo` label and requires one.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants