Skip to content

Add bootstate endpoint to registry server#424

Merged
Nuckal777 merged 4 commits intomainfrom
enh/bootstate-endpoint
Nov 21, 2025
Merged

Add bootstate endpoint to registry server#424
Nuckal777 merged 4 commits intomainfrom
enh/bootstate-endpoint

Conversation

@Nuckal777
Copy link
Contributor

Proposed Changes

Fixes #399.

@github-actions github-actions bot added size/L enhancement New feature or request labels Aug 7, 2025
@Nuckal777 Nuckal777 force-pushed the enh/bootstate-endpoint branch from bd4e68c to 9a2c29f Compare August 7, 2025 15:35
@afritzler
Copy link
Member

Adding a k8sClient to the registry server will make it harder in the future to factor out the registry and deploy it somewhere were it might not have access to the metal-operator api server. How about instead the ServerReconciler probes the /bootstate endpoint periodically since we are doing a periodic retry on all Server objects?

@Nuckal777
Copy link
Contributor Author

How about instead the ServerReconciler probes the /bootstate endpoint periodically since we are doing a periodic retry on all Server objects?

This would introduce a gap in which the bootstate can be lost:

  • Server boots up, calls the bootstate endpoint and receives an HTTP 200.
  • Some time passes, while the bootstate only resides in memory
  • Server controller reconciles the Server object and applies the condition

Restarting the registry in the second step would loose the bootstate event.

@Nuckal777 Nuckal777 force-pushed the enh/bootstate-endpoint branch from 9a2c29f to f32a257 Compare October 27, 2025 14:14
@hardikdr hardikdr self-requested a review October 28, 2025 10:28
Copy link
Member

@hardikdr hardikdr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @Nuckal777, dropped some nit inline.

Thinking about it a little more, I’d propose a following holistic way of handling the first-boot flow. Instead of tying it to a single signal, we could make it more flexible, for example, let the boot-operator set conditions like IPXEScriptFetched and IgnitionDataFetched, and have the /bootstate POST update a BootStateReceived condition on the ServerBootConfig.

Then, the metal-operator could decide which of these conditions to treat as the actual boot completion using a boot-completion-condition flag. This would also make it easier to support things like NetBootOnce and NetBootAlways policies on the ServerClaim side later even when /bootstate POST call is not configured in the Ignition. Wdyt?

conditionutils.UpdateStatus(metav1.ConditionTrue),
conditionutils.UpdateReason("BootStatePosted"),
conditionutils.UpdateMessage("Server successfully posted boot state"),
conditionutils.UpdateObserved(&server),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be interesting to explore the trade offs between having this condition on Server vs ServerBootConfig CR. I am somehow leaning more towards ServerBootConfig.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, ServerBootConfiguration might be more fitting as their lifetimes are bound to a ServerClaim and the discovery boot. 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, I would say let's already move it to ServerBootConfig, and we can also discuss this in our PR review roulette soon.

@Nuckal777
Copy link
Contributor Author

Instead of tying it to a single signal, we could make it more flexible...

Agree, iirc it was part of the initial discussion on the topic that the owner of a ServerClaim could choose what is considered to be a successful network boot.

@hardikdr
Copy link
Member

hardikdr commented Nov 4, 2025

Instead of tying it to a single signal, we could make it more flexible...

Agree, iirc it was part of the initial discussion on the topic that the owner of a ServerClaim could choose what is considered to be a successful network boot.

Sure, having a knob on the ServerClaim would allow configuring it per Claims then, I would also consider tradeoffs against having a common single flag in the metal-operator.

@Nuckal777 Nuckal777 force-pushed the enh/bootstate-endpoint branch 2 times, most recently from 2118833 to 4c36b38 Compare November 7, 2025 19:27
Copy link
Member

@hardikdr hardikdr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests seem to be failing.
Looks good otherwise.

@Nuckal777 Nuckal777 force-pushed the enh/bootstate-endpoint branch from 4c36b38 to 6f37a36 Compare November 14, 2025 14:53
Copy link
Contributor

@defo89 defo89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it looks good, also find the suggestions from @hardikdr valuable.

For the future we could consider adding BootStateReceivedTimeout for cases registry fails to receive the boot state from server (issues with registry or server failed to boot), to retry redeploying a server.

@Nuckal777 Nuckal777 merged commit ea55fce into main Nov 21, 2025
14 of 16 checks passed
@github-project-automation github-project-automation bot moved this to Done in Roadmap Nov 21, 2025
@Nuckal777 Nuckal777 deleted the enh/bootstate-endpoint branch November 21, 2025 09:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Define /bootstate call back endpoint in manager to track if Server successfully booted

4 participants