Skip to content

Conversation

@smklein
Copy link
Collaborator

@smklein smklein commented Apr 12, 2022

Bootstrap Agent

  • Stop relying on the "Unspecified" address for bootstrap addrs - instead, follow the advice of RFD 63, and allocate addresses explicitly based on the physical link MAC address. Admittedly, RFD 63 suggests using the SP MAC, but the physical link MAC is the one we have today.
    • Re-work some of the link-local multicast code, ensuring that the sender address properly advertises the bootstrap address, rather than an unspecified interface. Provide an integration test (illumos-only) to validate that this works how we want it to work - at least, until integration with Maghemite progresses far enough that multicast can be removed.
  • Avoid launching the Sled Agent "by default" - instead, keep the sled agent uninitialized until explicitly requested, alongside a whole /64 for which it should be responsible. Store configuration data to local storage within OMICRON_CONFIG_PATH to automatically launch these agents on reboot, after RSS setup.
  • Provide a PeerMonitorObserver structure that may be used to await the notification of new peers within the bootstrap agent.

Sled Agent

  • Acquire initial address from subnet, avoid blocking on Nexus notification during setup.

RSS

  • Await the appearance of enough bootstrap agent peers before sending requests. This emulates an operator "seeing enough sleds come up" before deciding to initialize the rack.
  • Allocates subnets based on an arbitrary initial order of peers (presumably Nexus will take over this allocation later on).

Configs (mostly in smf/)

  • Adjust all addresses to be within the subnet of the first sled. Also, condense the range of IPv6 addresses we're using for hardcoded stuff.
  • Remove the explicit addresses for the "bootstrap_agent" and "sled_agent" - these should be inferred at runtime.

TODO

  • Deal with idempotency conditions - what if RSS fails partway through operation?
  • Add more tests?
  • Take another swing at the PeerMonitorObserver structure, maybe it can be simplified
  • Grep for old addresses (See: README) and update them

TODOs for later PRs (but which should be unblocked now)

Fixes #821

@smklein smklein changed the title Api to launch sled agent [sled-agent] Sled Agent launched by bootstrap agent, derive bootstrap addresses, allocate sled /64s Apr 12, 2022
@smklein smklein changed the title [sled-agent] Sled Agent launched by bootstrap agent, derive bootstrap addresses, allocate sled /64s [sled-agent] Launch from bootstrap agent, derive bootstrap addresses, allocate sled /64s Apr 12, 2022
Copy link
Contributor

@jmpesp jmpesp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, some questions:

Comment on lines 464 to 467
// NOTE: This is a "point-of-no-return" -- before sending any requests
// to neighboring sleds, ensure that we've recorded our plan to durable
// storage. This way, if the RSS power-cycles, it can idempotently
// execute the same allocation plan.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the way this is worded seems like it's not done yet. did create_plan perform a flush to durable storage?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll move the comment up, indicating that it is supposed to refer to the moment create_plan completes.

  • create_plan performs the write to durable storage.
  • It's not calling flush explicitly, but I believe it's being implicitly called when the tokio::fs::write - which calls std::fs::write - which itself calls File::create - drops the created File object.

Comment on lines 14 to 16
// This modifies global state of the target machine, creating
// an address named "bootstrap6", akin to what the bootstrap
// agent should do.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this cleaned up after the test?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will now.

@smklein smklein marked this pull request as ready for review April 18, 2022 19:28
Copy link
Collaborator

@bnaecker bnaecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good. I have a few clarifying comments and questions, but overall a +1 from me!

@bnaecker
Copy link
Collaborator

Looks like you've got a test flake on the check-omicron-deployment step. I had the same last week, seems an OOM situation on macOS, bummer.

Otherwise, LGTM!

@bnaecker bnaecker self-requested a review April 19, 2022 16:18
@smklein smklein merged commit b7f7ea6 into main Apr 19, 2022
@smklein smklein deleted the api-to-launch-sled-agent branch April 19, 2022 20:16
iximeow added a commit that referenced this pull request May 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[bootstrap agent] Should expose an API for the RSS to call, to initialize the Sled Agent

5 participants