Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nexus] add basic support for expunged sled policy and decommissioned sled state #5032

Conversation

sunshowers
Copy link
Contributor

@sunshowers sunshowers commented Feb 8, 2024

This PR does a few things:

  • Migrates our current sled_provision_state to a new sled_policy enum, which has more information (per RFD 457). This PR implements the expunged state, not the graceful removal state.
  • Adds a sled_state enum, which describes Nexus's view of the sled. This PR adds the active and decommissioned states.
  • Adds internal code to move around between valid states.
  • Makes the blueprint execution code aware of sleds eligible for discretionary services.
  • Adds tests for all of this new stuff, as well as valid and invalid state transitions -- and also makes sure that if we do end up in an invalid state, things don't break down.

Not done here, but in future PRs (to try and keep this PR a manageable size):

Questions

  • Should we make sled_list hide decommissioned sleds by default?
  • Do we need any unique indexes on the sled table -- maybe something like "each non-decommissioned sled ID should have a unique serial number"?

TODO:

  • State transition tests.
  • Blueprint tests.
  • Update all the other places that need the policy.
  • Region allocation tests.
  • Ensure that sled state isn't decommissioned while trying to change sled policy.
  • DB migrations + tests to ensure migrations happen successfully.

Created using spr 1.3.5
Created using spr 1.3.6-beta.1
@sunshowers sunshowers marked this pull request as draft February 17, 2024 03:11
@sunshowers
Copy link
Contributor Author

Almost done here, just need to finish up the region allocation tests. (I'll do the db migrations after acceptance.)

Created using spr 1.3.6-beta.1
Copy link
Contributor

@andrewjstone andrewjstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fantastic @sunshowers! Great work. Once you add the DB migrations I'll take a quick look and approve.

nexus/db-model/src/sled.rs Outdated Show resolved Hide resolved
nexus/db-model/src/sled_policy.rs Show resolved Hide resolved
nexus/db-model/src/sled_state.rs Show resolved Hide resolved
nexus/db-queries/src/db/datastore/sled.rs Outdated Show resolved Hide resolved
nexus/db-queries/src/db/datastore/sled.rs Outdated Show resolved Hide resolved
nexus/db-queries/src/db/datastore/sled.rs Outdated Show resolved Hide resolved
nexus/db-queries/src/db/datastore/sled.rs Show resolved Hide resolved
nexus/db-queries/src/db/datastore/sled.rs Show resolved Hide resolved
nexus/db-queries/src/db/datastore/sled.rs Show resolved Hide resolved
nexus/db-queries/src/db/datastore/sled.rs Show resolved Hide resolved
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
@sunshowers sunshowers changed the title [WIP] [nexus] add basic support for expunged sled policy and decommissioned sled state [nexus] add basic support for expunged sled policy and decommissioned sled state Feb 23, 2024
@sunshowers sunshowers marked this pull request as ready for review February 23, 2024 07:53
@sunshowers
Copy link
Contributor Author

All right, finally ready for review. I had to break up the migration into two, where the second one just drops the column and deletes the type. I believe this is required for idempotence reasons.

Would love it if folks had thoughts on the questions as well, but those aren't critical, and otherwise this should be ready to go. Thanks for your patience, I ended up spending a lot of time refactoring and chasing down bugs.

Copy link
Contributor

@jgallagher jgallagher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just some minor questions and nits.

schema/crdb/37.0.0/up02.sql Outdated Show resolved Hide resolved
nexus/db-model/src/sled_policy.rs Show resolved Hide resolved
nexus/db-queries/src/db/datastore/sled.rs Show resolved Hide resolved
nexus/db-queries/src/db/datastore/sled.rs Outdated Show resolved Hide resolved
nexus/db-queries/src/db/datastore/sled.rs Show resolved Hide resolved
nexus/deployment/src/planner.rs Outdated Show resolved Hide resolved
Created using spr 1.3.6-beta.1
@sunshowers sunshowers enabled auto-merge (squash) February 23, 2024 19:45
Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! Thanks for doing this. It's a big chunk of work.

nexus/db-queries/src/db/datastore/sled.rs Outdated Show resolved Hide resolved
nexus/deployment/src/planner.rs Outdated Show resolved Hide resolved
@@ -13,7 +13,7 @@ use omicron_common::api::external::SemverVersion;
///
/// This should be updated whenever the schema is changed. For more details,
/// refer to: schema/crdb/README.adoc
pub const SCHEMA_VERSION: SemverVersion = SemverVersion::new(36, 0, 0);
pub const SCHEMA_VERSION: SemverVersion = SemverVersion::new(37, 0, 1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have two schema versions in this PR?

Why do we use the micro number? I thought schema versions were always supposed to bump the major.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason we need two schema versions is that we have to make two separate transactions to achieve an idempotent result.

The order of operations we need to perform is:

  1. Add the new sled_policy and sled_state enum types.
  2. Add the new sled_policy and sled_state columns.
  3. Update sled_policy to the new value based on the provision_state column.
  4. Remove the provision_state column.
  5. Remove the sled_provision_state enum types.

1 and 2 are inherently idempotent, so those are fine.

3 by itself is also fine to repeat.

But consider what happens if step 4 occurs, then the process fails, and we go back to 1 again. We'd repeat 3, but the corresponding column no longer exists so it will fail.

Solving this can be done in one of two ways:

  1. Find a way to make 3 only perform the update if the provision_state column exists. This is possible, but somewhat complicated.
  2. Add a transaction commit between 3 and 4. This is the approach I ended up using.

The reason it's 37.0.0 and 37.0.1 is because both updates are part of the same PR. This follows the pattern in #4261.

Created using spr 1.3.6-beta.1
@sunshowers sunshowers enabled auto-merge (squash) February 23, 2024 22:50
@sunshowers sunshowers merged commit a6ef7f9 into main Feb 23, 2024
22 checks passed
@sunshowers sunshowers deleted the sunshowers/spr/wip-nexus-add-basic-support-for-expunged-sled-policy-and-decommissioned-sled-state branch February 23, 2024 23:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants