initial draft of blueprints #4804

davepacheco · 2024-01-11T19:44:33Z

This PR provides an initial implementation for Blueprints, which we've variously called "update plans" or "deployment plans". RFD 418 (and the comments here) describe broadly how we expect this to work.

What's here is basically:

in nexus/types, a few Blueprint-related types
in nexus/deployment, a new module with a low-level builder for making blueprints and the first cut at a higher-level planner
in the internal API and the Nexus app layer, some basic CRUD-like operations for working with blueprints
omdb support for the above

What's notably missing:

public API support (because I think we want to get further along before we expose this to people)
database support (because that can happen in parallel) -- this stuff is all stored in memory right now
any automation that uses any of this

Importantly, none of the new stuff in this PR actually does anything unless a person goes out of their way to use the internal APIs to configure it. Even then, it only generates some structures in memory. The main reason to land it now is to unblock a bunch of follow-on work like the stuff mentioned above.

A few changes came along for the ride here:

sled-agent-client and nexus-client use progenitor's "replace" directive for more types that are available in omicron_common instead of using the client-generated types. This is more stuff in the spirit of sled agent client could use some primitives from omicron_common #4754 but these became surprisingly load-bearing in this case. A bunch of this PR are changes to a handful of lines in a bunch of files related to this (mostly removing unneeded calls to into(), from(), clone(), or changing an import).
The simulated sled agent by default now simulates physical disks and zpools. This support was already there but th default config did not use it.

Some explicit questions for reviewers:

@ahl I wasn't sure how to structure the various versions of types that we have in nexus/types, especially the API ones
@leftwo reminded me that we've generally said that omdb will be "do no harm" unless you give it some -w/--destructive flag (which I don't think exists). But here I'm adding some commands that do (sort of) modify the system. I filed omdb could more safely support "write" commands #4805 for better omdb support for this sort of thing. In the meantime, I feel like what's here is a reasonable step (because it doesn't actually do anything yet; when it does start doing something, we may well rip out these interfaces anyway; and I think there's a lot of benefit to having this tooling and I don't know where else to put it). But if people would rather I left that out I could remove the omdb stuff from this PR.

Status: I think the bulk of this is ready for review. I still want to add a bit to this PR: more automated tests plus some omdb commands for diff'ing a blueprint against another blueprint or an inventory collection. I hope those things won't change much of what's here but feel free to hold off on review if you're worried about that churn.

…o into caller

…ollection (and have a separate code path for the initial blueprint)

davepacheco · 2024-01-17T01:25:20Z

I incorporated feedback from @ahl and I think I've finished adding all the tests I'm planning to add here.

jgallagher · 2024-01-16T16:20:53Z

clients/nexus-client/src/lib.rs

+        // It's kind of unfortunate to pull in such a complex and unstable type
+        // as "blueprint" this way, but we have really useful functionality
+        // (e.g., diff'ing) that's implemented on our local type.
+        Blueprint = nexus_types::deployment::Blueprint,


@ahl installed some healthy fear of replace in me a while back - how painful would it be to impl From for blueprints instead of replaceing?

I share yours and Adam's reluctance about replace. In this case, the functionality being shared (diff'ing blueprints) is pretty non-trivial and I think it would be fairly painful to impl the requisite Froms: I think it would require implementing From for Blueprint, and OmicronZonesConfig, which in turn requires OmicronZoneConfig, then OmicronZoneType (a big enum with a bunch of properties in many variants), OmicronZoneDataset, SledRole, ZpoolName, NetworkInterface, NetworkInterfaceKind, and SourceNatConfig. (I'm not positive, but I think that's right.)

I think better approaches here might be to have Progenitor validate that your replace type really is compatible with the spec, or alternatively if there were a way to automatically generate From impls between two types that are identical (which would presumably break -- correctly -- if they weren't identical). It doesn't feel sustainable to me to either duplicate complex functionality on multiple types or hand-roll so many complex From impls. What do you think?

I actually had the opposite reaction, and didn't think it was unfortunate to use replace at all. However, clearly I am missing what is problematic here.

I think what sucks about using replace is that you lose the ability to notice when your local type doesn't match the spec. In turn that means we lose all of the benefit of Rust's strong static typing across the OpenAPI boundary.

Concretely: in the normal case where Progenitor generates the types for you, say we pull in a new version of the spec with a new required field. Progenitor's type will have that new field in it. So if we had a code path that creates an instance of that type (i.e., to make the request) it will now fail to compile. That's good: if we didn't know it before, we now know we've made an incompatible change to the spec. If we did know it, now's when we update our code to fill in the new value.

If on the other hand we use replace to supply the type, when we pull in the updated spec, everything still compiles -- there's no reason it wouldn't. Instead, that code path will fail at runtime when the server rejects the request because it's missing the required field. That really sucks.

Similarly on the output side, if we use replace to supply our own type, we don't find out at compile time when our type is incompatible. A particularly bad case would be an enum, where the server might add a new variant (which one might even consider a non-breaking change). But if the client sees it, it will fail to deserialize the response. By contrast, if we use the Progenitor-generated type, we'll fail to build if, say, we try to match on the enum and don't include the new variant.

In summary, using Progenitor types gets a lot of the benefit of Rust across the OpenAPI boundary. Using "replace" bypasses all of it (for that type). The downside of Progenitor types is that when there's non-trivial functionality on the original type, you either have to impl that functionality twice (sucks) or convert the type to your own type (okay for simple types in a few places, but kind of sucks in the large). For simple, stable types like Name I think the benefits of "replace" probably outweigh the risks. For complex types that are likely to change, I'm less sure (but still feel what I said earlier about this case).

I think better approaches here might be to have Progenitor validate that your replace type really is compatible with the spec

This feels like it would be huge - it would allow use to use replace fearlessly, even on big / complex / volatile types like this one, I think?

I think better approaches here might be to have Progenitor validate that your replace type really is compatible with the spec

This feels like it would be huge - it would allow use to use replace fearlessly, even on big / complex / volatile types like this one, I think?

Potentially, yeah. I spoke with Adam about this a bit. It's not clear how we could do this at compile time, but I imagine we might be able to do it at test-time.

jgallagher · 2024-01-16T16:47:12Z

nexus/types/src/deployment.rs

+}
+
+/// Describes a complete set of software and configuration for the system
+// Blueprints are a fundamental part of how the system modifies itself.  Each


Tiny nit, but this feels like it should all be part of the doc comment, I think? It's all extremely relevant to anyone working with or around Blueprints.

I had that previously. But this is a publicly-exposed type, so the block comment is used for API consumers, not just Rust consumers. I remember discussing this months ago and deciding that unless there was a really compelling reason to try to support separate API docs and rustdoc for the same type, we'd just let the rustdoc reflect API docs.

Hmm, but this is an API type for the internal API, right? Do those exist? (To be fair, another question is whether we generate rustdocs for crates. I've done that locally occasionally but it's not a habit.)

I've been heavily using API docs for internal APIs (generated by hand with redoc). Then again I also use rustdocs via rust-analyzer. But I'm inclined to treat both APIs the same here -- part of the idea behind using OpenAPI everywhere was so that we could have a good experience using our own APIs through the patterns/processes/tooling we use for the external API. I find it annoying when I'm using the internal API docs and I get a lot of irrelevant Rust-isms.

Fair enough. I don't feel super strongly about this either way, happy to defer to you.

nexus/types/src/deployment.rs

nexus/src/app/deployment.rs

nexus/deployment/src/blueprint_builder.rs

nexus/deployment/src/planner.rs

andrewjstone

Dave, this looks great. It's a huge step forward. I particularly like all the blueprint diff stuff.

nexus/types/src/deployment.rs

nexus/src/internal_api/http_entrypoints.rs

nexus/src/app/deployment.rs

andrewjstone · 2024-01-17T18:22:27Z

nexus/src/app/deployment.rs

+            .map(|sled_row| {
+                let sled_id = sled_row.id();
+                let subnet = Ipv6Subnet::<SLED_PREFIX>::new(sled_row.ip());
+                let zpools = zpools_by_sled_id


I feel like it's safe to catch this on the next planning round. The inventory collection will change and the new sled will show up. I think the only thing you could do if you checked here is go back and redo this context, which seems somewhat redundant/unnecessary.

nexus/src/app/deployment.rs

nexus/deployment/src/ip_allocator.rs

nexus/deployment/src/planner.rs

andrewjstone · 2024-01-17T19:10:29Z

nexus/deployment/src/blueprint_builder.rs

+        // lying around).  Recompute it based on what boundary servers
+        // currently exist.
+        let ntp_servers = self
+            .parent_blueprint


This, like other parts of the builder that rely on the parent_blueprint seem to implicitly assume that the parent_blueprint was successfully executed. That is why they are capable of performing "next steps". Am I correct here?

If I am correct then this may present a problem when a plan is not successfully executed, or partially executed. While conceptually I like the idea of one plan building off the next plan, it seems like we may need to pass in more parameters where this may be problematic, or we need to use the inventory rather than the prior plan.

I wondered this too, but wasn't sure how to phrase the question, because I assume it's pretty reasonable we end up in this case:

We have a current target blueprint that we have not yet achieved

Something in the system changes such that we can never achieve the current target

Now what? We either need to be able to generate a child blueprint despite the parent not being successfully executed, or maybe we need some escape hatch to say "regenerate a new initial blueprint based on the latest collection"?

Yeah, I think eventually we're going to want to use both the parent blueprint and the most recent (valid) inventory.

This model does assume that the parent blueprint is currently the source of truth for the intended state of the system. That's why I added the constraints that (1) aside from the initial blueprint, blueprints get generated from the current target, and (2) a blueprint can only become the new target if its parent is [still] the target.

I think this particular code does not assume the parent blueprint has been executed. Suppose this sequence:

we start with some initial blueprint B1

we create a new blueprint B2 (parent is B1) where we provision a third boundary NTP zone

before we start executing B2, we create a new blueprint B3 (parent is B2) that adds the "internal NTP" zone to a new sled

this code will include in the config for the new internal NTP zone the boundary NTP zone added in B2

I think that's right, even though the new boundary NTP zone isn't necessarily deployed yet. It is one of the current boundary NTP zones right now, even if it's not up yet, and clients always need to handle dependencies not being up yet. Put differently: the new blueprint describes a new state of the system, and in that state, there is an internal NTP zone and it uses the boundary NTP zone from B2. It's fine if it takes a minute to get there because it can all be done in one execution step. (sort of a footnote: We may well want to only add things to DNS after they're up, but that doesn't really change this -- it just means that the internal DNS zone will fail to resolve the DNS name at first instead of resolving it to something that's not running NTP yet.)

All of the particulars here should be obviated by #4791 but it's certainly a general question that applies to everything here. Your comment has helped me think through this more. I think some principles on using the parent blueprint vs. inventory might be:

When constructing the new blueprint, start with the parent blueprint and evolve it as needed. (Sounds obvious but it wasn't to me at first. Concretely: in the new blueprint, the set of Omicron zones for each sled starts with whatever was in the parent blueprint and not what's in the latest inventory precisely for the reason you mentioned: there might be other changes in the parent blueprint that haven't been executed yet, and that's fine -- we just want to make sure our new blueprint reflects those also.)

When the new blueprint requires knowing what zones will exist in the new world (i.e., as in the case here, to construct a list of dependencies to put into a config block), you use the contents of the new blueprint itself. Here we use the parent blueprint instead because there are no changes because we do not touch boundary NTP zones yet. But if, say, we had deployed a new boundary NTP zone within the same blueprint, that'd be fine, too, and we'd want to include that in the list of boundary NTP zones that the internal NTP zones know about. (Again, this is a dicey example because there's not really a way to reconfigure the existing internal NTP zones -- we should do NTP zone config could be more dynamic #4791 and that would obviate this particular case.)

When there are runtime dependencies in a sequence of operations, the planner might need to use the latest inventory collection to decide what to do, if anything. Examples include: when rolling out a bunch of new Nexus zones one at a time (phased rollout), we'll want to wait for execution of each blueprint that deploys a new zone. Or, right now, we really should wait for an internal NTP zone to show up in inventory before creating a plan that deploys Crucible zones. It's not really a problem in this PR because the plans are not generated automatically, but we'll need to update this before this is fully automated. (Another footnote: this is pretty quirky and I'm increasingly feeling like we should probably go do the real async PUT /omicron-zones instead, but in the meantime, this is the case and the planner needs to deal with it.) These will probably involve higher-level considerations, too: we don't want to remove a CockroachDB node, even if we have more than the minimum number, until we know that CockroachDB has come to rest in terms of replication. You could imagine, say, replacing a 3-node cluster by deploying 3 new nodes and removing the original 3. It's not safe to do that until we know the cluster's actually replicated to the 3 new nodes! So I expect we'll eventually add this sort of thing to inventory, too.

This is definitely a really tricky area that I went back and forth on while working on this and will probably evolve, too.

I wondered this too, but wasn't sure how to phrase the question, because I assume it's pretty reasonable we end up in this case:

* We have a current target blueprint that we have not yet achieved * Something in the system changes such that we can never achieve the current target

Now what? We either need to be able to generate a child blueprint despite the parent not being successfully executed, or maybe we need some escape hatch to say "regenerate a new initial blueprint based on the latest collection"?

Agreed. I think the conclusion is that we can't assume the parent blueprint has/will be executed. So far, I don't think that's been a difficult constraint.

Your third principle covers what I was worried about. Specifically if we started to deploy crucible zones assuming that the internal NTP zone was deployed already. If the planner takes into consideration the current inventory and that the parent blueprint may not have executed before generating the next blueprint, then we should be all set. This also means that the planner may very well be in a "wait for the target blueprint to complete" before it does anything for a good amount of time. That seems normal though.

andrewjstone

Ship it!

jgallagher

Changes look great - just one nit about a comment. Thanks!

jgallagher · 2024-01-18T14:08:17Z

nexus/deployment/src/blueprint_builder.rs

+    fn sled_alloc_ip(&mut self, sled_id: Uuid) -> Result<Ipv6Addr, Error> {
+        let sled_subnet = self.sled_resources(sled_id)?.subnet;
+
+        // Work around rust-lang/rust-clippy#11935 and related issues.


This comment is no longer relevant since we reworked this

davepacheco · 2024-01-18T14:38:11Z

Thanks for the close reviews, @jgallagher and @andrewjstone!

This PR is the first step in creating a background task that is capable of taking a `Blueprint` and then reifying that blueprint into deployed or updated software. This PR uses the initial version of a Blueprint introduced in #4804. A basic executor that sends the related `OmicronZonesConfig` to the appropriate sled-agents for newly added sleds was created. A test is included that shows how a hypothetical planner for an `add-sled` workflow will deploy Omicron zones in a manner similar to RSS, where first the internal DNS zone is deployed and then the internal DNS and NTP zones are deployed. Deployment alwyas contains all zones expected to be running on the sled-agent. Any zones running that are not included are expected to be shut down.

This replaces the in-memory blueprint storage added as a placeholder in #4804 with cockroachdb-backed tables. Both the tables and related queries are _heavily_ derived from the similar tables in the inventory system (particularly serializing omicron zones and their related properties). The tables are effectively identical as of this PR, but we opted to keep the separate because we expect them to diverge some over time (e.g., inventory might start collecting additional per-zone properties that don't exist for blueprints, such as uptime). The big exception to "basically the same as inventory" is the `bp_target` table which tracks the current (and past) target blueprint. Inserting into this table has some subtleties, and we use a CTE to check and enforce the invariants. This is the first diesel/CTE I've written; it's based on other similar CTEs in Nexus, but I'd still appreciate a particularly careful look there. Fixes #4793.

This PR is the first step in creating a background task that is capable of taking a `Blueprint` and then reifying that blueprint into deployed or updated software. This PR uses the initial version of a Blueprint introduced in #4804. A basic executor that sends the related `OmicronZonesConfig` to the appropriate sled-agents for newly added sleds was created. A background task that loads the target `Blueprint` from the database and feeds it to the executor is also included, along with a test for each.

davepacheco added 27 commits January 4, 2024 13:32

first cut at "add sled" blueprint maker

91059b2

Merge branch 'main' into dap/update-control-2

95a7cd0

WIP: make BlueprintBuilder more general because planning logic will g…

9c7a40d

…o into caller

refactor towards incremental plans

7f7235e

when planning, we want to be using the parent blueprint and not the c…

0fb4e95

…ollection (and have a separate code path for the initial blueprint)

WIP: adding internal APIs to be able to play with this stuff

a46520d

WIP: flesh out those internal APIs

e9bbb5c

fetch sleds and zpools for planning

155c0be

omdb support for these APIs

cf3aa30

sled agent client could use more common networking types

52c7d9f

print out more about blueprints, plus some cleanup

8e4ef98

add blueprint authz

9a8ec62

no more table scan

9670e8b

remove XXX for issue filed

2646426

have simulated sled agent simulate storage too

8f0b344

general cleanup -- fix lots of XXXs

ef1fa8e

follow-up to sled-agent-client changes

a1d40ca

clean up various XXXs

c8f7685

blueprint builder could be idempotent; comment about internal DNS

d4b9813

the case I was worried about should not be a problem

faa9dde

clippy backlog

b5a9d47

Merge branch 'main' into dap/update-control-2

2897254

use new IP address range

f50ee0f

test run + fixups

a2abb97

remove one XXX

f0b0451

omdb nits, regenerate spec

b1f2fb9

remove XXXs now tracked elsewhere

52fc677

davepacheco marked this pull request as ready for review January 11, 2024 22:08

davepacheco added 2 commits January 11, 2024 14:11

nit

d8ad0ee

add diff, simplify some types

c39c75d

davepacheco added 5 commits January 12, 2024 14:51

cleanup, programmatic diff, fix up test

81842f6

rustfmt

c0da2e5

Merge branch 'main' into dap/update-control-2

1342d9f

support diff collection and blueprint and use that in the test

58f7e39

add test for ip_allocator

ce76d0f

davepacheco mentioned this pull request Jan 16, 2024

DB Serialization for Blueprints #4793

Closed

davepacheco added 2 commits January 16, 2024 17:16

add basic tests for blueprint builder

0ec1d8a

Merge branch 'main' into dap/update-control-2

b49794a

jgallagher reviewed Jan 17, 2024

View reviewed changes

review feedback

c58c7ed

andrewjstone reviewed Jan 17, 2024

View reviewed changes

davepacheco added 3 commits January 17, 2024 12:38

make user specify collection

06825c0

review feedback

701d2c7

Merge branch 'main' into dap/update-control-2

2e17346

andrewjstone approved these changes Jan 17, 2024

View reviewed changes

jgallagher approved these changes Jan 18, 2024

View reviewed changes

davepacheco added 2 commits January 18, 2024 06:35

stray comment

80bec1a

Merge branch 'main' into dap/update-control-2

402ea0c

davepacheco enabled auto-merge (squash) January 18, 2024 14:40

davepacheco merged commit 6b20cd2 into main Jan 18, 2024

davepacheco deleted the dap/update-control-2 branch January 18, 2024 16:06

andrewjstone mentioned this pull request Jan 24, 2024

Add a background task for update plan execution #4891

Merged

jgallagher mentioned this pull request Jan 25, 2024

Serialize blueprints in the database #4899

Merged

initial draft of blueprints #4804

initial draft of blueprints #4804

Uh oh!

Conversation

davepacheco commented Jan 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davepacheco commented Jan 17, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

andrewjstone left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davepacheco Jan 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrewjstone left a comment

Choose a reason for hiding this comment

Uh oh!

jgallagher left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davepacheco commented Jan 18, 2024

Uh oh!

Uh oh!

davepacheco commented Jan 11, 2024 •

edited

Loading

davepacheco Jan 17, 2024 •

edited

Loading