[nexus] Add a new user for service balancing #1234

smklein · 2022-06-20T19:51:48Z

Creates a new user to execute background tasks.
This user will actually be used more actively in a follow-up PR: RSS handoff, Nexus doing svc management, DNS integration #1216

smklein · 2022-06-22T16:14:55Z

What do you think about me just adding a "Service Balancer" task - basically, exactly what this PR is doing, but rename the user?

Sounds great! I like that name.

Done!

smklein · 2022-06-24T02:57:36Z

Any other feedback before I merge this?

davepacheco

I'm seeing a bunch of changes here that I don't remember last time and don't seem that related. (They're fine except for one I commented on below.) I'm guessing maybe these went into "main" in other PRs and this is an artifact of the way this PR was merged or GitHub's review mechanism or something.

davepacheco · 2022-06-24T04:21:23Z

common/src/nexus_config.rs

@@ -102,6 +102,8 @@ pub enum Database {
 pub struct DeploymentConfig {
    /// Uuid of the Nexus instance
    pub id: Uuid,
+    /// Uuid of the Rack where Nexus is executing.
+    pub rack_id: Uuid,


Is this really supposed to be a rack, or is this the instance of the control plane (which we currently take to be the Fleet, although that's not quite right either)? (I think it'd be a bad idea to depend on rack_id == a unique identifier for the control plane and I've been trying to avoid that.)

It was intended to identify the rack on which this Nexus is running.

RSS initializes this value, and transfers it to nexus , as of #1217

Note, the major change of that patch is the source of that rack ID, rather than the existence of a rack ID at all. Previously the rack ID was randomly generated by nexus in nexus/server/lib.rs:

omicron/nexus/src/lib.rs

Line 170 in 9a4e7bf

let rack_id = Uuid::new_v4();

So, that change moves the source of the rack ID from "random every time nexus boots" to "set once during RSS initialization, then stored in the DB".

Note, the major change of that patch is the source of that rack ID, rather than the existence of a rack ID at all.

Got it. If I'm reading the code in "main" correctly right now, it seems like maybe this can go away entirely? We store "rack_id" in the Nexus struct but only so that we can implement racks_list and rack_get. If we're moving the rack record to the database, then I imagine we don't need these any more?

We still need a way to instruct Nexus "which rack should we be querying from the DB" to check the RSS handoff described in RFD 278. This is currently done by passing this rack_id value, and checking it on boot.

Additionally, when performing background tasks in a subsequent PR (#1241), Nexus attempts to manipulate state which is local to the current rack - such as asking the question, "do I have enough CRDB instances on my rack?"

Some state is still larger than rack scope - for example, internal DNS servers are allocated within a subnet that is shared AZ-wide - these are allocated without referencing the rack ID.

However, in general, I had the expectation that "each rack would be running at least one Nexus," so it could be in charge of managing rack-wide state. Is this wrong? Should a single Nexus be trying to ensure that all services are running across all racks?

We still need a way to instruct Nexus "which rack should we be querying from the DB" to check the RSS handoff described in RFD 278. This is currently done by passing this rack_id value, and checking it on boot.

Are you talking about this part of RFD 278:

Nexus "remembers" that this handoff has completed by recording the service as initialized within CRDB. This is done by marking a column in the associated row of the "Rack" table.

I think the scope of this is "the control plane instance", not "the rack". If we had two racks, whether they were set up together or not, I think we're only going to go through the RFD 57 process (and Nexus handoff) once. So I think this state should probably be global in CockroachDB, not scoped to a rack.

Additionally, when performing background tasks in a subsequent PR (#1241), Nexus attempts to manipulate state which is local to the current rack - such as asking the question, "do I have enough CRDB instances on my rack?"
...
However, in general, I had the expectation that "each rack would be running at least one Nexus," so it could be in charge of managing rack-wide state. Is this wrong? Should a single Nexus be trying to ensure that all services are running across all racks?

I'd assumed it was the latter. The original goal was that Nexus instances are fungible at least within an AZ. That's also why RFD 248 considers "nexus" a single DNS service, rather than having each group-of-Nexuses-within-a-rack have its own DNS name.

I gather you're assuming (1) there's at least one Nexus in each rack and (2) Nexus instances are only responsible for managing their own rack. I think that's got several downsides. We can have multiple Nexus instances within a rack, so we still need to build some coordination to avoid them stomping over each other. Once we've built that, it's extra work (that we otherwise wouldn't need to do) to constrain their responsibilities to only cover their own rack. It's also extra work to enforce the deployment constraints and verify them with runtime monitoring. I think it'd be a lot simpler to say that the Nexus instances need to coordinate on changes (which they need to do anyway, as I mentioned) and then all we need to ensure is that there are enough Nexus instances in enough different racks to maintain our availability goals. That leaves the deployment system with much more flexibility about where to put them.

I think eliminating constraint (2) also has a big impact on efficiency. With that constraint, we basically need two Nexus instances in each rack to survive a single failure and we need three instances to survive a double failure. So a 10-rack deployment would need 30 Nexus instances to survive 2 failures. Without this constraint (i.e., letting Nexus instances operate on all racks, with coordination) you'd only need 5 instances to survive any two failures.

This is called a ClusterId in RFD 238.

I don't think I understand how you want me to eliminate this constraint in the current implementation.

To fully satisfy the one-to-many relationship of Nexuses to racks, we'd need to:

Build a concept of a fleet, and have that be populated somewhere

Ensure that when RSS performs initialization, it does not format CRDB instances from scratch, but rather, define the mechanism for integrating with the fleet-wide DB. This means going back to the design of RFD 278, because we'd need to distinguish the "adding a rack to an initialized fleet" case from "adding a rack to an uninitialized fleet".

The same "integration into the rack handoff logic", but for Clickhouse

Rebuild the mechanism for "service balancing" to work across all racks in a fleet

To be clear, I think that is the right long-term direction, but I'm also concerned it's a lot of work for something that isn't relevant for v1.

My prototype implementation of service management within Nexus can specify ServiceRedundancy as an enum, and it explicitly specifies Per-Rack, because that's all we have right now. When we approach the multi-rack case, we could change this redundancy value to "per-fleet" or "per-az" - but I hesitate to include that now, since we won't really be able to validate that without building out more genuine end-to-end multi-rack support.

So, backing up: what's the action item here? Should I be deferring the RSS handoff patches, and instead favor deleting all locally-stored notions of "Rack ID"?

It seems like I'm missing something important about RSS's involvement in a multi-rack world, so let's table this for now.

I remain a little worried about this:

To fully satisfy the one-to-many relationship of Nexuses to racks, we'd need to:

I think we're absolutely going to need to support multiple Nexuses in the v1 product, even being only one rack. Otherwise, the control plane can't survive a single failure. I don't really understand the pieces you said have to happen. Maybe we can discuss at some point.

Ah, to be explicit, in the current implementation:

We already support "multiple Nexuses running on one rack".

We do not support "one nexus controlling multiple racks", because our mechanism for populating services is based on "ensuring enough services exist within rack_id's (Nexus') rack". This can be changed in the future.

When RSS initializes a new rack, it executes operations like "create a new CRDB instance, and format all storage - resulting in an empty cluster".

For a single-rack use-case, this works fine - if we're initializing the rack, there is no other fleet-wide metadata storage to consider.

For a multi-rack scenario, it isn't always appropriate to initialize CRDB "from scratch" here. If other racks exist, we wouldn't want to create partitioned clusters of CRDB, As you referenced in RFD 61:

[CRDB should be] Deployed as a single fleet (across multiple racks) with enough Instances in enough racks to satisfy availability and scale requirements. Sharding will be relevant here, but we know it’s important that a rack’s last known state is available in the database even when the rack itself is completely unavailable.

I assume we'd want some way of specifying: "here is the region, here is the existing fleet, if CRDB nodes already exist, join them instead of starting a totally new cluster".

Filed #1276

Propolis changes: Add `IntrPin::import_state` and migrate LPC UART pin states (#669) Attempt to set WCE for raw file backends Fix clippy/lint nits for rust 1.77.0 Crucible changes: Correctly (and robustly) count bytes (#1237) test-replay.sh fix name of DTrace script (#1235) BlockReq -> BlockOp (#1234) Simplify `BlockReq` (#1218) DTrace, cmon, cleanup, retry downstairs connections at 10 seconds. (#1231) Remove `MAX_ACTIVE_COUNT` flow control system (#1217) Crucible changes that were in Omicron but not in Propolis before this commit. Return *410 Gone* if volume is inactive (#1232) Update Rust crate opentelemetry to 0.22.0 (#1224) Update Rust crate base64 to 0.22.0 (#1222) Update Rust crate async-recursion to 1.1.0 (#1221) Minor cleanups to extent implementations (#1230) Update Rust crate http to 0.2.12 (#1220) Update Rust crate reedline to 0.30.0 (#1227) Update Rust crate rayon to 1.9.0 (#1226) Update Rust crate nix to 0.28 (#1223) Update Rust crate async-trait to 0.1.78 (#1219) Various buffer optimizations (#1211) Add low-level test for message encoding (#1214) Don't let df failures ruin the buildomat tests (#1213) Activate the NBD server's psuedo file (#1209) --------- Co-authored-by: Alan Hanson <alan@oxide.computer>

smklein added 23 commits June 8, 2022 15:38

[nexus] Split Nexus configuration (package vs runtime)

b78ff98

Merge branch 'main' into nexus-argsplit

cca5795

Ensure postgres config was just a rename

fccc15c

Merge branch 'main' into nexus-argsplit

2443215

review feedback

a077bd4

Merge branch 'main' into nexus-argsplit

f91cea1

DNS client

d16eda2

Add concurrency

8db30b7

comment

3a0c6ba

fmt

33b3e02

lockfile

3eb57dc

Merge branch 'main' into nexus-argsplit

39aa9ff

s/runtime/deployment

dd04a67

Merge branch 'nexus-argsplit' into dns-client

63b6379

[nexus][sled-agent] Generate rack ID in RSS, plumb it through Nexus

e1dc941

need rack_id in the test config too

a4309ac

Merge branch 'main' into nexus-argsplit

02f592d

[internal-dns] Avoid 'picking ports'

ff2d7b9

Merge branch 'nexus-argsplit' into dns-client

a261155

Merge branch 'fix-internal-dns-api' into dns-client

6cc7864

Changes from rss-handoff

2a035a5

Merge branch 'dns-client' into rack-id

e84faaf

[nexus] Add a new user for background tasks

c3a49bb

smklein requested a review from davepacheco June 20, 2022 19:52

smklein mentioned this pull request Jun 20, 2022

RSS handoff, Nexus doing svc management, DNS integration #1216

Closed

17 tasks

smklein added 5 commits June 21, 2022 10:03

Merge branch 'main' into nexus-argsplit

1e0b8fe

Merge branch 'nexus-argsplit' into fix-internal-dns-api

da4a2b8

Merge branch 'fix-internal-dns-api' into dns-client

d7b10cf

Merge branch 'dns-client' into rack-id

bb9a3af

Merge branch 'rack-id' into background-work-user

fed4a3d

background-work -> service-balancer

d6e3c9d

smklein changed the title ~~[nexus] Add a new user for background tasks~~ [nexus] Add a new user for service balancing Jun 22, 2022

smklein added 9 commits June 22, 2022 12:15

Merge branch 'main' into dns-client

fd8286a

Merge branch 'dns-client' into rack-id

bed0269

Merge branch 'rack-id' into background-work-user

ef6072d

Merge branch 'main' into dns-client

b959c39

review feedback

470da8b

Merge branch 'dns-client' into rack-id

a23a036

Merge branch 'rack-id' into background-work-user

56d2e1c

Merge branch 'main' into dns-client

13b9825

Merge branch 'dns-client' into rack-id

e1a912f

Merge branch 'rack-id' into background-work-user

28d87f5

Base automatically changed from rack-id to main June 24, 2022 04:01

davepacheco reviewed Jun 24, 2022

View reviewed changes

smklein added 6 commits June 24, 2022 00:31

Merge branch 'main' into dns-client

5fa89fe

Merge branch 'dns-client' into rack-id

a5fb65a

Merge branch 'rack-id' into background-work-user

a5784c1

Merge branch 'main' into dns-client

f7d7796

Merge branch 'dns-client' into rack-id

01a5fa5

Merge branch 'rack-id' into background-work-user

52357a6

smklein enabled auto-merge (squash) June 24, 2022 19:40

smklein disabled auto-merge June 24, 2022 19:40

smklein enabled auto-merge (squash) June 24, 2022 19:40

Merge branch 'main' into background-work-user

71b40e7

smklein merged commit 3ace6e7 into main Jun 24, 2022

smklein deleted the background-work-user branch June 24, 2022 21:15

leftwo mentioned this pull request Mar 29, 2024

Update Propolis and Crucible #5352

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[nexus] Add a new user for service balancing #1234

[nexus] Add a new user for service balancing #1234

smklein commented Jun 20, 2022

smklein commented Jun 22, 2022

smklein commented Jun 24, 2022

davepacheco left a comment

davepacheco Jun 24, 2022

smklein Jun 24, 2022

davepacheco Jun 24, 2022

smklein Jun 24, 2022

davepacheco Jun 24, 2022

andrewjstone Jun 24, 2022

smklein Jun 24, 2022 •

edited

davepacheco Jun 24, 2022

smklein Jun 24, 2022 •

edited

smklein Jun 24, 2022

[nexus] Add a new user for service balancing #1234

[nexus] Add a new user for service balancing #1234

Conversation

smklein commented Jun 20, 2022

smklein commented Jun 22, 2022

smklein commented Jun 24, 2022

davepacheco left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smklein Jun 24, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smklein Jun 24, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smklein Jun 24, 2022 •

edited

smklein Jun 24, 2022 •

edited