Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nexus] Add a new user for service balancing #1234

Merged
merged 58 commits into from Jun 24, 2022
Merged

Conversation

smklein
Copy link
Collaborator

@smklein smklein commented Jun 20, 2022

image

@smklein
Copy link
Collaborator Author

smklein commented Jun 22, 2022

What do you think about me just adding a "Service Balancer" task - basically, exactly what this PR is doing, but rename the user?

Sounds great! I like that name.

Done!

@smklein smklein changed the title [nexus] Add a new user for background tasks [nexus] Add a new user for service balancing Jun 22, 2022
@smklein
Copy link
Collaborator Author

smklein commented Jun 24, 2022

Any other feedback before I merge this?

Base automatically changed from rack-id to main June 24, 2022 04:01
Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm seeing a bunch of changes here that I don't remember last time and don't seem that related. (They're fine except for one I commented on below.) I'm guessing maybe these went into "main" in other PRs and this is an artifact of the way this PR was merged or GitHub's review mechanism or something.

@@ -102,6 +102,8 @@ pub enum Database {
pub struct DeploymentConfig {
/// Uuid of the Nexus instance
pub id: Uuid,
/// Uuid of the Rack where Nexus is executing.
pub rack_id: Uuid,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really supposed to be a rack, or is this the instance of the control plane (which we currently take to be the Fleet, although that's not quite right either)? (I think it'd be a bad idea to depend on rack_id == a unique identifier for the control plane and I've been trying to avoid that.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was intended to identify the rack on which this Nexus is running.

RSS initializes this value, and transfers it to nexus , as of #1217

Note, the major change of that patch is the source of that rack ID, rather than the existence of a rack ID at all. Previously the rack ID was randomly generated by nexus in nexus/server/lib.rs:

let rack_id = Uuid::new_v4();

So, that change moves the source of the rack ID from "random every time nexus boots" to "set once during RSS initialization, then stored in the DB".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, the major change of that patch is the source of that rack ID, rather than the existence of a rack ID at all.

Got it. If I'm reading the code in "main" correctly right now, it seems like maybe this can go away entirely? We store "rack_id" in the Nexus struct but only so that we can implement racks_list and rack_get. If we're moving the rack record to the database, then I imagine we don't need these any more?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need a way to instruct Nexus "which rack should we be querying from the DB" to check the RSS handoff described in RFD 278. This is currently done by passing this rack_id value, and checking it on boot.

Additionally, when performing background tasks in a subsequent PR (#1241), Nexus attempts to manipulate state which is local to the current rack - such as asking the question, "do I have enough CRDB instances on my rack?"

Some state is still larger than rack scope - for example, internal DNS servers are allocated within a subnet that is shared AZ-wide - these are allocated without referencing the rack ID.

However, in general, I had the expectation that "each rack would be running at least one Nexus," so it could be in charge of managing rack-wide state. Is this wrong? Should a single Nexus be trying to ensure that all services are running across all racks?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need a way to instruct Nexus "which rack should we be querying from the DB" to check the RSS handoff described in RFD 278. This is currently done by passing this rack_id value, and checking it on boot.

Are you talking about this part of RFD 278:

Nexus "remembers" that this handoff has completed by recording the service as initialized within CRDB. This is done by marking a column in the associated row of the "Rack" table.

I think the scope of this is "the control plane instance", not "the rack". If we had two racks, whether they were set up together or not, I think we're only going to go through the RFD 57 process (and Nexus handoff) once. So I think this state should probably be global in CockroachDB, not scoped to a rack.

Additionally, when performing background tasks in a subsequent PR (#1241), Nexus attempts to manipulate state which is local to the current rack - such as asking the question, "do I have enough CRDB instances on my rack?"
...
However, in general, I had the expectation that "each rack would be running at least one Nexus," so it could be in charge of managing rack-wide state. Is this wrong? Should a single Nexus be trying to ensure that all services are running across all racks?

I'd assumed it was the latter. The original goal was that Nexus instances are fungible at least within an AZ. That's also why RFD 248 considers "nexus" a single DNS service, rather than having each group-of-Nexuses-within-a-rack have its own DNS name.

I gather you're assuming (1) there's at least one Nexus in each rack and (2) Nexus instances are only responsible for managing their own rack. I think that's got several downsides. We can have multiple Nexus instances within a rack, so we still need to build some coordination to avoid them stomping over each other. Once we've built that, it's extra work (that we otherwise wouldn't need to do) to constrain their responsibilities to only cover their own rack. It's also extra work to enforce the deployment constraints and verify them with runtime monitoring. I think it'd be a lot simpler to say that the Nexus instances need to coordinate on changes (which they need to do anyway, as I mentioned) and then all we need to ensure is that there are enough Nexus instances in enough different racks to maintain our availability goals. That leaves the deployment system with much more flexibility about where to put them.

I think eliminating constraint (2) also has a big impact on efficiency. With that constraint, we basically need two Nexus instances in each rack to survive a single failure and we need three instances to survive a double failure. So a 10-rack deployment would need 30 Nexus instances to survive 2 failures. Without this constraint (i.e., letting Nexus instances operate on all racks, with coordination) you'd only need 5 instances to survive any two failures.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is called a ClusterId in RFD 238.

Copy link
Collaborator Author

@smklein smklein Jun 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I understand how you want me to eliminate this constraint in the current implementation.

To fully satisfy the one-to-many relationship of Nexuses to racks, we'd need to:

  • Build a concept of a fleet, and have that be populated somewhere
  • Ensure that when RSS performs initialization, it does not format CRDB instances from scratch, but rather, define the mechanism for integrating with the fleet-wide DB. This means going back to the design of RFD 278, because we'd need to distinguish the "adding a rack to an initialized fleet" case from "adding a rack to an uninitialized fleet".
  • The same "integration into the rack handoff logic", but for Clickhouse
  • Rebuild the mechanism for "service balancing" to work across all racks in a fleet

To be clear, I think that is the right long-term direction, but I'm also concerned it's a lot of work for something that isn't relevant for v1.

My prototype implementation of service management within Nexus can specify ServiceRedundancy as an enum, and it explicitly specifies Per-Rack, because that's all we have right now. When we approach the multi-rack case, we could change this redundancy value to "per-fleet" or "per-az" - but I hesitate to include that now, since we won't really be able to validate that without building out more genuine end-to-end multi-rack support.

So, backing up: what's the action item here? Should I be deferring the RSS handoff patches, and instead favor deleting all locally-stored notions of "Rack ID"?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like I'm missing something important about RSS's involvement in a multi-rack world, so let's table this for now.

I remain a little worried about this:

To fully satisfy the one-to-many relationship of Nexuses to racks, we'd need to:

I think we're absolutely going to need to support multiple Nexuses in the v1 product, even being only one rack. Otherwise, the control plane can't survive a single failure. I don't really understand the pieces you said have to happen. Maybe we can discuss at some point.

Copy link
Collaborator Author

@smklein smklein Jun 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, to be explicit, in the current implementation:

  • We already support "multiple Nexuses running on one rack".
  • We do not support "one nexus controlling multiple racks", because our mechanism for populating services is based on "ensuring enough services exist within rack_id's (Nexus') rack". This can be changed in the future.

When RSS initializes a new rack, it executes operations like "create a new CRDB instance, and format all storage - resulting in an empty cluster".

For a single-rack use-case, this works fine - if we're initializing the rack, there is no other fleet-wide metadata storage to consider.

For a multi-rack scenario, it isn't always appropriate to initialize CRDB "from scratch" here. If other racks exist, we wouldn't want to create partitioned clusters of CRDB, As you referenced in RFD 61:

[CRDB should be] Deployed as a single fleet (across multiple racks) with enough Instances in enough racks to satisfy availability and scale requirements. Sharding will be relevant here, but we know it’s important that a rack’s last known state is available in the database even when the rack itself is completely unavailable.

I assume we'd want some way of specifying: "here is the region, here is the existing fleet, if CRDB nodes already exist, join them instead of starting a totally new cluster".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #1276

@smklein smklein enabled auto-merge (squash) June 24, 2022 19:40
@smklein smklein disabled auto-merge June 24, 2022 19:40
@smklein smklein enabled auto-merge (squash) June 24, 2022 19:40
@smklein smklein merged commit 3ace6e7 into main Jun 24, 2022
@smklein smklein deleted the background-work-user branch June 24, 2022 21:15
leftwo added a commit that referenced this pull request Mar 29, 2024
Propolis changes:
Add `IntrPin::import_state` and migrate LPC UART pin states (#669)
Attempt to set WCE for raw file backends
Fix clippy/lint nits for rust 1.77.0

Crucible changes:
Correctly (and robustly) count bytes (#1237)
test-replay.sh fix name of DTrace script (#1235)
BlockReq -> BlockOp (#1234)
Simplify `BlockReq` (#1218)
DTrace, cmon, cleanup, retry downstairs connections at 10 seconds.
(#1231)
Remove `MAX_ACTIVE_COUNT` flow control system (#1217)

Crucible changes that were in Omicron but not in Propolis before this commit.
Return *410 Gone* if volume is inactive (#1232)
Update Rust crate opentelemetry to 0.22.0 (#1224)
Update Rust crate base64 to 0.22.0 (#1222)
Update Rust crate async-recursion to 1.1.0 (#1221)
Minor cleanups to extent implementations (#1230)
Update Rust crate http to 0.2.12 (#1220)
Update Rust crate reedline to 0.30.0 (#1227)
Update Rust crate rayon to 1.9.0 (#1226)
Update Rust crate nix to 0.28 (#1223)
Update Rust crate async-trait to 0.1.78 (#1219)
Various buffer optimizations (#1211)
Add low-level test for message encoding (#1214)
Don't let df failures ruin the buildomat tests (#1213)
Activate the NBD server's psuedo file (#1209)

---------

Co-authored-by: Alan Hanson <alan@oxide.computer>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants