[sled-agent][nexus][RSS] Implement RFD 278: RSS to Nexus Handoff #1954

smklein · 2022-11-16T20:46:53Z

Overview

Implements https://rfd.shared.oxide.computer/rfd/0278
This PR moves much of the service configuration from the hard-coded config-rss.toml file to RSS itself.
- In the future (See: [rss][nexus][sled-agent] Responsibility for deploying services should (where possible) migrate into Nexus #732) many of these services will be initialized by Nexus. Decoupling their provisioning from the hard-coded versions is the first step in this process.

What Changed in the Sled Agent

Sled Agent
- A new get_zpools endpoint is exposed from the Sled Agent. This is invoked by RSS when figuring out where to provision datasets.
- The UUID for the sled agent is removed from the config file (it's dynamic, and should not be shared among sleds)

What Changed in RSS

HardcodedSledRequest (and the corresponding entries in config-rss.toml) has been removed
A plan module was added, where plans for sled generation ("What sleds should get what addresses?") and service generation ("What services should run where?") are generated.
Refactor service and dataset initialization to insert entries into DNS
Invoke the handoff_to_nexus, informing it of all previously-owned-by-RSS services.

What Changed in Nexus

Expand RackInitializationRequest to consider both services and datasets
dataset_put API removed -- beyond the initialization request, Nexus should be responsible for provisioning new datasets, not the sled agent.

TO-DO list before sending out for review

Address "TODO: If nexus, add a pool?", by adding an IP pool for Nexus (Punted to: RSS: Collect and transfer IP pool information into Nexus #1958)
Pass x509 cert information alongside the RackInitializationRequest, store it in CRDB. This will be necessary to re-provision Nexus in follow-up PRs. See: nexus/src/internal_api/params.rs for more context. (Punted to: RSS: Collect and transfer x509 cert information into Nexus #1959)
Add tests for the address bump allocator: sled-agent/src/rack_setup/plan/service.rs
Audit the usage of config files by RSS
Address "TODO: Consider putting the rack subnet here?"
Address: "TODO" in sled-agent/src/rack_setup/service.rs for conversion to ServiceType

Fixes #1148
Part of #732
Part of #824

smklein · 2022-11-16T21:37:59Z

.github/buildomat/jobs/deploy.sh

 #
 tar xf out/omicron-sled-agent.tar pkg/config-rss.toml
-sed -e '/\[gateway\]/,/\[request\]/ s/^.*address =.*$/address = "192.168.1.199"/' \
+sed -e 's/^# address =.*$/address = "192.168.1.199"/' \


The format of config-rss.toml changed, so this needed to change too.

This only wants to update the gateway address; this modification makes it correct once more.

Makes sense. Too bad there's not a CLI tool available here to structurally modify the toml instead.

Now that config-rss.toml has been shrinking, it may actually make sense to just have totally a distinct copy for the lab?

FYI to @andrewjstone , I do think when we get to the point where wicket actually talks to RSS to "let the operator dynamically set a profile (including Nexus IP info, the subnet, etc)", it'll be a great time to revisit "how we pass this info into RSS".

I think the gateway stuff won't be necessary at that point, since it's only needed for the OPTE hack?

smklein · 2022-11-16T21:38:50Z

end-to-end-tests/src/helpers/ctx.rs

-                    }
-                }
-            }
+            return (config.nexus_external_address, 80).into();


We no longer list the "ServiceZoneRequest" for nexus in config-rss.toml -- now we specify the external IP, and let RSS do the rest of the provisioning.

smklein · 2022-11-16T21:39:22Z

nexus/db-model/src/dataset.rs

    pub pool_id: Uuid,

-    ip: ipnetwork::IpNetwork,
+    ip: ipv6::Ipv6Addr,


This is a single IP address on the underlay, not a whole network, so this feels a little more accurate.

smklein · 2022-11-16T21:40:06Z

nexus/db-model/src/dataset.rs

        id: Uuid,
        pool_id: Uuid,
-        addr: SocketAddr,
+        addr: SocketAddrV6,


Here and elsewhere in the PR, we use Ipv6 variants instead of more generic Ip variants.

This should be accurate for the underlay, where we expect everything to be IPv6, and prevents some impossible-to-hit variant matching errors.

smklein · 2022-11-17T21:40:48Z

nexus/src/internal_api/http_entrypoints.rs

-async fn dataset_put(
-    rqctx: Arc<RequestContext<Arc<ServerContext>>>,
-    path_params: Path<DatasetPathParam>,
-    info: TypedBody<DatasetPutRequest>,
-) -> Result<HttpResponseOk<DatasetPutResponse>, HttpError> {


After this PR, Nexus should be in charge of "when to provision datasets" (after the RSS handoff).

As a result, this endpoint is removed.

smklein · 2022-11-17T21:50:00Z

sled-agent/src/rack_setup/plan/service.rs

+// The number of Nexus instances to create from RSS.
+const NEXUS_COUNT: usize = 1;
+
+// The number of CRDB instances to create from RSS.
+const CRDB_COUNT: usize = 1;
+
+// TODO(https://github.com/oxidecomputer/omicron/issues/732): Remove
+// when Nexus provisions Oximeter.
+const OXIMETER_COUNT: usize = 1;
+// TODO(https://github.com/oxidecomputer/omicron/issues/732): Remove
+// when Nexus provisions Clickhouse.
+const CLICKHOUSE_COUNT: usize = 1;


For all these services, this is our way of specifying "how many of these services should we be initializing in RSS"?

Previously, these were hard-coded entries in config-rss.toml, eventually, this may be a configurable option as part of the interation via wicket. However, by moving these here, RSS can make a call on where to place services more dynamically.

smklein · 2022-11-17T21:59:17Z

sled-agent/src/rack_setup/plan/service.rs

+            // TODO(https://github.com/oxidecomputer/omicron/issues/732):
+            // We're currently waiting for ALL zpools to appear, so RSS can be
+            // responsible for provisioning Crucible datasets.
+            //
+            // Once this responsibility shifts to Nexus, we actually only
+            // need enough zpools to provision CRDB.
+            if zpools.len() < MINIMUM_ZPOOL_COUNT {
+                return Err(BackoffError::transient(
+                    PlanError::SledInitialization(
+                        "Awaiting zpools".to_string(),
+                    ),
+                ));
+            }


So here's the deal - eventually, Nexus should 100% be calling the shots on when to provision crucible datasets, and RSS shouldn't muck with this. This will all be sorted out in follow-up PRs.

AT THE MOMENT, however, if RSS doesn't do this, no one will. So RSS is arbitrarily waiting for three zpools to show up on a sled before saying "okay, go ahead, and I'll tell all your zpools to add a crucible dataset". This is a hack! but it's hopefully a short-lived one, until Nexus is more fully in-charge of this data.

smklein · 2022-11-17T22:01:22Z

sled-agent/src/services.rs

-                                "postgresql://root@[fd00:1122:3344:0101::3]:32221/omicron?sslmode=disable"
-                            ).unwrap(),
-                        }
+                        database: nexus_config::Database::FromDns,


As a nice perk, because we're setting the DNS record correctly, Nexus can connect to CRDB regardless of what IP it has.

smklein · 2022-11-17T22:02:14Z

sled-agent/src/sled_agent.rs

+    /// Gets the sled's current list of all zpools.
+    pub async fn zpools_get(&self) -> Result<Vec<Zpool>, Error> {
+        let zpools = self.inner.storage.get_zpools().await?;
+        Ok(zpools)
+    }


This is a new endpoint being exposed from the sled agent so RSS can query for zpool information when deciding "where should CRDB go?"

smklein · 2022-11-17T22:03:06Z

smf/sled-agent/config-rss.toml

 # how-to-run.adoc for details on how to determine the value for your network.
 mac = "00:0d:b9:54:fe:e4"
-
-[[request]]


This is basically the point of this PR - this information can - and should be - dynamic. RSS will control it for now, and Nexus will take more responsibility in future PRs.

jgallagher · 2022-11-18T21:22:08Z

sled-agent/src/rack_setup/plan/service.rs

+
+    /// DNS Services to be instantiated.
+    #[serde(default, rename = "dns_service")]
+    pub dns_services: Vec<ServiceZoneRequest>,


This is kinda nitpicky and certainly doesn't need to change as part of this PR, but it seems a little strange that this vec has the same type as services in the "make illegal states unrepresentable" sense. What happens if there's an item in this vec that doesn't have ZoneType::InternalDns, or an item in services that does? Would it make sense to split ServiceZoneRequest into two types if DNS zones are special enough to be split out here?

ServiceZoneRequest has an internal ZoneType field that allows for matching - it's mostly RSS that cares about the distinction, because it wants to launch DNS services before others.

I'll just lump all services into pub services, and make this distinction internally. I agree that it's odd to make illegal state representation possible here.

Done in 8d899c5

jgallagher · 2022-11-18T21:34:56Z

sled-agent/src/rack_setup/service.rs

 }

-fn rss_plan_path() -> std::path::PathBuf {
+fn rss_completed_plan_path() -> PathBuf {


Tiny nit - based on the name I expected this to return rss-plan-complete.toml, not .marker. Maybe rss_plan_executed_path() or rss_completed_marker_path() would be clearer?

Done in d223fb9

jgallagher · 2022-11-18T21:38:45Z

sled-agent/src/rack_setup/service.rs

+        for dataset in datasets {
+            records
+                .entry(dataset.srv())
+                .or_insert_with(Vec::new)


I was going to suggest .or_default() instead of .or_insert_with(..), but that would require putting a type annotation on records. I might still do that, but that puts me much more on the fence. Up to you.

Done in c53a0bf

jgallagher · 2022-11-18T21:41:27Z

sled-agent/src/rack_setup/service.rs

+//!
+//! ## Sled Plan
+//!
+//! When RSS starts, it presumably is executing on a single sled, and must


I'm not sure what "it presumably is executing on a single sled" means or is getting at. There are instances of RSS running on both scrimlets, right? But only one of them will actually perform setup (whichever one the technician happens to connect to / plug in to).

Updated comment in 362f7e1.

I agree with your comment here - we expect the operator to only be initializing the sled via a connection through a single technician port.

andrewjstone · 2022-11-21T17:12:24Z

sled-agent/src/rack_setup/plan/service.rs

+
+impl Plan {
+    pub async fn load(log: &Logger) -> Result<Option<Plan>, PlanError> {
+        // If we already created a plan for this RSS to allocate


I just wan't to note that eventually, this should probably be written and retreived from the bootstore for fault tolerance. The Handoff "switch" can go there also. Obviously there's a lot more work to do on the bootstore front, and I'm not looking to change anything here now (certainly not before MUP) but I wanted to leave a note.

I might need a better understanding of the bootstore - how is it expected to be replicated within the rack?

Unfortunately bootstore is not well documented yet. It's a replacement for the more complicated reconfiguration protocol in RFD 238 section 5 and is alluded to in RFD 301 with some implied functionality. Once we get past MUP I'll write it up completely in section 5 of RFD 238 or a new RFD.

For now though, you can think of it the following way. There's a coordinator (based in the boot agent) that drives a 2-phase commit with input from Nexus or RSS, given initial setup or standard reconfiguration. Replication is to every sled with the exception of specific key-shares going only to their assigned replicas. Transport is over sprockets.

The intended uses are:

Storage of key shares and trust quorum metadata

Storage of enough networking information required to enable NTP and bring up CockroachDB

Storage of initial plan data from RSS that can be fed into Nexus/CockroachDB on the nexus handoff. This should allow the system to tolerate failures of this initial setup in a more robust manner and prevent conflicting plans in case of those versions. John and I discussed this a bit a while back, but right now it's just a rough idea in my head. Since the plan and key-shares are intimately tied together for rack init, I figured we'd just ship them together from RSS when the user pressed go.

Currently the protocol for the bootstore implementation has been started with a focus on storing key shares. There are some comments related to also storing the plan, but nothing implemented yet. Those changes should be relatively minor from the bootstore perspective.

Thanks for that summary @andrewjstone!

davepacheco

@smklein thanks so much for the detailed PR description and the PR comments on the code. That made this a lot easier to review. I have some thoughts above but I think they're pretty small.

There's so much goodness here. This makes the rack setup process feel so much more real.

davepacheco · 2022-12-01T16:53:38Z

smf/sled-agent/config-rss.toml

 # how-to-run.adoc for details on how to determine the value for your network.
 mac = "00:0d:b9:54:fe:e4"
-
-[[request]]


davepacheco · 2022-12-01T16:59:45Z

.github/buildomat/jobs/deploy.sh

 #
 tar xf out/omicron-sled-agent.tar pkg/config-rss.toml
-sed -e '/\[gateway\]/,/\[request\]/ s/^.*address =.*$/address = "192.168.1.199"/' \
+sed -e 's/^# address =.*$/address = "192.168.1.199"/' \


Makes sense. Too bad there's not a CLI tool available here to structurally modify the toml instead.

davepacheco · 2022-12-01T17:07:24Z

nexus/db-model/src/dataset.rs

+use std::net::{Ipv6Addr, SocketAddrV6};
 use uuid::Uuid;

 /// Database representation of a Dataset.


Cool. The changes here make sense.

This is unrelated to this PR but I find it a little unintuitive that we call this a Dataset because it seems to encapsulate not just the ZFS dataset, but at least some aspects of the networking for the service running in a zone atop the dataset (its IP, its DNS record type and record name). Is that just historical or is there some intuition I'm missing?

That's a fair point, it's basically tightly coupling the concept of the "dataset" with "the service which manages the dataset".

This has been implemented because it fits our current "dataset-managing services", including:

CRDB

Crucible

Clickhouse

Though it's a fair point, it definitely won't apply to all "datasets", and it makes it awkward to describe datasets which aren't managed by an explicit service (such as a dump device, dataset for logs, etc).

Filed #2000

davepacheco · 2022-12-01T17:09:35Z

nexus/src/app/rack.rs

+                    warn!(self.log, "Cannot look up rack: {}", e);
+                }
+            }
+            tokio::time::sleep(std::time::Duration::from_secs(2)).await;


nit: this seems like a hand-rolled version of wait_for_condition...maybe there's no value in commonizing it.

I would need to pull wait_for_condition out of test-utils (this is not test-only code) to re-use it, and I would need to modify it to take an infinite duration? I dunno.

davepacheco · 2022-12-01T17:10:10Z

nexus/src/app/rack.rs

+            match result {
+                Ok(rack) => {
+                    if rack.initialized {
+                        return;


nit: might be worth logging something here at "info" level

Added in 4ec1e28

davepacheco · 2022-12-01T17:12:12Z

nexus/src/db/datastore/rack.rs

        rack_id: Uuid,
        services: Vec<Service>,
+        datasets: Vec<Dataset>,
    ) -> UpdateResult<Rack> {


I think at one point in the RFD 278 discussion we talked about ensuring that if we got a second RSS initialization request, it must have the same parameters as the first one. Is that checked somehow here or are we deferring that work for now?

We aren't checking the parameters right now, but we do bail out early if we have ever completed a call to rack_set_initialized:

omicron/nexus/src/db/datastore/rack.rs

Lines 106 to 120 in 362f7e1

// Early exit if the rack has already been initialized.

let rack = rack_dsl::rack

.filter(rack_dsl::id.eq(rack_id))

.select(Rack::as_select())

.get_result_async(&conn)

.await

.map_err(|e| {

TxnError::CustomError(RackInitError::RackUpdate(

PoolError::from(e),

))

})?;

if rack.initialized {

info!(log, "Early exit: Rack already initialized");

return Ok(rack);

}

This means that we won't ever try to create conflicting state, but rather enforcing a "first-one-wins" policy.

davepacheco · 2022-12-01T17:17:11Z

nexus/src/lib.rs

+/// A partially-initialized Nexus server, which exposes an internal interface,
+/// but is not ready to receive external requests.
+pub struct InternalServer<'a> {


I like this split. I wonder if the name here could better reflect its half-initialized state? The distinction between InternalServer and Server seems really that the former is a transient state during Nexus startup where we happen to not have set up the external server.

davepacheco · 2022-12-01T17:20:04Z

sled-agent/src/bin/sled-agent.rs

            // Configure and run the Bootstrap server.
            let bootstrap_config = BootstrapConfig {
-                id: config.id,
+                id: Uuid::new_v4(),


I think this is probably beyond the scope of this PR but I wonder if the uuids used for Bootstrap and Sled Agent should come from some hardware identity. Has that been discussed anywhere?

Yeah, this is touched upon in #945, admittedly more in the context of the bootstrap address.

davepacheco · 2022-12-01T17:27:17Z

sled-agent/src/rack_setup/service.rs

-
-use super::config::{HardcodedSledRequest, SetupServiceConfig as Config};
-use crate::bootstrap::config::BOOTSTRAP_AGENT_PORT;
+//! Rack Setup Service (RSS) implementation


Thank you for this comment! (I wonder if it belongs in sled-agent/src/rack_setup/lib.rs so it'd be more easily found?)

Do you mean sled-agent/src/rack_setup/mod.rs ? I'll add doc comments there to help discovery (a659390), but I think this comment is tightly coupled with this file, as the body of service.rs is the implementation described here.

davepacheco · 2022-12-01T17:31:58Z

sled-agent/src/rack_setup/plan/service.rs

+
+impl Plan {
+    pub async fn load(log: &Logger) -> Result<Option<Plan>, PlanError> {
+        // If we already created a plan for this RSS to allocate


Thanks for that summary @andrewjstone!

smklein added 14 commits November 16, 2022 15:44

[sled-agent][nexus][RSS] Implement RFD 278: RSS to Nexus Handoff

b68b90a

Remove dataset notify

bf9fe66

Update deploy script to leave nexus address alone

c12f2a8

Fix discovery of nexus IP in e2e tests

37a4457

Nexus actually awaiting RSS handoff

809ac64

test fixes: service balancer role, dataset upserting from tests

4ffa1bd

Bump allocator tests

cfa9573

re-order config fields

82dd525

Nexus needs to find CRDB via DNS

b10db8f

Fix RSS config files, better docs

ac5e9f6

No doc links to private modules, I guess

b9ddc9e

store config in sled plan

cf6a07b

Service translation

f3c6a34

comment cleanup

0782327

smklein marked this pull request as ready for review November 17, 2022 22:03

smklein commented Nov 17, 2022

View reviewed changes

smklein requested review from davepacheco, iliana and jgallagher November 17, 2022 22:04

jgallagher reviewed Nov 18, 2022

View reviewed changes

andrewjstone reviewed Nov 21, 2022

View reviewed changes

smklein added 5 commits November 30, 2022 11:27

Merge branch 'main' into 278

cfeb88f

Collapse services and dns_services

8d899c5

rss_completed_marker_path naming

d223fb9

or_default

c53a0bf

Updated doc comment

362f7e1

davepacheco approved these changes Dec 1, 2022

View reviewed changes

Merge branch 'main' into 278 (new retry policies)

e916dba

smklein mentioned this pull request Dec 1, 2022

[cleanup][db] Not all datasets have associated "services" managing them #2000

Closed

Add log statement about rack init

4ec1e28

smklein mentioned this pull request Dec 1, 2022

[Bootstrap Agent] Actually create trust quorum -> use unique device IDs -> make bootstrap addresses random #945

Open

Add doc comments

a659390

smklein merged commit 27dfb62 into main Dec 2, 2022

smklein deleted the 278 branch December 2, 2022 14:48

	// Early exit if the rack has already been initialized.
	let rack = rack_dsl::rack
	.filter(rack_dsl::id.eq(rack_id))
	.select(Rack::as_select())
	.get_result_async(&conn)
	.await
	.map_err(\|e\| {
	TxnError::CustomError(RackInitError::RackUpdate(
	PoolError::from(e),
	))
	})?;
	if rack.initialized {
	info!(log, "Early exit: Rack already initialized");
	return Ok(rack);
	}

[sled-agent][nexus][RSS] Implement RFD 278: RSS to Nexus Handoff #1954

[sled-agent][nexus][RSS] Implement RFD 278: RSS to Nexus Handoff #1954

Uh oh!

Conversation

smklein commented Nov 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

What Changed in the Sled Agent

What Changed in RSS

What Changed in Nexus

TO-DO list before sending out for review

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davepacheco left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

smklein commented Nov 16, 2022 •

edited

Loading