Skip to content

Conversation

@smklein
Copy link
Collaborator

@smklein smklein commented Nov 16, 2022

Overview

What Changed in the Sled Agent

  • Sled Agent
    • A new get_zpools endpoint is exposed from the Sled Agent. This is invoked by RSS when figuring out where to provision datasets.
    • The UUID for the sled agent is removed from the config file (it's dynamic, and should not be shared among sleds)

What Changed in RSS

  • HardcodedSledRequest (and the corresponding entries in config-rss.toml) has been removed
  • A plan module was added, where plans for sled generation ("What sleds should get what addresses?") and service generation ("What services should run where?") are generated.
  • Refactor service and dataset initialization to insert entries into DNS
  • Invoke the handoff_to_nexus, informing it of all previously-owned-by-RSS services.

What Changed in Nexus

  • Expand RackInitializationRequest to consider both services and datasets
  • dataset_put API removed -- beyond the initialization request, Nexus should be responsible for provisioning new datasets, not the sled agent.

TO-DO list before sending out for review

  • Address "TODO: If nexus, add a pool?", by adding an IP pool for Nexus (Punted to: RSS: Collect and transfer IP pool information into Nexus #1958)
  • Pass x509 cert information alongside the RackInitializationRequest, store it in CRDB. This will be necessary to re-provision Nexus in follow-up PRs. See: nexus/src/internal_api/params.rs for more context. (Punted to: RSS: Collect and transfer x509 cert information into Nexus #1959)
  • Add tests for the address bump allocator: sled-agent/src/rack_setup/plan/service.rs
  • Audit the usage of config files by RSS
  • Address "TODO: Consider putting the rack subnet here?"
  • Address: "TODO" in sled-agent/src/rack_setup/service.rs for conversion to ServiceType

Fixes #1148
Part of #732
Part of #824

@smklein smklein marked this pull request as ready for review November 17, 2022 22:03
#
tar xf out/omicron-sled-agent.tar pkg/config-rss.toml
sed -e '/\[gateway\]/,/\[request\]/ s/^.*address =.*$/address = "192.168.1.199"/' \
sed -e 's/^# address =.*$/address = "192.168.1.199"/' \
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The format of config-rss.toml changed, so this needed to change too.

This only wants to update the gateway address; this modification makes it correct once more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Too bad there's not a CLI tool available here to structurally modify the toml instead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that config-rss.toml has been shrinking, it may actually make sense to just have totally a distinct copy for the lab?

FYI to @andrewjstone , I do think when we get to the point where wicket actually talks to RSS to "let the operator dynamically set a profile (including Nexus IP info, the subnet, etc)", it'll be a great time to revisit "how we pass this info into RSS".

I think the gateway stuff won't be necessary at that point, since it's only needed for the OPTE hack?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally!

}
}
}
return (config.nexus_external_address, 80).into();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer list the "ServiceZoneRequest" for nexus in config-rss.toml -- now we specify the external IP, and let RSS do the rest of the provisioning.

pub pool_id: Uuid,

ip: ipnetwork::IpNetwork,
ip: ipv6::Ipv6Addr,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a single IP address on the underlay, not a whole network, so this feels a little more accurate.

id: Uuid,
pool_id: Uuid,
addr: SocketAddr,
addr: SocketAddrV6,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and elsewhere in the PR, we use Ipv6 variants instead of more generic Ip variants.

This should be accurate for the underlay, where we expect everything to be IPv6, and prevents some impossible-to-hit variant matching errors.

Comment on lines -153 to -157
async fn dataset_put(
rqctx: Arc<RequestContext<Arc<ServerContext>>>,
path_params: Path<DatasetPathParam>,
info: TypedBody<DatasetPutRequest>,
) -> Result<HttpResponseOk<DatasetPutResponse>, HttpError> {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this PR, Nexus should be in charge of "when to provision datasets" (after the RSS handoff).

As a result, this endpoint is removed.

Comment on lines +29 to +40
// The number of Nexus instances to create from RSS.
const NEXUS_COUNT: usize = 1;

// The number of CRDB instances to create from RSS.
const CRDB_COUNT: usize = 1;

// TODO(https://github.com/oxidecomputer/omicron/issues/732): Remove
// when Nexus provisions Oximeter.
const OXIMETER_COUNT: usize = 1;
// TODO(https://github.com/oxidecomputer/omicron/issues/732): Remove
// when Nexus provisions Clickhouse.
const CLICKHOUSE_COUNT: usize = 1;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all these services, this is our way of specifying "how many of these services should we be initializing in RSS"?

Previously, these were hard-coded entries in config-rss.toml, eventually, this may be a configurable option as part of the interation via wicket. However, by moving these here, RSS can make a call on where to place services more dynamically.

Comment on lines +151 to +163
// TODO(https://github.com/oxidecomputer/omicron/issues/732):
// We're currently waiting for ALL zpools to appear, so RSS can be
// responsible for provisioning Crucible datasets.
//
// Once this responsibility shifts to Nexus, we actually only
// need enough zpools to provision CRDB.
if zpools.len() < MINIMUM_ZPOOL_COUNT {
return Err(BackoffError::transient(
PlanError::SledInitialization(
"Awaiting zpools".to_string(),
),
));
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here's the deal - eventually, Nexus should 100% be calling the shots on when to provision crucible datasets, and RSS shouldn't muck with this. This will all be sorted out in follow-up PRs.

AT THE MOMENT, however, if RSS doesn't do this, no one will. So RSS is arbitrarily waiting for three zpools to show up on a sled before saying "okay, go ahead, and I'll tell all your zpools to add a crucible dataset". This is a hack! but it's hopefully a short-lived one, until Nexus is more fully in-charge of this data.

"postgresql://root@[fd00:1122:3344:0101::3]:32221/omicron?sslmode=disable"
).unwrap(),
}
database: nexus_config::Database::FromDns,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a nice perk, because we're setting the DNS record correctly, Nexus can connect to CRDB regardless of what IP it has.

Comment on lines +555 to +559
/// Gets the sled's current list of all zpools.
pub async fn zpools_get(&self) -> Result<Vec<Zpool>, Error> {
let zpools = self.inner.storage.get_zpools().await?;
Ok(zpools)
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new endpoint being exposed from the sled agent so RSS can query for zpool information when deciding "where should CRDB go?"

# how-to-run.adoc for details on how to determine the value for your network.
mac = "00:0d:b9:54:fe:e4"

[[request]]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is basically the point of this PR - this information can - and should be - dynamic. RSS will control it for now, and Nexus will take more responsibility in future PRs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!


/// DNS Services to be instantiated.
#[serde(default, rename = "dns_service")]
pub dns_services: Vec<ServiceZoneRequest>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kinda nitpicky and certainly doesn't need to change as part of this PR, but it seems a little strange that this vec has the same type as services in the "make illegal states unrepresentable" sense. What happens if there's an item in this vec that doesn't have ZoneType::InternalDns, or an item in services that does? Would it make sense to split ServiceZoneRequest into two types if DNS zones are special enough to be split out here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ServiceZoneRequest has an internal ZoneType field that allows for matching - it's mostly RSS that cares about the distinction, because it wants to launch DNS services before others.

I'll just lump all services into pub services, and make this distinction internally. I agree that it's odd to make illegal state representation possible here.

Done in 8d899c5

}

fn rss_plan_path() -> std::path::PathBuf {
fn rss_completed_plan_path() -> PathBuf {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny nit - based on the name I expected this to return rss-plan-complete.toml, not .marker. Maybe rss_plan_executed_path() or rss_completed_marker_path() would be clearer?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in d223fb9

for dataset in datasets {
records
.entry(dataset.srv())
.or_insert_with(Vec::new)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to suggest .or_default() instead of .or_insert_with(..), but that would require putting a type annotation on records. I might still do that, but that puts me much more on the fence. Up to you.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in c53a0bf

//!
//! ## Sled Plan
//!
//! When RSS starts, it presumably is executing on a single sled, and must
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what "it presumably is executing on a single sled" means or is getting at. There are instances of RSS running on both scrimlets, right? But only one of them will actually perform setup (whichever one the technician happens to connect to / plug in to).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated comment in 362f7e1.

I agree with your comment here - we expect the operator to only be initializing the sled via a connection through a single technician port.


impl Plan {
pub async fn load(log: &Logger) -> Result<Option<Plan>, PlanError> {
// If we already created a plan for this RSS to allocate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just wan't to note that eventually, this should probably be written and retreived from the bootstore for fault tolerance. The Handoff "switch" can go there also. Obviously there's a lot more work to do on the bootstore front, and I'm not looking to change anything here now (certainly not before MUP) but I wanted to leave a note.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might need a better understanding of the bootstore - how is it expected to be replicated within the rack?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately bootstore is not well documented yet. It's a replacement for the more complicated reconfiguration protocol in RFD 238 section 5 and is alluded to in RFD 301 with some implied functionality. Once we get past MUP I'll write it up completely in section 5 of RFD 238 or a new RFD.

For now though, you can think of it the following way. There's a coordinator (based in the boot agent) that drives a 2-phase commit with input from Nexus or RSS, given initial setup or standard reconfiguration. Replication is to every sled with the exception of specific key-shares going only to their assigned replicas. Transport is over sprockets.

The intended uses are:

  • Storage of key shares and trust quorum metadata
  • Storage of enough networking information required to enable NTP and bring up CockroachDB
  • Storage of initial plan data from RSS that can be fed into Nexus/CockroachDB on the nexus handoff. This should allow the system to tolerate failures of this initial setup in a more robust manner and prevent conflicting plans in case of those versions. John and I discussed this a bit a while back, but right now it's just a rough idea in my head. Since the plan and key-shares are intimately tied together for rack init, I figured we'd just ship them together from RSS when the user pressed go.

Currently the protocol for the bootstore implementation has been started with a focus on storing key shares. There are some comments related to also storing the plan, but nothing implemented yet. Those changes should be relatively minor from the bootstore perspective.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for that summary @andrewjstone!

Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smklein thanks so much for the detailed PR description and the PR comments on the code. That made this a lot easier to review. I have some thoughts above but I think they're pretty small.

There's so much goodness here. This makes the rack setup process feel so much more real.

# how-to-run.adoc for details on how to determine the value for your network.
mac = "00:0d:b9:54:fe:e4"

[[request]]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!

#
tar xf out/omicron-sled-agent.tar pkg/config-rss.toml
sed -e '/\[gateway\]/,/\[request\]/ s/^.*address =.*$/address = "192.168.1.199"/' \
sed -e 's/^# address =.*$/address = "192.168.1.199"/' \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Too bad there's not a CLI tool available here to structurally modify the toml instead.

use std::net::{Ipv6Addr, SocketAddrV6};
use uuid::Uuid;

/// Database representation of a Dataset.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. The changes here make sense.

This is unrelated to this PR but I find it a little unintuitive that we call this a Dataset because it seems to encapsulate not just the ZFS dataset, but at least some aspects of the networking for the service running in a zone atop the dataset (its IP, its DNS record type and record name). Is that just historical or is there some intuition I'm missing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair point, it's basically tightly coupling the concept of the "dataset" with "the service which manages the dataset".

This has been implemented because it fits our current "dataset-managing services", including:

  • CRDB
  • Crucible
  • Clickhouse

Though it's a fair point, it definitely won't apply to all "datasets", and it makes it awkward to describe datasets which aren't managed by an explicit service (such as a dump device, dataset for logs, etc).

Filed #2000

warn!(self.log, "Cannot look up rack: {}", e);
}
}
tokio::time::sleep(std::time::Duration::from_secs(2)).await;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this seems like a hand-rolled version of wait_for_condition...maybe there's no value in commonizing it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would need to pull wait_for_condition out of test-utils (this is not test-only code) to re-use it, and I would need to modify it to take an infinite duration? I dunno.

match result {
Ok(rack) => {
if rack.initialized {
return;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: might be worth logging something here at "info" level

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 4ec1e28

rack_id: Uuid,
services: Vec<Service>,
datasets: Vec<Dataset>,
) -> UpdateResult<Rack> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think at one point in the RFD 278 discussion we talked about ensuring that if we got a second RSS initialization request, it must have the same parameters as the first one. Is that checked somehow here or are we deferring that work for now?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We aren't checking the parameters right now, but we do bail out early if we have ever completed a call to rack_set_initialized:

// Early exit if the rack has already been initialized.
let rack = rack_dsl::rack
.filter(rack_dsl::id.eq(rack_id))
.select(Rack::as_select())
.get_result_async(&conn)
.await
.map_err(|e| {
TxnError::CustomError(RackInitError::RackUpdate(
PoolError::from(e),
))
})?;
if rack.initialized {
info!(log, "Early exit: Rack already initialized");
return Ok(rack);
}

This means that we won't ever try to create conflicting state, but rather enforcing a "first-one-wins" policy.

Comment on lines +70 to +72
/// A partially-initialized Nexus server, which exposes an internal interface,
/// but is not ready to receive external requests.
pub struct InternalServer<'a> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this split. I wonder if the name here could better reflect its half-initialized state? The distinction between InternalServer and Server seems really that the former is a transient state during Nexus startup where we happen to not have set up the external server.

// Configure and run the Bootstrap server.
let bootstrap_config = BootstrapConfig {
id: config.id,
id: Uuid::new_v4(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is probably beyond the scope of this PR but I wonder if the uuids used for Bootstrap and Sled Agent should come from some hardware identity. Has that been discussed anywhere?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is touched upon in #945, admittedly more in the context of the bootstrap address.

use super::config::{HardcodedSledRequest, SetupServiceConfig as Config};
use crate::bootstrap::config::BOOTSTRAP_AGENT_PORT;
//! Rack Setup Service (RSS) implementation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this comment! (I wonder if it belongs in sled-agent/src/rack_setup/lib.rs so it'd be more easily found?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean sled-agent/src/rack_setup/mod.rs ? I'll add doc comments there to help discovery (a659390), but I think this comment is tightly coupled with this file, as the body of service.rs is the implementation described here.


impl Plan {
pub async fn load(log: &Logger) -> Result<Option<Plan>, PlanError> {
// If we already created a plan for this RSS to allocate
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for that summary @andrewjstone!

@smklein smklein merged commit 27dfb62 into main Dec 2, 2022
@smklein smklein deleted the 278 branch December 2, 2022 14:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[nexus] Nexus should on init block until "rss_initialization_complete" has been called

5 participants