Skip to content

Conversation

@smklein
Copy link
Collaborator

@smklein smklein commented Jul 11, 2023

  • Removes all references to rpool/zones
  • Exclusively stores zone filesystems on the U.2, under an encrypted dataset
  • Wipes these datasets from the U.2s when parsing them on sled agent boot
  • Tangentially related: Removes a handful of low-quality Sled Agent tests, reliant heavily on mock interfaces (see [sled agent] Fakes are better than mocks; get rid of mocks #2422 , though we still have work to improve this test coverage).

Fixes #3533

Comment on lines -374 to -375
// Collect all datasets for ramdisk-based Oxide zones,
// if any exist.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We aren't using these, so stop trying to delete 'em

Comment on lines +24 to +31
#[derive(thiserror::Error, Debug)]
pub enum DestroyDatasetErrorVariant {
#[error("Dataset not found")]
NotFound,
#[error(transparent)]
Other(crate::ExecutionError),
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a more specific error here so that the caller can get a better idea of if we tried to delete a dataset that doesn't exist (I mean, go figure, but it's relevant now because we're doing this on boot for the zone datasets).

Comment on lines -344 to -345
// Before we start creating zones, we need to ensure that the
// necessary ZFS and Zone resources are ready.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this because we should be off the ramdisk.

Comment on lines +830 to +837
let mut rng = rand::rngs::StdRng::from_entropy();
let root = inner
.storage
.all_u2_mountpoints(ZONE_DATASET)
.await
.choose(&mut rng)
.ok_or_else(|| Error::U2NotFound)?
.clone();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do this in a couple spots, but our policy is roughly the same:

  • List all the U.2 "ZONE_DATASET" mountpoints
  • Pick one randomly

We aren't accounting for space here, but we also delete these on reboot

Comment on lines -23 to -27
#[cfg(not(test))]
use crate::instance::Instance;
#[cfg(test)]
use crate::instance::MockInstance as Instance;

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ripping off this band-aid a little early, but these mock-based tests are terrible (see: #2422 ) and rather than making them work with this U.2 integration, I got rid of some of them.

SWITCH_ZONE_BOOTSTRAP_IP,
vec![],
StorageManager::new(&log, storage_key_requester).await,
StorageResources::new_for_test(),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need access to the full StorageManager here -- we aren't going to be adding new disks here -- we just want to know about "what U.2s exist on the system", which is a smaller interface.

@smklein smklein marked this pull request as ready for review July 13, 2023 21:33
Copy link
Contributor

@andrewjstone andrewjstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So nice to have this change go in!

@smklein smklein enabled auto-merge (squash) July 14, 2023 01:56
@smklein smklein merged commit f4ad52d into main Jul 14, 2023
@smklein smklein deleted the u2-zones branch July 14, 2023 16:46
@iliana iliana mentioned this pull request Jul 16, 2023
iliana added a commit that referenced this pull request Jul 16, 2023
1. Moving the zones onto the U.2 devices (#3557), real or synthetic,
results in the paths of all the zones changing, which results in the
paths of all their logs changing. Updated the deploy.sh job to look in
the new spot for logs, so that we can find:
2. The end-to-end test is failing[^1] because Nexus is returning a 500
on disk creation, because [Nexus cannot contact the Crucible
downstairs](https://buildomat.eng.oxide.computer/wg/0/artefact/01H5ED4P9ZPW22RMY4BEDV0X6Q/VZmMOazlZARWMoMr6qgqt59i4NHEwei5lZ4Ds8d5TJLKdbd2/01H5ED53S5T9XSX4PXS7K6GZ1S/01H5EGRG8XW9GWBQ6ZQXP93WPD/oxide-nexus:default.log?format=x-bunyan#L3759),
because [the Crucible agent is repeatedly panicking because it cannot
create a dataset, because the zpool is out of
space](https://buildomat.eng.oxide.computer/wg/0/artefact/01H5ED4P9ZPW22RMY4BEDV0X6Q/VZmMOazlZARWMoMr6qgqt59i4NHEwei5lZ4Ds8d5TJLKdbd2/01H5ED53S5T9XSX4PXS7K6GZ1S/01H5EGRF4V6N2XS8TXN2B6CK15/oxide-crucible-agent:default.log?format=x-bunyan#L93).
We attempt to rectify the issue by increasing the size of the synthetic
drives in create_virtual_hardware.sh.
3. It is possible that we are hitting this limit for the first time
because Crucible as of #3646 reserves more space.

(We should also switch the deploy job to using real disks, instead of
tmpfs, for these datasets. But that will not be part of this PR.)

[^1]: Not always; some commits are evidently lucky.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

zone logs should not be in rpool

3 participants