-
Notifications
You must be signed in to change notification settings - Fork 62
[sled agent] Store zone filesystems on U.2s, not the ramdisk #3557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| // Collect all datasets for ramdisk-based Oxide zones, | ||
| // if any exist. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We aren't using these, so stop trying to delete 'em
| #[derive(thiserror::Error, Debug)] | ||
| pub enum DestroyDatasetErrorVariant { | ||
| #[error("Dataset not found")] | ||
| NotFound, | ||
| #[error(transparent)] | ||
| Other(crate::ExecutionError), | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a more specific error here so that the caller can get a better idea of if we tried to delete a dataset that doesn't exist (I mean, go figure, but it's relevant now because we're doing this on boot for the zone datasets).
| // Before we start creating zones, we need to ensure that the | ||
| // necessary ZFS and Zone resources are ready. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing this because we should be off the ramdisk.
| let mut rng = rand::rngs::StdRng::from_entropy(); | ||
| let root = inner | ||
| .storage | ||
| .all_u2_mountpoints(ZONE_DATASET) | ||
| .await | ||
| .choose(&mut rng) | ||
| .ok_or_else(|| Error::U2NotFound)? | ||
| .clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do this in a couple spots, but our policy is roughly the same:
- List all the U.2 "ZONE_DATASET" mountpoints
- Pick one randomly
We aren't accounting for space here, but we also delete these on reboot
| #[cfg(not(test))] | ||
| use crate::instance::Instance; | ||
| #[cfg(test)] | ||
| use crate::instance::MockInstance as Instance; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ripping off this band-aid a little early, but these mock-based tests are terrible (see: #2422 ) and rather than making them work with this U.2 integration, I got rid of some of them.
| SWITCH_ZONE_BOOTSTRAP_IP, | ||
| vec![], | ||
| StorageManager::new(&log, storage_key_requester).await, | ||
| StorageResources::new_for_test(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need access to the full StorageManager here -- we aren't going to be adding new disks here -- we just want to know about "what U.2s exist on the system", which is a smaller interface.
andrewjstone
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So nice to have this change go in!
1. Moving the zones onto the U.2 devices (#3557), real or synthetic, results in the paths of all the zones changing, which results in the paths of all their logs changing. Updated the deploy.sh job to look in the new spot for logs, so that we can find: 2. The end-to-end test is failing[^1] because Nexus is returning a 500 on disk creation, because [Nexus cannot contact the Crucible downstairs](https://buildomat.eng.oxide.computer/wg/0/artefact/01H5ED4P9ZPW22RMY4BEDV0X6Q/VZmMOazlZARWMoMr6qgqt59i4NHEwei5lZ4Ds8d5TJLKdbd2/01H5ED53S5T9XSX4PXS7K6GZ1S/01H5EGRG8XW9GWBQ6ZQXP93WPD/oxide-nexus:default.log?format=x-bunyan#L3759), because [the Crucible agent is repeatedly panicking because it cannot create a dataset, because the zpool is out of space](https://buildomat.eng.oxide.computer/wg/0/artefact/01H5ED4P9ZPW22RMY4BEDV0X6Q/VZmMOazlZARWMoMr6qgqt59i4NHEwei5lZ4Ds8d5TJLKdbd2/01H5ED53S5T9XSX4PXS7K6GZ1S/01H5EGRF4V6N2XS8TXN2B6CK15/oxide-crucible-agent:default.log?format=x-bunyan#L93). We attempt to rectify the issue by increasing the size of the synthetic drives in create_virtual_hardware.sh. 3. It is possible that we are hitting this limit for the first time because Crucible as of #3646 reserves more space. (We should also switch the deploy job to using real disks, instead of tmpfs, for these datasets. But that will not be part of this PR.) [^1]: Not always; some commits are evidently lucky.
rpool/zonesFixes #3533