-
Notifications
You must be signed in to change notification settings - Fork 62
[sled-agent] Ensure that datasets get mounted #7887
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
leftwo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we find a mount point that we can't mount our dataset on, we see a message like this:
23:29:07.067Z WARN SledAgent (StorageResources): Disk::new failed
err = Dataset(EnsureDataset(EnsureDatasetError { name: "oxp_aa375496-52b4-494e-ac1b-e44804d88a63/crypt/debug", err: MountEncryptedFsFailed(CommandFailure(CommandFa
ilureInfo { command: "/usr/sbin/zfs mount -l oxp_aa375496-52b4-494e-ac1b-e44804d88a63/crypt/debug", status: ExitStatus(unix_wait_status(256)), stdout: "", stderr: "can
not mount '/pool/ext/aa375496-52b4-494e-ac1b-e44804d88a63/crypt/debug': directory is not empty\\n" })) }))
file = sled-storage/src/resources.rs:445
Should that be an ERRO and not a WARN?
It's pretty fatal in this situation, as other things are not mounted and, depending on what else is on that disk, it could be other zones we needed.
| logctx.cleanup_successful(); | ||
| } | ||
|
|
||
| async fn is_mounted(dataset: &DatasetName) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was worried about the unwrap's here, but it looks like this is a test only function.
Would it make sense to #[cfg(test)]?
Same with unmount() below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already in a mod test block, which is already compiled with #[cfg(test)]
| } | ||
|
|
||
| if !config.name.kind().zoned() && !old_dataset.mounted { | ||
| info!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When would we hit this (I don't see it in the error case I tested)?
Should this be a warning? It seems like if we hit this we just ignore it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is being called when "omicron_config" PUT is called, via datasets_ensure -> datsaet_ensure.
Basically: If someone tries to ask for a dataset, but it is not mounted, we'll try to ensure that it gets mounted.
This is necessary for my test below, where I do the following operations:
- Ask for a dataset to exist
- Unmount that dataset, manually
- Ask for that dataset to exist once more
- Observe that it becomes mounted
Is this likely in practice? no. But it's one case of "the config you asked for hasn't been 100% applied, so we should try to make it happen".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it makes no difference, then I would at least throw a warning or an error. If it ever does end up being called in production, then we would have a better chance of seeing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are a couple ways we could see this in production (one theoretical, one actual but unlikely). Both are related to the fact that we currently don't read the ledgered DatasetConfig on sled-agent startup, so as Sean said above we only enter this codepath in response to a PUT /omicron-config. (Not reading the ledgered dataset config is also the root cause of #7546, I believe.)
During a cold boot, sled-agent will try to mount:
- The
cryptdataset - The
crypt/debugdataset
and then as it starts up each zone, it will mount each zone's durable dataset (if it has one) and transient root dataset (always).
Back to the two ways we could go down this code path:
- (Entirely theoretical) If the
DatasetConfigwe get from Nexus contained datasets other thancrypt,crypt/debug, or a dataset related to a zone, we'd try to mount it here. I don't believeDatasetConfighas any such datasets, though. - (Practical but unlikely) If we got a
PUT /omicron-configafter sled-agent started accepting HTTP requests but before it had launched all its zones, we'd go through this codepath to mount any zone-related datasets that hadn't been mounted yet.
#7884 is also related: sled-agent won't mount support bundle datasets after a cold boot, but none of the above about reading the ledgered DatasetConfig will help with that, because the support bundle datasets aren't in that config.
TL;DR: We could see this in production in some weird cases. I think those cases are related to preexisting bugs (i.e., not reading DatasetConfig on launch and related fallout from that). These aren't particularly bad, and this PR doesn't make them any worse. We have ongoing work to address them, but that work shouldn't be an R14 blocker.
I could go either way on this - it is technically not fatal for the sled agent, but it does lead to us not using the disk? Regardless, I'm happy to make this an "error" to make it more visible in the logs? |
Yeah, it's fatal for the overall control plane if there are other things on this disk that would have been mounted, but not for sled agent specifically. I wanted it an error to make it easier to find for the poor soul debugging this. It might not be obvious what the problem is at first glances, and having an |
Done in a3a7215 |
leftwo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think we should line this up with 7888 and have them land in quick succession.
And, also be sure that we know what to do on dogfood (and colo) (and customers) so we are ready when these changes land on those racks.
Related to #7887 - we have seen that datasets nested within encrypted datasets are not automatically mounted on reboot. To mitigate this issue: whenever we perform an operation which attempts to read the dataset, ensure that it is mounted. To do this operation, some minor refactors were made to mounting-related code. Fixes #7884
Partial fix of #7874 and #4203
The
debugdataset has not been mounted post-reboot due to logic issues outlined in #7874 (comment). This PR resolves those issues, so that the sled agent attempts to mount non-zoned datasets.Note: This PR makes no attempt to delete existing crypt/debug directories.
If files exist under the desired mountpoint, this PR makes no attempt to remove them.