[sled-agent] Ensure that datasets get mounted #7887

smklein · 2025-03-27T23:07:50Z

Partial fix of #7874 and #4203

The debug dataset has not been mounted post-reboot due to logic issues outlined in #7874 (comment). This PR resolves those issues, so that the sled agent attempts to mount non-zoned datasets.

Note: This PR makes no attempt to delete existing crypt/debug directories.
If files exist under the desired mountpoint, this PR makes no attempt to remove them.

leftwo

When we find a mount point that we can't mount our dataset on, we see a message like this:

23:29:07.067Z WARN SledAgent (StorageResources): Disk::new failed
    err = Dataset(EnsureDataset(EnsureDatasetError { name: "oxp_aa375496-52b4-494e-ac1b-e44804d88a63/crypt/debug", err: MountEncryptedFsFailed(CommandFailure(CommandFa
ilureInfo { command: "/usr/sbin/zfs mount -l oxp_aa375496-52b4-494e-ac1b-e44804d88a63/crypt/debug", status: ExitStatus(unix_wait_status(256)), stdout: "", stderr: "can
not mount '/pool/ext/aa375496-52b4-494e-ac1b-e44804d88a63/crypt/debug': directory is not empty\\n" })) }))
    file = sled-storage/src/resources.rs:445

Should that be an ERRO and not a WARN?
It's pretty fatal in this situation, as other things are not mounted and, depending on what else is on that disk, it could be other zones we needed.

leftwo · 2025-03-28T23:35:55Z

sled-storage/src/manager.rs

        logctx.cleanup_successful();
    }

+    async fn is_mounted(dataset: &DatasetName) -> bool {


I was worried about the unwrap's here, but it looks like this is a test only function.
Would it make sense to #[cfg(test)]?

Same with unmount() below

This is already in a mod test block, which is already compiled with #[cfg(test)]

leftwo · 2025-03-28T23:40:40Z

sled-storage/src/manager.rs

        }
+
+        if !config.name.kind().zoned() && !old_dataset.mounted {
+            info!(


When would we hit this (I don't see it in the error case I tested)?
Should this be a warning? It seems like if we hit this we just ignore it.

This is being called when "omicron_config" PUT is called, via datasets_ensure -> datsaet_ensure.

Basically: If someone tries to ask for a dataset, but it is not mounted, we'll try to ensure that it gets mounted.

This is necessary for my test below, where I do the following operations:

Ask for a dataset to exist

Unmount that dataset, manually

Ask for that dataset to exist once more

Observe that it becomes mounted

Is this likely in practice? no. But it's one case of "the config you asked for hasn't been 100% applied, so we should try to make it happen".

If it makes no difference, then I would at least throw a warning or an error. If it ever does end up being called in production, then we would have a better chance of seeing it.

I think there are a couple ways we could see this in production (one theoretical, one actual but unlikely). Both are related to the fact that we currently don't read the ledgered DatasetConfig on sled-agent startup, so as Sean said above we only enter this codepath in response to a PUT /omicron-config. (Not reading the ledgered dataset config is also the root cause of #7546, I believe.)

During a cold boot, sled-agent will try to mount:

The crypt dataset

The crypt/debug dataset

and then as it starts up each zone, it will mount each zone's durable dataset (if it has one) and transient root dataset (always).

Back to the two ways we could go down this code path:

(Entirely theoretical) If the DatasetConfig we get from Nexus contained datasets other than crypt, crypt/debug, or a dataset related to a zone, we'd try to mount it here. I don't believe DatasetConfig has any such datasets, though.

(Practical but unlikely) If we got a PUT /omicron-config after sled-agent started accepting HTTP requests but before it had launched all its zones, we'd go through this codepath to mount any zone-related datasets that hadn't been mounted yet.

#7884 is also related: sled-agent won't mount support bundle datasets after a cold boot, but none of the above about reading the ledgered DatasetConfig will help with that, because the support bundle datasets aren't in that config.

TL;DR: We could see this in production in some weird cases. I think those cases are related to preexisting bugs (i.e., not reading DatasetConfig on launch and related fallout from that). These aren't particularly bad, and this PR doesn't make them any worse. We have ongoing work to address them, but that work shouldn't be an R14 blocker.

smklein · 2025-03-31T19:00:38Z

When we find a mount point that we can't mount our dataset on, we see a message like this:
23:29:07.067Z WARN SledAgent (StorageResources): Disk::new failed
    err = Dataset(EnsureDataset(EnsureDatasetError { name: "oxp_aa375496-52b4-494e-ac1b-e44804d88a63/crypt/debug", err: MountEncryptedFsFailed(CommandFailure(CommandFa
ilureInfo { command: "/usr/sbin/zfs mount -l oxp_aa375496-52b4-494e-ac1b-e44804d88a63/crypt/debug", status: ExitStatus(unix_wait_status(256)), stdout: "", stderr: "can
not mount '/pool/ext/aa375496-52b4-494e-ac1b-e44804d88a63/crypt/debug': directory is not empty\\n" })) }))
    file = sled-storage/src/resources.rs:445
Should that be an ERRO and not a WARN? It's pretty fatal in this situation, as other things are not mounted and, depending on what else is on that disk, it could be other zones we needed.

I could go either way on this - it is technically not fatal for the sled agent, but it does lead to us not using the disk?

Regardless, I'm happy to make this an "error" to make it more visible in the logs?

leftwo · 2025-03-31T20:39:25Z

When we find a mount point that we can't mount our dataset on, we see a message like this:
23:29:07.067Z WARN SledAgent (StorageResources): Disk::new failed
    err = Dataset(EnsureDataset(EnsureDatasetError { name: "oxp_aa375496-52b4-494e-ac1b-e44804d88a63/crypt/debug", err: MountEncryptedFsFailed(CommandFailure(CommandFa
ilureInfo { command: "/usr/sbin/zfs mount -l oxp_aa375496-52b4-494e-ac1b-e44804d88a63/crypt/debug", status: ExitStatus(unix_wait_status(256)), stdout: "", stderr: "can
not mount '/pool/ext/aa375496-52b4-494e-ac1b-e44804d88a63/crypt/debug': directory is not empty\\n" })) }))
    file = sled-storage/src/resources.rs:445
Should that be an ERRO and not a WARN? It's pretty fatal in this situation, as other things are not mounted and, depending on what else is on that disk, it could be other zones we needed.
I could go either way on this - it is technically not fatal for the sled agent, but it does lead to us not using the disk?

Regardless, I'm happy to make this an "error" to make it more visible in the logs?

Yeah, it's fatal for the overall control plane if there are other things on this disk that would have been mounted, but not for sled agent specifically.

I wanted it an error to make it easier to find for the poor soul debugging this. It might not be obvious what the problem is at first glances, and having an ERRO in the logs might get some attention.

smklein · 2025-03-31T20:51:15Z

I wanted it an error to make it easier to find for the poor soul debugging this. It might not be obvious what the problem is at first glances, and having an ERRO in the logs might get some attention.

Done in a3a7215

leftwo

I do think we should line this up with 7888 and have them land in quick succession.
And, also be sure that we know what to do on dogfood (and colo) (and customers) so we are ready when these changes land on those racks.

Related to #7887 - we have seen that datasets nested within encrypted datasets are not automatically mounted on reboot. To mitigate this issue: whenever we perform an operation which attempts to read the dataset, ensure that it is mounted. To do this operation, some minor refactors were made to mounting-related code. Fixes #7884

smklein added 4 commits March 27, 2025 16:06

[sled-agent] Ensure that datasets get mounted

4b50beb

docs

8627221

Update tests to less-confusing values

5fb7507

We dont have legacy mountpoints, but deal with them too

a3d8313

askfongjojo mentioned this pull request Mar 28, 2025

keep the system from running out of disk space #7875

Open

18 tasks

morlandi7 added this to the 14 milestone Mar 28, 2025

morlandi7 assigned smklein Mar 28, 2025

leftwo reviewed Mar 29, 2025

View reviewed changes

smklein added 2 commits March 31, 2025 10:55

Merge branch 'main' into do-more-mounting

a951a2b

Disk::new failing can be an error

a3a7215

leftwo approved these changes Mar 31, 2025

View reviewed changes

smklein added 2 commits April 1, 2025 08:34

Merge branch 'main' into do-more-mounting

b3a7bb4

Merge branch 'main' into do-more-mounting

c3d55db

jgallagher approved these changes Apr 3, 2025

View reviewed changes

smklein merged commit 58f4fc3 into main Apr 3, 2025
16 checks passed

smklein deleted the do-more-mounting branch April 3, 2025 20:38

smklein mentioned this pull request Apr 8, 2025

[sled-agent] Ensure Support Bundles datasets are mounted #7938

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[sled-agent] Ensure that datasets get mounted #7887

[sled-agent] Ensure that datasets get mounted #7887

Uh oh!

smklein commented Mar 27, 2025

Uh oh!

leftwo left a comment

Uh oh!

leftwo Mar 28, 2025

Uh oh!

smklein Mar 31, 2025

Uh oh!

leftwo Mar 28, 2025

Uh oh!

smklein Mar 31, 2025

Uh oh!

leftwo Mar 31, 2025

Uh oh!

jgallagher Apr 3, 2025

Uh oh!

smklein commented Mar 31, 2025

Uh oh!

leftwo commented Mar 31, 2025

Uh oh!

smklein commented Mar 31, 2025

Uh oh!

leftwo left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[sled-agent] Ensure that datasets get mounted #7887

[sled-agent] Ensure that datasets get mounted #7887

Uh oh!

Conversation

smklein commented Mar 27, 2025

Uh oh!

leftwo left a comment

Choose a reason for hiding this comment

Uh oh!

leftwo Mar 28, 2025

Choose a reason for hiding this comment

Uh oh!

smklein Mar 31, 2025

Choose a reason for hiding this comment

Uh oh!

leftwo Mar 28, 2025

Choose a reason for hiding this comment

Uh oh!

smklein Mar 31, 2025

Choose a reason for hiding this comment

Uh oh!

leftwo Mar 31, 2025

Choose a reason for hiding this comment

Uh oh!

jgallagher Apr 3, 2025

Choose a reason for hiding this comment

Uh oh!

smklein commented Mar 31, 2025

Uh oh!

leftwo commented Mar 31, 2025

Uh oh!

smklein commented Mar 31, 2025

Uh oh!

leftwo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants