-
Notifications
You must be signed in to change notification settings - Fork 62
Description
sled-agent's baked-in list of datasets that should exist on each U.2 zpool include a parent .../crypt/zone dataset, which contains all transient zone filesystems as child datasets. This parent dataset is configured as should be wiped on boot. This causes sled-agent to recursively destroy the dataset, destroying it and any leftover transient zone filesystems.
This is both fine and intentional: the zone filesystems are supposed to be transient, so clearing them out on a reboot (or even a sled-agent restart) ensures they are truly transient. However, the dataset ledger now includes properties for each dataset that are supposed to be persistent: every dataset has an ID (which gets set as the oxide:uuid zfs property), a compression level, and optionally may have a quota and/or a reservation. When sled-agent destroys the parent zone/ dataset and its children, it does not restore these properties when it recreates them. The properties will be restored the next time sled-agent gets a PUT /datasets request (e.g., from the blueprint executor RPW, if blueprint execution is enabled), but this shouldn't be necessary: sled-agent already has these properties in its ledger.
Today this bug has minimal practical impact, because we don't set quota/reservation values for transient zone roots, there is no consumer of the oxide:uuid properties other than inventory (and there are no consumers of those inventory properties except a human during debugging), and the compression level sled-agent chooses matches the compression level we assign in the ledger.
In the future, it will be critical the sled-agent restore these properties, particularly something like quota. For example, if we started to set quotas on transient datasets, sled-agent today would destory+recreate that dataset on startup, fail to set its quota to the quota specified in the ledger, then could fill it with more data than it was supposed to have. That would be bad enough, but then the next time sled-agent received a PUT /datasets, it wouldn't be able to set the quota because the dataset would already be beyond it.
An idea we've bounced around several times is that sled-agent should merge its three reconfigurator-controlled PUT endpoints (disks, datasets, and zones). While doing this, we should strongly consider doing another thing we've bounced around: make this endpoint much more asynchronous. Validate and accept the new configuration then return immediately, and then internally to sled-agent, execute some kind of reconciler loop to ensure that the actual sled state is made consistent with the requested config. (These have been mentioned in at least #5086, which talks about making omicron-zones async+reconcile, and #7309, which talks about merging the endpoints to fix issues with ordering them.) If we had such a reconciler loop, we should be able to make use of it during sled-agent startup as well, and have it restore all expected state from the ledger, rather than the current behavior which is largely distinct from the ledgered config.
We discovered this while investigating #7543. I'll put notes from how we found this in a comment below.