-
Notifications
You must be signed in to change notification settings - Fork 62
Description
While working on some zone bundle improvements, I noticed that the debug datasets we create on the U.2s, which are used for cores and archived logs, are not actually mounted.
When the sled-agent starts up, we create a hierarchy of datasets on each U.2. These are structured like <pool_name>/crypt, with child datasets underneath, such as zone or debug. The <pool_name>/crypt/zone datasets have further children under them, which are the root filesystems for each zone we launch in the control plane. As an example, from the current dogfood rack, we have:
BRM42220009 # zfs list -Ho name -r oxp_d0584f4a-20ba-436d-a75b-7709e80deb79/crypt
oxp_d0584f4a-20ba-436d-a75b-7709e80deb79/crypt
oxp_d0584f4a-20ba-436d-a75b-7709e80deb79/crypt/debug
oxp_d0584f4a-20ba-436d-a75b-7709e80deb79/crypt/zone
oxp_d0584f4a-20ba-436d-a75b-7709e80deb79/crypt/zone/oxz_propolis-server_0cfa0ed5-8ff3-459f-bf22-25cda4faf68a
oxp_d0584f4a-20ba-436d-a75b-7709e80deb79/crypt/zone/oxz_propolis-server_2ee45a51-e813-40ac-92a1-b79e21b51310
oxp_d0584f4a-20ba-436d-a75b-7709e80deb79/crypt/zone/oxz_propolis-server_428b2b5c-b962-4e36-9ef8-4fbd9f2b657e
oxp_d0584f4a-20ba-436d-a75b-7709e80deb79/crypt/zone/oxz_propolis-server_4f73f6c1-99b6-41ee-9570-48a5a7af0f3d
oxp_d0584f4a-20ba-436d-a75b-7709e80deb79/crypt/zone/oxz_propolis-server_5e307afc-678c-4b01-9101-40fb1a0a84b0
oxp_d0584f4a-20ba-436d-a75b-7709e80deb79/crypt/zone/oxz_propolis-server_66298fe0-dc65-4a50-bfb6-5ce3feccea89
oxp_d0584f4a-20ba-436d-a75b-7709e80deb79/crypt/zone/oxz_propolis-server_677ebc1d-048e-424c-8a34-6364a0510bd3
oxp_d0584f4a-20ba-436d-a75b-7709e80deb79/crypt/zone/oxz_propolis-server_699ad227-6387-4acf-bb21-89cb00242143
oxp_d0584f4a-20ba-436d-a75b-7709e80deb79/crypt/zone/oxz_propolis-server_c3723727-0480-4f29-878f-ad8cb786845a
oxp_d0584f4a-20ba-436d-a75b-7709e80deb79/crypt/zone/oxz_propolis-server_e1328433-9194-4eed-993a-b57553200c0f
oxp_d0584f4a-20ba-436d-a75b-7709e80deb79/crypt/zone/oxz_propolis-server_e54bb9b7-4ccd-4e58-a686-2ed68d58b905
oxp_d0584f4a-20ba-436d-a75b-7709e80deb79/crypt/zone/oxz_propolis-server_ebe3de59-b867-470e-bb63-d93357ac5e7d
oxp_d0584f4a-20ba-436d-a75b-7709e80deb79/crypt/zone/oxz_propolis-server_fd81398c-2055-4352-8743-bdbf4d620213The debug dataset is intended for cores, crash dumps, and also archived logs from the zones. Here is where it's supposed to be mounted:
BRM42220009 # zfs list -Ho name,mountpoint oxp_d0584f4a-20ba-436d-a75b-7709e80deb79/crypt/debug
oxp_d0584f4a-20ba-436d-a75b-7709e80deb79/crypt/debug /pool/ext/d0584f4a-20ba-436d-a75b-7709e80deb79/crypt/debugAnd there are indeed directories there:
BRM42220009 # ls /pool/ext/d0584f4a-20ba-436d-a75b-7709e80deb79/crypt/debug
global oxz_crucible_2f294ca1-7a4f-468f-8966-2b7915804729 oxz_crucible_cf3b2d54-5e36-4c93-b44f-8bf36ac98071
oxz_clickhouse_aa646c82-c6d7-4d0c-8401-150130927759 oxz_crucible_5c8c244c-00dc-4b16-aa17-6d9eb4827fab oxz_crucible_ee8bce67-8f8e-4221-97b0-85f1860d66d0
oxz_cockroachdb_a3628a56-6f85-43b5-be50-71d8f0e04877 oxz_crucible_6cec1d60-5c1a-4c1b-9632-2b4bc76bd37c oxz_crucible_f65a6668-1aea-4deb-81ed-191fbe469328
oxz_crucible_04eef8aa-055c-42ab-bdb6-c982f63c9be0 oxz_crucible_7d5e942b-926c-442d-937a-76cc4aa72bf3 oxz_ntp_7529be1c-ca8b-441a-89aa-37166cc450df
oxz_crucible_1a77bd1d-4fd4-4d6c-a105-17f942d94ba6 oxz_crucible_8568c997-fbbb-46a8-8549-b78284530ffcHowever, if we look at the dataset that those directories actually belong to, we see this:
BRM42220009 # zfs get -Ho name name /pool/ext/d0584f4a-20ba-436d-a75b-7709e80deb79/crypt/debug/oxz_clickhouse_aa646c82-c6d7-4d0c-8401-150130927759
oxp_d0584f4a-20ba-436d-a75b-7709e80deb79/cryptI.e., they are in the base crypt dataset, not the expected one at crypt/debug. And in fact, those directories do not show up in /etc/mnttab:
BRM42220009 # grep "oxp_d0584f4a-20ba-436d-a75b-7709e80deb79/crypt/debug" /etc/mnttab
BRM42220009 #And just to double-check, they are in fact not mounted:
BRM42220009 # zfs list -Ho name,mounted | grep crypt/debug
oxp_0e485ad3-04e6-404b-b619-87d4fea9f5ae/crypt/debug no
oxp_43efdd6d-7419-437a-a282-fc45bfafd042/crypt/debug no
oxp_4c157f35-865d-4310-9d81-c6259cb69293/crypt/debug no
oxp_62a4c68a-2073-42d0-8e49-01f5e8b90cd4/crypt/debug no
oxp_845ff39a-3205-416f-8bda-e35829107c8a/crypt/debug no
oxp_9b61d4b2-66f6-459f-86f4-13d0b8c5d6cf/crypt/debug no
oxp_b252b176-3974-436a-915b-60382b21eb76/crypt/debug no
oxp_b6bdfdaf-9c0d-4b74-926c-49ff3ed05562/crypt/debug no
oxp_d0584f4a-20ba-436d-a75b-7709e80deb79/crypt/debug no
oxp_fd82dcc7-00dd-4d01-826a-937a7d8238fb/crypt/debug noAlso, we appear to still be archiving logs into those debug directories, in the crypt dataset:
BRM42220009 # grep "DumpSetup" $(svcs -L sled-agent) | looker | tail -10
18:45:40.645Z INFO SledAgent (StorageManager): Archiving 1 log files from oxz_clickhouse_aa646c82-c6d7-4d0c-8401-150130927759 zone
file = sled-agent/src/storage/dump_setup.rs:612
18:45:43.103Z INFO SledAgent (StorageManager): Archiving 103 log files from oxz_propolis-server_699ad227-6387-4acf-bb21-89cb00242143 zone
file = sled-agent/src/storage/dump_setup.rs:612
18:45:43.246Z INFO SledAgent (StorageManager): Archiving 103 log files from oxz_propolis-server_08b1679a-68a1-479a-b59c-96a88427e19f zone
file = sled-agent/src/storage/dump_setup.rs:612
18:45:43.277Z INFO SledAgent (StorageManager): Archiving 103 log files from oxz_propolis-server_39579991-0ebc-411a-8057-3f3d73b422b1 zone
file = sled-agent/src/storage/dump_setup.rs:612
18:45:43.309Z INFO SledAgent (StorageManager): Archiving 103 log files from oxz_propolis-server_b75865d6-f068-4ddc-b260-b417c5940ca2 zone
file = sled-agent/src/storage/dump_setup.rs:612
BRM42220009 # date
Wed Oct 4 18:51:37 UTC 2023And we can see there are many files in the archive directory implied by that last Propolis zone name:
BRM42220009 # ls -1 /pool/ext/**/crypt/debug/oxz_propolis-server_b75865d6-f068-4ddc-b260-b417c5940ca2 | grep -c propolis
132
BRM42220009 # find /pool/ext -type d -name oxz_propolis-server_b75865d6-f068-4ddc-b260-b417c5940ca2
/pool/ext/0e485ad3-04e6-404b-b619-87d4fea9f5ae/crypt/debug/oxz_propolis-server_b75865d6-f068-4ddc-b260-b417c5940ca2There are also core files in those locations:
BRM42220009 # find /pool/ext -name "*core\.oxz_*" 2> /dev/null
/pool/ext/9b61d4b2-66f6-459f-86f4-13d0b8c5d6cf/crypt/debug/core.oxz_propolis-server_7cf0b20a-9f38-4518-a4df-4e60d2517685.propolis-server.17914.1693505891
^CTo summarize:
- The U.2 debug datasets exist, but are not mounted
- There appear to be existing directories that conflict with the mountpoint of those debug datasets
So, it appears that those directories have been created at some prior point, which prevents ZFS from automounting the .../crypt/debug dataset over them. It's not totally clear to me what the right path forward here is. Deleting those directories is needed for them to be automounted, but we also don't want to necessarily blow away any of the existing debug data.