-
Notifications
You must be signed in to change notification settings - Fork 58
Description
After a fresh deployment of a4x2, I found that many datasets in the system's initial blueprint have non-NULL (and misleading) address and port fields in the database. (This is very far removed from the initial symptoms so I'll jump to the root cause here and put the consequences / debugging process into a separate comment.) I think the problem is here:
omicron/sled-agent/src/rack_setup/service.rs
Lines 1542 to 1569 in c140817
for d in sled_config.datasets.datasets.values() { | |
// Only the "Crucible" dataset needs to know the address | |
let address = sled_config.zones.iter().find_map(|z| { | |
if let BlueprintZoneType::Crucible( | |
blueprint_zone_type::Crucible { address, dataset }, | |
) = &z.zone_type | |
{ | |
if &dataset.pool_name == d.name.pool() { | |
return Some(*address); | |
} | |
}; | |
None | |
}); | |
datasets.insert( | |
d.id, | |
BlueprintDatasetConfig { | |
disposition: BlueprintDatasetDisposition::InService, | |
id: d.id, | |
pool: d.name.pool().clone(), | |
kind: d.name.dataset().clone(), | |
address, | |
compression: d.inner.compression, | |
quota: d.inner.quota, | |
reservation: d.inner.reservation, | |
}, | |
); | |
} |
This code is taking the DatasetsConfig
that was generated during RSS and converting it into a BlueprintDatasetsConfig
that will become the rack's initial blueprint. The blueprint struct has space for a socket address (IP addr and TCP port), which is only used for one kind of dataset: the persistent dataset of a Crucible zone. That's not in DatasetsConfig
. This code has to fill that in from the zone information. For each dataset in DatasetsConfig
, it does this by looking for any zone of type "Crucible" on the same pool. If it finds one, then it populates the new BlueprintDatasetConfig
for this dataset with the socket address (IP address and TCP port) of that Crucible zone. I think this is just wrong. As an example, my system has these datasets on this pool:
oxp_15b53b30-72cf-4edb-a7c4-325ee3f7c679/crucible
oxp_15b53b30-72cf-4edb-a7c4-325ee3f7c679/crypt/debug
oxp_15b53b30-72cf-4edb-a7c4-325ee3f7c679/crypt/zone
oxp_15b53b30-72cf-4edb-a7c4-325ee3f7c679/crypt/zone/oxz_crucible_049d9f96-6e06-43a0-a924-35146efd7b8c
oxp_15b53b30-72cf-4edb-a7c4-325ee3f7c679/crypt/zone/oxz_ntp_2b3c2cf8-bf97-4a7c-9327-712f1d589c7b
That's one Crucible zone's persistent dataset, a debug dataset, and a couple of transient zone root filesystems. In the initial blueprint, all of these have the same IP address and port (the one from the Crucible zone):
root@[fd00:1122:3344:102::3]:32221/omicron> select ip,port,id,kind,zone_name from bp_omicron_dataset where pool_id = '15b53b30-72cf-4edb-a7c4-325ee3f7c679' AND blueprint_id = '831679c9-26f8-4e3b-9873-e2522cfdc087';
ip | port | id | kind | zone_name
------------------------+-------+--------------------------------------+-----------+----------------------------------------------------
fd00:1122:3344:101::a | 32345 | 43a80037-e23f-44be-84eb-bb30bd1f539e | zone | oxz_ntp_2b3c2cf8-bf97-4a7c-9327-712f1d589c7b
fd00:1122:3344:101::a | 32345 | 6f610524-4329-4634-adab-ffbd6f65a653 | debug | NULL
fd00:1122:3344:101::a | 32345 | 801a8141-9e83-4cc0-9428-fb1db210657d | zone | oxz_crucible_049d9f96-6e06-43a0-a924-35146efd7b8c
fd00:1122:3344:101::a | 32345 | aff65822-a39d-4b21-9b1e-d94ba1688057 | crucible | NULL
fd00:1122:3344:101::a | 32345 | cac4df64-07ac-4266-9c73-822fb620ff9f | zone_root | NULL
(5 rows)
I believe this is wrong because the IP/port fields are supposed to be NULL for datasets other than a Crucible zone's persistent dataset. It's also misleading because if you didn't know that, you might reasonably think that the value for the NTP zone's dataset there is the IP of the NTP zone (for example), but it's not.