[fm] Add simple disk diagnoser based on zpool health#10460
Conversation
| pub(super) fn analyze( | ||
| input: &Input, | ||
| builder: &mut SitrepBuilder<'_>, | ||
| ) -> anyhow::Result<()> { |
There was a problem hiding this comment.
So, the whole point of this PR is "to be able to build something here, and re-use it", but ironically the contents of this particular DE is particularly prone to change.
The "short version" of what we're doing:
- Look at inventory, DB state, old sitreps
- Make sure a case exists for each unhealthy zpool, with a corresponding "DiskFact"
- Close old cases if their zpools is now healthy (or expunged)
We're doing this with a jumble of indices, iterations, etc. I think those will change. I think this DE will grow to track other state about these disks. I think each of these cases will potentially grow to have different facts.
|
|
||
| /// Fetch all `fm_case_fact` rows belonging to cases in the given sitrep, | ||
| /// grouped by `case_id`. | ||
| async fn fm_case_facts_read_on_conn( |
There was a problem hiding this comment.
By reading facts alongside cases, there isn't really a need to mark "DE" on the fact table, so I removed it. It's redundant data anyway.
(Figured I'd mention this because it diverges slightly from the DB structure we talked about - but still sorts facts into case-specific buckets, so we can still "parse by the case DE type").
There was a problem hiding this comment.
Hm, I still think it's probably worth including in the DB record as a structured field, even if only for debugging reasons for now.
Also, at some point, I think we are going to probably have to figure out a way to allow multiple DEs to add facts to a case, although we don't have to cross that bridge yet. Consider the example of an ereport.data_loss.possible ereport indicating that a service processor has restarted and will need to be health-checked, as described in RFD 589. Suppose we have a trivial DE for handling data loss reports from SPs by doing a complete health check of that SP. This might open a case, and then request additional health checking of that DE, which might record some facts. Suppose one of those facts includes data that another DE would use to diagnose a fault. We should figure out how that flow will work, although we don't have to in this PR...
There was a problem hiding this comment.
Couldn't each of those DEs just make a duplicate copy of that fact in their own cases? that seems like it helps keep fact lifecycle scoped "per-case" which is what we want.
I really hesitate to include this data "just to have it" because then it means we need to handle the case where "fact.de != fact.case.de", which is an impossible data corruption case we could just avoid by omitting the column
There was a problem hiding this comment.
Specifically with your data-loss case: my main argument is that "facts are associated with cases", regardless of how they're generated.
So: in the case where we have "DE 1 which does something, but wants to write down a fact for a case managed by DE 2" - I think we can make this happen in-memory during sitrep construction, but on-disk, this could look like:
- DE1 has a case C1, queries for data
- (next sitrep) DE1 sees new data for C1, decides to open a case C2 for analysis by a different DE (DE2). It can also pass along a fact for C2
- On-disk: That fact is associated with C2. We could have a "comment" about how it was originally noticed by DE1/C1? But that origination doesn't really matter
There was a problem hiding this comment.
Yeah, what I am getting at here is the question of, how do we expect the "handing off" of data from one DE to analysis by another DE to work if we are expecting that the two DEs will never try to read facts from cases that they don't own.
I think the idea of having DE1 open a new case for DE2, with facts whose schemas come from DE2's fact schemas (as you described in #10460 (comment)), seems like a reasonable approach. That was precisely the kind of thing I was hoping to work out a design for, and I feel like this is a reasonable one.
a3cddcc to
26f2ade
Compare
andrewjstone
left a comment
There was a problem hiding this comment.
It's really exciting to see this coming together!
I think it makes sense to use JSON for payloads in the DB due to the explosion in types as discussed in chat. I wonder about the versioning strategy though. The DEs in Nexus are the only things that need to interpret payloads, but they are essentially client side versioned. During an update, Nexus will not understand new payloads. Do we plan to use a two-phase update model where reporters can't issue newly added reports until a second update, or will DE's just ignore payloads they can't understand?
Nexus is performing an atomic handoff from "old" to "new" before the database can be accessed, right? I don't think we need to worry about a mixed-version Nexus scenario - I believe we'll have "old Nexus, working with old data", then we'll perform handoff, and only worry about "new Nexus working with old + new data, which it can migrate" Regardless, there are a bunch of strategies we could use for doing "fact payload" schema migration:
|
26f2ade to
67b661f
Compare
The first fault management diagnosis engine: opens a case for any
non-Online zpool whose backing physical disk is currently in service
in the control plane, and closes it on recovery or expungement.
Supporting infrastructure introduced along the way:
- DiagnosisEngineKind::Disk variant (Rust + DB enum)
- fm_case_fact child table for per-engine state (one case has 0..N
immutable facts; stable UUIDs across sitreps; participates in
copy-forward + GC like other sitrep child tables)
- CaseBuilder::{add_fact, remove_fact, facts} API
- InServiceDisk nexus-types projection consumed by FM, populated from
the existing zpool_list_all_external_batched datastore method with
policy filtering done in the background task
Schema migration: add-disk-de-and-facts (version 260) adds the 'disk'
enum value and creates fm_case_fact.
67b661f to
793b1ec
Compare
Ah, I must be misunderstanding how payloads get populated. I was presuming that it's possible for the ingester of the payload to write to the database without actually knowing the format of the payload. But if we limit ingestion of new payloads until Nexus is updated, than I agree there is no problem. |
hawkw
left a comment
There was a problem hiding this comment.
Here's an incomplete review focusing on the database models and domain types; I haven't actually gotten as far as the actual diagnosis engine yet. I figured it would be more useful to leave a smaller review sooner rather than waiting to get to the "other half" of this PR.
|
|
||
| /// Fetch all `fm_case_fact` rows belonging to cases in the given sitrep, | ||
| /// grouped by `case_id`. | ||
| async fn fm_case_facts_read_on_conn( |
There was a problem hiding this comment.
Hm, I still think it's probably worth including in the DB record as a structured field, even if only for debugging reasons for now.
Also, at some point, I think we are going to probably have to figure out a way to allow multiple DEs to add facts to a case, although we don't have to cross that bridge yet. Consider the example of an ereport.data_loss.possible ereport indicating that a service processor has restarted and will need to be health-checked, as described in RFD 589. Suppose we have a trivial DE for handling data loss reports from SPs by doing a complete health check of that SP. This might open a case, and then request additional health checking of that DE, which might record some facts. Suppose one of those facts includes data that another DE would use to diagnose a fault. We should figure out how that flow will work, although we don't have to in this PR...
| let mut support_bundles_requested = Vec::new(); | ||
| let mut bundle_data_selections_requested = Vec::new(); | ||
| let mut case_ereports = Vec::new(); | ||
| let mut case_facts = Vec::new(); |
There was a problem hiding this comment.
would be nice to be able to with_capacity this to be as long as the case's facts map...but i also notice we are not doing this for any of the other ones so it's kinda fine i guess...
Split out from #10460 per review feedback. Renames the `Input::cases()` accessor to `Input::open_cases()`. The struct already tracked open and closed-copied-forward cases separately in private fields; this just makes the public accessor name reflect that, and adds a short doc comment pointing at the (crate-private) `closed_cases_copied_forward()` accessor for the other half.
| name: &str, | ||
| pattern: &str, | ||
| ) -> &mut Self { | ||
| pub fn variable_regex(&mut self, name: &str, pattern: &str) -> &mut Self { |
There was a problem hiding this comment.
huh, i guess rustfmt just decided it wanted to change this for...some reason?
hawkw
left a comment
There was a problem hiding this comment.
i'm starting to try and wrap my head around what the diagnosis engine is actually doing. overall, i think this seems good so far, but i want to spend some more time thinking through how it currently works, especially the way it intersects with some of our design decisions around how facts work...
| /// recorded zpool. There can be multiple facts in pathological cases | ||
| /// (e.g., two zpool ids on the same case after a hand-edit); the | ||
| /// diagnoser keeps all of them in its accounting. | ||
| zpool_unhealthy: BTreeMap<ZpoolUuid, Vec<(FactUuid, ZpoolHealth)>>, |
There was a problem hiding this comment.
personally i kind of feel like i might bite the bullet and make this iddqd-y, but i suppose that requires making a lot more structs
There was a problem hiding this comment.
also, nitpickily, i might consider calling it "unhealthy_zpools" or something, since it is an index of zpools by UUID...
There was a problem hiding this comment.
Updated to use iddqd. Not sure if I like this more or less, honestly?
| ) -> anyhow::Result<()> { | ||
| // The disk DE's primary key today is `zpool_id`, so we build a local | ||
| // index keyed by zpool. Future variants of `DiskFact` are welcome to | ||
| // derive their own secondary indices (e.g., by `sled_id` for FMD). |
There was a problem hiding this comment.
the sentence "the disk DE's primary key" feels a bit weird to me, i feel like when i see "primary key" i take that to mean we are talking about a database table..
There was a problem hiding this comment.
Changed the wording here; I really do care about "key" but PK is definitely DB terminology we don't need.
| // Index every zpool we observed in this inventory by ID, so we can | ||
| // distinguish "saw it, it's Online" from "didn't see it at all" below. | ||
| let observed: BTreeMap<ZpoolUuid, ZpoolHealth> = input | ||
| .inventory() | ||
| .sled_agents | ||
| .iter() | ||
| .flat_map(|sa| sa.zpools.iter()) | ||
| .map(|z| (z.id, z.health)) | ||
| .collect(); | ||
|
|
||
| // Currently-faulty, control-plane-managed zpools. | ||
| // | ||
| // Out-of-service zpools are intentionally ignored: a non-`Online` zpool | ||
| // whose disk has been expunged is no longer the control plane's concern. | ||
| let faulty: BTreeMap<ZpoolUuid, ZpoolHealth> = observed | ||
| .iter() | ||
| .filter(|(id, _)| in_service_by_zpool.contains_key(*id)) | ||
| .filter(|(_, h)| **h != ZpoolHealth::Online) | ||
| .map(|(id, h)| (*id, *h)) | ||
| .collect(); |
There was a problem hiding this comment.
hm, okay, this code all makes me feel like ZpoolHealth ought to be a struct with a ZpoolUuid in it, and use iddqd::IdOrdMap for these. Or, perhaps IdMap; I don't think we care about ordering here as we are not printing these out or serializing them.
| // Inspect parent-forwarded Disk cases from the input (i.e., the state | ||
| // copied from the parent sitrep — *not* the in-progress builder, which | ||
| // we will mutate below). Each case's facts are JSON blobs owned by this | ||
| // engine; deserialize each one as DiskFact. Skip (with a warning) any | ||
| // fact we can't read. |
There was a problem hiding this comment.
i don't think it's necessary to restate the explanation of what facts are in this comment, though maybe that's somewhat valuable if this DE is intended as a sort of prototypical example DE. This feels a bit like "claude decided to restate his prompt again" to me though, which always rubs me wrong...but maybe that's just because I'm an old man yelling at claudes.
| // points at zpools that are now Online or expunged. Closed cases are not | ||
| // copied forward, so their facts naturally drop with them. |
There was a problem hiding this comment.
comment about "closed cases" feels unnecessary to me, it's restating the semantics of sitreps which are documented at a higher level. not a big deal but yet again feels claudey in a way that makes me irrationally irritated 🤷♀️
| .close( | ||
| "all ZpoolUnhealthy facts have resolved (zpool back to \ | ||
| Online, or disk no longer in service)", | ||
| ); |
There was a problem hiding this comment.
is it possible to do this in a way where the comment closing the case could actually state the cause (either "zpool back to online" or "disk no longer in service"?)
| } | ||
| let any_still_unhealthy = | ||
| summary.zpool_unhealthy.keys().any(|zpool_id| { | ||
| in_service_by_zpool.contains_key(zpool_id) |
There was a problem hiding this comment.
just making sure I understand this correctly: if a zpool ID is not present in in_service_by_zpool, that is an EXPLICIT, POSITIVE INDICATION that the disk or the sled it is in has been expunged, correct? And if the disk has not been expunged, it will be present in in_service_by_zpool no matter what bad thing happens to it?
I might want a comment here explaining that, (maybe even instead of the above comment that "absence is not a recovery signal"). to the unfamiliar reader, this looks like it's checking for "absence from inventory" rather than "explicit signal of expungement", but my sense is that this is not actually what it's doing?
There was a problem hiding this comment.
looking at the input code, i am increasingly uncomfy about this and am starting to feel like i would rather see us detect that a disk has been decommissioned by maintaining a list of disks which are in the decommissioned state, and only closing the case if the disk is actually in that list, rather than checking that it is absent? but that might be because i don't know the semantics of the physical_disk/zpool tables well enough to know if what we're doing now is safe or not...
There was a problem hiding this comment.
Yes, I am relying on the contents of in_service_by_zpool - really, the contents of in_service_disks, we read during the preparation phase - to be the full set of in-service disks.
I explained in the comment above that we ignore "whether or not the thing is in inventory" as a signal about whether to update the disk - inventory is lossy, could have transient issues, etc.
Disk lifecycle goes through the following phases:
- Blueprint adds disk in-service
- Execution creates a PhysicalDisk row (with policy marked as "in-service", and state as "active")
- (Disk lifetime goes here)
- Blueprint later marks a disk as expunged
- Execution marks that PhysicalDisk row (policy is now "expunged")
- Expungement happens, and the "disk state" eventually is updated to "decommissioned"
- The PhysicalDisk row could presumably be deleted here. It isn't today, but it probably will be in the future. However, the
decommissioned_disk_cleaneris already deleting theZpoolrows! So it's basically just a matter of "CRDB rows are effectively deleted here, more cleanup will happen".
The current PR already finds pools/disks of interest by joining on zpool, and filters by "disks that are in-service". This means:
If a physical disk is in step (2) - (3), we'll treat it as "alive and observable". Otherwise: It either is being expunged, or has been expunged.
i would rather see us detect that a disk has been decommissioned by maintaining a list of disks which are in the decommissioned state, and only closing the case if the disk is actually in that list, rather than checking that it is absent?
Yeah this would be sorta problematic, because now we can't delete zpools / physical disk rows until we have confirmed that all their associated cases have "finished up"! Otherwise, if we do expungement before the case gets to close, it'll be stuck open forever, waiting to see a "positive signal of decommission" that will never arrive.
We hit a bunch of these backward dependencies when we tried going through expungement, and it's a real pain-in-the ass. I have a preference for - when we can - stating: "this is the set of all in-service stuff", which is what we're currently doing here
There was a problem hiding this comment.
okay, thanks for the explanation. i think this makes sense then!
| .expect("unreadable case should be copied forward"); | ||
| assert!( | ||
| unreadable.is_open(), | ||
| "unreadable case must not be closed by the diagnoser", |
There was a problem hiding this comment.
| "unreadable case must not be closed by the diagnoser", | |
| "unreadable case must not be closed by the diagnosis engine", |
hawkw
left a comment
There was a problem hiding this comment.
still trying to grok the de stuff
| if disk.disk_policy != PhysicalDiskPolicy::InService { | ||
| continue; | ||
| } |
There was a problem hiding this comment.
hm. so i am a bit worried here that we are not able to differentiate between disks which have been decommissioned and disks which have just been deleted from the table for "some reason" in the DE code that checks if an unhealthy disk is still in service? but maybe that's fine and we rely on the rest of the system to not mess with this in a way that will make us sad.
There was a problem hiding this comment.
I'm not sure how to reconcile this. The database state is our representation of "what disks we consider to be in-service or not". Even the blueprint is just intent; the database rows actually enact that policy.
Coping with "a disk that gets deleted from CRDB for some reason" is akin to coping with arbitrary database corruption IMO. I am not sure I can reasonably accept inputs from CRDB for this DE if we are trying to model things in a byzantine way.
There was a problem hiding this comment.
yeah, your answer to my subsequent comment made me feel a bit less sketchy about this. i wasn't super familiar with the lifecycle of the disk tables, so walking through it was helpful. i think this is fine, thank you for clarifying stuff!
| let zpools_and_disks = self | ||
| .datastore | ||
| .zpool_list_all_external_batched(opctx) | ||
| .await | ||
| .context("failed to load in-service control plane disks")?; |
There was a problem hiding this comment.
zpool_list_all_external_batched has a comment on it saying, essentially, that this can take a while. i kinda wonder if, since this and the ereport loading code both could load a lot of data and are basically completely isolated from each other, might we want to spawn separate tokio tasks to do the collection of those different inputs in parallel?
There was a problem hiding this comment.
(to be clear, this is not a blocker for this PR, just a thought)
There was a problem hiding this comment.
I think this is probably a good idea. I suspect basically all the "preparation" logic for reading from DB, potentially reading from clickhouse, etc, etc, etc, before we start analysis can be parallelized.
There was a problem hiding this comment.
Also, to be clear, not doing it in this PR
| /// All control plane managed disks | ||
| in_service_disks: Arc<IdOrdMap<InServiceDisk>>, |
There was a problem hiding this comment.
it looks like the list of in service disks never makes it into the AnalysisInputReport. do we want it to, or is that too much data to spit out in every Status object from every activation of the analysis task? might we want to at least summarize it with say, the number of in-service disks? that might help to spot some obviously weird things such as an analysis pass that loaded 0 in service disks...?
There was a problem hiding this comment.
Added in 6ae7bc6 (printing some basic UUID info)
| } | ||
| let any_still_unhealthy = | ||
| summary.zpool_unhealthy.keys().any(|zpool_id| { | ||
| in_service_by_zpool.contains_key(zpool_id) |
There was a problem hiding this comment.
looking at the input code, i am increasingly uncomfy about this and am starting to feel like i would rather see us detect that a disk has been decommissioned by maintaining a list of disks which are in the decommissioned state, and only closing the case if the disk is actually in that list, rather than checking that it is absent? but that might be because i don't know the semantics of the physical_disk/zpool tables well enough to know if what we're doing now is safe or not...
|
Couple thoughts on the DE in particular:
This may be justification for a "sitrep version of blippy/clippy". Slippy. Which validates "this case has facts which are relevant, parseable, and coherent". |
The first fault management diagnosis engine: opens a case for any
non-Online zpool whose backing physical disk is currently in service
in the control plane, and closes it on recovery or expungement.
Supporting infrastructure introduced along the way:
immutable facts; stable UUIDs across sitreps; participates in
copy-forward + GC like other sitrep child tables)
the existing zpool_list_all_external_batched datastore method with
policy filtering done in the background task