[fm] Add simple disk diagnoser based on zpool health by smklein · Pull Request #10460 · oxidecomputer/omicron

smklein · 2026-05-19T00:13:33Z

The first fault management diagnosis engine: opens a case for any
non-Online zpool whose backing physical disk is currently in service
in the control plane, and closes it on recovery or expungement.

Supporting infrastructure introduced along the way:

DiagnosisEngineKind::Disk variant (Rust + DB enum)
fm_case_fact child table for per-engine state (one case has 0..N
immutable facts; stable UUIDs across sitreps; participates in
copy-forward + GC like other sitrep child tables)
CaseBuilder::{add_fact, remove_fact, facts} API
InServiceDisk nexus-types projection consumed by FM, populated from
the existing zpool_list_all_external_batched datastore method with
policy filtering done in the background task

smklein · 2026-05-19T00:36:48Z

+pub(super) fn analyze(
+    input: &Input,
+    builder: &mut SitrepBuilder<'_>,
+) -> anyhow::Result<()> {


So, the whole point of this PR is "to be able to build something here, and re-use it", but ironically the contents of this particular DE is particularly prone to change.

The "short version" of what we're doing:

Look at inventory, DB state, old sitreps

Make sure a case exists for each unhealthy zpool, with a corresponding "DiskFact"

Close old cases if their zpools is now healthy (or expunged)

We're doing this with a jumble of indices, iterations, etc. I think those will change. I think this DE will grow to track other state about these disks. I think each of these cases will potentially grow to have different facts.

smklein · 2026-05-19T01:03:18Z


+    /// Fetch all `fm_case_fact` rows belonging to cases in the given sitrep,
+    /// grouped by `case_id`.
+    async fn fm_case_facts_read_on_conn(


By reading facts alongside cases, there isn't really a need to mark "DE" on the fact table, so I removed it. It's redundant data anyway.

(Figured I'd mention this because it diverges slightly from the DB structure we talked about - but still sorts facts into case-specific buckets, so we can still "parse by the case DE type").

Hm, I still think it's probably worth including in the DB record as a structured field, even if only for debugging reasons for now.

Also, at some point, I think we are going to probably have to figure out a way to allow multiple DEs to add facts to a case, although we don't have to cross that bridge yet. Consider the example of an ereport.data_loss.possible ereport indicating that a service processor has restarted and will need to be health-checked, as described in RFD 589. Suppose we have a trivial DE for handling data loss reports from SPs by doing a complete health check of that SP. This might open a case, and then request additional health checking of that DE, which might record some facts. Suppose one of those facts includes data that another DE would use to diagnose a fault. We should figure out how that flow will work, although we don't have to in this PR...

Couldn't each of those DEs just make a duplicate copy of that fact in their own cases? that seems like it helps keep fact lifecycle scoped "per-case" which is what we want.

I really hesitate to include this data "just to have it" because then it means we need to handle the case where "fact.de != fact.case.de", which is an impossible data corruption case we could just avoid by omitting the column

Specifically with your data-loss case: my main argument is that "facts are associated with cases", regardless of how they're generated.

So: in the case where we have "DE 1 which does something, but wants to write down a fact for a case managed by DE 2" - I think we can make this happen in-memory during sitrep construction, but on-disk, this could look like:

DE1 has a case C1, queries for data

(next sitrep) DE1 sees new data for C1, decides to open a case C2 for analysis by a different DE (DE2). It can also pass along a fact for C2

On-disk: That fact is associated with C2. We could have a "comment" about how it was originally noticed by DE1/C1? But that origination doesn't really matter

Yeah, what I am getting at here is the question of, how do we expect the "handing off" of data from one DE to analysis by another DE to work if we are expecting that the two DEs will never try to read facts from cases that they don't own.

I think the idea of having DE1 open a new case for DE2, with facts whose schemas come from DE2's fact schemas (as you described in #10460 (comment)), seems like a reasonable approach. That was precisely the kind of thing I was hoping to work out a design for, and I feel like this is a reasonable one.

andrewjstone

It's really exciting to see this coming together!

I think it makes sense to use JSON for payloads in the DB due to the explosion in types as discussed in chat. I wonder about the versioning strategy though. The DEs in Nexus are the only things that need to interpret payloads, but they are essentially client side versioned. During an update, Nexus will not understand new payloads. Do we plan to use a two-phase update model where reporters can't issue newly added reports until a second update, or will DE's just ignore payloads they can't understand?

smklein · 2026-05-19T16:27:40Z

During an update, Nexus will not understand new payloads. Do we plan to use a two-phase update model where reporters can't issue newly added reports until a second update, or will DE's just ignore payloads they can't understand?

Nexus is performing an atomic handoff from "old" to "new" before the database can be accessed, right? I don't think we need to worry about a mixed-version Nexus scenario - I believe we'll have "old Nexus, working with old data", then we'll perform handoff, and only worry about "new Nexus working with old + new data, which it can migrate"

Regardless, there are a bunch of strategies we could use for doing "fact payload" schema migration:

We could use the existing DB migration tools, to perform "data-only" migrations (look at all fm_case_fact rows, where diagnosis_engine = x, and where payload->variant = y, and re-write the payload).
We could rely on the re-generation of sitreps to have a phase where we load "old facts, and update them to new fact format". e.g. CaseFact::VariantFoo1 could be read, and in-memory updated to CaseFact::VariantFoo2, which gets written out in the next sitrep.

The first fault management diagnosis engine: opens a case for any non-Online zpool whose backing physical disk is currently in service in the control plane, and closes it on recovery or expungement. Supporting infrastructure introduced along the way: - DiagnosisEngineKind::Disk variant (Rust + DB enum) - fm_case_fact child table for per-engine state (one case has 0..N immutable facts; stable UUIDs across sitreps; participates in copy-forward + GC like other sitrep child tables) - CaseBuilder::{add_fact, remove_fact, facts} API - InServiceDisk nexus-types projection consumed by FM, populated from the existing zpool_list_all_external_batched datastore method with policy filtering done in the background task Schema migration: add-disk-de-and-facts (version 260) adds the 'disk' enum value and creates fm_case_fact.

andrewjstone · 2026-05-19T17:43:45Z

Nexus is performing an atomic handoff from "old" to "new" before the database can be accessed, right? I don't think we need to worry about a mixed-version Nexus scenario - I believe we'll have "old Nexus, working with old data", then we'll perform handoff, and only worry about "new Nexus working with old + new data, which it can migrate"

Ah, I must be misunderstanding how payloads get populated. I was presuming that it's possible for the ingester of the payload to write to the database without actually knowing the format of the payload. But if we limit ingestion of new payloads until Nexus is updated, than I agree there is no problem.

hawkw

Here's an incomplete review focusing on the database models and domain types; I haven't actually gotten as far as the actual diagnosis engine yet. I figured it would be more useful to leave a smaller review sooner rather than waiting to get to the "other half" of this PR.

hawkw · 2026-05-19T17:36:46Z


+    /// Fetch all `fm_case_fact` rows belonging to cases in the given sitrep,
+    /// grouped by `case_id`.
+    async fn fm_case_facts_read_on_conn(


Hm, I still think it's probably worth including in the DB record as a structured field, even if only for debugging reasons for now.

Also, at some point, I think we are going to probably have to figure out a way to allow multiple DEs to add facts to a case, although we don't have to cross that bridge yet. Consider the example of an ereport.data_loss.possible ereport indicating that a service processor has restarted and will need to be health-checked, as described in RFD 589. Suppose we have a trivial DE for handling data loss reports from SPs by doing a complete health check of that SP. This might open a case, and then request additional health checking of that DE, which might record some facts. Suppose one of those facts includes data that another DE would use to diagnose a fault. We should figure out how that flow will work, although we don't have to in this PR...

hawkw · 2026-05-19T18:28:36Z

        let mut support_bundles_requested = Vec::new();
        let mut bundle_data_selections_requested = Vec::new();
        let mut case_ereports = Vec::new();
+        let mut case_facts = Vec::new();


would be nice to be able to with_capacity this to be as long as the case's facts map...but i also notice we are not doing this for any of the other ones so it's kinda fine i guess...

Split out from #10460 per review feedback. Renames the `Input::cases()` accessor to `Input::open_cases()`. The struct already tracked open and closed-copied-forward cases separately in private fields; this just makes the public accessor name reflect that, and adds a short doc comment pointing at the (crate-private) `closed_cases_copied_forward()` accessor for the other half.

hawkw · 2026-05-20T18:33:53Z

-        name: &str,
-        pattern: &str,
-    ) -> &mut Self {
+    pub fn variable_regex(&mut self, name: &str, pattern: &str) -> &mut Self {


huh, i guess rustfmt just decided it wanted to change this for...some reason?

¯_(ツ)_/¯

hawkw

i'm starting to try and wrap my head around what the diagnosis engine is actually doing. overall, i think this seems good so far, but i want to spend some more time thinking through how it currently works, especially the way it intersects with some of our design decisions around how facts work...

hawkw · 2026-05-20T20:19:44Z

+    /// recorded zpool. There can be multiple facts in pathological cases
+    /// (e.g., two zpool ids on the same case after a hand-edit); the
+    /// diagnoser keeps all of them in its accounting.
+    zpool_unhealthy: BTreeMap<ZpoolUuid, Vec<(FactUuid, ZpoolHealth)>>,


personally i kind of feel like i might bite the bullet and make this iddqd-y, but i suppose that requires making a lot more structs

also, nitpickily, i might consider calling it "unhealthy_zpools" or something, since it is an index of zpools by UUID...

Updated to use iddqd. Not sure if I like this more or less, honestly?

hawkw · 2026-05-20T20:21:44Z

+) -> anyhow::Result<()> {
+    // The disk DE's primary key today is `zpool_id`, so we build a local
+    // index keyed by zpool. Future variants of `DiskFact` are welcome to
+    // derive their own secondary indices (e.g., by `sled_id` for FMD).


the sentence "the disk DE's primary key" feels a bit weird to me, i feel like when i see "primary key" i take that to mean we are talking about a database table..

Changed the wording here; I really do care about "key" but PK is definitely DB terminology we don't need.

hawkw · 2026-05-20T20:22:49Z

+    // Index every zpool we observed in this inventory by ID, so we can
+    // distinguish "saw it, it's Online" from "didn't see it at all" below.
+    let observed: BTreeMap<ZpoolUuid, ZpoolHealth> = input
+        .inventory()
+        .sled_agents
+        .iter()
+        .flat_map(|sa| sa.zpools.iter())
+        .map(|z| (z.id, z.health))
+        .collect();
+
+    // Currently-faulty, control-plane-managed zpools.
+    //
+    // Out-of-service zpools are intentionally ignored: a non-`Online` zpool
+    // whose disk has been expunged is no longer the control plane's concern.
+    let faulty: BTreeMap<ZpoolUuid, ZpoolHealth> = observed
+        .iter()
+        .filter(|(id, _)| in_service_by_zpool.contains_key(*id))
+        .filter(|(_, h)| **h != ZpoolHealth::Online)
+        .map(|(id, h)| (*id, *h))
+        .collect();


hm, okay, this code all makes me feel like ZpoolHealth ought to be a struct with a ZpoolUuid in it, and use iddqd::IdOrdMap for these. Or, perhaps IdMap; I don't think we care about ordering here as we are not printing these out or serializing them.

Done in 6ae7bc6

hawkw · 2026-05-20T20:24:10Z

+    // Inspect parent-forwarded Disk cases from the input (i.e., the state
+    // copied from the parent sitrep — *not* the in-progress builder, which
+    // we will mutate below). Each case's facts are JSON blobs owned by this
+    // engine; deserialize each one as DiskFact. Skip (with a warning) any
+    // fact we can't read.


i don't think it's necessary to restate the explanation of what facts are in this comment, though maybe that's somewhat valuable if this DE is intended as a sort of prototypical example DE. This feels a bit like "claude decided to restate his prompt again" to me though, which always rubs me wrong...but maybe that's just because I'm an old man yelling at claudes.

Dropping it.

hawkw · 2026-05-20T20:29:31Z

+    // points at zpools that are now Online or expunged. Closed cases are not
+    // copied forward, so their facts naturally drop with them.


comment about "closed cases" feels unnecessary to me, it's restating the semantics of sitreps which are documented at a higher level. not a big deal but yet again feels claudey in a way that makes me irrationally irritated 🤷‍♀️

hawkw · 2026-05-20T20:33:49Z

hawkw · 2026-05-20T20:39:10Z

+                .close(
+                    "all ZpoolUnhealthy facts have resolved (zpool back to \
+                     Online, or disk no longer in service)",
+                );


is it possible to do this in a way where the comment closing the case could actually state the cause (either "zpool back to online" or "disk no longer in service"?)

hawkw · 2026-05-20T20:42:25Z

+        }
+        let any_still_unhealthy =
+            summary.zpool_unhealthy.keys().any(|zpool_id| {
+                in_service_by_zpool.contains_key(zpool_id)


just making sure I understand this correctly: if a zpool ID is not present in in_service_by_zpool, that is an EXPLICIT, POSITIVE INDICATION that the disk or the sled it is in has been expunged, correct? And if the disk has not been expunged, it will be present in in_service_by_zpool no matter what bad thing happens to it?

I might want a comment here explaining that, (maybe even instead of the above comment that "absence is not a recovery signal"). to the unfamiliar reader, this looks like it's checking for "absence from inventory" rather than "explicit signal of expungement", but my sense is that this is not actually what it's doing?

looking at the input code, i am increasingly uncomfy about this and am starting to feel like i would rather see us detect that a disk has been decommissioned by maintaining a list of disks which are in the decommissioned state, and only closing the case if the disk is actually in that list, rather than checking that it is absent? but that might be because i don't know the semantics of the physical_disk/zpool tables well enough to know if what we're doing now is safe or not...

Yes, I am relying on the contents of in_service_by_zpool - really, the contents of in_service_disks, we read during the preparation phase - to be the full set of in-service disks.

I explained in the comment above that we ignore "whether or not the thing is in inventory" as a signal about whether to update the disk - inventory is lossy, could have transient issues, etc.

Disk lifecycle goes through the following phases:

Blueprint adds disk in-service

Execution creates a PhysicalDisk row (with policy marked as "in-service", and state as "active")

(Disk lifetime goes here)

Blueprint later marks a disk as expunged

Execution marks that PhysicalDisk row (policy is now "expunged")

Expungement happens, and the "disk state" eventually is updated to "decommissioned"

The PhysicalDisk row could presumably be deleted here. It isn't today, but it probably will be in the future. However, the decommissioned_disk_cleaner is already deleting the Zpool rows! So it's basically just a matter of "CRDB rows are effectively deleted here, more cleanup will happen".

The current PR already finds pools/disks of interest by joining on zpool, and filters by "disks that are in-service". This means:

If a physical disk is in step (2) - (3), we'll treat it as "alive and observable". Otherwise: It either is being expunged, or has been expunged.

i would rather see us detect that a disk has been decommissioned by maintaining a list of disks which are in the decommissioned state, and only closing the case if the disk is actually in that list, rather than checking that it is absent?

Yeah this would be sorta problematic, because now we can't delete zpools / physical disk rows until we have confirmed that all their associated cases have "finished up"! Otherwise, if we do expungement before the case gets to close, it'll be stuck open forever, waiting to see a "positive signal of decommission" that will never arrive.

We hit a bunch of these backward dependencies when we tried going through expungement, and it's a real pain-in-the ass. I have a preference for - when we can - stating: "this is the set of all in-service stuff", which is what we're currently doing here

okay, thanks for the explanation. i think this makes sense then!

hawkw · 2026-05-20T20:43:25Z

+            .expect("unreadable case should be copied forward");
+        assert!(
+            unreadable.is_open(),
+            "unreadable case must not be closed by the diagnoser",


Suggested change

"unreadable case must not be closed by the diagnoser",

"unreadable case must not be closed by the diagnosis engine",

updated in 6ae7bc6, here and elsewhere

hawkw

still trying to grok the de stuff

hawkw · 2026-05-20T20:52:51Z

+            if disk.disk_policy != PhysicalDiskPolicy::InService {
+                continue;
+            }


hm. so i am a bit worried here that we are not able to differentiate between disks which have been decommissioned and disks which have just been deleted from the table for "some reason" in the DE code that checks if an unhealthy disk is still in service? but maybe that's fine and we rely on the rest of the system to not mess with this in a way that will make us sad.

I'm not sure how to reconcile this. The database state is our representation of "what disks we consider to be in-service or not". Even the blueprint is just intent; the database rows actually enact that policy.

Coping with "a disk that gets deleted from CRDB for some reason" is akin to coping with arbitrary database corruption IMO. I am not sure I can reasonably accept inputs from CRDB for this DE if we are trying to model things in a byzantine way.

yeah, your answer to my subsequent comment made me feel a bit less sketchy about this. i wasn't super familiar with the lifecycle of the disk tables, so walking through it was helpful. i think this is fine, thank you for clarifying stuff!

hawkw · 2026-05-20T20:53:54Z

+        let zpools_and_disks = self
+            .datastore
+            .zpool_list_all_external_batched(opctx)
+            .await
+            .context("failed to load in-service control plane disks")?;


zpool_list_all_external_batched has a comment on it saying, essentially, that this can take a while. i kinda wonder if, since this and the ereport loading code both could load a lot of data and are basically completely isolated from each other, might we want to spawn separate tokio tasks to do the collection of those different inputs in parallel?

(to be clear, this is not a blocker for this PR, just a thought)

I think this is probably a good idea. I suspect basically all the "preparation" logic for reading from DB, potentially reading from clickhouse, etc, etc, etc, before we start analysis can be parallelized.

Also, to be clear, not doing it in this PR

hawkw · 2026-05-20T20:56:18Z

+    /// All control plane managed disks
+    in_service_disks: Arc<IdOrdMap<InServiceDisk>>,


it looks like the list of in service disks never makes it into the AnalysisInputReport. do we want it to, or is that too much data to spit out in every Status object from every activation of the analysis task? might we want to at least summarize it with say, the number of in-service disks? that might help to spot some obviously weird things such as an analysis pass that loaded 0 in service disks...?

Added in 6ae7bc6 (printing some basic UUID info)

hawkw · 2026-05-20T20:58:22Z

+        }
+        let any_still_unhealthy =
+            summary.zpool_unhealthy.keys().any(|zpool_id| {
+                in_service_by_zpool.contains_key(zpool_id)


looking at the input code, i am increasingly uncomfy about this and am starting to feel like i would rather see us detect that a disk has been decommissioned by maintaining a list of disks which are in the decommissioned state, and only closing the case if the disk is actually in that list, rather than checking that it is absent? but that might be because i don't know the semantics of the physical_disk/zpool tables well enough to know if what we're doing now is safe or not...

…port, comments

smklein · 2026-05-20T23:41:38Z

Couple thoughts on the DE in particular:

Since "cases" have no concept of identity aside from their facts - basically, "which disk are you working on" - theoretically, a single case could be full of facts about different disks. This would be... bad? Probably?
So there is kinda a concept of like, "what is the identity of the resource we are building a case about" that is kinda implied by facts today. Perhaps that's okay? It's flexible? but it also allows the data to model impossible situations.

This may be justification for a "sitrep version of blippy/clippy". Slippy. Which validates "this case has facts which are relevant, parseable, and coherent".

smklein commented May 19, 2026

View reviewed changes

smklein force-pushed the fm-disk-diagnoser branch 2 times, most recently from a3cddcc to 26f2ade Compare May 19, 2026 01:28

andrewjstone reviewed May 19, 2026

View reviewed changes

smklein force-pushed the fm-disk-diagnoser branch from 26f2ade to 67b661f Compare May 19, 2026 16:28

smklein force-pushed the fm-disk-diagnoser branch from 67b661f to 793b1ec Compare May 19, 2026 17:12

hawkw self-requested a review May 19, 2026 17:31

hawkw reviewed May 19, 2026

View reviewed changes

smklein added 5 commits May 19, 2026 12:07

Debuggable facts

dfabfb4

more logging

7555b0b

CaseFact -> Fact, add created_sitrep_id to facts, fm- prefix

bcf5b68

payload_to, deserialization error context

a415d27

case formatting

58c780f

smklein mentioned this pull request May 20, 2026

[fm] Rename Input::cases() to Input::open_cases() #10471

Merged

s/Disk/PhysicalDisk

bf5988f

merge

325e34b

hawkw reviewed May 20, 2026

View reviewed changes

Comment thread nexus/db-model/src/fm/case.rs Outdated

Comment thread nexus/db-model/src/fm/case.rs Outdated

smklein added 3 commits May 20, 2026 12:32

Reuse bulleted JSON formatter for fact payload display

b016d7c

Facts don't case about your prefixes?

aafbf9d

comments

d021ea9

hawkw reviewed May 20, 2026

View reviewed changes

smklein added 2 commits May 20, 2026 14:40

Guarantee at least one sitrep in omdb tests

d6467a1

disk -> physical_disk

b33b0e0

smklein added 2 commits May 20, 2026 15:34

IdOrdMap, ZpoolUnhealthyFact, adding initial observation, expanded re…

6ae7bc6

…port, comments

omdb output, struct naming

503fe79

Index by physical disk, refactor, check analysis report

5dfc4fd

		// points at zpools that are now Online or expunged. Closed cases are not
		// copied forward, so their facts naturally drop with them.

	"unreadable case must not be closed by the diagnoser",
	"unreadable case must not be closed by the diagnosis engine",

		/// All control plane managed disks
		in_service_disks: Arc<IdOrdMap<InServiceDisk>>,

Conversation

smklein commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smklein May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrewjstone left a comment

Choose a reason for hiding this comment

Uh oh!

smklein commented May 19, 2026

Uh oh!

andrewjstone commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hawkw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hawkw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

smklein commented May 19, 2026 •

edited

Loading

smklein May 19, 2026 •

edited

Loading

andrewjstone commented May 19, 2026 •

edited

Loading