Skip to content

[fm] Add simple disk diagnoser based on zpool health#10460

Open
smklein wants to merge 16 commits into
mainfrom
fm-disk-diagnoser
Open

[fm] Add simple disk diagnoser based on zpool health#10460
smklein wants to merge 16 commits into
mainfrom
fm-disk-diagnoser

Conversation

@smklein
Copy link
Copy Markdown
Collaborator

@smklein smklein commented May 19, 2026

The first fault management diagnosis engine: opens a case for any
non-Online zpool whose backing physical disk is currently in service
in the control plane, and closes it on recovery or expungement.

Supporting infrastructure introduced along the way:

  • DiagnosisEngineKind::Disk variant (Rust + DB enum)
  • fm_case_fact child table for per-engine state (one case has 0..N
    immutable facts; stable UUIDs across sitreps; participates in
    copy-forward + GC like other sitrep child tables)
  • CaseBuilder::{add_fact, remove_fact, facts} API
  • InServiceDisk nexus-types projection consumed by FM, populated from
    the existing zpool_list_all_external_batched datastore method with
    policy filtering done in the background task

pub(super) fn analyze(
input: &Input,
builder: &mut SitrepBuilder<'_>,
) -> anyhow::Result<()> {
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the whole point of this PR is "to be able to build something here, and re-use it", but ironically the contents of this particular DE is particularly prone to change.

The "short version" of what we're doing:

  • Look at inventory, DB state, old sitreps
  • Make sure a case exists for each unhealthy zpool, with a corresponding "DiskFact"
  • Close old cases if their zpools is now healthy (or expunged)

We're doing this with a jumble of indices, iterations, etc. I think those will change. I think this DE will grow to track other state about these disks. I think each of these cases will potentially grow to have different facts.

Comment thread nexus/db-queries/src/db/datastore/fm.rs Outdated

/// Fetch all `fm_case_fact` rows belonging to cases in the given sitrep,
/// grouped by `case_id`.
async fn fm_case_facts_read_on_conn(
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By reading facts alongside cases, there isn't really a need to mark "DE" on the fact table, so I removed it. It's redundant data anyway.

(Figured I'd mention this because it diverges slightly from the DB structure we talked about - but still sorts facts into case-specific buckets, so we can still "parse by the case DE type").

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I still think it's probably worth including in the DB record as a structured field, even if only for debugging reasons for now.

Also, at some point, I think we are going to probably have to figure out a way to allow multiple DEs to add facts to a case, although we don't have to cross that bridge yet. Consider the example of an ereport.data_loss.possible ereport indicating that a service processor has restarted and will need to be health-checked, as described in RFD 589. Suppose we have a trivial DE for handling data loss reports from SPs by doing a complete health check of that SP. This might open a case, and then request additional health checking of that DE, which might record some facts. Suppose one of those facts includes data that another DE would use to diagnose a fault. We should figure out how that flow will work, although we don't have to in this PR...

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't each of those DEs just make a duplicate copy of that fact in their own cases? that seems like it helps keep fact lifecycle scoped "per-case" which is what we want.

I really hesitate to include this data "just to have it" because then it means we need to handle the case where "fact.de != fact.case.de", which is an impossible data corruption case we could just avoid by omitting the column

Copy link
Copy Markdown
Collaborator Author

@smklein smklein May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically with your data-loss case: my main argument is that "facts are associated with cases", regardless of how they're generated.

So: in the case where we have "DE 1 which does something, but wants to write down a fact for a case managed by DE 2" - I think we can make this happen in-memory during sitrep construction, but on-disk, this could look like:

  • DE1 has a case C1, queries for data
  • (next sitrep) DE1 sees new data for C1, decides to open a case C2 for analysis by a different DE (DE2). It can also pass along a fact for C2
  • On-disk: That fact is associated with C2. We could have a "comment" about how it was originally noticed by DE1/C1? But that origination doesn't really matter

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, what I am getting at here is the question of, how do we expect the "handing off" of data from one DE to analysis by another DE to work if we are expecting that the two DEs will never try to read facts from cases that they don't own.

I think the idea of having DE1 open a new case for DE2, with facts whose schemas come from DE2's fact schemas (as you described in #10460 (comment)), seems like a reasonable approach. That was precisely the kind of thing I was hoping to work out a design for, and I feel like this is a reasonable one.

@smklein smklein force-pushed the fm-disk-diagnoser branch 2 times, most recently from a3cddcc to 26f2ade Compare May 19, 2026 01:28
Copy link
Copy Markdown
Contributor

@andrewjstone andrewjstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's really exciting to see this coming together!

I think it makes sense to use JSON for payloads in the DB due to the explosion in types as discussed in chat. I wonder about the versioning strategy though. The DEs in Nexus are the only things that need to interpret payloads, but they are essentially client side versioned. During an update, Nexus will not understand new payloads. Do we plan to use a two-phase update model where reporters can't issue newly added reports until a second update, or will DE's just ignore payloads they can't understand?

@smklein
Copy link
Copy Markdown
Collaborator Author

smklein commented May 19, 2026

During an update, Nexus will not understand new payloads. Do we plan to use a two-phase update model where reporters can't issue newly added reports until a second update, or will DE's just ignore payloads they can't understand?

Nexus is performing an atomic handoff from "old" to "new" before the database can be accessed, right? I don't think we need to worry about a mixed-version Nexus scenario - I believe we'll have "old Nexus, working with old data", then we'll perform handoff, and only worry about "new Nexus working with old + new data, which it can migrate"

Regardless, there are a bunch of strategies we could use for doing "fact payload" schema migration:

  • We could use the existing DB migration tools, to perform "data-only" migrations (look at all fm_case_fact rows, where diagnosis_engine = x, and where payload->variant = y, and re-write the payload).
  • We could rely on the re-generation of sitreps to have a phase where we load "old facts, and update them to new fact format". e.g. CaseFact::VariantFoo1 could be read, and in-memory updated to CaseFact::VariantFoo2, which gets written out in the next sitrep.

@smklein smklein force-pushed the fm-disk-diagnoser branch from 26f2ade to 67b661f Compare May 19, 2026 16:28
The first fault management diagnosis engine: opens a case for any
non-Online zpool whose backing physical disk is currently in service
in the control plane, and closes it on recovery or expungement.

Supporting infrastructure introduced along the way:

- DiagnosisEngineKind::Disk variant (Rust + DB enum)
- fm_case_fact child table for per-engine state (one case has 0..N
  immutable facts; stable UUIDs across sitreps; participates in
  copy-forward + GC like other sitrep child tables)
- CaseBuilder::{add_fact, remove_fact, facts} API
- InServiceDisk nexus-types projection consumed by FM, populated from
  the existing zpool_list_all_external_batched datastore method with
  policy filtering done in the background task

Schema migration: add-disk-de-and-facts (version 260) adds the 'disk'
enum value and creates fm_case_fact.
@smklein smklein force-pushed the fm-disk-diagnoser branch from 67b661f to 793b1ec Compare May 19, 2026 17:12
@hawkw hawkw self-requested a review May 19, 2026 17:31
@andrewjstone
Copy link
Copy Markdown
Contributor

andrewjstone commented May 19, 2026

Nexus is performing an atomic handoff from "old" to "new" before the database can be accessed, right? I don't think we need to worry about a mixed-version Nexus scenario - I believe we'll have "old Nexus, working with old data", then we'll perform handoff, and only worry about "new Nexus working with old + new data, which it can migrate"

Ah, I must be misunderstanding how payloads get populated. I was presuming that it's possible for the ingester of the payload to write to the database without actually knowing the format of the payload. But if we limit ingestion of new payloads until Nexus is updated, than I agree there is no problem.

Copy link
Copy Markdown
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an incomplete review focusing on the database models and domain types; I haven't actually gotten as far as the actual diagnosis engine yet. I figured it would be more useful to leave a smaller review sooner rather than waiting to get to the "other half" of this PR.

Comment thread nexus/db-queries/src/db/datastore/fm.rs Outdated

/// Fetch all `fm_case_fact` rows belonging to cases in the given sitrep,
/// grouped by `case_id`.
async fn fm_case_facts_read_on_conn(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I still think it's probably worth including in the DB record as a structured field, even if only for debugging reasons for now.

Also, at some point, I think we are going to probably have to figure out a way to allow multiple DEs to add facts to a case, although we don't have to cross that bridge yet. Consider the example of an ereport.data_loss.possible ereport indicating that a service processor has restarted and will need to be health-checked, as described in RFD 589. Suppose we have a trivial DE for handling data loss reports from SPs by doing a complete health check of that SP. This might open a case, and then request additional health checking of that DE, which might record some facts. Suppose one of those facts includes data that another DE would use to diagnose a fault. We should figure out how that flow will work, although we don't have to in this PR...

Comment thread nexus/fm/src/builder/case.rs Outdated
Comment thread nexus/fm/src/builder/case.rs Outdated
Comment thread nexus/fm/src/builder/case.rs
Comment thread nexus/fm/src/builder/case.rs Outdated
Comment thread nexus/types/src/fm/case.rs Outdated
Comment thread nexus/types/src/fm/case.rs Outdated
Comment thread schema/crdb/fm-disk-de-and-facts/up2.sql
let mut support_bundles_requested = Vec::new();
let mut bundle_data_selections_requested = Vec::new();
let mut case_ereports = Vec::new();
let mut case_facts = Vec::new();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to be able to with_capacity this to be as long as the case's facts map...but i also notice we are not doing this for any of the other ones so it's kinda fine i guess...

Comment thread nexus/fm/src/analysis_input.rs
smklein added a commit that referenced this pull request May 20, 2026
Split out from #10460 per review feedback.

Renames the `Input::cases()` accessor to `Input::open_cases()`. The
struct already tracked open and closed-copied-forward cases separately
in private fields; this just makes the public accessor name reflect
that, and adds a short doc comment pointing at the (crate-private)
`closed_cases_copied_forward()` accessor for the other half.
name: &str,
pattern: &str,
) -> &mut Self {
pub fn variable_regex(&mut self, name: &str, pattern: &str) -> &mut Self {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh, i guess rustfmt just decided it wanted to change this for...some reason?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

¯_(ツ)_/¯

Comment thread nexus/fm/src/builder/case.rs
Comment thread nexus/types/src/fm/case.rs Outdated
Comment thread nexus/db-model/src/fm/case.rs Outdated
Comment thread nexus/db-model/src/fm/case.rs Outdated
Copy link
Copy Markdown
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm starting to try and wrap my head around what the diagnosis engine is actually doing. overall, i think this seems good so far, but i want to spend some more time thinking through how it currently works, especially the way it intersects with some of our design decisions around how facts work...

Comment thread nexus/fm/src/diagnosis/physical_disk.rs
Comment thread nexus/fm/src/diagnosis/disk.rs Outdated
/// recorded zpool. There can be multiple facts in pathological cases
/// (e.g., two zpool ids on the same case after a hand-edit); the
/// diagnoser keeps all of them in its accounting.
zpool_unhealthy: BTreeMap<ZpoolUuid, Vec<(FactUuid, ZpoolHealth)>>,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

personally i kind of feel like i might bite the bullet and make this iddqd-y, but i suppose that requires making a lot more structs

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, nitpickily, i might consider calling it "unhealthy_zpools" or something, since it is an index of zpools by UUID...

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to use iddqd. Not sure if I like this more or less, honestly?

Comment thread nexus/fm/src/diagnosis/disk.rs Outdated
) -> anyhow::Result<()> {
// The disk DE's primary key today is `zpool_id`, so we build a local
// index keyed by zpool. Future variants of `DiskFact` are welcome to
// derive their own secondary indices (e.g., by `sled_id` for FMD).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the sentence "the disk DE's primary key" feels a bit weird to me, i feel like when i see "primary key" i take that to mean we are talking about a database table..

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the wording here; I really do care about "key" but PK is definitely DB terminology we don't need.

Comment thread nexus/fm/src/diagnosis/physical_disk.rs Outdated
Comment on lines +46 to +65
// Index every zpool we observed in this inventory by ID, so we can
// distinguish "saw it, it's Online" from "didn't see it at all" below.
let observed: BTreeMap<ZpoolUuid, ZpoolHealth> = input
.inventory()
.sled_agents
.iter()
.flat_map(|sa| sa.zpools.iter())
.map(|z| (z.id, z.health))
.collect();

// Currently-faulty, control-plane-managed zpools.
//
// Out-of-service zpools are intentionally ignored: a non-`Online` zpool
// whose disk has been expunged is no longer the control plane's concern.
let faulty: BTreeMap<ZpoolUuid, ZpoolHealth> = observed
.iter()
.filter(|(id, _)| in_service_by_zpool.contains_key(*id))
.filter(|(_, h)| **h != ZpoolHealth::Online)
.map(|(id, h)| (*id, *h))
.collect();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, okay, this code all makes me feel like ZpoolHealth ought to be a struct with a ZpoolUuid in it, and use iddqd::IdOrdMap for these. Or, perhaps IdMap; I don't think we care about ordering here as we are not printing these out or serializing them.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 6ae7bc6

Comment thread nexus/fm/src/diagnosis/disk.rs Outdated
Comment on lines +67 to +71
// Inspect parent-forwarded Disk cases from the input (i.e., the state
// copied from the parent sitrep — *not* the in-progress builder, which
// we will mutate below). Each case's facts are JSON blobs owned by this
// engine; deserialize each one as DiskFact. Skip (with a warning) any
// fact we can't read.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think it's necessary to restate the explanation of what facts are in this comment, though maybe that's somewhat valuable if this DE is intended as a sort of prototypical example DE. This feels a bit like "claude decided to restate his prompt again" to me though, which always rubs me wrong...but maybe that's just because I'm an old man yelling at claudes.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropping it.

Comment thread nexus/fm/src/diagnosis/disk.rs Outdated
Comment on lines +109 to +110
// points at zpools that are now Online or expunged. Closed cases are not
// copied forward, so their facts naturally drop with them.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment about "closed cases" feels unnecessary to me, it's restating the semantics of sitreps which are documented at a higher level. not a big deal but yet again feels claudey in a way that makes me irrationally irritated 🤷‍♀️

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

j'adore!

Comment thread nexus/fm/src/diagnosis/disk.rs Outdated
Comment on lines +128 to +131
.close(
"all ZpoolUnhealthy facts have resolved (zpool back to \
Online, or disk no longer in service)",
);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to do this in a way where the comment closing the case could actually state the cause (either "zpool back to online" or "disk no longer in service"?)

Comment thread nexus/fm/src/diagnosis/disk.rs Outdated
}
let any_still_unhealthy =
summary.zpool_unhealthy.keys().any(|zpool_id| {
in_service_by_zpool.contains_key(zpool_id)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just making sure I understand this correctly: if a zpool ID is not present in in_service_by_zpool, that is an EXPLICIT, POSITIVE INDICATION that the disk or the sled it is in has been expunged, correct? And if the disk has not been expunged, it will be present in in_service_by_zpool no matter what bad thing happens to it?

I might want a comment here explaining that, (maybe even instead of the above comment that "absence is not a recovery signal"). to the unfamiliar reader, this looks like it's checking for "absence from inventory" rather than "explicit signal of expungement", but my sense is that this is not actually what it's doing?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking at the input code, i am increasingly uncomfy about this and am starting to feel like i would rather see us detect that a disk has been decommissioned by maintaining a list of disks which are in the decommissioned state, and only closing the case if the disk is actually in that list, rather than checking that it is absent? but that might be because i don't know the semantics of the physical_disk/zpool tables well enough to know if what we're doing now is safe or not...

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I am relying on the contents of in_service_by_zpool - really, the contents of in_service_disks, we read during the preparation phase - to be the full set of in-service disks.

I explained in the comment above that we ignore "whether or not the thing is in inventory" as a signal about whether to update the disk - inventory is lossy, could have transient issues, etc.

Disk lifecycle goes through the following phases:

  1. Blueprint adds disk in-service
  2. Execution creates a PhysicalDisk row (with policy marked as "in-service", and state as "active")
  3. (Disk lifetime goes here)
  4. Blueprint later marks a disk as expunged
  5. Execution marks that PhysicalDisk row (policy is now "expunged")
  6. Expungement happens, and the "disk state" eventually is updated to "decommissioned"
  7. The PhysicalDisk row could presumably be deleted here. It isn't today, but it probably will be in the future. However, the decommissioned_disk_cleaner is already deleting the Zpool rows! So it's basically just a matter of "CRDB rows are effectively deleted here, more cleanup will happen".

The current PR already finds pools/disks of interest by joining on zpool, and filters by "disks that are in-service". This means:

If a physical disk is in step (2) - (3), we'll treat it as "alive and observable". Otherwise: It either is being expunged, or has been expunged.

i would rather see us detect that a disk has been decommissioned by maintaining a list of disks which are in the decommissioned state, and only closing the case if the disk is actually in that list, rather than checking that it is absent?

Yeah this would be sorta problematic, because now we can't delete zpools / physical disk rows until we have confirmed that all their associated cases have "finished up"! Otherwise, if we do expungement before the case gets to close, it'll be stuck open forever, waiting to see a "positive signal of decommission" that will never arrive.

We hit a bunch of these backward dependencies when we tried going through expungement, and it's a real pain-in-the ass. I have a preference for - when we can - stating: "this is the set of all in-service stuff", which is what we're currently doing here

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, thanks for the explanation. i think this makes sense then!

Comment thread nexus/fm/src/diagnosis/disk.rs Outdated
.expect("unreadable case should be copied forward");
assert!(
unreadable.is_open(),
"unreadable case must not be closed by the diagnoser",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"unreadable case must not be closed by the diagnoser",
"unreadable case must not be closed by the diagnosis engine",

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in 6ae7bc6, here and elsewhere

Copy link
Copy Markdown
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still trying to grok the de stuff

Comment thread nexus/fm/src/analysis_input.rs
Comment on lines +242 to +244
if disk.disk_policy != PhysicalDiskPolicy::InService {
continue;
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm. so i am a bit worried here that we are not able to differentiate between disks which have been decommissioned and disks which have just been deleted from the table for "some reason" in the DE code that checks if an unhealthy disk is still in service? but maybe that's fine and we rely on the rest of the system to not mess with this in a way that will make us sad.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how to reconcile this. The database state is our representation of "what disks we consider to be in-service or not". Even the blueprint is just intent; the database rows actually enact that policy.

Coping with "a disk that gets deleted from CRDB for some reason" is akin to coping with arbitrary database corruption IMO. I am not sure I can reasonably accept inputs from CRDB for this DE if we are trying to model things in a byzantine way.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, your answer to my subsequent comment made me feel a bit less sketchy about this. i wasn't super familiar with the lifecycle of the disk tables, so walking through it was helpful. i think this is fine, thank you for clarifying stuff!

Comment on lines +235 to +239
let zpools_and_disks = self
.datastore
.zpool_list_all_external_batched(opctx)
.await
.context("failed to load in-service control plane disks")?;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zpool_list_all_external_batched has a comment on it saying, essentially, that this can take a while. i kinda wonder if, since this and the ereport loading code both could load a lot of data and are basically completely isolated from each other, might we want to spawn separate tokio tasks to do the collection of those different inputs in parallel?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(to be clear, this is not a blocker for this PR, just a thought)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is probably a good idea. I suspect basically all the "preparation" logic for reading from DB, potentially reading from clickhouse, etc, etc, etc, before we start analysis can be parallelized.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, to be clear, not doing it in this PR

Comment on lines +43 to +44
/// All control plane managed disks
in_service_disks: Arc<IdOrdMap<InServiceDisk>>,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like the list of in service disks never makes it into the AnalysisInputReport. do we want it to, or is that too much data to spit out in every Status object from every activation of the analysis task? might we want to at least summarize it with say, the number of in-service disks? that might help to spot some obviously weird things such as an analysis pass that loaded 0 in service disks...?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 6ae7bc6 (printing some basic UUID info)

Comment thread nexus/fm/src/diagnosis/disk.rs Outdated
}
let any_still_unhealthy =
summary.zpool_unhealthy.keys().any(|zpool_id| {
in_service_by_zpool.contains_key(zpool_id)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking at the input code, i am increasingly uncomfy about this and am starting to feel like i would rather see us detect that a disk has been decommissioned by maintaining a list of disks which are in the decommissioned state, and only closing the case if the disk is actually in that list, rather than checking that it is absent? but that might be because i don't know the semantics of the physical_disk/zpool tables well enough to know if what we're doing now is safe or not...

@smklein
Copy link
Copy Markdown
Collaborator Author

smklein commented May 20, 2026

Couple thoughts on the DE in particular:

  • Since "cases" have no concept of identity aside from their facts - basically, "which disk are you working on" - theoretically, a single case could be full of facts about different disks. This would be... bad? Probably?
  • So there is kinda a concept of like, "what is the identity of the resource we are building a case about" that is kinda implied by facts today. Perhaps that's okay? It's flexible? but it also allows the data to model impossible situations.

This may be justification for a "sitrep version of blippy/clippy". Slippy. Which validates "this case has facts which are relevant, parseable, and coherent".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants