[nexus] fault management situation reports #9320

hawkw · 2025-10-30T18:22:58Z

RFD 603 proposes the fault management situation report, or sitrep, as the central data structure for the control plane's fault management subsystem. The design, which is discussed in much greater detail in that RFD, draws a lot of inspiration from the blueprint data structure in the Reconfigurator. Sitreps are generated by the planning phase in a plan-execute pattern. At any time, a single sitrep is considered current. Updating the control plane's understanding of the state of the system based on new inputs is done by a new planning step based on the current sitrep along with other inputs, and produces a new sitrep with the current sitrep as its parent. A sitrep may then be added to the version history of current sitreps if (and only if) its parent sitrep is still the current sitrep (i.e. the highest version number currently stored in the sitrep history). This ensures that there is a single sequentially consistent history of sitreps. Sitreps generated based on outdated inputs --- due to multiple Nexuses generating them concurrently, or a Nexus operating on state that is no longer up to date ---may not be made current, and are discarded.

This branch adds the foundation of the sitrep subsystem. In particular, it includes the following:

Database schemas for the fm_sitrep table, which stores metadata for sitreps, and the fm_sitrep_history table, which stores the version history
Models and nexus_types types for the same
Database queries for reading the current sitrep version, reading a sitrep by its ID, and for inserting sitreps, including the "compare and swap" CTE that ensures new versions may only be inserted if they descend directly from the current sitrep
A fm_sitrep_loader task that loads the latest sitrep version and publishes it over a tokio::sync::watch channel (which is not presently consumed by other code)
OMDB commands for looking at sitreps

Right now, a sitrep only contains its top-level metadata. Other tables for storing parts of the sitrep, such as cases and records for updating Problems, will be added later as more of the control plane fault management subsystem is implemented. Currently, no sitreps are ever created outside of tests, so this code won't really do anything yet. But, it's an important foundation for the ret of the FM work, so I wanted to get it up for review as soon as possible.

hawkw · 2025-10-31T19:39:59Z

Um, okay...the test failure on the Ubuntu buildomat job is making me kinda uncomfortable: it seems like something in the dropshot API test has overflowed its stack. This is disquieting because there isn't any change to Dropshot APIs on this branch, so I'm not sure if there's anything I've changed that has caused it to behave differently... I'll have to investigate further.

hawkw · 2025-10-31T19:45:14Z

Um, okay...the test failure on the Ubuntu buildomat job is making me kinda uncomfortable: it seems like something in the dropshot API test has overflowed its stack. This is disquieting because there isn't any change to Dropshot APIs on this branch, so I'm not sure if there's anything I've changed that has caused it to behave differently... I'll have to investigate further.

Ah, apparently this was fixed on main yesterday. i've updated this branch.

nexus/db-model/src/fm.rs

dev-tools/omdb/src/bin/omdb/db/sitrep.rs

nexus/src/app/background/tasks/fm_sitrep_load.rs

nexus/db-queries/src/db/datastore/fm.rs

hawkw · 2025-11-01T18:38:17Z

The test failure in this build-and-test (helios) job looks like a flake. I've opened #9330 to track that.

nexus/src/app/background/tasks/fm_sitrep_load.rs

smklein · 2025-11-06T21:23:27Z

nexus/db-queries/src/db/datastore/fm.rs

+        // TODO(eliza): other sitrep records would be inserted here...
+
+        // Now, try to make the sitrep current.
+        let query = InsertSitrepVersionQuery { sitrep_id: sitrep.id() };


Now that I'm reading the rest of the GC PR, and I have more context...

I think you're getting away with this "non-transactional" sitrep creation largely because "Create Sitrep (non-atomically) always precedes this InsertSitrepVersionQuery".

Basically:

(To make our orphan-scanning tools work) The top-level sitrep record must be inserted first, and must exist while any other rows could exist for the sitrep

(To know that a proposed sitrep ID references a non-torn sitrep) We only run the "Make sitrep current" insertion query after assembling a sitrep ourselves

HOWEVER, this kinda means we cannot create a sitrep, and ever have an opportunity to inspect it before we try to insert it into sitrep history. If we exposed some API to "try to make an arbitrary sitrep_id current", we wouldn't have a way to know whether it was fully or partially written.

IMO this is the biggest downside of the non-transactional sitrep insertion. I'm definitely not saying that "sitrep insertion" + "updating the sitrep version" should be made transactional -- I think it's good they're separate -- but making just "sitrep insertion" transactional (like we do with blueprints) might make this a more flexible API for inspection.

(This would also probably require making the sitrep deletion operation transactional too - which we also do with blueprints, FWIW)

Hmm. I'm open to making the sitrep insertion and deletion operations transactional in order to permit inserting a sitrep without making it current.

However, I'm not sure if I understand the value of inserting a sitrep without making it current here. I understand that this is useful for the reconfigurator, because it allows you to inspect a blueprint without actually requesting the system execute that blueprint. I'm just not sure if I get why it's useful for FM.

In the reconfigurator case, I imagine the separation of these operations is generally used when a blueprint is manually constructed by a developer and inserted via a CLI tool, and then you would want to look at it before you decide to make it the target --- is that correct? I'm not sure if this will be as useful for FM development. A blueprint represents a desired system state, but a sitrep represents an observed system state. For reconfigurator development, I imagine it's very useful to be able to manually construct a desired state and watch the execution part of the system drive towards that state, even when the planner cannot automatically construct that blueprint. Here, on the other hand, constructing a sitrep from observations is most of what the automated FM system is trying to do, and I imagine that the more useful approach for manual testing and simulation is to construct fake observations and test that the expected sitrep is generated from those observations. So in that world, sitreps would only ever be inserted into the database by a running fault management planner/controller/thingy, and for it to insert one without also immediately trying to make it current would just be a bug.

Maybe I'm wrong here and I'm certainly open to being convinced this is a useful capability to have, but that's my thinking on the subject currently.

Yeah, I probably should have been more clear in my original comment: As long as we can safely perform cleanup (see: all the comments on the other PR) I think it's okay to do this work non-transactionally. I do think it tightly couples the "create sitrep" + "activate sitrep" steps together - which is something we may or may not want - but mostly wanted to call that out as:

If we want to decouple those things...

... we should consider making this operation more transactional

And if we don't care - or we simply don't see a need for this now - we don't need to!

Yeah, I didn't really take that as you saying that we had to separate those operations. But, I wanted to make sure that I understood the rationale for why we might want to do that, to make sure I'm not missing something when I decided not to care about it.

hawkw · 2025-11-07T21:28:39Z

This build failure looks like it was a Buildomat internal error, btw.

hawkw · 2025-11-07T22:28:12Z

Huh, now the helios / build TUF repo job is hitting an error trying to clone the dmar_report git repo, which apparently doesn't exist?

When a Nexus attempts to commit a new fault management situation report to the sitrep history but fails to do so because another sitrep with the same parent has already been inserted, that sitrep is said to be _orphaned_. Records pertaining to it are left behind in the database, but it will not be accessed by the rest of the system. Thus, we must occasionally garbage-collect such sitreps. This branch adds a background task for doing so. Depends on #9320 Co-authored-by: Sean Klein <sean@oxide.computer>

hawkw self-assigned this Oct 30, 2025

hawkw added the nexus Related to nexus label Oct 30, 2025

hawkw added 17 commits October 30, 2025 13:09

[nexus] initial schema for sitrep tables

65f6ffe

[nexus] sitrep types

be4b7ff

[nexus] finish loader task, other stuff

c3b36f0

[nexus] s/fm_sitrep_version/fm_sitrep_history

72a7c50

start on horrific CTE

5d0cfa9

finish sitrep insert CTE

9cf6b8d

wip omdb stuff

6565dcb

finish omdb sitrep cmds

05c8504

add bg task details

bef2996

add sitrep_loader to config tomls

585b39d

fix clap

c4acdac

finally update OMDB tests

18142e3

migrations

b9c5c07

commentary + a few API tweaks

6640ab6

clippy tidiness

218056e

fixup docs

e9d87b2

whoops make module public so the error is visible

4108899

hawkw force-pushed the eliza/fm-sitrep branch from e951689 to 4108899 Compare October 30, 2025 20:11

hawkw requested review from davepacheco, jgallagher and smklein October 30, 2025 20:14

you gotta remember to update the config tests

570dda5

Merge branch 'main' into eliza/fm-sitrep

350508b

smklein reviewed Oct 31, 2025

View reviewed changes

reorder fields

aae8d15

hawkw mentioned this pull request Nov 1, 2025

test failed in CI: test_disks_detached_when_instance_destroyed #9330

Open

add sitrep loader test

8522c68

hawkw mentioned this pull request Nov 4, 2025

[nexus] garbage collect orphaned FM sitreps #9335

Merged

hawkw added 3 commits November 4, 2025 10:24

check that the current version hasn't gone down

dfffd69

make OMDB's view of the current sitrep more consistent

649a327

rename model module

cb1813f

smklein reviewed Nov 6, 2025

View reviewed changes

nexus/src/app/background/tasks/fm_sitrep_load.rs Outdated Show resolved Hide resolved

smklein approved these changes Nov 6, 2025

View reviewed changes

smklein reviewed Nov 6, 2025

View reviewed changes

fix wrong test name (x__x)

555d534

Merge branch 'main' into eliza/fm-sitrep

fd8beb5

hawkw enabled auto-merge (squash) November 7, 2025 22:31

hawkw merged commit 2967351 into main Nov 8, 2025
16 checks passed

hawkw deleted the eliza/fm-sitrep branch November 8, 2025 00:52

hawkw added the fault-management Everything related to the fault-management initiative (RFD480 and others) label Nov 11, 2025

[nexus] fault management situation reports #9320

[nexus] fault management situation reports #9320

Uh oh!

Conversation

hawkw commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hawkw commented Oct 31, 2025

Uh oh!

hawkw commented Oct 31, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hawkw commented Nov 1, 2025

Uh oh!

Uh oh!

smklein Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

hawkw Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

smklein Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

hawkw Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

hawkw commented Nov 7, 2025

Uh oh!

hawkw commented Nov 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hawkw commented Oct 30, 2025 •

edited

Loading