Skip to content

Conversation

@hawkw
Copy link
Member

@hawkw hawkw commented Oct 30, 2025

RFD 603 proposes the fault management situation report, or sitrep, as the central data structure for the control plane's fault management subsystem. The design, which is discussed in much greater detail in that RFD, draws a lot of inspiration from the blueprint data structure in the Reconfigurator. Sitreps are generated by the planning phase in a plan-execute pattern. At any time, a single sitrep is considered current. Updating the control plane's understanding of the state of the system based on new inputs is done by a new planning step based on the current sitrep along with other inputs, and produces a new sitrep with the current sitrep as its parent. A sitrep may then be added to the version history of current sitreps if (and only if) its parent sitrep is still the current sitrep (i.e. the highest version number currently stored in the sitrep history). This ensures that there is a single sequentially consistent history of sitreps. Sitreps generated based on outdated inputs --- due to multiple Nexuses generating them concurrently, or a Nexus operating on state that is no longer up to date ---may not be made current, and are discarded.

This branch adds the foundation of the sitrep subsystem. In particular, it includes the following:

  • Database schemas for the fm_sitrep table, which stores metadata for sitreps, and the fm_sitrep_history table, which stores the version history
  • Models and nexus_types types for the same
  • Database queries for reading the current sitrep version, reading a sitrep by its ID, and for inserting sitreps, including the "compare and swap" CTE that ensures new versions may only be inserted if they descend directly from the current sitrep
  • A fm_sitrep_loader task that loads the latest sitrep version and publishes it over a tokio::sync::watch channel (which is not presently consumed by other code)
  • OMDB commands for looking at sitreps

Right now, a sitrep only contains its top-level metadata. Other tables for storing parts of the sitrep, such as cases and records for updating Problems, will be added later as more of the control plane fault management subsystem is implemented. Currently, no sitreps are ever created outside of tests, so this code won't really do anything yet. But, it's an important foundation for the ret of the FM work, so I wanted to get it up for review as soon as possible.

@hawkw hawkw self-assigned this Oct 30, 2025
@hawkw hawkw added the nexus Related to nexus label Oct 30, 2025
@hawkw
Copy link
Member Author

hawkw commented Oct 31, 2025

Um, okay...the test failure on the Ubuntu buildomat job is making me kinda uncomfortable: it seems like something in the dropshot API test has overflowed its stack. This is disquieting because there isn't any change to Dropshot APIs on this branch, so I'm not sure if there's anything I've changed that has caused it to behave differently... I'll have to investigate further.

@hawkw
Copy link
Member Author

hawkw commented Oct 31, 2025

Um, okay...the test failure on the Ubuntu buildomat job is making me kinda uncomfortable: it seems like something in the dropshot API test has overflowed its stack. This is disquieting because there isn't any change to Dropshot APIs on this branch, so I'm not sure if there's anything I've changed that has caused it to behave differently... I'll have to investigate further.

Ah, apparently this was fixed on main yesterday. i've updated this branch.

@hawkw
Copy link
Member Author

hawkw commented Nov 1, 2025

The test failure in this build-and-test (helios) job looks like a flake. I've opened #9330 to track that.

// TODO(eliza): other sitrep records would be inserted here...

// Now, try to make the sitrep current.
let query = InsertSitrepVersionQuery { sitrep_id: sitrep.id() };
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I'm reading the rest of the GC PR, and I have more context...

I think you're getting away with this "non-transactional" sitrep creation largely because "Create Sitrep (non-atomically) always precedes this InsertSitrepVersionQuery".

Basically:

  • (To make our orphan-scanning tools work) The top-level sitrep record must be inserted first, and must exist while any other rows could exist for the sitrep
  • (To know that a proposed sitrep ID references a non-torn sitrep) We only run the "Make sitrep current" insertion query after assembling a sitrep ourselves

HOWEVER, this kinda means we cannot create a sitrep, and ever have an opportunity to inspect it before we try to insert it into sitrep history. If we exposed some API to "try to make an arbitrary sitrep_id current", we wouldn't have a way to know whether it was fully or partially written.

IMO this is the biggest downside of the non-transactional sitrep insertion. I'm definitely not saying that "sitrep insertion" + "updating the sitrep version" should be made transactional -- I think it's good they're separate -- but making just "sitrep insertion" transactional (like we do with blueprints) might make this a more flexible API for inspection.

(This would also probably require making the sitrep deletion operation transactional too - which we also do with blueprints, FWIW)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I'm open to making the sitrep insertion and deletion operations transactional in order to permit inserting a sitrep without making it current.

However, I'm not sure if I understand the value of inserting a sitrep without making it current here. I understand that this is useful for the reconfigurator, because it allows you to inspect a blueprint without actually requesting the system execute that blueprint. I'm just not sure if I get why it's useful for FM.

In the reconfigurator case, I imagine the separation of these operations is generally used when a blueprint is manually constructed by a developer and inserted via a CLI tool, and then you would want to look at it before you decide to make it the target --- is that correct? I'm not sure if this will be as useful for FM development. A blueprint represents a desired system state, but a sitrep represents an observed system state. For reconfigurator development, I imagine it's very useful to be able to manually construct a desired state and watch the execution part of the system drive towards that state, even when the planner cannot automatically construct that blueprint. Here, on the other hand, constructing a sitrep from observations is most of what the automated FM system is trying to do, and I imagine that the more useful approach for manual testing and simulation is to construct fake observations and test that the expected sitrep is generated from those observations. So in that world, sitreps would only ever be inserted into the database by a running fault management planner/controller/thingy, and for it to insert one without also immediately trying to make it current would just be a bug.

Maybe I'm wrong here and I'm certainly open to being convinced this is a useful capability to have, but that's my thinking on the subject currently.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I probably should have been more clear in my original comment: As long as we can safely perform cleanup (see: all the comments on the other PR) I think it's okay to do this work non-transactionally. I do think it tightly couples the "create sitrep" + "activate sitrep" steps together - which is something we may or may not want - but mostly wanted to call that out as:

  • If we want to decouple those things...
  • ... we should consider making this operation more transactional

And if we don't care - or we simply don't see a need for this now - we don't need to!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I didn't really take that as you saying that we had to separate those operations. But, I wanted to make sure that I understood the rationale for why we might want to do that, to make sure I'm not missing something when I decided not to care about it.

@hawkw
Copy link
Member Author

hawkw commented Nov 7, 2025

This build failure looks like it was a Buildomat internal error, btw.

@hawkw
Copy link
Member Author

hawkw commented Nov 7, 2025

Huh, now the helios / build TUF repo job is hitting an error trying to clone the dmar_report git repo, which apparently doesn't exist?

@hawkw hawkw enabled auto-merge (squash) November 7, 2025 22:31
@hawkw hawkw merged commit 2967351 into main Nov 8, 2025
16 checks passed
@hawkw hawkw deleted the eliza/fm-sitrep branch November 8, 2025 00:52
hawkw added a commit that referenced this pull request Nov 10, 2025
When a Nexus attempts to commit a new fault management situation report
to the sitrep history but fails to do so because another sitrep with the
same parent has already been inserted, that sitrep is said to be
_orphaned_. Records pertaining to it are left behind in the database,
but it will not be accessed by the rest of the system. Thus, we must
occasionally garbage-collect such sitreps. This branch adds a background
task for doing so.

Depends on #9320

Co-authored-by: Sean Klein <sean@oxide.computer>
@hawkw hawkw added the fault-management Everything related to the fault-management initiative (RFD480 and others) label Nov 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fault-management Everything related to the fault-management initiative (RFD480 and others) nexus Related to nexus

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants