-
Notifications
You must be signed in to change notification settings - Fork 62
[nexus] fault management situation reports #9320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
e951689 to
4108899
Compare
|
Um, okay...the test failure on the Ubuntu buildomat job is making me kinda uncomfortable: it seems like something in the dropshot API test has overflowed its stack. This is disquieting because there isn't any change to Dropshot APIs on this branch, so I'm not sure if there's anything I've changed that has caused it to behave differently... I'll have to investigate further. |
Ah, apparently this was fixed on |
|
The test failure in this build-and-test (helios) job looks like a flake. I've opened #9330 to track that. |
| // TODO(eliza): other sitrep records would be inserted here... | ||
|
|
||
| // Now, try to make the sitrep current. | ||
| let query = InsertSitrepVersionQuery { sitrep_id: sitrep.id() }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I'm reading the rest of the GC PR, and I have more context...
I think you're getting away with this "non-transactional" sitrep creation largely because "Create Sitrep (non-atomically) always precedes this InsertSitrepVersionQuery".
Basically:
- (To make our orphan-scanning tools work) The top-level sitrep record must be inserted first, and must exist while any other rows could exist for the sitrep
- (To know that a proposed sitrep ID references a non-torn sitrep) We only run the "Make sitrep current" insertion query after assembling a sitrep ourselves
HOWEVER, this kinda means we cannot create a sitrep, and ever have an opportunity to inspect it before we try to insert it into sitrep history. If we exposed some API to "try to make an arbitrary sitrep_id current", we wouldn't have a way to know whether it was fully or partially written.
IMO this is the biggest downside of the non-transactional sitrep insertion. I'm definitely not saying that "sitrep insertion" + "updating the sitrep version" should be made transactional -- I think it's good they're separate -- but making just "sitrep insertion" transactional (like we do with blueprints) might make this a more flexible API for inspection.
(This would also probably require making the sitrep deletion operation transactional too - which we also do with blueprints, FWIW)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I'm open to making the sitrep insertion and deletion operations transactional in order to permit inserting a sitrep without making it current.
However, I'm not sure if I understand the value of inserting a sitrep without making it current here. I understand that this is useful for the reconfigurator, because it allows you to inspect a blueprint without actually requesting the system execute that blueprint. I'm just not sure if I get why it's useful for FM.
In the reconfigurator case, I imagine the separation of these operations is generally used when a blueprint is manually constructed by a developer and inserted via a CLI tool, and then you would want to look at it before you decide to make it the target --- is that correct? I'm not sure if this will be as useful for FM development. A blueprint represents a desired system state, but a sitrep represents an observed system state. For reconfigurator development, I imagine it's very useful to be able to manually construct a desired state and watch the execution part of the system drive towards that state, even when the planner cannot automatically construct that blueprint. Here, on the other hand, constructing a sitrep from observations is most of what the automated FM system is trying to do, and I imagine that the more useful approach for manual testing and simulation is to construct fake observations and test that the expected sitrep is generated from those observations. So in that world, sitreps would only ever be inserted into the database by a running fault management planner/controller/thingy, and for it to insert one without also immediately trying to make it current would just be a bug.
Maybe I'm wrong here and I'm certainly open to being convinced this is a useful capability to have, but that's my thinking on the subject currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I probably should have been more clear in my original comment: As long as we can safely perform cleanup (see: all the comments on the other PR) I think it's okay to do this work non-transactionally. I do think it tightly couples the "create sitrep" + "activate sitrep" steps together - which is something we may or may not want - but mostly wanted to call that out as:
- If we want to decouple those things...
- ... we should consider making this operation more transactional
And if we don't care - or we simply don't see a need for this now - we don't need to!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I didn't really take that as you saying that we had to separate those operations. But, I wanted to make sure that I understood the rationale for why we might want to do that, to make sure I'm not missing something when I decided not to care about it.
|
This build failure looks like it was a Buildomat internal error, btw. |
|
Huh, now the |
When a Nexus attempts to commit a new fault management situation report to the sitrep history but fails to do so because another sitrep with the same parent has already been inserted, that sitrep is said to be _orphaned_. Records pertaining to it are left behind in the database, but it will not be accessed by the rest of the system. Thus, we must occasionally garbage-collect such sitreps. This branch adds a background task for doing so. Depends on #9320 Co-authored-by: Sean Klein <sean@oxide.computer>
RFD 603 proposes the fault management situation report, or sitrep, as the central data structure for the control plane's fault management subsystem. The design, which is discussed in much greater detail in that RFD, draws a lot of inspiration from the blueprint data structure in the Reconfigurator. Sitreps are generated by the planning phase in a plan-execute pattern. At any time, a single sitrep is considered current. Updating the control plane's understanding of the state of the system based on new inputs is done by a new planning step based on the current sitrep along with other inputs, and produces a new sitrep with the current sitrep as its parent. A sitrep may then be added to the version history of current sitreps if (and only if) its parent sitrep is still the current sitrep (i.e. the highest version number currently stored in the sitrep history). This ensures that there is a single sequentially consistent history of sitreps. Sitreps generated based on outdated inputs --- due to multiple Nexuses generating them concurrently, or a Nexus operating on state that is no longer up to date ---may not be made current, and are discarded.
This branch adds the foundation of the sitrep subsystem. In particular, it includes the following:
fm_sitreptable, which stores metadata for sitreps, and thefm_sitrep_historytable, which stores the version historynexus_typestypes for the samefm_sitrep_loadertask that loads the latest sitrep version and publishes it over atokio::sync::watchchannel (which is not presently consumed by other code)Right now, a sitrep only contains its top-level metadata. Other tables for storing parts of the sitrep, such as cases and records for updating Problems, will be added later as more of the control plane fault management subsystem is implemented. Currently, no sitreps are ever created outside of tests, so this code won't really do anything yet. But, it's an important foundation for the ret of the FM work, so I wanted to get it up for review as soon as possible.