Skip to content

cpapi_instances_put delenda est #10290

@hawkw

Description

@hawkw

The current flow of VMM state information between Nexus and sled-agent is very unfortunate in the world of online update. At present, the same data (represented by the SledVmmState type, which currently lives in omicron_common) reaches Nexus through three different mechanisms:

  • As the return value of Nexus calls into sled-agent APIs that change the state of a VMM, with the resultant state
  • As the return value of Nexus calls into sled-agent's VMM GET API (called periodically by the instance_watcher background task)
  • As the body of the Nexus internal API's cpapi_instances_put API, which is called by sled-agent when the instance's state changes (which I shall refer to henceforth as the upcall)

The reason that this is bad and I should feel bad1 is that the existence of the upcall prevents basically all of the types involved in this from ever changing until some time in the glorious future after client-side versioning is actually implemented (see RFD 567. Because the upcall is part of the client-side-versioned Nexus API, we cannot simply move the various types involved in SledVmmState into the sled-agent API and version them normally. This is sad, because it makes otherwise simple changes that just wish to add a new field to SledVmmState or VmmRuntimeState or what have you (such as #10285) incredibly painful, requiring them to be smeared across multiple releases if they are even possible.

To fix this, I propose that cpapi_instances_put MUST BE DESTROYED. In particular, since all of the same data is available for Nexus to pull through the VMM GET API (which already exists and is server-side versioned, making it easy to change), we don't need to include the actual state in a Nexus upcall.

This means we could just remove cpapi_instances_put entirely, and rely on the periodic activation of the instance_watcher background task to eventually learn about VMM state changes. I would prefer not to do this, because it increases the latency by which changes in VMM state (such as crashing!) propagates to a place where it's visible to the user. This already takes a bit more time than I would like, and relying on periodic background task activations as the only way we learn about VMM states isn't gonna make that any better. So, giving the sled-agent a way to proactively notify Nexus that the VMM's state has changed is valuable. But, it doesn't really need to actually contain the VMM state as its body. Therefore, I think we should change this so that the endpoint which the sled-agent hits is just a "doorbell" which tells Nexus, "hey, please come and request the current state for this VMM UUID". I propose we should do this as follows:

  1. In the next release (R20):
    • move the various SledVmmState/VmmRuntimeState/etc types into from the omicron_common crate to sled-agent-types-versions
    • change the API definition for cpapi_instances_put to explicitly expect a sled_agent_types_versions::v1::SledVmmState.
    • add a new "doorbell" API that takes a VMM UUID and no body
    • implement the doorbell API by calling the sled-agent's VMM GET endpoint and then doing what cpapi_instances_put does with the state message currently.
    • change the current implementation of cpapi_instances_put in Nexus to just do the same thing as the doorbell and ignore the body
  2. In the release after (R21):
    • change sled-agent to just call the new doorbell API instead of cpapi_instances_put
    • CONSIGN cpapi_instances_put TO THE DUSTBIN OF HISTORY FOREVER

I believe that Greg and I discussed turning this API into a doorbell many moons ago, but we sadly missed our chance to do it before it became complicated. Oh well. Second-best time is now.

Footnotes

  1. Despite the fact that cpapi_instances_put existence predates my tenure at Oxide, I still take responsibility for not having fixed this back when it would have been easy to do.

Metadata

Metadata

Assignees

Labels

Sled AgentRelated to the Per-Sled Configuration and ManagementnexusRelated to nexus

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions