[helios-fusion][sled-agent] A fake Helios system for Sled Agent Testing #3948
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The Sled Agent is responsible for managing all sled-local state on a system, which includes bootstrapping, managing networking, local storage, services, routing, and much more. Effectively: although Nexus is responsible for planning and distributing workloads, the Sled Agent is responsible for “making that state real” – even before Nexus is running.
As a consequence of this architecture, Nexus is fairly decoupled from the underlying host OS primitives, while the Sled Agent is extremely tightly coupled with the underlying host OS. As it exists today, Nexus can happily operate on more-or-less any operating system – although it expects other machines to provision zones and run illumos-specific commands, it does not perform those operations itself. In contrast, the Sled Agent is specifically tuned to run on Helios, and in particular, the stlouis branch of Helios. Although it may compile for other operating systems (left mostly as a developer convenience for rust-analyzer), it really only executes on this stlouis/Helios host.
Sidenote: The Sled Agent contains an HTTP server which runs a “simulated” sled agent. To be precise, this interface “simulated a sled agent”, but is mostly used for testing Nexus, as it shares little code with the “real” sled agent. This PR focuses on test coverage of the real sled agent.
Automated testing of the “real” Sled Agent has been notoriously difficult. Since the Sled Agent manages global state, it is not merely a program that runs in isolation from the rest of the system – rather, it operates on “global” state of the host OS (e.g., spinning up zones, managing networks/disks, etc) which are not trivial to run in isolation. Up until recently, most Sled Agent testing is manual – it involves getting a real machine running illumos (hopefully with the stlouis branch of Helios on it, but there are other illumos configs that people have historically used - see RFD 411), trying to get it to a “clean slate”, running the installation scripts, and hoping that you got the configuration right.
Historically, the Sled Agent attempted to abstract away the host system using mockall, a mocking crate. In isolated cases, this worked, but presented significant challenges moving up the stack: the dependency on “mock interfaces” pushes up the stack, so setting up “expectation” calls propagates upward through the stack. Additionally, the decision of “what to mock” has been inconsistent across the stack, resulting in a handful of mismatched “mock” layers that are difficult to use from a high-level perspective.
Proposal
Intercept the commands from the Sled Agent to the underlying host OS. Today, these largely consist of CLI commands, although this is also true of commands to a host library using FFI. For any operation within the Sled Agent that attempts to modify the host OS (e.g., provisioning zones, setting up routes, managing storage, etc), we can roughly do the following: act on a “host” object, which uses dynamic dispatch to decide which “host backend” should be used. For a real system in deployment, these requests will be mapped to a thin, pass-through implementation (e.g., std::process::Commands will be executed, host libraries will be called directly). For systems under test, we can operate on this output however we’d like.
Provide an implementation of a “fake host”, to track the ways in which the Sled Agent has modified a host. This is one possible backend of the “host” abstraction, which does not try to actually run any workload, but rather, act as a “record of the state that the Sled Agent can manage”. The interface here would be the same as the one used for the real host – a trait through which CLI commands or commands to host libraries can be issued – but rather than actually running on a real host, they would be recorded, and would provide “fake responses” that look like a host system. Additionally, additional interfaces could be used for tests to arbitrarily modify this behavior. For example: we could allow callers to send commands to the “ZFS” subsystem, but choose particular commands for which we should always return errors. Similarly, we could emulate “pulling a disk out of a fake system”, and monitoring how the Sled Agent responds.
Pros:
Cons:
Nuances:
Example Tests
“I want to spin up a new service, what happens?”
“I want to test what happens if a propolis instance fails partway through setup - does Sled Agent crash?”
Validation:
Alternatives
Spin up a VM to run each Sled Agent test, rather than a “Fake Host”
Pros
Cons