Skip to content

Conversation

@smklein
Copy link
Collaborator

@smklein smklein commented Aug 24, 2023

The Sled Agent is responsible for managing all sled-local state on a system, which includes bootstrapping, managing networking, local storage, services, routing, and much more. Effectively: although Nexus is responsible for planning and distributing workloads, the Sled Agent is responsible for “making that state real” – even before Nexus is running.

As a consequence of this architecture, Nexus is fairly decoupled from the underlying host OS primitives, while the Sled Agent is extremely tightly coupled with the underlying host OS. As it exists today, Nexus can happily operate on more-or-less any operating system – although it expects other machines to provision zones and run illumos-specific commands, it does not perform those operations itself. In contrast, the Sled Agent is specifically tuned to run on Helios, and in particular, the stlouis branch of Helios. Although it may compile for other operating systems (left mostly as a developer convenience for rust-analyzer), it really only executes on this stlouis/Helios host.

Sidenote: The Sled Agent contains an HTTP server which runs a “simulated” sled agent. To be precise, this interface “simulated a sled agent”, but is mostly used for testing Nexus, as it shares little code with the “real” sled agent. This PR focuses on test coverage of the real sled agent.

Automated testing of the “real” Sled Agent has been notoriously difficult. Since the Sled Agent manages global state, it is not merely a program that runs in isolation from the rest of the system – rather, it operates on “global” state of the host OS (e.g., spinning up zones, managing networks/disks, etc) which are not trivial to run in isolation. Up until recently, most Sled Agent testing is manual – it involves getting a real machine running illumos (hopefully with the stlouis branch of Helios on it, but there are other illumos configs that people have historically used - see RFD 411), trying to get it to a “clean slate”, running the installation scripts, and hoping that you got the configuration right.

Historically, the Sled Agent attempted to abstract away the host system using mockall, a mocking crate. In isolated cases, this worked, but presented significant challenges moving up the stack: the dependency on “mock interfaces” pushes up the stack, so setting up “expectation” calls propagates upward through the stack. Additionally, the decision of “what to mock” has been inconsistent across the stack, resulting in a handful of mismatched “mock” layers that are difficult to use from a high-level perspective.

  • We need to have automated tests that validate the Sled Agent on real hardware. We need these tests, to confirm the behaviors of the Sled Agent interacting with the real host OS, on real Oxide hardware. This is the product! It should be validated, and this is a major priority for end-to-end validation. This is a valid category of tests, but not the focus of this PR.
  • The Sled Agent, however, is not the host OS. It is a single component of software which exists atop the host OS. As such, it would also be useful to validate “how the Sled Agent interfaces with the host system”, by intercepting and validating the commands sent to and received from the host. This category of tests is the focus of this PR.

Proposal

  1. Intercept the commands from the Sled Agent to the underlying host OS. Today, these largely consist of CLI commands, although this is also true of commands to a host library using FFI. For any operation within the Sled Agent that attempts to modify the host OS (e.g., provisioning zones, setting up routes, managing storage, etc), we can roughly do the following: act on a “host” object, which uses dynamic dispatch to decide which “host backend” should be used. For a real system in deployment, these requests will be mapped to a thin, pass-through implementation (e.g., std::process::Commands will be executed, host libraries will be called directly). For systems under test, we can operate on this output however we’d like.

  2. Provide an implementation of a “fake host”, to track the ways in which the Sled Agent has modified a host. This is one possible backend of the “host” abstraction, which does not try to actually run any workload, but rather, act as a “record of the state that the Sled Agent can manage”. The interface here would be the same as the one used for the real host – a trait through which CLI commands or commands to host libraries can be issued – but rather than actually running on a real host, they would be recorded, and would provide “fake responses” that look like a host system. Additionally, additional interfaces could be used for tests to arbitrarily modify this behavior. For example: we could allow callers to send commands to the “ZFS” subsystem, but choose particular commands for which we should always return errors. Similarly, we could emulate “pulling a disk out of a fake system”, and monitoring how the Sled Agent responds.

Pros:

  • This separation would provide a minimal interface over the “real host” commands – the overhead is simply “calling through an appropriate interface in the Host trait”, and underneath, there should be minimal host-specific code. Unlike the more general mock-based approach, this provides a more unified spot for abstracting access to the host.
  • Full control of host OS responses. If we wanted to control how a fake host responded to a very specific command (either through the CLI or a host library), we could, and we could return an arbitrary error response. For example, see: [sled-agent] Create an "Executor", which intercepts requests through std::process::Command #3442 (comment)
  • Composability: By building out a holistic fake host abstraction, we should be able to test arbitrary layers within the Sled Agent. Unlike mocks, this will give us the ability to test layers of the Sled Agent that cross interface boundaries. For one such example, see: [sled-agent] Create an "Executor", which intercepts requests through std::process::Command #3442 (comment)
  • Rapid iteration time: For areas of the fake host that are built-out, it should be possible to rapidly re-compile and re-test the behavior of the Sled Agent against the fake host without repackaging an entire Oxide system and deploying it on limited hardware space.

Cons:

  • Skew. The fake host interface would only be as valuable as its ability to emulate a real system, which means there will be additional validation work to ensure it looks like a real stlouis system. This gap can be somewhat mitigated by using “oracle testing” (using a “fake host” and a “real host” side-by-side, and validating that their output is identical for a particular set of inputs). Regardless, this would require developer resources to build out.
  • Maintenance. Whenever a new mechanism for interfacing with the host OS is built out, we’d need to add it to the “host interface” trait and also add a “fake implementation” in the fake host, if we want testing fidelity.
  • Limits the ways in which we interface with our underlying host system. More complex host OS coupling, such as a heavy reliance on fork/execute and signals, would be more difficult to emulate. As a result, their cost to emulate and test may discourage usage in the production sled agent. It is left as an exercise to the reader to determine if this is actually a “pro”, or indeed a “con”.

Nuances:

  • Do we consider access to the host filesystem an interaction with the host? In many ways, “yes”, but full access to a filesystem can also be emulated effectively using a temporary directory. Sending all file access commands (read/write) to a “host interface”, may be an onerous abstraction, and it may be preferable to re-factor our access to storage so that we can use “real filesystem” of the test environment.

Example Tests

“I want to spin up a new service, what happens?”

  • Sled Agent would make a variety of calls on the “Host” object to instantiate VNICs, IP addresses, zones, routes, etc.
  • These calls would all be issued to the “FakeHost” backend. As long as they’re layered appropriately (e.g., zonecfg references a VNIC that was previously created with the fake host), the commands should succeed.
  • Afterwards, the test can query the “FakeHost” backend directly, and confirm the service was initialized with the requested configuration.

“I want to test what happens if a propolis instance fails partway through setup - does Sled Agent crash?”

  • We would spin up a “fake Nexus” HTTP server, to intercept the internal calls from the Sled Agent.
  • Sled Agent would make a variety of calls on the “Host” object to instantiate VNICs, IP addresses, zones, routes, etc. This is similar to the “new service” case, but exercising the “new instance” pathway.
  • These calls would be issued to the “FakeHost” backend once more.
  • When the “FakeHost” notices that a propolis zone has been started, it could create a “fake propolis” HTTP server which the Sled Agent could query.
  • The Sled Agent could then monitor this “fake propolis” HTTP server, and we could send back whatever responses we want from the “fake propolis” to get coverage of the Sled Agent handling.

Validation:

  • We could confirm that the Sled Agent does not crash when an error comes back from the “fake propolis” HTTP server
  • We could confirm that the instance zone and VNICs are appropriately destroyed when this happens
  • We could confirm that the state of the instance is correctly reported upward to the “fake nexus” HTTP server

Alternatives

Spin up a VM to run each Sled Agent test, rather than a “Fake Host”

Pros

  • Arguably better signal fidelity – we’d deal with nested virtualization rather than a “fake host interface”
  • No “FakeHost” layer to maintain, nor a “Host” trait object to be passed around. We could continue just directly issuing commands to the host.

Cons

  • Each test would become fairly expensive, requiring that we spin up a virtual machine for each, and configure it as we’d expect the Sled Agent to look.
  • We wouldn’t have a direct test interface to manipulate “what the Sled Agent sees when it looks at the host” – rather, we’d need to modify the VMM to allow us to send back targeted responses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants