-
Notifications
You must be signed in to change notification settings - Fork 62
Closed
Labels
Update SystemReplacing old bits with newer, cooler bitsReplacing old bits with newer, cooler bits
Description
This issue tracks the work mentioned in RFD 183. Resolving these issues will likely result in subsequent RFDs.
Creating and Hosting Packages
- Off-rack Update Service: Host a service where software bundles can be uploaded, stored, and downloaded. Likely TUF.
- Update Hosting: Provide storage / interfaces for downloading packages.
- Update Listing: Provide interfaces for listing and querying packages.
- Automated Tooling For Publishing Packages: Create tooling to automate the process by which artifacts can be added to the "Update Service".
- Packages of interest include...
- Helios Ramdisk
- Zone Images
- Images for SP / RoT
- Automated Testing For Packages: Create tests to confirm/deny interoperability between differently-versioned artifacts.
- Signing Infrastructure: We need a mechanism for packages to be signed as "from Oxide", which can be validated on the rack. This schema should be transparent and documented, such that rack owners could plausibly replace components with their own software.
- Packages of interest include...
Getting Updates to the Rack
- Decide when to perform updates: Make Nexus self-sufficient, and able to decide when to update itself.
- Version-awareness: Make Nexus able to consider "desired" versions, and update to a reasonable choice while maintaining backwards compatibility. (The current implementation updates to whatever is latest, regardless of other software on the rack)
- Rebalancing, liveness-awareness: Make Nexus able to rebalance workloads to enable upgrades that require service / sled reboots. This process must consider...
- Externally-facing Resources: Namely, live migration of virtual machines
- Internally-facing Services: Nexus, CRDB, Clickhouse, DNS servers, Oximeter, Crucible Downstairs, etc, must all maintain availability amid updates.
- Modifying Storage: Preparing for / executing DB schema changes
- Draining Sagas: As documented in RFD 289, ensure that sagas don't cross upgrade boundaries
- Downgrade: Define/implement a process for downgrade.
- Get the bundles into Nexus: Update Nexus's interface to expose an endpoint for uploading + instructing racks to update themselves (completed by [v2] TUF integration in Nexus + update artifact fetching by sled-agent #717).
- Store the bundles within on-sled: The SQL representation of software bundles will likely need to be updated to include metadata referencing downloaded software versions, but the storage of the locally-downloaded binaries will likely live outside CockroachDB.
- Create a more holistic solution to storage management: In [v2] TUF integration in Nexus + update artifact fetching by sled-agent #717 , we created a solution where artifacts are stored to
/var/tmp/oxide_artifacts. However, these artifacts are never cleaned, and not limited / accounted for when considering consumption of device storage relative to customer usage.
- Create a more holistic solution to storage management: In [v2] TUF integration in Nexus + update artifact fetching by sled-agent #717 , we created a solution where artifacts are stored to
- Store the bundles within on-sled: The SQL representation of software bundles will likely need to be updated to include metadata referencing downloaded software versions, but the storage of the locally-downloaded binaries will likely live outside CockroachDB.
Communicating Update Status to the API / Console
- Create APIs for inspecting versions of software, both at a component and "whole-rack" level (e.g., "what version of the API am I using")
- Create APIs for requesting particular versions of software. Presumably this is a mechanism by which downgrade could be requested, but also could be utilized for avoiding updates during critical service windows.
- Provide support in the Console for inspecting/requesting versions
Pushing Updates from Nexus to Everything Else
- Expose APIs to Receive Updates: Within Sled Agent, SP, etc, "update targets" should expose an interface to be able to download and apply software bundles as instructed by Nexus.
- Coordinating Updates: Nexus applying updates to the rack should probably use an update schedule that doesn't powercycle all sleds at once. Though we can certainly start with something simple here, once we get the initial system wired up, we can start building systems to balance updates against live customer workload. Update plans #764
Validating that this Process Works
- Stand up a lab system using a minimally-defined update system
- Iterate on this system "without re-installing", to acquire empirical evidence of the update system utility
Metadata
Metadata
Assignees
Labels
Update SystemReplacing old bits with newer, cooler bitsReplacing old bits with newer, cooler bits