Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement region replacement for Volumes #5683

Closed
wants to merge 6 commits into from

Commits on May 2, 2024

  1. Implement region replacement for Volumes

    When a disk is expunged, any region that was on that disk is assumed to
    be gone. A single disk expungement can put many Volumes into degraded
    states, as one of the three mirrors of a region set is now gone. Volumes
    that are degraded in this way remain degraded until a new region is
    swapped in, and the Upstairs performs the necessary repair operation
    (either through a Live Repair or Reconciliation). Nexus can only
    initiate these repairs - it does not participate in them, instead
    requesting that a Crucible Upstairs perform the repair.
    
    These repair operations can only be done by an Upstairs running as part
    of an activated Volume: either Nexus has to send this Volume to a Pantry
    and repair it there, or Nexus has to talk to a propolis that has that
    active Volume. Further complicating things is that the Volumes in
    question can be activated and deactivated as a result of user action,
    namely starting and stopping Instances. This will interrupt any on-going
    repair. This is ok! Both operations support being interrupted, but as a
    result it's then Nexus' job to continually monitor these repair
    operations and initiate further operations if the current one is
    interrupted.
    
    Nexus starts by creating region replacement requests, either manually or
    as a result of disk expungement. These region replacement requests go
    through the following states:
    
            Requested   <--
                          |
                |         |
                v         |
                          |
            Allocating  --
    
                |
                v
    
             Running    <--
                          |
                |         |
                v         |
                          |
             Driving    --
    
                |
                v
    
         ReplacementDone  <--
                            |
                |           |
                v           |
                            |
            Completing    --
    
                |
                v
    
            Completed
    
    A single saga invocation is not enough to continually make sure a Volume
    is being repaired, so region replacement is structured as series of
    background tasks and saga invocations from those background tasks.
    
    Here's a high level summary:
    
    - a `region replacement` background task:
    
      - looks for disks that have been expunged and inserts region
        replacement requests into CRDB with state `Requested`
    
      - looks for all region replacemnt requests in state `Requested`
        (picking up new requests and requests that failed to transition to
        `Running`), and invokes a `region replacement start` saga.
    
    - the `region replacement start` saga:
    
      - transitions the request to state `Allocating`, blocking out other
        invocations of the same saga
    
      - allocates a new replacement region
    
      - alters the Volume Construction Request by swapping out the old
        region for the replacement one
    
      - transitions the request to state `Running`
    
      - any unwind will transition the request back to the `Requested`
        state.
    
    - a `region replacement drive` background task:
    
      - looks for requests with state `Running`, and invokes the `region
        replacement drive` saga for those requests
    
      - looks for requests with state `ReplacementDone`, and invokes the
        `region replacement finish` saga for those requests
    
    - the `region replacement drive` saga will:
    
      - transition a request to state `Driving`, again blocking out other
        invocations of the same saga
    
      - check if Nexus has taken an action to initiate a repair yet. if not,
        then one is needed. if it _has_ previously initiated a repair
        operation, the state of the system is examined: is that operation
        still running? has something changed? further action may be required
        depending on this observation.
    
      - if an action is required, Nexus will prepare an action that will
        initiate either Live Repair or Reconciliation based on the current
        observed state of the system.
    
      - that action is then executed. if there was an error, then the saga
        unwinds. if it was successful, it is recorded as a "repair step" in
        CRDB and will be checked the next time the saga runs.
    
      - if Nexus observed an Upstairs telling it that a repair was completed
        or not necessary, then the request is placed into the
        `ReplacementDone` state, otherwise it is placed back into the
        `Running` state. if the saga unwinds, it unwinds back to the
        `Running` state.
    
    - finally, the `region replacement finish` saga will:
    
      - transition a request into `Completing`
    
      - delete the old region by deleting a transient Volume that refers to
        it (in the case where a sled or disk is actually physically gone,
        expunging that will trigger oxidecomputer#4331, which needs to be fixed!)
    
      - transition the request to the `Complete` state
    
    More detailed documentation is provided in each of the region
    replacement saga's beginning docstrings.
    
    Testing was done manually using the Canada region using the following
    test cases:
    
    - a disk needing repair is attached to a instance for the duration of
      the repair
    
    - a disk needing repair is attached to a instance that is migrated
      mid-repair
    
    - a disk needing repair is attached to a instance that is stopped
      mid-repair
    
    - a disk needing repair is attached to a instance that is stopped
      mid-repair, then started in the middle of the pantry's repair
    
    - a detached disk needs repair
    
    - a detached disk needs repair, and is then attached to an instance that
      is then started
    
    - a sled is expunged, causing region replacement requests for all
      regions on it
    
    Fixes oxidecomputer#3886
    Fixes oxidecomputer#5191
    jmpesp committed May 2, 2024
    Configuration menu
    Copy the full SHA
    300f4a8 View commit details
    Browse the repository at this point in the history

Commits on May 3, 2024

  1. mark_region_replacement_as_done should always work

    fix case where mark_region_replacement_as_done wasn't changing the state
    of a request for which there was a drive saga running.
    jmpesp committed May 3, 2024
    Configuration menu
    Copy the full SHA
    b45ac93 View commit details
    Browse the repository at this point in the history

Commits on May 7, 2024

  1. Configuration menu
    Copy the full SHA
    471f9b7 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    3449f97 View commit details
    Browse the repository at this point in the history
  3. remove non-idempotent code

    jmpesp committed May 7, 2024
    Configuration menu
    Copy the full SHA
    6ae24e5 View commit details
    Browse the repository at this point in the history

Commits on May 8, 2024

  1. Configuration menu
    Copy the full SHA
    93b3e84 View commit details
    Browse the repository at this point in the history