Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OADP-145: Two-phase DC restore for restic #141

Merged
merged 1 commit into from Jul 18, 2022

Conversation

sseago
Copy link
Contributor

@sseago sseago commented May 10, 2022

When using default volumes to restic, disconnect DeploymentConfig
pods from ReplicationController and label for later cleanup.
For DeploymentConfigs, set replicas to zero (and paused to false)
and annotate/label for later cleanup.

After restore completes, a post-restore bash script must be run
to clean up restic restore pods and restore DC to prior configuration.

Note that this is currently only done when the backup sets defaultVolumesToRestic=true -- for the restic annotation case, we don't know in the DeploymentConfig plugin that the DC had its pods disconnected. To enable this case we may need to add another backup annotation to communicate the user's desire to disconnect all DC pods and scale down DCs.

Also, note that this does not currently handle the case for post-restore hooks on DC pods. For restore-level hooks, this could be added similarly to what's being done w/ the defaultVolumesToRestic config. For pod annotations, the concerns are similar to restic annotations.

Also note that following a successful restore, the following script would need to be run in order to clean up the restore pods and restore the DC to its prior configuration (script location in github still pending):

#!/bin/bash
set -e

OADP_NAMESPACE=${OADP_NAMESPACE:=openshift-adp}

if [[ $# -ne 1 ]]; then
  echo "usage: ${BASH_SOURCE} restore-name"
  exit 1
fi

echo using OADP Namespace $OADP_NAMESPACE
echo restore $1

echo Deleting disconnected restore pods
oc delete pods -l openshift.io/disconnected-from-dc=$1

for dc in $(oc get dc --all-namespaces -l openshift.io/replicas-modified=$1 -o jsonpath='{range .items[*]}{.metadata.namespace}{","}{.metadata.name}{","}{.metadata.annotations.openshift\.io/original-replicas}{","}{.metadata.annotations.openshift\.io/original-paused}{"\n"}')
do
    IFS=',' read -ra dc_arr <<< "$dc"
    echo Found deployment ${dc_arr[0]}/${dc_arr[1]}, setting replicas: ${dc_arr[2]}, paused: ${dc_arr[3]}

    cat <<EOF | oc patch dc  -n ${dc_arr[0]} ${dc_arr[1]} --patch-file /dev/stdin
spec:
  replicas: ${dc_arr[2]}
  paused: ${dc_arr[3]}
EOF

done

@sseago
Copy link
Contributor Author

sseago commented May 10, 2022

One additional documentation note: The post-restore script will fail for Velero restores with names exceeding 63 chars, since those names are not valid labels.

@kaovilai
Copy link
Member

This is cool

@sseago
Copy link
Contributor Author

sseago commented May 10, 2022

Note the bash script above has been added in a PR to oadp-operator to doc/scripts

const (
DCPodDeploymentLabel string = "deployment" // identifies associated RC
DCPodDeploymentConfigLabel string = "deploymentconfig" // identifies associated DC
DCPodDisconnectedLabel string = "openshift.io/disconnected-from-dc" // pod had DC labels removed on restore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think pod labels should be oadp specific. Don't want to have cross-project label conflicts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or openshift-velero-plugin specific. velero.openshift.io?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a more specific label namespace we should be using for OADP labels? Looking at the current code, it seemed like we were using openshift.io and not something like oadp.openshift.io -- @dymurray what is the standard for OADP namespace for labels and annotations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know for migration use cases we started with openshift.io and eventually moved to migration.openshift.io -- Whatever we use, we should standardize on the appropriate label/annotation namespace for OADP in general. Currently we seem to be using only "openshift.io" and "migration.openshift.io" -- and nothing else.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@sseago sseago May 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I'm not sure I've ever seen label/annotation names formulated this way. Is there precedent? For MTC we used migration.openshift.io/foo rather than openshift.io/migration/foo

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok alternatively openshift.io/oadp_disconnected-from-dc

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like it nested, i like the namespacing. oadp.openshift.io/disconnected-from-dc looks good to me.

In general I like moving all our annotations/labels to oadp.openshift.io as that's the namespace our APIs live under.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dymurray or should it be adp.openshift.io? I'm a bit unclear on when to use "oadp" vs "adp" -- i.e. namespace is "openshift-adp" presumably because we don't want "openshift" mentioned twice in the name (once spelled out, once in the acronym). Would the same apply to the ".openshift.io" namespacing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whatever it is, I'm assuming we'd use this for all new oadp label/annotation names going forward. "oadp.openshift.io" is probably clearest.

@dymurray dymurray changed the title Two-phase DC restore for restic OADP-145: Two-phase DC restore for restic Jul 12, 2022
When using default volumes to restic, disconnect DeploymentConfig
pods from ReplicationController and label for later cleanup.
For DeploymentConfigs, set replicas to zero (and paused to false)
and annotate/label for later cleanup.

After restore completes, a post-restore bash script must be run
to clean up restic restore pods and restore DC to prior configuration.
@sseago
Copy link
Contributor Author

sseago commented Jul 13, 2022

@dymurray I've updated the PR to reflect the proposed label name changes, as well as updating it to work with the new GetBackup method signature.

p.Log.Infof("[pod-restore] skipping restore of pod %s, has owner references, no restic backup, and no restore hooks", pod.Name)
return velero.NewRestoreItemActionExecuteOutput(input.Item).WithoutRestore(), nil
}

// If pod has both "deployment" and "deploymentconfig" labels, it belongs to a DeploymentConfig
// If defaultVolumesToRestic, remove these labels so that the DC won't immediately delete the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So when does this pod get deleted? never?

So we end up with restored Pod + 1 DeploymentConfig pod?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It needs to be cleaned up post-restore. At restore, we don't get 2 pods since the DC is restored with replicas=0. Cleanup should be triggered by running the script added in openshift/oadp-operator#684 -- deletes these pods and scales DC replicas back up to original values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The underlying issue is that if you restore a pod and a DC-not-scaled-to-zero, then the restored pod is immediately deleted. The only way for restic restores or post-restore hooks to work with these pods is to short-circuit that action. There are two ways to do it. The first approach was a docs-only fix -- telling users to first run the restore excluding DCs and ReplicationControllers, then once restic and hooks were done, restore those missing resources from the same backup. The docs-only fix was later rejected in favor of this more active approach, where disconnected pods are restored along with scaled-down DCs. The disconnection was needed to prevent the DC controller from immediately scaling these pods to zero as called for by the modified restored DC. The only other way around this problem is to have users move to Deployments before backup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants