New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OADP-145: Two-phase DC restore for restic #141
Conversation
One additional documentation note: The post-restore script will fail for Velero restores with names exceeding 63 chars, since those names are not valid labels. |
This is cool |
Note the bash script above has been added in a PR to oadp-operator to |
velero-plugins/common/types.go
Outdated
const ( | ||
DCPodDeploymentLabel string = "deployment" // identifies associated RC | ||
DCPodDeploymentConfigLabel string = "deploymentconfig" // identifies associated DC | ||
DCPodDisconnectedLabel string = "openshift.io/disconnected-from-dc" // pod had DC labels removed on restore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think pod labels should be oadp specific. Don't want to have cross-project label conflicts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or openshift-velero-plugin specific. velero.openshift.io?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a more specific label namespace we should be using for OADP labels? Looking at the current code, it seemed like we were using openshift.io
and not something like oadp.openshift.io
-- @dymurray what is the standard for OADP namespace for labels and annotations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know for migration use cases we started with openshift.io
and eventually moved to migration.openshift.io
-- Whatever we use, we should standardize on the appropriate label/annotation namespace for OADP in general. Currently we seem to be using only "openshift.io" and "migration.openshift.io" -- and nothing else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sseago @kaovilai we use the label openshift.io/oadp
for OADP Operator resources https://github.com/openshift/oadp-operator/blob/master/api/v1alpha1/oadp_types.go#L31
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I'm not sure I've ever seen label/annotation names formulated this way. Is there precedent? For MTC we used migration.openshift.io/foo
rather than openshift.io/migration/foo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok alternatively openshift.io/oadp_disconnected-from-dc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like it nested, i like the namespacing. oadp.openshift.io/disconnected-from-dc
looks good to me.
In general I like moving all our annotations/labels to oadp.openshift.io
as that's the namespace our APIs live under.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dymurray or should it be adp.openshift.io? I'm a bit unclear on when to use "oadp" vs "adp" -- i.e. namespace is "openshift-adp" presumably because we don't want "openshift" mentioned twice in the name (once spelled out, once in the acronym). Would the same apply to the ".openshift.io" namespacing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whatever it is, I'm assuming we'd use this for all new oadp label/annotation names going forward. "oadp.openshift.io" is probably clearest.
When using default volumes to restic, disconnect DeploymentConfig pods from ReplicationController and label for later cleanup. For DeploymentConfigs, set replicas to zero (and paused to false) and annotate/label for later cleanup. After restore completes, a post-restore bash script must be run to clean up restic restore pods and restore DC to prior configuration.
@dymurray I've updated the PR to reflect the proposed label name changes, as well as updating it to work with the new GetBackup method signature. |
p.Log.Infof("[pod-restore] skipping restore of pod %s, has owner references, no restic backup, and no restore hooks", pod.Name) | ||
return velero.NewRestoreItemActionExecuteOutput(input.Item).WithoutRestore(), nil | ||
} | ||
|
||
// If pod has both "deployment" and "deploymentconfig" labels, it belongs to a DeploymentConfig | ||
// If defaultVolumesToRestic, remove these labels so that the DC won't immediately delete the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So when does this pod get deleted? never?
So we end up with restored Pod + 1 DeploymentConfig pod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It needs to be cleaned up post-restore. At restore, we don't get 2 pods since the DC is restored with replicas=0. Cleanup should be triggered by running the script added in openshift/oadp-operator#684 -- deletes these pods and scales DC replicas back up to original values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The underlying issue is that if you restore a pod and a DC-not-scaled-to-zero, then the restored pod is immediately deleted. The only way for restic restores or post-restore hooks to work with these pods is to short-circuit that action. There are two ways to do it. The first approach was a docs-only fix -- telling users to first run the restore excluding DCs and ReplicationControllers, then once restic and hooks were done, restore those missing resources from the same backup. The docs-only fix was later rejected in favor of this more active approach, where disconnected pods are restored along with scaled-down DCs. The disconnection was needed to prevent the DC controller from immediately scaling these pods to zero as called for by the modified restored DC. The only other way around this problem is to have users move to Deployments before backup.
When using default volumes to restic, disconnect DeploymentConfig
pods from ReplicationController and label for later cleanup.
For DeploymentConfigs, set replicas to zero (and paused to false)
and annotate/label for later cleanup.
After restore completes, a post-restore bash script must be run
to clean up restic restore pods and restore DC to prior configuration.
Note that this is currently only done when the backup sets defaultVolumesToRestic=true -- for the restic annotation case, we don't know in the DeploymentConfig plugin that the DC had its pods disconnected. To enable this case we may need to add another backup annotation to communicate the user's desire to disconnect all DC pods and scale down DCs.
Also, note that this does not currently handle the case for post-restore hooks on DC pods. For restore-level hooks, this could be added similarly to what's being done w/ the defaultVolumesToRestic config. For pod annotations, the concerns are similar to restic annotations.
Also note that following a successful restore, the following script would need to be run in order to clean up the restore pods and restore the DC to its prior configuration (script location in github still pending):