Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1840531: Adding rollbackcopy subcommand #372

Conversation

retroflexer
Copy link
Contributor

@retroflexer retroflexer commented Jun 8, 2020

This work provides a container within the etcd pod to take periodic backups.

The immediate need for automated backups is to provide a path to upgrade from 4.4 to 4.5, with the knowledge that if the upgrade fails for some reason, there is no easy path to rollback, as the etcd version 3.4.x (used in OCP 4.5) is incompatible with the etcd version of 3.3.x (used in 4.4.x). Since restoring a backup is the safest way to rollback, it is important to have automated periodical backups to protect the users from unexpected data loss in such scenarios.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 8, 2020
@retroflexer retroflexer force-pushed the rollback-copy-subcommand branch 5 times, most recently from 898c3b8 to 0e0c3ba Compare June 10, 2020 12:28
Copy link
Contributor

@hexfusion hexfusion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good @retroflexer , few notes.

Although we definitely want the backup logic to eventually be in golang having both means we need to support both in 4.4. Just something to consider for bugs/logical changes etc.

Try to make the logical flow as simple as possible, adding unit tests will make review easier in general.

pkg/cmd/rollbackcopy/etcdclientutils.go Outdated Show resolved Hide resolved
pkg/cmd/rollbackcopy/etcdclientutils.go Outdated Show resolved Hide resolved
pkg/cmd/rollbackcopy/etcdclientutils.go Outdated Show resolved Hide resolved
pkg/cmd/rollbackcopy/etcdclientutils.go Outdated Show resolved Hide resolved
pkg/cmd/rollbackcopy/fileioutils.go Show resolved Hide resolved
Copy link
Contributor

@ironcladlou ironcladlou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick first pass, there's a lot more here to discuss.

bindata/etcd/pod.yaml Show resolved Hide resolved
bindata/etcd/pod.yaml Outdated Show resolved Hide resolved
pkg/cmd/rollbackcopy/etcdclientutils.go Outdated Show resolved Hide resolved
pkg/cmd/rollbackcopy/etcdclientutils.go Show resolved Hide resolved
pkg/cmd/rollbackcopy/fileioutils.go Outdated Show resolved Hide resolved
pkg/etcdcli/interfaces.go Outdated Show resolved Hide resolved
pkg/cmd/rollbackcopy/tarutils.go Show resolved Hide resolved
pkg/cmd/rollbackcopy/backuputils.go Show resolved Hide resolved
pkg/etcdcli/etcdcli.go Outdated Show resolved Hide resolved
@retroflexer retroflexer force-pushed the rollback-copy-subcommand branch 2 times, most recently from 860da5f to 4e7c15d Compare June 10, 2020 14:38
@retroflexer retroflexer changed the title [wip]: Adding rollbackcopy subcommand Bug 1840531: [wip]: Adding rollbackcopy subcommand Jun 10, 2020
@openshift-ci-robot openshift-ci-robot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. and removed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Jun 10, 2020
@openshift-ci-robot
Copy link

@retroflexer: This pull request references Bugzilla bug 1840531, which is invalid:

  • expected the bug to target the "4.6.0" release, but it targets "4.5.0" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1840531: [wip]: Adding rollbackcopy subcommand

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@retroflexer retroflexer changed the title Bug 1840531: [wip]: Adding rollbackcopy subcommand [wip]: Bug 1840531: Adding rollbackcopy subcommand Jun 10, 2020
@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 10, 2020
@retroflexer retroflexer changed the title [wip]: Bug 1840531: Adding rollbackcopy subcommand [wip]: Bug 1846025: Adding rollbackcopy subcommand Jun 10, 2020
@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Jun 10, 2020
@openshift-ci-robot
Copy link

@retroflexer: This pull request references Bugzilla bug 1846025, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

[wip]: Bug 1846025: Adding rollbackcopy subcommand

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot removed the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Jun 10, 2020
@retroflexer
Copy link
Contributor Author

/bugzilla refresh

@openshift-ci-robot
Copy link

@retroflexer: This pull request references Bugzilla bug 1846025, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@retroflexer retroflexer force-pushed the rollback-copy-subcommand branch 5 times, most recently from 3abeeef to 1a28818 Compare June 11, 2020 03:13
@retroflexer retroflexer force-pushed the rollback-copy-subcommand branch 3 times, most recently from bbe17cf to 4c1a082 Compare June 16, 2020 15:59
@retroflexer
Copy link
Contributor Author

/test e2e-metal-ipi

@retroflexer
Copy link
Contributor Author

/retest

@retroflexer
Copy link
Contributor Author

/test e2e-aws

1 similar comment
@retroflexer
Copy link
Contributor Author

/test e2e-aws

@hexfusion
Copy link
Contributor

/lgtm
/approve

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 17, 2020
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hexfusion, retroflexer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 17, 2020
@retroflexer
Copy link
Contributor Author

retroflexer commented Jun 17, 2020

Verified the sanity of the backup taken by the rollbackcopy container by comparing it with the tar backup obtained from cluster-backup.sh script.

$ cat backupenv.json 
{"ClusterVersion":"4.4.0-0.ci-2020-06-16-152910","TimeStamp":"2020-06-17_121246"}

Extracted the static pod resources into a /tmp directory and compared the resources with  the backup taken using cluster-backup.sh:
$ sudo diff -r -s static-pod-resources /tmp/cluster-back-up/static-pod-resources/
Files static-pod-resources/kube-controller-manager-pod-9/configmaps/kube-controller-cert-syncer-kubeconfig/kubeconfig and /tmp/cluster-back-up/static-pod-resources/kube-controller-manager-pod-9/configmaps/kube-controller-cert-syncer-kubeconfig/kubeconfig are identical
Files static-pod-resources/kube-controller-manager-pod-9/configmaps/kube-controller-manager-pod/forceRedeploymentReason and /tmp/cluster-back-up/static-pod-resources/kube-controller-manager-pod-9/configmaps/kube-controller-manager-pod/forceRedeploymentReason are identical
Files static-pod-resources/kube-controller-manager-pod-9/configmaps/kube-controller-manager-pod/pod.yaml and /tmp/cluster-back-up/static-pod-resources/kube-controller-manager-pod-9/configmaps/kube-controller-manager-pod/pod.yaml are identical 
Files static-pod-resources/kube-controller-manager-pod-9/configmaps/kube-controller-manager-pod/version and /tmp/cluster-back-up/static-pod-resources/kube-controller-manager-pod-9/configmaps/kube-controller-manager-pod/version are identical
Files static-pod-resources/kube-controller-manager-pod-9/configmaps/serviceaccount-ca/ca-bundle.crt and /tmp/cluster-back-up/static-pod-resources/kube-controller-manager-pod-9/configmaps/serviceaccount-ca/ca-bundle.crt are identical
Files static-pod-resources/kube-controller-manager-pod-9/configmaps/service-ca/ca-bundle.crt and /tmp/cluster-back-up/static-pod-resources/kube-controller-manager-pod-9/configmaps/service-ca/ca-bundle.crt are identical
Files static-pod-resources/kube-controller-manager-pod-9/kube-controller-manager-pod.yaml and /tmp/cluster-back-up/static-pod-resources/kube-controller-manager-pod-9/kube-controller-manager-pod.yaml are identical
Files static-pod-resources/kube-controller-manager-pod-9/secrets/localhost-recovery-client-token/ca.crt and /tmp/cluster-back-up/static-pod-resources/kube-controller-manager-pod-9/secrets/localhost-recovery-client-token/ca.crt are identical
Files static-pod-resources/kube-controller-manager-pod-9/secrets/localhost-recovery-client-token/namespace and /tmp/cluster-back-up/static-pod-resources/kube-controller-manager-pod-9/secrets/localhost-recovery-client-token/namespace are identical
Files static-pod-resources/kube-controller-manager-pod-9/secrets/localhost-recovery-client-token/service-ca.crt and /tmp/cluster-back-up/static-pod-resources/kube-controller-manager-pod-9/secrets/localhost-recovery-client-token/service-ca.crt are identical
Files static-pod-resources/kube-controller-manager-pod-9/secrets/localhost-recovery-client-token/token and /tmp/cluster-back-up/static-pod-resources/kube-controller-manager-pod-9/secrets/localhost-recovery-client-token/token are identicalFiles static-pod-resources/kube-controller-manager-pod-9/secrets/service-account-private-key/service-account.key and /tmp/cluster-back-up/static-pod-resources/kube-controller-manager-pod-9/secrets/service-account-private-key/service-account.key are identical
Files static-pod-resources/kube-controller-manager-pod-9/secrets/service-account-private-key/service-account.pub and /tmp/cluster-back-up/static-pod-resources/kube-controller-manager-pod-9/secrets/service-account-private-key/service-account.pub are identical
Files static-pod-resources/kube-controller-manager-pod-9/secrets/serving-cert/tls.crt and /tmp/cluster-back-up/static-pod-resources/kube-controller-manager-pod-9/secrets/serving-cert/tls.crt are identical
Files static-pod-resources/kube-controller-manager-pod-9/secrets/serving-cert/tls.key and /tmp/cluster-back-up/static-pod-resources/kube-controller-manager-pod-9/secrets/serving-cert/tls.key are identical
Files static-pod-resources/kube-scheduler-pod-10/configmaps/config/config.yaml and /tmp/cluster-back-up/static-pod-resources/kube-scheduler-pod-10/configmaps/config/config.yaml are identical
Files static-pod-resources/kube-scheduler-pod-10/configmaps/kube-scheduler-cert-syncer-kubeconfig/kubeconfig and /tmp/cluster-back-up/static-pod-resources/kube-scheduler-pod-10/configmaps/kube-scheduler-cert-syncer-kubeconfig/kubeconfig are identical
Files static-pod-resources/kube-scheduler-pod-10/configmaps/kube-scheduler-pod/forceRedeploymentReason and /tmp/cluster-back-up/static-pod-resources/kube-scheduler-pod-10/configmaps/kube-scheduler-pod/forceRedeploymentReason are identical
Files static-pod-resources/kube-scheduler-pod-10/configmaps/kube-scheduler-pod/pod.yaml and /tmp/cluster-back-up/static-pod-resources/kube-scheduler-pod-10/configmaps/kube-scheduler-pod/pod.yaml are identical
Files static-pod-resources/kube-scheduler-pod-10/configmaps/kube-scheduler-pod/version and /tmp/cluster-back-up/static-pod-resources/kube-scheduler-pod-10/configmaps/kube-scheduler-pod/version are identical
Files static-pod-resources/kube-scheduler-pod-10/configmaps/scheduler-kubeconfig/kubeconfig and /tmp/cluster-back-up/static-pod-resources/kube-scheduler-pod-10/configmaps/scheduler-kubeconfig/kubeconfig are identical
Files static-pod-resources/kube-scheduler-pod-10/configmaps/serviceaccount-ca/ca-bundle.crt and /tmp/cluster-back-up/static-pod-resources/kube-scheduler-pod-10/configmaps/serviceaccount-ca/ca-bundle.crt are identical
Files static-pod-resources/kube-scheduler-pod-10/kube-scheduler-pod.yaml and /tmp/cluster-back-up/static-pod-resources/kube-scheduler-pod-10/kube-scheduler-pod.yaml are identical
Files static-pod-resources/kube-scheduler-pod-10/secrets/localhost-recovery-client-token/ca.crt and /tmp/cluster-back-up/static-pod-resources/kube-scheduler-pod-10/secrets/localhost-recovery-client-token/ca.crt are identical
Files static-pod-resources/kube-scheduler-pod-10/secrets/localhost-recovery-client-token/namespace and /tmp/cluster-back-up/static-pod-resources/kube-scheduler-pod-10/secrets/localhost-recovery-client-token/namespace are identical
Files static-pod-resources/kube-scheduler-pod-10/secrets/localhost-recovery-client-token/service-ca.crt and /tmp/cluster-back-up/static-pod-resources/kube-scheduler-pod-10/secrets/localhost-recovery-client-token/service-ca.crt are identical
Files static-pod-resources/kube-scheduler-pod-10/secrets/localhost-recovery-client-token/token and /tmp/cluster-back-up/static-pod-resources/kube-scheduler-pod-10/secrets/localhost-recovery-client-token/token are identical
Files static-pod-resources/kube-scheduler-pod-10/secrets/serving-cert/tls.crt and /tmp/cluster-back-up/static-pod-resources/kube-scheduler-pod-10/secrets/serving-cert/tls.crt are identical
Files static-pod-resources/kube-scheduler-pod-10/secrets/serving-cert/tls.key and /tmp/cluster-back-up/static-pod-resources/kube-scheduler-pod-10/secrets/serving-cert/tls.key are identical

Applied cluster-restore.sh to restore the pod to the database saved by the rollbackcopy. It successfully restored and a single node cluster came up:

sudo -E /usr/local/bin/cluster-restore.sh /etc/kubernetes/rollbackcopy/olderVersion.latest
[core@ip-10-0-184-186 backup]$ sudo -E /usr/local/bin/cluster-restore.sh 
/etc/kubernetes/rollbackcopy/olderVersion.latest
...stopping kube-apiserver-pod.yaml
...stopping kube-controller-manager-pod.yaml
...stopping kube-scheduler-pod.yaml
...stopping etcd-pod.yaml
Waiting for container etcd to stop
.complete
Waiting for container etcdctl to stop
complete
Waiting for container etcd-metrics to stop
complete
Waiting for container kube-controller-manager to stop
complete
Waiting for container kube-apiserver to stop
....................................................................................................................................complete
Waiting for container kube-scheduler to stop
complete
removing previous backup /var/lib/etcd-backup/member
Moving etcd data-dir /var/lib/etcd/member to /var/lib/etcd-backup
starting restore-etcd static pod
starting kube-apiserver-pod.yaml
static-pod-resources/kube-apiserver-pod-11/kube-apiserver-pod.yaml
starting kube-controller-manager-pod.yaml
static-pod-resources/kube-controller-manager-pod-9/kube-controller-manager-pod.yaml
starting kube-scheduler-pod.yaml
static-pod-resources/kube-scheduler-pod-10/kube-scheduler-pod.yaml

And the cluster came up with single node cluster successfully!

@hexfusion
Copy link
Contributor

This is a targeted bug fix, instead of committing to master/4.6 and reverting we would like to merge directly in 4.5 with the net as a 4.6 removal. cc @mfojtik once merged into 4.5 we will verify with QE tonight. If there is a problem we will revert. But this has been extensively tested in CI and manually.

@mfojtik mfojtik added staff-eng-approved Indicates a release branch PR has been approved by a staff engineer (formerly group/pillar lead). bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Jun 17, 2020
@hexfusion
Copy link
Contributor

/refresh

@openshift-ci-robot
Copy link

@retroflexer: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-azure 293c03d link /test e2e-azure
ci/prow/e2e-aws-disruptive 293c03d link /test e2e-aws-disruptive

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 8708739 into openshift:release-4.5 Jun 17, 2020
@openshift-ci-robot
Copy link

@retroflexer: All pull requests linked via external trackers have merged: . Bugzilla bug 1840531 has been moved to the MODIFIED state.

In response to this:

Bug 1840531: Adding rollbackcopy subcommand

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@hexfusion
Copy link
Contributor

/cherry-pick release-4.4

@openshift-cherrypick-robot

@hexfusion: new pull request created: #379

In response to this:

/cherry-pick release-4.4

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged. staff-eng-approved Indicates a release branch PR has been approved by a staff engineer (formerly group/pillar lead).
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants