New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add etcd 3.x minor version rollback support to migrate-if-needed.sh #59298

Merged
merged 1 commit into from Feb 13, 2018

Conversation

@jpbetz
Contributor

jpbetz commented Feb 3, 2018

Provide automatic etcd 3.x minor version downgrade when using the gcr.io/google_containers/etcd docker images to operate etcd.

Uses etcdctl snapshot save and etcdctl snapshot restore to safely downgrade etcd from 3.2->3.1 or 3.1->3.0. This is safe because the data storage file formats used by etcd have not changed between these versions.

Intended as a stop-gap until we can introduce more comprehensive downgrade support in etcd. The main limitation of this approach is that it is not able to perform zero downtime downgrades for HA clusters. For HA clusters, all members must be stopped and downgraded before the cluster may be restarted at the downgraded version.

Example usage:

  • Initially the etcd.manifest is set to gcr.io/google_containers/etcd:3.0.17, TARGET_VERSION=3.0.17
  • A upgrade to 3.1.11 is initiated.
  • etcd.manifest is updated to gcr.io/google_containers/etcd:3.1.11, TARGET_VERSION=3.1.11
  • etcd restarts and establishes 3.1 as it's "cluster version"
  • For whatever reason, a downgrade is initiated
  • etcd.manifest is updated gcr.io/google_containers/etcd:3.1.11, TARGET_VERSION=3.0.17
  • migrate-if-needed.sh detects that the current version (3.1.11) is newer than the target version, so it:
    • creates a snapshot using etcd & etcdctl 3.1.11
    • backs up the data dir
    • restores the snapshot using etcdctl 3.0.17 to create a replacement data dir
    • starts etcd 3.0.17

Note that while this will rollback to an earlier etcd version, the newer etcd gcr.io image version must continue to be used throughout the downgrade. Only TARGET_VERSION is downgraded.

Test coverage was lacking for migrate-if-needed.sh so this adds some container level testing to the Makefile for migrating and rolling back. This surfaced a couple bugs that are fixed by this PR as well.

cc @mml @lavalamp @wenjiaswe

Add automatic etcd 3.2->3.1 and 3.1->3.0 minor version rollback support to gcr.io/google_container/etcd images. For HA clusters, all members must be stopped before performing a rollback.

@jpbetz jpbetz added this to the v1.10 milestone Feb 3, 2018

@jpbetz jpbetz self-assigned this Feb 3, 2018

@k8s-ci-robot k8s-ci-robot requested review from eparis and jbeda Feb 3, 2018

@@ -227,7 +262,7 @@ for step in ${SUPPORTED_VERSIONS}; do
echo "Starting etcd ${step} in v3 mode failed"
exit 1
fi
${ETCDCTL_CMD} rm --recursive "${ETCD_DATA_PREFIX}"
${ETCDCTL_CMD} --endpoints "http://127.0.0.1:${ETCD_PORT}" rm --recursive "${ETCD_DATA_PREFIX}"

This comment has been minimized.

@jpbetz

jpbetz Feb 5, 2018

Contributor

@wojtek-t FYI. We found this when adding tests. It looks like this wasn't getting run correctly since the --endpoints flag was not being set to the etcd ports etcd_start starts etcd on. Maybe we should adjust the logic to check for the data in all 3.x migrations and delete it if it is found to make sure any clusters that have already upgraded to 3.1 get cleaned up?

This comment has been minimized.

@wojtek-t

wojtek-t Feb 9, 2018

Member

Makes sense. But let's do this in a separate PR.

@wojtek-t wojtek-t self-assigned this Feb 5, 2018

}
# Rollback from "3.0.x" version in 'etcd3' mode to "2.2.1" version in 'etcd2' mode, if needed.
rollback_to_etcd2() {

This comment has been minimized.

@jpbetz

jpbetz Feb 6, 2018

Contributor

Body of this function is unchanged. The code was moved from inline script (below) to this function.

@@ -0,0 +1,68 @@
#!/bin/sh
# Copyright 2016 The Kubernetes Authors.

This comment has been minimized.

@jpbetz

jpbetz Feb 6, 2018

Contributor

Moved these functions to their own source file so they can be leveraged by the tests.

@mml

Better with tests and, yes, functions. Thanks!

cd $(TEMP_DIR)/rollback-etcd2
@echo "Starting $(ETCD2_ROLLBACK_NEW_TAG) etcd and writing some sample data."
docker run -ti -v $(TEMP_DIR)/rollback-etcd2:/var/etcd \

This comment has been minimized.

@mml

mml Feb 6, 2018

Contributor

Can we spell out the docker flags? e.g. --tty, --interactive? Maybe I just haven't run docker in too long.

This comment has been minimized.

@jpbetz

jpbetz Feb 7, 2018

Contributor

Updated.

test: test-rollback test-rollback-etcd2 test-migrate
all: build test
.PHONY: build push test-rollback test-rollback-etcd2 test-migrate

This comment has been minimized.

@mml

mml Feb 6, 2018

Contributor

test is phony too

This comment has been minimized.

@jpbetz

jpbetz Feb 7, 2018

Contributor

Thanks!

}
# Rollback from "3.0.x" version in 'etcd3' mode to "2.2.1" version in 'etcd2' mode, if needed.
rollback_to_etcd2() {

This comment has been minimized.

@mml

mml Feb 6, 2018

Contributor

The functions help. For clarity, can we move their definitions up to the top? It makes it easier to see what the flow control of the main script is.

This comment has been minimized.

@jpbetz

jpbetz Feb 7, 2018

Contributor

Sounds good.

@@ -163,6 +114,90 @@ if [ "${CURRENT_VERSION}" = "2.2.1" -a "${CURRENT_VERSION}" != "${TARGET_VERSION
echo "Backup done in ${BACKUP_DIR}"
fi
# Rollback to previous minor version of etcd 3.x, if needed.
# This approach has only been tested with 3.2 and earlier. If newer versions of etcd support this
# downgrade approach, please update the "${CURRENT_MINOR_VERSION} -le ..." check here.

This comment has been minimized.

@mml

mml Feb 6, 2018

Contributor

I don't see this "-le" check.

This comment has been minimized.

@jpbetz

jpbetz Feb 7, 2018

Contributor

Good catch. I've removed this comment. We now have the tests, which ensures the rollback flow won't get broken without being noticed.

-e "TARGET_VERSION=$(ETCD2_ROLLBACK_NEW_TAG)" \
-e "DATA_DIRECTORY=/var/etcd/data" \
gcr.io/google_containers/etcd-$(ARCH):$(REGISTRY_TAG) /bin/sh -c \
'/usr/local/bin/migrate-if-needed.sh && \

This comment has been minimized.

@wojtek-t

wojtek-t Feb 7, 2018

Member

I don't think this one is needed

This comment has been minimized.

@jpbetz

jpbetz Feb 7, 2018

Contributor

It's needed here to write version.txt.

ETCDCTL_API=3 /usr/local/bin/etcdctl-$(ETCD2_ROLLBACK_NEW_TAG) --endpoints http://127.0.0.1:$${ETCD_PORT} put /registry/k1 value1 && \
stop_etcd'
docker run --tty --interactive -v $(TEMP_DIR)/rollback-etcd2:/var/etcd \

This comment has been minimized.

@wojtek-t

wojtek-t Feb 7, 2018

Member

Can we merge this one with the previous?

This comment has been minimized.

@jpbetz

jpbetz Feb 7, 2018

Contributor

Yes. Done.

source /usr/local/bin/start-stop-etcd.sh && \
START_STORAGE=etcd3 START_VERSION=$(REGISTRY_TAG) start_etcd && \
ETCDCTL_API=3 /usr/local/bin/etcdctl --endpoints http://127.0.0.1:$${ETCD_PORT} put /registry/k1 value1 && \
stop_etcd'

This comment has been minimized.

@wojtek-t

wojtek-t Feb 7, 2018

Member

Why we don't need to write version.txt file here?

This comment has been minimized.

@jpbetz

jpbetz Feb 7, 2018

Contributor

migrate-if-needed.sh writes it.

docker run --tty --interactive -v $(TEMP_DIR)/rollback-test:/var/etcd \
gcr.io/google_containers/etcd-$(ARCH):$(REGISTRY_TAG) /bin/sh -c \
'[ $$(cat /var/etcd/data/version.txt) = $(ROLLBACK_REGISTRY_TAG)/etcd3 ] && \
grep -q value1 /var/etcd/keyspace.txt'

This comment has been minimized.

@wojtek-t

wojtek-t Feb 7, 2018

Member

Those two rollback tests seem pretty similar - would it be possible somehow share the code between those?

This comment has been minimized.

@jpbetz

jpbetz Feb 7, 2018

Contributor

I looked into it briefly and the main reason I kept them separate is because we will be deleting the etcd support later this year, and by keeping the tests separate we make the deletion trivial.

-e "TARGET_VERSION=$${tag}" \
-e "DATA_DIRECTORY=/var/etcd/data" \
gcr.io/google_containers/etcd-$(ARCH):$(REGISTRY_TAG) /bin/sh -c \
"/usr/local/bin/migrate-if-needed.sh && \

This comment has been minimized.

@wojtek-t

wojtek-t Feb 7, 2018

Member

Seems unneeded.

This comment has been minimized.

@jpbetz

jpbetz Feb 7, 2018

Contributor

Needed to write an initial version.txt

gcr.io/google_containers/etcd-$(ARCH):$(REGISTRY_TAG) /bin/sh -c \
"/usr/local/bin/migrate-if-needed.sh && \
source /usr/local/bin/start-stop-etcd.sh && \
START_STORAGE=etcd2 START_VERSION=$${tag} start_etcd && \

This comment has been minimized.

@wojtek-t

wojtek-t Feb 7, 2018

Member

Why this is always etcd v2?

This comment has been minimized.

@jpbetz

jpbetz Feb 7, 2018

Contributor

Good catch. This is incorrect. Fixing

ETCDCTL_CMD="/usr/local/bin/etcdctl-${TARGET_VERSION}"
NAME="etcd-$(hostname)"
ETCDCTL_API=3 ${ETCDCTL_CMD} snapshot restore "${SNAPSHOT_FILE}" \
--data-dir "${DATA_DIRECTORY}" --name "${NAME}" --initial-cluster "${NAME}=http://localhost:2380"

This comment has been minimized.

@wojtek-t

wojtek-t Feb 7, 2018

Member

This initial cluster will work only in non-HA mode, right?

This comment has been minimized.

@jpbetz

jpbetz Feb 7, 2018

Contributor

Yes. After this restore, the etcd cluster is then started using the --initial-cluster {{ etcd_cluster }} setting provided in the etcd.manifest. Looks like we also need to pass that setting in to migrate-if-needed.sh.

This comment has been minimized.

@jpbetz

jpbetz Feb 7, 2018

Contributor

Fixed.

@jpbetz

This comment has been minimized.

Contributor

jpbetz commented Feb 7, 2018

Test failure is a parse error in etcd-empty-dir-cleanup.yaml which was not touched by this PR. Rebasing.

@jpbetz

This comment has been minimized.

Contributor

jpbetz commented Feb 8, 2018

@mml @wojtek-t All feedback has been applied. PTAL.

@wojtek-t

/approve no-issue

I will let @mml to also take a look.

@@ -227,7 +262,7 @@ for step in ${SUPPORTED_VERSIONS}; do
echo "Starting etcd ${step} in v3 mode failed"
exit 1
fi
${ETCDCTL_CMD} rm --recursive "${ETCD_DATA_PREFIX}"
${ETCDCTL_CMD} --endpoints "http://127.0.0.1:${ETCD_PORT}" rm --recursive "${ETCD_DATA_PREFIX}"

This comment has been minimized.

@wojtek-t

wojtek-t Feb 9, 2018

Member

Makes sense. But let's do this in a separate PR.

@jpbetz

This comment has been minimized.

Contributor

jpbetz commented Feb 9, 2018

@wojtek-t For splitting --endpoints "http://127.0.0.1:${ETCD_PORT}" out into a separate PR- WIthout the change the newly added tests fail and splitting them out into a separate PR (or sequencing that PR to go first) is quite a bit of extra work. Okay if instead we just open an issue explaining the bug that this fixes and note that this PR fixes that issue?

@wojtek-t

This comment has been minimized.

Member

wojtek-t commented Feb 9, 2018

@wojtek-t For splitting --endpoints "http://127.0.0.1:${ETCD_PORT}" out into a separate PR- WIthout the change the newly added tests fail and splitting them out into a separate PR (or sequencing that PR to go first) is quite a bit of extra work. Okay if instead we just open an issue explaining the bug that this fixes and note that this PR fixes that issue?

I didn't mean splitting this change to a separate PR. I meant splitting this thing:

Maybe we should adjust the logic to check for the data in all 3.x migrations and delete it if it is found to make sure any clusters that have already upgraded to 3.1 get cleaned up?

@jpbetz

This comment has been minimized.

Contributor

jpbetz commented Feb 9, 2018

@wojtek-t Ah, yes. Totally agree. Separate PR for that.

@mml

This comment has been minimized.

Contributor

mml commented Feb 12, 2018

/lgtm

Might want a followup issue for making sure these tests get run by CI.

Thanks again.

@mml

mml approved these changes Feb 12, 2018

@k8s-ci-robot

This comment has been minimized.

Contributor

k8s-ci-robot commented Feb 12, 2018

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jpbetz, mml, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these OWNERS Files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-merge-robot k8s-merge-robot removed the lgtm label Feb 12, 2018

@jpbetz jpbetz added the lgtm label Feb 13, 2018

@jpbetz

This comment has been minimized.

Contributor

jpbetz commented Feb 13, 2018

Clean rebase, reapplying lgtm

@k8s-merge-robot

This comment has been minimized.

Contributor

k8s-merge-robot commented Feb 13, 2018

/test all [submit-queue is verifying that this PR is safe to merge]

@k8s-ci-robot

This comment has been minimized.

Contributor

k8s-ci-robot commented Feb 13, 2018

@jpbetz: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-e2e-kubeadm-gce-canary 5fb7ecf link /test pull-kubernetes-e2e-kubeadm-gce-canary
pull-kubernetes-e2e-kops-aws 746e247 link /test pull-kubernetes-e2e-kops-aws

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-merge-robot

This comment has been minimized.

Contributor

k8s-merge-robot commented Feb 13, 2018

Automatic merge from submit-queue (batch tested with PRs 59298, 59773, 59772). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-merge-robot k8s-merge-robot merged commit c1216df into kubernetes:master Feb 13, 2018

11 of 13 checks passed

pull-kubernetes-e2e-kops-aws Job failed.
Details
Submit Queue Required Github CI test is not green: pull-kubernetes-e2e-kops-aws
Details
cla/linuxfoundation jpbetz authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gke Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce Job succeeded.
Details
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-unit Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details

k8s-merge-robot added a commit that referenced this pull request Mar 1, 2018

Merge pull request #59834 from jpbetz/automated-cherry-pick-of-#59298-…
…origin-release-1.9

Automatic merge from submit-queue.

Automated cherry pick of #59298: Add etcd 3.x minor version rollback support to

Cherry pick of #59298 on release-1.9.

#59298: Add etcd 3.x minor version rollback support to

```release-note
Add automatic etcd 3.2->3.1 and 3.1->3.0 minor version rollback support to gcr.io/google_container/etcd images. For HA clusters, all members must be stopped before performing a rollback.
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment