New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reimplement migrate-if-needed.sh in go #61241

Merged
merged 1 commit into from Apr 11, 2018

Conversation

@jpbetz
Contributor

jpbetz commented Mar 15, 2018

The migrate-if-needed.sh script was already partially implemented in go (see the attachlease and rollback sub-dirs), but was still unnecessarily difficult to understand and test. This closely reimplements the original logic but with improved code structure, error handling and testing.

Where possible, go code that was previously executed as separate binaries is now statically linked into a single 'migrate' go cobra CLI app, which is then thinly wrapped bymigrate-if-needed.sh.

There are numerous additional improvements that need to be made, but will be submitted in future PRs. This PR is focused on achieving parity with the pre-existing functionality and introducing some much needed test coverage, in particular HA cluster upgrade test coverage.

It appears that the attachlease and rollback go binaries are no longer needed as standalones and so I have consolidated them into the new migrate go binary. Other than that, this change aims to be 100% backward compatible.

NONE
@jpbetz

This comment has been minimized.

Contributor

jpbetz commented Mar 15, 2018

done
echo "$(date +'%Y-%m-%d %H:%M:%S') Migration finished"
/usr/local/bin/migrate \

This comment has been minimized.

@wenjiaswe

wenjiaswe Mar 16, 2018

Contributor
  1. Did you miss binDir in the list? (line 49 https://github.com/kubernetes/kubernetes/pull/61241/files#diff-755f39aaeeec98fb826929445011dde0)
  2. Would it be better if they are in the same order as in migrateOpts struct in migrate.go? (minor)

This comment has been minimized.

@jpbetz

jpbetz Mar 17, 2018

Contributor

#1 - The default value is correct, we don't need to explicitly set it.
#2 - I don't feel strongly about it. The flags are also in a different order in the migrate -h output. We did keep the relative order of all flags and env vars throughout migrate-if-needed.sh, which seems like the most important place to be consistent.

This comment has been minimized.

@wenjiaswe

wenjiaswe Mar 19, 2018

Contributor

Sounds good.

// and etcdctl commands called via the shell.
type CombinedEtcdClient struct {
cfg *EtcdMigrateCfg
v2client clientv2.Client

This comment has been minimized.

@wenjiaswe

wenjiaswe Mar 16, 2018

Contributor

Is v2client supposed to be a pointer as well?

This comment has been minimized.

@jpbetz

jpbetz Mar 17, 2018

Contributor

Nope, it's an interface.

}
if !exists {
glog.Infof("data directory '%s' does not exist, creating it", path)
err := os.MkdirAll(path, 0755)

This comment has been minimized.

@cheftako

cheftako Mar 16, 2018

Member

I realize that the script your replacing was unspecific on the permissions for the directory being created. I believe we were probably creating it with 755 but it seems that the data directory containing things like secrets should not be world readable.

This comment has been minimized.

@jpbetz

jpbetz Mar 17, 2018

Contributor

Thanks for the nudge. I'll review what exactly happens in practice (in GCE/GKE) and see if we can restrict this to 0750

This comment has been minimized.

@mml

mml Mar 19, 2018

Contributor

Related question: does our umask impact this? (I think yes.) By default, GNU mkdir with no arguments uses 0777 (modified by umask). So the "no change" change would be to use 0777 and probably add a comment asking if we can/should tighten that up.

Unexpectedly restricting the permissions on update could cause an outage.

if err != nil {
return err
}
err = os.MkdirAll(backupDir, 0755)

This comment has been minimized.

@cheftako
}
err = w.SaveSnapshot(walpb.Snapshot{Index: hardstate.Commit, Term: hardstate.Term})
if err != nil {
glog.Fatal(err)
return err
}
w.Close()

This comment has been minimized.

@cheftako

cheftako Mar 16, 2018

Member

Are we deliberately trying to skip close() in the event of an error saving a snapshot? That seems wrong...

This comment has been minimized.

@jpbetz

jpbetz Mar 17, 2018

Contributor

Good catch, this is what we run in production today :P Fixing.

@@ -1,45 +0,0 @@
# Rollback workflow
Build it in this directory.

This comment has been minimized.

@wenjiaswe

wenjiaswe Mar 16, 2018

Contributor

Are you planning to add another README file as documentation?

This comment has been minimized.

@jpbetz

jpbetz Mar 17, 2018

Contributor

We'll consolidate docs to the cluster/images/etcd/README.md file. I'll do another pass on it shortly to make sure it's fleshed out.

return d.dir.Close()
}
// Initialize set the version.txt to the target version if the data

This comment has been minimized.

@wenjiaswe

wenjiaswe Mar 17, 2018

Contributor

I think this part correspond to the logic in

# If there is no data in DATA_DIRECTORY, this means that we are
, this means that we are, I like the smart new way, I am just not sure how obvious the new comment is for users, would everyone know that 1. if the data directory is empty then we don't need to do migration? 2. if version.txt in etcd data folder is set to target version then there is no migration? Maybe it's actually obvious to other people, ignore me then.

limitations under the License.
*/
package main

This comment has been minimized.

@wenjiaswe

wenjiaswe Mar 17, 2018

Contributor

Do we need a test for empty data directory?

version *EtcdVersionPair
match bool
}{
{"3.1.2/etcd3", &EtcdVersionPair{&EtcdVersion{3, 1, 2}, storageEtcd3}, true},

This comment has been minimized.

@wenjiaswe

wenjiaswe Mar 17, 2018

Contributor

Shall we add some more random cases, like completely out of nowhere number of etcd version and etcdstorage version, like 23.1.6 and etcd6? As well as nonserialized etcd version pair? I know the code handles these scenario, but just for completeness of the test.

This comment has been minimized.

@jpbetz

jpbetz Mar 20, 2018

Contributor

Sure, I've expanded the test coverage.

@lavalamp

This comment has been minimized.

Member

lavalamp commented Apr 5, 2018

@jpbetz

This comment has been minimized.

Contributor

jpbetz commented Apr 5, 2018

@lavalamp I've switched back to static and added a flag incase we ever want to change it from the default of 18629. Since etcd and etcd-events run in separate docker containers and that port is not exposed in etcd.manifest, they can safely use the same port concurrently in the master kubelet. @wojtek-t the fix to use peer port 2381 for events is in migrate-if-needed.sh.

@lavalamp

This comment has been minimized.

Member

lavalamp commented Apr 5, 2018

@jpbetz

This comment has been minimized.

Contributor

jpbetz commented Apr 5, 2018

/retest

@jpbetz

This comment has been minimized.

Contributor

jpbetz commented Apr 6, 2018

@wojtek-t @cheftako All feedback applied. PTAL.

@k8s-merge-robot k8s-merge-robot removed this from the v1.11 milestone Apr 6, 2018

@jpbetz jpbetz added this to the v1.11 milestone Apr 6, 2018

const (
versionFilename = "version.txt"
defaultPort uint64 = 18629

This comment has been minimized.

@wojtek-t

wojtek-t Apr 9, 2018

Member

Hmm - I still don't see solution for port conflicts between main etcd and event etcd.

I think you should at least set "--port" flag when you're calling "migrate binary in the script.

This comment has been minimized.

@jpbetz

jpbetz Apr 9, 2018

Contributor

No need. Since etcd and etcd events are run in separate docker containers, and neither of them expose 18629 to the host operating system, they can concurrently use this port within the docker containers. I'd done a couple sanity checks to make sure this works as expected.

This comment has been minimized.

@wojtek-t

wojtek-t Apr 10, 2018

Member

But we are using hostnetwork = true:
https://github.com/kubernetes/kubernetes/blob/master/cluster/gce/manifests/etcd.manifest#L12

I think this is causing a problem...

This comment has been minimized.

@jpbetz

jpbetz Apr 10, 2018

Contributor

@wojtek-t Ohh. Thanks so much for catching this! I've populated the --port flag in migrate-if-needed.sh.

k8s-merge-robot added a commit that referenced this pull request Apr 9, 2018

Merge pull request #61956 from jpbetz/migrate-if-needed-ha-fix-1.9
Automatic merge from submit-queue.

Backport etcd.manifest fixes for HA clusters from #61241 to 1.9

Backport the `etcd.manifest` changes from #61241 to kubernetes 1.9. This fixes GCE configurations using HA etcd with k8s.gcr.io/etcd images built from #61241 (k8s.gcr.io/etcd:e.g. 3.1.13-0).

Note: I am not including the #60422 fix to use `host_ip` instead of `hostname` in this backport, which is required for running etcd 3.2+.  We might need that, but I'd like backport it in a separate PR if we do.

```release-note
Fix GCE etcd scripts to pass in all required parameters for the etcd migration utility to correctly perform HA upgrades and downgrades
```
@k8s-ci-robot

This comment has been minimized.

Contributor

k8s-ci-robot commented Apr 10, 2018

@jpbetz: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-kubemark-e2e-gce-big 20454bc link /test pull-kubernetes-kubemark-e2e-gce-big
pull-kubernetes-e2e-gce-100-performance 20454bc link /test pull-kubernetes-e2e-gce-100-performance

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@wojtek-t

This comment has been minimized.

Member

wojtek-t commented Apr 10, 2018

/approve no-issue

I'm fine with that - not LGTM-ing for now to give others one more change.

@wojtek-t

This comment has been minimized.

Member

wojtek-t commented Apr 10, 2018

/retest

@k8s-ci-robot

This comment has been minimized.

Contributor

k8s-ci-robot commented Apr 10, 2018

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cheftako, jpbetz, mml, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@wojtek-t

This comment has been minimized.

Member

wojtek-t commented Apr 11, 2018

I'm LGTM-ing it to unblock it and have more soaking time of it before release (though it probably doesn't really matter).

@wojtek-t

This comment has been minimized.

Member

wojtek-t commented Apr 11, 2018

/lgtm

@k8s-merge-robot

This comment has been minimized.

Contributor

k8s-merge-robot commented Apr 11, 2018

[MILESTONENOTIFIER] Milestone Pull Request: Up-to-date for process

@cheftako @jpbetz @lavalamp @mml @wojtek-t

Pull Request Labels
  • sig/gcp: Pull Request will be escalated to these SIGs if needed.
  • priority/important-longterm: Escalate to the pull request owners; move out of the milestone after 1 attempt.
  • kind/feature: New functionality.
Help
@k8s-merge-robot

This comment has been minimized.

Contributor

k8s-merge-robot commented Apr 11, 2018

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-merge-robot k8s-merge-robot merged commit a8899b3 into kubernetes:master Apr 11, 2018

14 of 15 checks passed

Submit Queue Required Github CI test is not green: pull-kubernetes-e2e-gce
Details
cla/linuxfoundation jpbetz authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gke Job succeeded.
Details
pull-kubernetes-e2e-kops-aws Job succeeded.
Details
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce Job succeeded.
Details
pull-kubernetes-local-e2e Skipped
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details

@wenjiaswe wenjiaswe referenced this pull request Nov 8, 2018

Closed

REQUEST: New membership for @wenjiaswe #227

6 of 6 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment