Allow uneven etcd zones #6641

adammw · 2019-03-20T00:27:23Z

Previously trying to use multiple etcd members on two AZs would result in the "there should be an odd number of master-zones" error, even if there was an odd number of etcd members.

This change fixes the etcdNames check by actually storing the values in the map, and uses that to check that the number of etcd members is an odd number, even if the zones are reused (there is an existing warning for that).

k8s-ci-robot · 2019-03-20T00:27:30Z

Hi @adammw. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

chrisz100 · 2019-03-20T11:30:34Z

/ok-to-test

ryan-dyer · 2019-03-20T15:42:46Z

I'm not sure this is a good idea. If I'm not mistaken, etcd operates via quorum. In the case of 3 etcd in 2 AZ, if the AZ hosting 2 instances goes out, then you've lost quorum. I believe the intent of kops HA design is that if an AZ is lost, the cluster doesnt go out. Operating in 2 AZs is not a very HA design. Perhaps an argument which ignores the check would be a better solution.

chrisz100 · 2019-03-20T16:03:07Z

Etcd can operate stand-alone which it would if the az with two instances fail.

Etcd is also capable to run with two nodes, just that the consensus protocol wouldn’t work and thus performance would degrade badly.

Personally I see this as quite a good approach to get a more HA etcd without being required to use more AZ

chrisz100 · 2019-03-20T19:24:54Z

/lgtm

It’s a small change so unsure if really suitable for 1.12 still but definitely a thing for 1.13
What do you think @justinsb ?

adammw · 2019-03-20T19:29:45Z

So the problem I was trying to solve here was to create a HA cluster in us-west-1 which only has two AZs available. However, even after solving this issue, I ran into other issues where having multiple etcd members in the same instance group, or having multiple instance groups in the same zone, just doesn't work.

ryan-dyer · 2019-03-21T12:26:16Z

@chrisz100 https://coreos.com/etcd/docs/latest/op-guide/failures.html according to this, in the scenario of 1 AZ going down that has 2 instances, this would be considered a majority failure and cause the cluster to cease performing writes. I personally would not consider this to be an ideal HA scenario. I completely appreciate that in some cases (as in the authors case) it may be the best you have to work with. But I dont feel kops should lower its bar for a good HA design without the user willingly agreeing to it. And it also sounds as if there should be additional work which goes into this PR to make it work correctly in those scenarios to begin with.

chrisz100 · 2019-03-21T12:47:00Z

So the problem I was trying to solve here was to create a HA cluster in us-west-1 which only has two AZs available. However, even after solving this issue, I ran into other issues where having multiple etcd members in the same instance group, or having multiple instance groups in the same zone, just doesn't work.

Do you have more information about these issues?

Before this is sorted out:
/lgtm cancel

justinsb · 2019-03-25T15:26:29Z

upup/pkg/fi/cloudup/populate_cluster_spec.go

@@ -157,10 +157,11 @@ func (c *populateClusterSpec) run(clientset simple.Clientset) error {
 					//if clusterSubnets[zone] == nil {
 					//	return fmt.Errorf("EtcdMembers for %q is configured in zone %q, but that is not configured at the k8s-cluster level", etcd.Name, m.Zone)
 					//}
+					etcdNames[m.Name] = m


Good fix here - we weren't actually checking for duplicate names!

justinsb · 2019-03-25T15:26:55Z

upup/pkg/fi/cloudup/populate_cluster_spec.go

 					etcdInstanceGroups[instanceGroupName] = m
 				}

-				if (len(etcdInstanceGroups) % 2) == 0 {
+				if (len(etcdNames) % 2) == 0 {


We could also look at the size of the etcd.Members map, but this is fine..

justinsb · 2019-03-25T15:30:51Z

I would classify this as a bug fix, because the code wasn't checking correctly for duplicate names (for example).

However, yes, if you're putting 2 of 3 nodes in the same AZ, it's arguably not much more reliable than running non-HA.

I'm happy to go either way - on the one hand the previous checks were wrong, on the other hand it was helpfully wrong in that it stopped some configurations that probably weren't ideal.

And yes, it should work with multiple etcd members in the same group / AZ (though it gets confusing because they can mount each other's volumes). It sounds like we need a test for that case though...

justinsb · 2019-03-25T15:30:59Z

/lgtm

KashifSaadat · 2019-03-27T09:44:14Z

/approve

Thanks @adammw 👍

k8s-ci-robot · 2019-03-27T09:44:42Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: adammw, KashifSaadat

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [KashifSaadat]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Cherry pick of #6641 onto release-1.12

Check for an odd number of etcd pods, not instancegroups

02b0422

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 20, 2019

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 20, 2019

k8s-ci-robot requested review from rdrgmnzs and robinpercy March 20, 2019 00:27

k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Mar 20, 2019

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 20, 2019

k8s-ci-robot assigned chrisz100 Mar 20, 2019

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 20, 2019

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 21, 2019

justinsb reviewed Mar 25, 2019

View reviewed changes

k8s-ci-robot assigned justinsb Mar 25, 2019

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 25, 2019

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 27, 2019

k8s-ci-robot merged commit c91bed7 into kubernetes:master Mar 27, 2019

adammw deleted the adammw/allow-uneven-etcd-zones branch March 27, 2019 16:17

justinsb mentioned this pull request May 10, 2019

Cherry pick of #6641 onto release-1.12 #6907

Merged

k8s-ci-robot added a commit that referenced this pull request May 10, 2019

Merge pull request #6907 from justinsb/cherrypick_6641_release-1.12

dc0537c

Cherry pick of #6641 onto release-1.12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow uneven etcd zones #6641

Allow uneven etcd zones #6641

adammw commented Mar 20, 2019

k8s-ci-robot commented Mar 20, 2019

chrisz100 commented Mar 20, 2019

ryan-dyer commented Mar 20, 2019

chrisz100 commented Mar 20, 2019

chrisz100 commented Mar 20, 2019

adammw commented Mar 20, 2019

ryan-dyer commented Mar 21, 2019

chrisz100 commented Mar 21, 2019

justinsb Mar 25, 2019

justinsb Mar 25, 2019

justinsb commented Mar 25, 2019

justinsb commented Mar 25, 2019

KashifSaadat commented Mar 27, 2019

k8s-ci-robot commented Mar 27, 2019

Allow uneven etcd zones #6641

Allow uneven etcd zones #6641

Conversation

adammw commented Mar 20, 2019

k8s-ci-robot commented Mar 20, 2019

chrisz100 commented Mar 20, 2019

ryan-dyer commented Mar 20, 2019

chrisz100 commented Mar 20, 2019

chrisz100 commented Mar 20, 2019

adammw commented Mar 20, 2019

ryan-dyer commented Mar 21, 2019

chrisz100 commented Mar 21, 2019

justinsb Mar 25, 2019

Choose a reason for hiding this comment

justinsb Mar 25, 2019

Choose a reason for hiding this comment

justinsb commented Mar 25, 2019

justinsb commented Mar 25, 2019

KashifSaadat commented Mar 27, 2019

k8s-ci-robot commented Mar 27, 2019