Managed HA etcd cluster #332

mumoshu · 2017-02-20T03:11:54Z

This is a WIP pull request to achieve "Managed HA etcd cluster" with private IPs resolved via public EC2 hostnames stabilized with a pool of EBS and EIP pairs for etcd nodes.

After this change, EC2 instances backing "virtual" etcd nodes are managed by an ASG.

Supported use-cases:

Automatic recovery from temporary Etcd node failures
- Even if all the nodes went down, the cluster recovers eventually as long as the EBS volumes aren't corrupted
Rolling-update of the instance type for etcd nodes without downtime
- = Scaling-out of Etcd nodes via NOT modifying the ASG directly BUT indirectly via CloudFormation stack updates
Other use-cases implied by the fact that the nodes are managed by ASGs
You can choose "eip" or "eni" for etcd node(=etcd member) identity via the etcd.memberIdentityProvider key in cluster.yaml
- "eip", which is the default setting, is recommended
- If you want, choose "eni".
- If you choose "eni", and your region has less than 3 AZs, setting etcd.internalDomainName to something other than default is HIGHLY RECOMMENDED to prepare for disaster recovery
- It is an advanced option but DNS other than Amazon DNS could be used (when memberIdentityProvider is "eni", internalDomainName is set, manageRecordSets is false, and every EC2 instance has a custom DNS which is capable of resolving FQDNs under internalDomainName)

Unsupported use-cases:

Automatic recovery from more than (N-1)/2 permanent Etcd nodes failure.
- Requires etcd backups and automatic determination of whether the new etcd cluster should be created or not via ETCD_INITIAL_CLUSTER_STATE
Scaling-in of Etcd nodes
- Just remains untested because it isn't my primary focus in this area. Contributions are welcomed

Relevant issues to be (partly) resolved via this PR:

Part(s) of etcd management #27
Wait signal for etcd nodes. See Proposal: Clouformation wait for signal #49
Probably Existing VPC with custom DHCP Option Set - etcd cluster won't start #189 why does kube-aws not support custom AWS domain names ? #260 as this relies on stable EC2 public hostnames and AWS DNS for peer communication and discovery regardless of whether an EC2 instance relies on a custom domain/hostname or not

The general idea is to make Etcd nodes "virtual" by retaining the state and the identity of an etcd node in a pair of an EBS volume and an EIP or an ENI, respectively.
This way, we can recover/recreate/rolling-update EC2 instances backing etcd nodes without another moving parts like external apps and/or ASG lifecycle hooks, SQS queues, SNS topics, etc.

Unlike well-known etcd HA solutions like crewjam/etcd-aws and MonsantoCo/etcd-aws-cluster, this is intended to be a less flexible but a simpler alternative or the basis for introducing a similar solutions like those.

Implementation notes

General rules

If you rely on Route 53 record sets, don't modify ones initially created by CloudFormation
- Doing so breaks CloudFormation stack deletions because it has no way to know about modified record sets and therefore can't cleanly remove them.
To prepare for a disaster recovery for a single-AZ etcd cluster(possible when the user relies on an AWS region with 2 or less AZs), use Route 53 record sets or EIPs to retain network identities among AZs
- ENIs and EBS can't be moved to an another AZ
- EBS volume can, however, be transferred utilizing a snapshot

Examples of experimented but not employed strategies

Static private IPs via a pool of ENIs dynamically assigned to EC2 instances under control of a single ASG
- ENIs can't move around different AZs. What happens when you have 2 ENIs in and 1 ENI in different AZs and the former AZ goes down? Nothing until the AZ comes up! It isn't the degree of H/A I wish to have at all!
Dynamic private IPs via stable hostnames using a pool of EIP&EBS pairs, single ASG
- EBS is required in order to achieve "locking" of a pair associated to an etcd instance
  - First of all, identify the "free" pair by filtering available EBS volumes and try to associate it to the EC2 instance
  - Successful association of an EBS volume means that the paired EIP can also be associated to the instance without race conditions
- EBS can't move around different AZs. What happens when you have 2 pairs in AZ 1 and 1 pair in AZ 2? Once the AZ 2 goes down, the options you can take are (1) manually altering AZ 2 to have 3 etcd nodes and then manually elect a new leader (2) recreate the etcd cluster within AZ 2 by modifying etcd.subnets[] point to AZ 2 in cluster.yaml and running kube-aws update, ssh into one of nodes and restore etcd state from a backup. Neither is automatic.

TODOs

Non-TODOs(for now)

Graceful termination of etcd nodes
- like kube-node-drainer.service for worker nodes
- to elect a new leader when the terminating node was the former leader

mumoshu · 2017-02-20T03:41:46Z

Just realized that what I'm trying to achieve is similar in concept to simlodon mentioned in kubernetes/kops#772 😃

mumoshu · 2017-02-20T05:41:01Z

Ah, I had been somehow forgetting the fact that an EBS can't move around AZs, too 😢

mumoshu · 2017-02-20T23:09:40Z

I'm going to create a dedicated ASG for each etcd instance.
Each etcd ASG would depend on the next ASG so that we can hopefully do rolling-updates of ASGs.

mumoshu · 2017-02-21T02:48:34Z

I've verified the current implementation by triggering a rolling-update of instance types for etcd nodes while running k8s conformance test.
The conformance test passed without any failures so I say no visible downtime was observed 😉

mumoshu · 2017-02-21T02:51:10Z

My biggest concern left untouched yet is now:

When a rolling-update is in progress, don't we need to wait until a newly recreated etcd member with possible outdated data(=persisted in the EBS volume which had been attached to the previously terminated instance replaced by the newly created instance) to catch up with the latest data from running etcd cluster?

mumoshu · 2017-02-21T04:40:54Z

Hi @pieterlange @camilb, could I have your comments/requests regarding the TODOs and Non-TODOs in the description of this PR, if any? 😃

mumoshu · 2017-02-21T04:43:30Z

Hi @redbaron, I've implemented my POC to make etcd cluster a bit more H/A.
Believing you're experienced in this area, woud you mind leaving your comments/requests/etc regarding the TODOs and Non-TODOs in the description of this PR? 😃

mumoshu · 2017-02-21T04:48:01Z

Updated the supported use-cases in the description:

Automatic recovery from temporary Etcd node failures

Even if all the nodes went down, the cluster recovers eventually as long as the EBS volumes aren't corrupted

redbaron · 2017-02-21T07:18:00Z

core/controlplane/config/templates/cloud-config-etcd

+
+        aws ec2 describe-volumes \
+          --region {{.Region}} \
+          --filters Name=tag:aws:cloudformation:stack-name,Values=$stack_name Name=tag:kube-aws:owner:role,Values=etcd Name=status,Values=available \


you can add Name=availability-zone,Values=$az here and drop jq filter later

Good catch 👍 Thanks

redbaron · 2017-02-21T07:20:14Z

core/controlplane/config/templates/cloud-config-etcd

+          echo "no etcd volume available in availability zone $az"
+        fi
+
+        vol_id=$(echo "$vol" | jq -r ".VolumeId")


~~this will break if more than 1 volume is found. it shouldn't happen normally, though,but better to guard against it~~ missed that you do it already

Thanks but I believe I've guardded it properly at line 189 with the final [0] in the jq expression?

Just read your updated comment now 😉

redbaron · 2017-02-21T07:31:07Z

if you have ASG per AZ,then why don't tag ASG with volume id and ENI id to use, instread of going with lengthy and error prone self-discovery process?

redbaron · 2017-02-21T07:32:14Z

core/controlplane/config/config.go

@@ -284,6 +284,10 @@ func (c *Cluster) SetDefaults() {
 			c.Etcd.Subnets = c.PublicSubnets()
 		}
 	}
+
+	if c.Etcd.InternalDomainName == "" {
+		c.Etcd.InternalDomainName = fmt.Sprintf("%s.internal", c.ClusterName)


not .compute.internal?

This is now dead-code which had been used for the ENI+Route53 Record Set way of implementing the network identity for etcd nodes.
I can revive it if you really need it rather than the current, EIP way.

I revived this to implement memberIdentityProvider: ENI 😃

redbaron · 2017-02-21T07:34:15Z

is EIP the only way to go? it will be a no-go for us :(

redbaron · 2017-02-21T07:35:25Z

instead of EIP you can attach ENI in the same way as you do EBS

redbaron · 2017-02-21T07:38:53Z

core/controlplane/config/templates/cloud-config-etcd

+        vol=$(echo "$describe_volume_result" | jq -r ".Volumes[0]")
+      fi
+      vol_id=$(echo "$vol" | jq -r ".VolumeId")
+      eip_alloc_id=$(echo "$vol" | jq -r ".Tags[] | select(.Key == \"kube-aws:etcd:eip-alloc-id\").Value")


I like how you tied data (EBS) with network identity (EIP) by tagging EBS.

mumoshu · 2017-02-21T07:39:52Z

core/controlplane/config/templates/cloud-config-etcd

+      # Doing so under an EC2 instance under an auto-scaling group would achieve automatic recovery from etcd node failures.
+
+      # TODO: Dynamically attach an EBS volume to /dev/xvdf before var-lib-etcd2.mount happens.
+      # Probably we cant achieve it here but in the "bootstrap" cloud-config embedded in stack-template.json


Unnecessary, already addressed TODOs in comments

mumoshu · 2017-02-21T07:40:33Z

core/controlplane/config/templates/cloud-config-etcd

+      #!/bin/bash -vxe
+
+      instance_id=$(curl http://169.254.169.254/latest/meta-data/instance-id)
+      private_ip=$(curl http://169.254.169.254/latest/meta-data/local-ipv4)


private_ip is probably an unused shell variable

redbaron · 2017-02-21T07:41:15Z

core/controlplane/config/templates/cloud-config-etcd

+      echo $assumed_hostname > /var/run/coreos/assumed_hostname
+      echo $eip_alloc_id > /var/run/coreos/eip_alloc_id
+
+  - path: /opt/bin/assume-etcd-hostname-with-private-ip


if you go with ENI, then IP addresses will be fixed and Route53 records can be created once and for all by CF

Certainly but it makes a disaster recovery a bit difficult(though not possible, just slower) as described in #332 (comment)

mumoshu · 2017-02-21T07:43:31Z

core/controlplane/config/templates/cloud-config-etcd

+          "Name=key,Values=aws:cloudformation:stack-name" \
+          --output json \
+        | jq -r ".Tags[].Value"
+      )


nit: stack_name should be taken from the env var KUBE_AWS_STACK_NAME injected via the embedded userdata in stack-template.json because it is a more portable way to get the stack name. More concretely, the aws:cloudformation:stack-name and other tags won't be populated when the EC2 instance is created via a spot fleet.

redbaron · 2017-02-21T07:51:18Z

Automatic recovery from temporary Etcd node failures
Even if all the nodes went down, the cluster recovers eventually as long as the EBS volumes aren't corrupted

Is it so? AFAIK if all nodes go down, then quorum is lost and it requires following disaster-recovery process: https://github.com/coreos/etcd/blob/master/Documentation/v2/admin_guide.md#disaster-recovery

mumoshu · 2017-02-21T07:55:41Z

instead of EIP you can attach ENI in the same way as you do EBS

I'm almost ok with the ENI instead of EIP but the ENI way isn't more complete than the EIP.
I just wanted to go with a more complete option.

My concern is in the fact an ENI can't move between different AZs.
It implies that, for an user like me who is in an AWS region with only 2 AZs available hence must rely on a single-AZ etcd cluster, we have to recreate all ENIs (with different private IPs) and therefore all the etcd members and a cluster in an another live AZ when the single AZ goes down.

Using EIPs instead allows us to retain and reassign these IPs hence etcd members and etcd cluster while recreating all the backing ASGs and EC2 instances for etcd in an another AZ.

// BTW, in both cases, we need EBS snapshots to recreate EBS volumes with equivalent data in an another AZ

mumoshu · 2017-02-21T07:57:12Z

Is it so? AFAIK if all nodes go down, then quorum is lost and it requires following disaster-recovery process: https://github.com/coreos/etcd/blob/master/Documentation/v2/admin_guide.md#disaster-recovery

Yes, because we retain data in EBS volumes.
There's a brief downtime(approx. 5min) between all the etcd nodes are terminated and then recreated but the cluster starts functioning again after the downtime.

redbaron · 2017-02-21T07:59:07Z

Yes, because we retain data in EBS volumes.
There's a brief downtime(approx. 5min) between all the etcd nodes are terminated and then recreated but the cluster starts functioning again after the downtime.

Probably that works only for clean shutfown, etcd2 docs say that if quorum is lost then no auto-recovery possible:

However, in extreme circumstances, a cluster might permanently lose enough members such that quorum is irrevocably lost. For example, if a three-node cluster suffered two simultaneous and unrecoverable machine failures, it would be normally impossible for the cluster to restore quorum and continue functioning.

mumoshu · 2017-02-21T08:02:09Z

@redbaron AFIAK, the "unrecoverable machine failures" in this case is corrupted EBS volumes.
As long as data in EBS volumes aren't corrupted, it doesn't count to the "extreme circumstances"; it is a recoverable failure.

mumoshu · 2017-02-21T08:04:17Z

is EIP the only way to go? it will be a no-go for us :(

Would you mind sharing me why EIP isn't the way to go for you?
You may already know but EIPs are used just to stabilize hostnames, eventually resolved to "private IPs" of etcd nodes (by AWS DNS)

redbaron · 2017-02-21T08:15:17Z

@redbaron AFIAK, the unrecoverable machine failures in this case is corrupted EBS volumes.
As long as data in EBS volumes aren't corrupted, it doesn't count to the "extreme circumstances"; it is a recoverable failure.

I can't see how it can work if AZ with last available leader is lost AND quorum is lost too.So in your 2 AZ case, if AZ with 2 nodes go down and one of them was leadere when it happened, even if you restore EBS into healthy AZ, etcd shouldn't come up, otherwise they allow silent data loss without explicit permission from operator, which I doubt very much.

gianrubio · 2017-02-21T15:38:36Z

core/controlplane/cluster/cluster.go

@@ -157,7 +157,7 @@ func (c *Cluster) Assets() (cfnstack.Assets, error) {

 	return cfnstack.NewAssetsBuilder(c.StackName(), c.StackConfig.S3URI).
 		Add("userdata-controller", c.UserDataController).
-		Add("userdata-worker", c.UserDataWorker).


Is there a mistake here?

No!
Although it is required only by a node pool stack, we had been unnecessarily uploading user-worker here for a control-plane stack, too.

gianrubio · 2017-02-21T15:42:21Z

core/controlplane/config/templates/stack-template.json

-       "cloudconfig" : "{{.UserDataEtcd}}"
-      }
-    }
-  },
  "Resources": {
    "{{.Controller.LogicalName}}": {
      "Type": "AWS::AutoScaling::AutoScalingGroup",


👍 for etcd in asg

+1 for etcd in ASG also.

mumoshu · 2017-02-21T22:49:50Z

I can't see how it can work if AZ with last available leader is lost AND quorum is lost too.So in your 2 AZ case, if AZ with 2 nodes go down and one of them was leadere when it happened, even if you restore EBS into healthy AZ, etcd shouldn't come up, otherwise they allow silent data loss without explicit permission from operator, which I doubt very much.

Thanks!
That's why I'm no longer going to deploy 2-AZ etcd cluster.
As long as your cluster is distributed among an odd number of AZs, I believe my comment at #332 (comment) applies.

mumoshu · 2017-03-01T06:24:40Z

Rebased.

mumoshu · 2017-03-01T06:53:49Z

E2E tests are passed.

redbaron · 2017-03-01T06:57:02Z

@mumoshu , this is incredible, thank you

cknowles · 2017-03-13T04:27:58Z

@mumoshu I've just managed to upgrade etcd instances to a larger type as an uptime upgrade thanks to you! 🎉

It's seems there was a slight pause on etcd responding in a 3 node cluster when the state was:

first new node was up, old node terminated
second new node was running but possibly not quite fully linked into cluster yet, old node terminated
third new node was not up, old node still running

The pause was circa 20 seconds and various processes including kubectl and dashboard became unresponsive momentarily. I just wanted to check if anyone has seen anything similar before trying to diagnose more? Each of the wait signals was passing after around 5 minutes so it looks like this was etcd related somehow.

mumoshu · 2017-03-13T04:54:17Z

@c-knowles I greatly appreciate this kind of feedback! Thanks.
I'm still finding a better way to reduce possible down time like that.

AFAIK, each etcd2 member(=etcd2 process inside a rkt pod) doesn't wait until the member becomes connected and ready to serve requests on its startup, and there's no way to know the member is actually ready.

For example, running etcdctl --peers <first etcd member's advertised peer url> cluster-health would block until all the remaining etcd members until the number meets quorum(2 for your cluster). This incomplete solution hits a chicken-and-egg problem like this and break wait signals. That's why it doesn't wait for an etcd2 member to be ready to avoid down time completely.

For me, the down time was less than 1 sec when I first tried but I suspect the result varies from time to time hence your case.

cknowles · 2017-03-13T06:54:19Z

@mumoshu Yeah, it probably varies a little bit. Is there an issue to track this as a known issue? I had a look but didn't find one unless it's part of etcd v2. If not, we should make one for other users who come across this.

mumoshu · 2017-03-13T07:45:06Z

@c-knowles There's no github issue I'm aware of 😢
Btw, etcd3 seems to signal systemd for readiness when its systemd unit is set to Type=notify.

redbaron · 2017-03-13T09:34:56Z

can't we draw dependencies between ASGs, then CF will roll them one by one. It would allow quorum to be maintained all the time

cknowles · 2017-03-13T11:50:28Z

I've separated the issue out as above so we can track any progress we make on this.

mumoshu · 2017-03-13T22:58:19Z

Hi @redbaron!
It is already implemented so. ASGs do get (hence etcd nodes managed by them) replaced one-by-one according to dependencies among ASGs combined with wait signals. Therefore my guess is that we're proceeding to next node too early because of insufficient wait signals #411, which would result in a temporary loss of quorum in an extreme case(=rolling-update happened faster than I'd expected).

mumoshu · 2017-03-13T22:59:04Z

@c-knowles Thanks for the good writeup!

trinitronx · 2017-03-31T17:51:47Z

@mumoshu:

Hello, I've just tested out the new kube-aws v0.9.5-rc.6 yesterday and I was unable to get it to work due to an error: The following resource(s) failed to create: [Controlplane].

After diving further into this, I found that the reason it was failing was really because we have reached our EIP limit. The real error was: The maximum number of addresses has been reached.

So I have a question about this seemingly new kube-aws memberIdentityProvider option for Etcd2. We are already using our limit of EIPs that AWS gives us, and this requirement is new to us. Why is this now necessary instead of using a SRV record for Etcd2 nodes?

It seems to me that SRV record discovery would be cleaner than having to use EIPs for the Etcd2 nodes, and more dynamic as it would not require IP hardcoding into the Etcd2 nodes in a static config file. This forces stored state of the ETCD_INITIAL_CLUSTER variable into files stored on the EBS volume. Having to manage EIPs and EBS volumes in case of failure or scaling seems a bit backwards & would easily hit the 5 EIP limit that AWS has by default with a larger Etcd2 cluster.

My thought was that SRV discovery could allow the CloudFormation template from kube-aws to then just manage a single Route53 SRV record instead based on the Etcd2 AutoScalingGroup. The current implementation of writing a bunch of hardcoded IPs to /var/run/coreos/etcd-environment seems like a bit of an unscalable hack IMHO (no offense intended).

Has this option for Etcd2 been explored yet, or are there any pitfalls / pros / cons involved with this type of configuration?

trinitronx · 2017-03-31T19:10:39Z

@mumoshu: Also perhaps related: Would waiting for the PR from #417 change the picture any by choosing to avoid Etcd2 for a new cluster, and simply go with Kubernetes v1.6 and Etcd3?

After reading a bit of the surrounding information regarding disaster recovery, this choice seems to be a pivotal moment that would seal our cluster's fate & ease of future maintenance in the event of a disaster. Any thoughts or recommendations here?

billyteves · 2017-04-05T09:27:18Z

@mumoshu regarding the ETCD cluster, is the setup for data is in replica? if one server goes down, data are still intact to the other servers? Can you also suggest a recommended etcd storage size?

jsravn · 2017-04-05T16:43:16Z

A bit late to the party, but we've built something similar to the Monsanto solution at https://github.com/sky-uk/etcd-bootstrap. It handles node replacement/cluster expansion&shrinking/new clusters, and we've been using in prod for a while now. It's self contained, just a docker wrapper around etcd that queries the local ASG for the other instances in the group. If you use apiserver on the same node, it's easy to run this and have apiserver hit localhost, with an ELB on the ASG to load balance to the apiservers.

It's not entirely clear to me the benefit of managing EBS/ENI separately - why not just rebuild the node including EBS/ENI? Is that in case the entire cluster dies?

Vince-Cercury · 2017-07-17T07:17:32Z

With the ENI as memberIdentityProvider and 3 private subnets in 3 different AZ, how many ETCD instances (per AZ?) should we run, in order to:

keep cluster running if one AZ goes down?
keep cluster running if two AZ go down?

How many instances minimum do we need if we are ok with an ETCD downtime (until ASG re-creates the instances and re-attach EBS and ENI)?

Alternatively if you could point me to doc, so I can do the maths. I'd be happy to read any doc to help me understand how kube-aws solved the problem.

redbaron · 2017-07-17T07:23:32Z

@Vincemd , etcd need to maintain quorum at all times to keep cluster running. Quorum is N/2+1 where N is number of etcd nodes.

Therefore if you run 1 node per AZ, you'll continue to maintain quorum in case one AZ goes down.

To tolerate two AZs failure, you'd need to span your cluster across 5 AZs, 2 of which will probably be in another region. I don't have information how happy etcd cluster will be when it sees significant latency increase for certain member of the cluster.

Vince-Cercury · 2017-07-17T08:36:34Z

Thanks @redbaron. So I will run 3 nodes. One per AZ we have available in Sydney Region. This will allow 1 AZ failure.

In case 2 nodes are down at the same time from 2 different AZ , would the ASG + ENI + EBS solution from @mumoshu allow the ETCD cluster to recover automatically, with some downtime? Assuming ASG is able to create an EC2 again in the same AZ that were affected (since ENI cannot be moved across from one AZ to another) and re-attach the EBS and ANI fine.
-> I'm just trying to put the case of ENI forward. I somewhat understand the idea, but need to explain it better to my peers.

We had that situation in Sydney recently when 1 AZ was done and another one was affected temporarily. It's also not impossible to see 2 instances fail at the same time from 2 AZ. Rare but not impossible.

If, for some reason ENI+EBS does not help/work, would a manual intervention allow recovering of the cluster by cleaning and allowing to elect a new leader? I think we are fine with downtime in case of 2 Nodes being down, as it's very unlikely. The apps will still run, just Kubernetes won't be able to manage the pods until etcd is fixed, I assume.

redbaron · 2017-07-17T08:50:09Z

There were some bugs in automatic recovery, which hopefully were ironed out, but in theory yes, it recovers once AZ is back.

Why you push for ENI and not for default EIP? EIP should allow you to restore quorum not waiting for AZ to become available.

Vince-Cercury · 2017-07-17T08:58:03Z

Everything has to be private (private subnets only and no EIP allowed).

redbaron · 2017-07-17T10:53:16Z

Trick is, that even if EIP is used, it is resolved to a private IP address, therefore can be used inside private subents

Vince-Cercury · 2017-07-17T11:04:00Z

I got an AWS error though when I tried it because there was not Internet Gateway in my Subnet. Since the subnet is private and I'm not allowed to use IGW

redbaron · 2017-07-17T11:04:48Z

Are you using amazon DNS?

Managed HA etcd cluster

mumoshu added this to the v0.9.5-rc.1 milestone Feb 20, 2017

redbaron reviewed Feb 21, 2017

View reviewed changes

mumoshu commented Feb 21, 2017

View reviewed changes

redbaron reviewed Feb 21, 2017

View reviewed changes

mumoshu commented Feb 21, 2017

View reviewed changes

gianrubio reviewed Feb 21, 2017

View reviewed changes

mumoshu force-pushed the ha-etcd branch from 5401409 to 9f3fdd7 Compare March 1, 2017 06:24

mumoshu merged commit 54eab73 into kubernetes-retired:master Mar 1, 2017

mumoshu deleted the ha-etcd branch March 1, 2017 06:54

mumoshu mentioned this pull request Mar 1, 2017

CloudFormation: Assign elastic IPs from a certain pool to new nodes #219

Closed

cknowles mentioned this pull request Mar 8, 2017

Should etcd EBS volumes have deletion policy set? #395

Closed

cknowles mentioned this pull request Mar 13, 2017

Always maintain etcd quorum, ensure wait signals are sufficient #411

Closed

kylehodgetts pushed a commit to HotelsDotCom/kube-aws that referenced this pull request Mar 27, 2018

Merge pull request kubernetes-retired#332 from mumoshu/ha-etcd

0ca2439

Managed HA etcd cluster

Managed HA etcd cluster #332

Managed HA etcd cluster #332

Conversation

mumoshu commented Feb 20, 2017 • edited

Implementation notes

General rules

Examples of experimented but not employed strategies

TODOs

Non-TODOs(for now)

mumoshu commented Feb 20, 2017

mumoshu commented Feb 20, 2017

mumoshu commented Feb 20, 2017

mumoshu commented Feb 21, 2017 • edited

mumoshu commented Feb 21, 2017

mumoshu commented Feb 21, 2017

mumoshu commented Feb 21, 2017

mumoshu commented Feb 21, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

redbaron Feb 21, 2017 • edited

Choose a reason for hiding this comment

mumoshu Feb 21, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

redbaron commented Feb 21, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

redbaron commented Feb 21, 2017

redbaron commented Feb 21, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mumoshu Feb 21, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

redbaron commented Feb 21, 2017

mumoshu commented Feb 21, 2017

mumoshu commented Feb 21, 2017

redbaron commented Feb 21, 2017

mumoshu commented Feb 21, 2017 • edited

mumoshu commented Feb 21, 2017 • edited

redbaron commented Feb 21, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mumoshu commented Feb 21, 2017

mumoshu commented Mar 1, 2017

mumoshu commented Mar 1, 2017

redbaron commented Mar 1, 2017

cknowles commented Mar 13, 2017 • edited

mumoshu commented Mar 13, 2017

cknowles commented Mar 13, 2017 • edited

mumoshu commented Mar 13, 2017

redbaron commented Mar 13, 2017

cknowles commented Mar 13, 2017

mumoshu commented Mar 13, 2017

mumoshu commented Mar 13, 2017

trinitronx commented Mar 31, 2017

trinitronx commented Mar 31, 2017

billyteves commented Apr 5, 2017

jsravn commented Apr 5, 2017

Vince-Cercury commented Jul 17, 2017

redbaron commented Jul 17, 2017

Vince-Cercury commented Jul 17, 2017

redbaron commented Jul 17, 2017

Vince-Cercury commented Jul 17, 2017

redbaron commented Jul 17, 2017

Vince-Cercury commented Jul 17, 2017

redbaron commented Jul 17, 2017

mumoshu commented Feb 20, 2017 •

edited

mumoshu commented Feb 21, 2017 •

edited

redbaron Feb 21, 2017 •

edited

mumoshu Feb 21, 2017 •

edited

mumoshu Feb 21, 2017 •

edited

mumoshu commented Feb 21, 2017 •

edited

mumoshu commented Feb 21, 2017 •

edited

cknowles commented Mar 13, 2017 •

edited

cknowles commented Mar 13, 2017 •

edited