[WIP] ETCD operator poc #67

mjudeikis · 2018-07-10T09:43:24Z

Just a poc to show how we could use etcd operator instead of static deployment.

TODO:

Add pvc config
add restore, backup operators to the underlay
clean certificates
test upgrades
draft ha and non ha configuration (do we need it for sure)
update config to corespond to [1]
add quotas and limits for it.
investigate rh etcd image usage

ETCD_ADVERTISE_CLIENT_URLS=https://master-etcd:2379
ETCD_ELECTION_TIMEOUT=2500
ETCD_HEARTBEAT_INTERVAL=500
ETCD_QUOTA_BACKEND_BYTES=4294967296   
ETCD_TRUSTED_CA_FILE=/etc/etcd/ca.crt

mjudeikis · 2018-07-10T10:25:05Z

cc: @pweil- , @jim-minter, @Kargakis

pweil- · 2018-07-10T11:36:27Z

👍 Let's discuss this when everyone is back. I think it buys a lot for us (backup procedures!). My main concern is image versioning. In the EtcdCluster type it notes:

// Repository is the name of the repository that hosts
	// etcd container images. It should be direct clone of the repository in official
	// release:
	//   https://github.com/coreos/etcd/releases
	// That means, it should have exact same tags and the same meaning for the tags.
	//
	// By default, it is `quay.io/coreos/etcd`.

I'm not sure if there are implications that if the repo layout doesn't match that there could be issues or if we're ok using this for custom images.

0xmichalis · 2018-07-11T10:13:32Z

aks/etcd-operator.yaml

@@ -0,0 +1,109 @@
+# concatenation of


Why is this part of the aks code? I expected we would deploy the operator as part of the helm templates.

Question is will we be running one operator per client namespace or one operators (same as ingress controller now). Its question on how HCP will shape in the end and how much access we will have.

I think operator per client is what we want. Other operators are going to be installed via helm charts/addons and will be per client so I don't think there is any reason to do something different for etcd. The only thing I expect we will get for free from the aks cluster is the etcdcluster CRD is already created (or is it created by the operator?).

CRD is created by operator itself. Need to confirm this, but sound reasonable. I just followed the ingress controller and tiller model.

I think operator per client is what we want.

agree

0xmichalis · 2018-07-11T10:14:12Z

pkg/config/generate.go

@@ -131,15 +132,28 @@ func Generate(m *api.Manifest, c *Config) (err error) {
 	}{
 		// Generate etcd certs


Hrm, isn't the operator handling these?

nop, if you want TLS we need to generate it:
https://github.com/coreos/etcd-operator/blob/master/doc/user/cluster_tls.md

0xmichalis · 2018-07-11T10:16:38Z

upgrade.sh

@@ -12,7 +12,7 @@ KUBECONFIG=aks/admin.kubeconfig helm upgrade $RESOURCEGROUP pkg/helm/chart -f _d

 # TODO: when sync runs as an HCP pod (i.e. not in development), hopefully should
 # be able to use helm upgrade --wait here
-for d in master-etcd master-api master-controllers; do
+for d in master-api master-controllers; do


OK, it seems we need to think about etcd upgrades separately. Follow-up issue for adding an etcd-upgrade.sh script? I expect in prod we are going to separate those anyway.

We can helm resource to manage CRD. We just need different checking method to validate if it was updated ok.

mjudeikis · 2018-07-11T13:26:18Z

suggest postponing any work on this until we will get some input from etcd team (@pweil- sent an email already) and @jim-minter comes back so we can discuss it.

Current design outlines:

One operator per client compartment.
Helm installs and lifecycles operator and we have a separate chart (?) to manage CRD for etcd management itself.
need confirmation of custom image usage.

mjudeikis · 2018-07-13T10:35:34Z

aks/etcd-operator.yaml

@@ -0,0 +1,17 @@
+# TODO: Move me to helm chart when we can sort ordering:


There is ordering issues with CRD and HELM.
In addition, if you delete/recreate CRD operators will kill etcd. You can delete operator, but you cant delete crd. If you put in customers helm where is a change it get wiped out :)
Maybe we could maintain things like this with ROOT level chart for all HCP?

mjudeikis · 2018-07-13T10:37:27Z

create.sh

@@ -50,7 +50,7 @@ fi
 # TODO: if the user interrupts the process here, the AAD application will leak.

 cat >_data/manifest.yaml <<EOF
-name: openshift
+name: $RESOURCEGROUP


Operator etcd uses certificates with master-etcd.namespace.svc to communicate. Those names need to be in the certificates. We need to know name for those. Could we use manifest name field for a namespace?

Hrm, they assume the operator runs on a different namespace than etcd?

this should not be even a case. I assume because operator can be global or local it uses same code base. And we end with full DNS names.

etcdserver: not healthy for reconfigure, rejecting member add {ID:e6a985970d419f37 RaftAttributes:{PeerURLs:[https://master-etcd-pkgtnptbmp.master-etcd.mjudeikis-hcp.svc:2380]} Attributes:{Name: ClientURLs:[]}} 2018-07-13 10:47:54.104837 W | etcdserver: timed out waiting for read index response 2018-07-13 10:48:01.390910 I | etcdserver/membership: added member 48e34665e30d3732 [https://master-etcd-m9glrkvqm2.master-etcd.mjudeikis-hcp.svc:2380] to cluster 6b80ae4a9ed28414

it advertises as full fqdn:

- command: - /usr/local/bin/etcd - --data-dir=/var/etcd/data - --name=master-etcd-5lswnbnvvb - --initial-advertise-peer-urls=https://master-etcd-5lswnbnvvb.master-etcd.mjudeikis-hcp.svc:2380 - --listen-peer-urls=https://0.0.0.0:2380 - --listen-client-urls=https://0.0.0.0:2379 - --advertise-client-urls=https://master-etcd-5lswnbnvvb.master-etcd.mjudeikis-hcp.svc:2379 - --initial-cluster=master-etcd-5lswnbnvvb=https://master-etcd-5lswnbnvvb.master-etcd.mjudeikis-hcp.svc:2380

Where advertise peer url is:

func (m *Member) Addr() string { return fmt.Sprintf("%s.%s.%s.svc", m.Name, clusterNameFromMemberName(m.Name), m.Namespace) }

We may need to loop in someone from the etcd team. I would expect both etcd peers and the operator could assume they can run on the same namespace.

mjudeikis · 2018-07-13T15:14:40Z

#75 all issues tracker

mjudeikis · 2018-07-16T07:26:41Z

pkg/helm/chart/templates/Deployment.apps/etcd-operator.yaml

+      serviceAccountName: etcd-operator
+      initContainers:
+      - name: nanny
+        image: docker.io/mangirdas/azure-nanny:latest


POC for azure-nanny. This container takes in parameters and create storageaccount, container, updates secrets.

mjudeikis mentioned this pull request Jul 10, 2018

add master-data pvc for etcd #65

Merged

0xmichalis reviewed Jul 11, 2018

View reviewed changes

mjudeikis commented Jul 13, 2018

View reviewed changes

mjudeikis force-pushed the etcd-operator branch from 228bd36 to 156d8ca Compare July 15, 2018 21:28

mjudeikis added 2 commits July 16, 2018 08:24

poc for etcd operator

265fee7

Add poc for etcd backup operator nanny

726695b

mjudeikis force-pushed the etcd-operator branch from 156d8ca to 726695b Compare July 16, 2018 07:25

mjudeikis commented Jul 16, 2018

View reviewed changes

mjudeikis closed this Jul 19, 2018

mjudeikis deleted the etcd-operator branch April 26, 2019 09:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] ETCD operator poc #67

[WIP] ETCD operator poc #67

mjudeikis commented Jul 10, 2018 •

edited

Loading

mjudeikis commented Jul 10, 2018

pweil- commented Jul 10, 2018

0xmichalis Jul 11, 2018

mjudeikis Jul 11, 2018

0xmichalis Jul 11, 2018

mjudeikis Jul 11, 2018

pweil- Jul 11, 2018

0xmichalis Jul 11, 2018

mjudeikis Jul 11, 2018 •

edited

Loading

0xmichalis Jul 11, 2018

mjudeikis Jul 11, 2018

mjudeikis commented Jul 11, 2018

mjudeikis Jul 13, 2018 •

edited

Loading

mjudeikis Jul 13, 2018

0xmichalis Jul 13, 2018

mjudeikis Jul 13, 2018

0xmichalis Jul 13, 2018

mjudeikis commented Jul 13, 2018 •

edited

Loading

mjudeikis Jul 16, 2018 •

edited

Loading

		@@ -131,15 +132,28 @@ func Generate(m api.Manifest, c Config) (err error) {
		}{
		// Generate etcd certs

		@@ -0,0 +1,17 @@
		# TODO: Move me to helm chart when we can sort ordering:

[WIP] ETCD operator poc #67

[WIP] ETCD operator poc #67

Conversation

mjudeikis commented Jul 10, 2018 • edited Loading

mjudeikis commented Jul 10, 2018

pweil- commented Jul 10, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mjudeikis Jul 11, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mjudeikis commented Jul 11, 2018

mjudeikis Jul 13, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mjudeikis commented Jul 13, 2018 • edited Loading

mjudeikis Jul 16, 2018 • edited Loading

Choose a reason for hiding this comment

mjudeikis commented Jul 10, 2018 •

edited

Loading

mjudeikis Jul 11, 2018 •

edited

Loading

mjudeikis Jul 13, 2018 •

edited

Loading

mjudeikis commented Jul 13, 2018 •

edited

Loading

mjudeikis Jul 16, 2018 •

edited

Loading