Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] ETCD operator poc #67

Closed
wants to merge 2 commits into from
Closed

Conversation

mjudeikis
Copy link
Contributor

@mjudeikis mjudeikis commented Jul 10, 2018

Just a poc to show how we could use etcd operator instead of static deployment.

TODO:

  • Add pvc config
  • add restore, backup operators to the underlay
  • clean certificates
  • test upgrades
  • draft ha and non ha configuration (do we need it for sure)
  • update config to corespond to [1]
  • add quotas and limits for it.
  • investigate rh etcd image usage
ETCD_ADVERTISE_CLIENT_URLS=https://master-etcd:2379
ETCD_ELECTION_TIMEOUT=2500
ETCD_HEARTBEAT_INTERVAL=500
ETCD_QUOTA_BACKEND_BYTES=4294967296   
ETCD_TRUSTED_CA_FILE=/etc/etcd/ca.crt

@mjudeikis
Copy link
Contributor Author

cc: @pweil- , @jim-minter, @Kargakis

@pweil-
Copy link
Contributor

pweil- commented Jul 10, 2018

👍 Let's discuss this when everyone is back. I think it buys a lot for us (backup procedures!). My main concern is image versioning. In the EtcdCluster type it notes:

// Repository is the name of the repository that hosts
	// etcd container images. It should be direct clone of the repository in official
	// release:
	//   https://github.com/coreos/etcd/releases
	// That means, it should have exact same tags and the same meaning for the tags.
	//
	// By default, it is `quay.io/coreos/etcd`.

I'm not sure if there are implications that if the repo layout doesn't match that there could be issues or if we're ok using this for custom images.

@@ -0,0 +1,109 @@
# concatenation of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this part of the aks code? I expected we would deploy the operator as part of the helm templates.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question is will we be running one operator per client namespace or one operators (same as ingress controller now). Its question on how HCP will shape in the end and how much access we will have.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think operator per client is what we want. Other operators are going to be installed via helm charts/addons and will be per client so I don't think there is any reason to do something different for etcd. The only thing I expect we will get for free from the aks cluster is the etcdcluster CRD is already created (or is it created by the operator?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CRD is created by operator itself. Need to confirm this, but sound reasonable. I just followed the ingress controller and tiller model.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think operator per client is what we want.

agree

@@ -131,15 +132,28 @@ func Generate(m *api.Manifest, c *Config) (err error) {
}{
// Generate etcd certs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm, isn't the operator handling these?

Copy link
Contributor Author

@mjudeikis mjudeikis Jul 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -12,7 +12,7 @@ KUBECONFIG=aks/admin.kubeconfig helm upgrade $RESOURCEGROUP pkg/helm/chart -f _d

# TODO: when sync runs as an HCP pod (i.e. not in development), hopefully should
# be able to use helm upgrade --wait here
for d in master-etcd master-api master-controllers; do
for d in master-api master-controllers; do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, it seems we need to think about etcd upgrades separately. Follow-up issue for adding an etcd-upgrade.sh script? I expect in prod we are going to separate those anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can helm resource to manage CRD. We just need different checking method to validate if it was updated ok.

@mjudeikis
Copy link
Contributor Author

suggest postponing any work on this until we will get some input from etcd team (@pweil- sent an email already) and @jim-minter comes back so we can discuss it.

Current design outlines:

  • One operator per client compartment.
  • Helm installs and lifecycles operator and we have a separate chart (?) to manage CRD for etcd management itself.
  • need confirmation of custom image usage.

@@ -0,0 +1,17 @@
# TODO: Move me to helm chart when we can sort ordering:
Copy link
Contributor Author

@mjudeikis mjudeikis Jul 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is ordering issues with CRD and HELM.
In addition, if you delete/recreate CRD operators will kill etcd. You can delete operator, but you cant delete crd. If you put in customers helm where is a change it get wiped out :)
Maybe we could maintain things like this with ROOT level chart for all HCP?

@@ -50,7 +50,7 @@ fi
# TODO: if the user interrupts the process here, the AAD application will leak.

cat >_data/manifest.yaml <<EOF
name: openshift
name: $RESOURCEGROUP
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Operator etcd uses certificates with master-etcd.namespace.svc to communicate. Those names need to be in the certificates. We need to know name for those. Could we use manifest name field for a namespace?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm, they assume the operator runs on a different namespace than etcd?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should not be even a case. I assume because operator can be global or local it uses same code base. And we end with full DNS names.

 etcdserver: not healthy for reconfigure, rejecting member add {ID:e6a985970d419f37 RaftAttributes:{PeerURLs:[https://master-etcd-pkgtnptbmp.master-etcd.mjudeikis-hcp.svc:2380]} Attributes:{Name: ClientURLs:[]}}
2018-07-13 10:47:54.104837 W | etcdserver: timed out waiting for read index response
2018-07-13 10:48:01.390910 I | etcdserver/membership: added member 48e34665e30d3732 [https://master-etcd-m9glrkvqm2.master-etcd.mjudeikis-hcp.svc:2380] to cluster 6b80ae4a9ed28414

it advertises as full fqdn:

  - command:
    - /usr/local/bin/etcd
    - --data-dir=/var/etcd/data
    - --name=master-etcd-5lswnbnvvb
    - --initial-advertise-peer-urls=https://master-etcd-5lswnbnvvb.master-etcd.mjudeikis-hcp.svc:2380
    - --listen-peer-urls=https://0.0.0.0:2380
    - --listen-client-urls=https://0.0.0.0:2379
    - --advertise-client-urls=https://master-etcd-5lswnbnvvb.master-etcd.mjudeikis-hcp.svc:2379
    - --initial-cluster=master-etcd-5lswnbnvvb=https://master-etcd-5lswnbnvvb.master-etcd.mjudeikis-hcp.svc:2380

Where advertise peer url is:

func (m *Member) Addr() string {
	return fmt.Sprintf("%s.%s.%s.svc", m.Name, clusterNameFromMemberName(m.Name), m.Namespace)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to loop in someone from the etcd team. I would expect both etcd peers and the operator could assume they can run on the same namespace.

@mjudeikis
Copy link
Contributor Author

mjudeikis commented Jul 13, 2018

#75 all issues tracker

serviceAccountName: etcd-operator
initContainers:
- name: nanny
image: docker.io/mangirdas/azure-nanny:latest
Copy link
Contributor Author

@mjudeikis mjudeikis Jul 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

POC for azure-nanny. This container takes in parameters and create storageaccount, container, updates secrets.

@mjudeikis mjudeikis closed this Jul 19, 2018
@mjudeikis mjudeikis deleted the etcd-operator branch April 26, 2019 09:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants