-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] ETCD operator poc #67
Conversation
cc: @pweil- , @jim-minter, @Kargakis |
👍 Let's discuss this when everyone is back. I think it buys a lot for us (backup procedures!). My main concern is image versioning. In the EtcdCluster type it notes:
I'm not sure if there are implications that if the repo layout doesn't match that there could be issues or if we're ok using this for custom images. |
aks/etcd-operator.yaml
Outdated
@@ -0,0 +1,109 @@ | |||
# concatenation of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this part of the aks code? I expected we would deploy the operator as part of the helm templates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question is will we be running one operator per client namespace or one operators (same as ingress controller now). Its question on how HCP will shape in the end and how much access we will have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think operator per client is what we want. Other operators are going to be installed via helm charts/addons and will be per client so I don't think there is any reason to do something different for etcd. The only thing I expect we will get for free from the aks cluster is the etcdcluster CRD is already created (or is it created by the operator?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CRD is created by operator itself. Need to confirm this, but sound reasonable. I just followed the ingress controller and tiller model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think operator per client is what we want.
agree
@@ -131,15 +132,28 @@ func Generate(m *api.Manifest, c *Config) (err error) { | |||
}{ | |||
// Generate etcd certs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrm, isn't the operator handling these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nop, if you want TLS we need to generate it:
https://github.com/coreos/etcd-operator/blob/master/doc/user/cluster_tls.md
@@ -12,7 +12,7 @@ KUBECONFIG=aks/admin.kubeconfig helm upgrade $RESOURCEGROUP pkg/helm/chart -f _d | |||
|
|||
# TODO: when sync runs as an HCP pod (i.e. not in development), hopefully should | |||
# be able to use helm upgrade --wait here | |||
for d in master-etcd master-api master-controllers; do | |||
for d in master-api master-controllers; do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, it seems we need to think about etcd upgrades separately. Follow-up issue for adding an etcd-upgrade.sh script? I expect in prod we are going to separate those anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can helm resource to manage CRD. We just need different checking method to validate if it was updated ok.
suggest postponing any work on this until we will get some input from etcd team (@pweil- sent an email already) and @jim-minter comes back so we can discuss it. Current design outlines:
|
@@ -0,0 +1,17 @@ | |||
# TODO: Move me to helm chart when we can sort ordering: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is ordering issues with CRD and HELM.
In addition, if you delete/recreate CRD operators will kill etcd. You can delete operator, but you cant delete crd. If you put in customers helm where is a change it get wiped out :)
Maybe we could maintain things like this with ROOT level chart for all HCP?
@@ -50,7 +50,7 @@ fi | |||
# TODO: if the user interrupts the process here, the AAD application will leak. | |||
|
|||
cat >_data/manifest.yaml <<EOF | |||
name: openshift | |||
name: $RESOURCEGROUP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Operator etcd uses certificates with master-etcd.namespace.svc
to communicate. Those names need to be in the certificates. We need to know name for those. Could we use manifest name field for a namespace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrm, they assume the operator runs on a different namespace than etcd?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should not be even a case. I assume because operator can be global or local it uses same code base. And we end with full DNS names.
etcdserver: not healthy for reconfigure, rejecting member add {ID:e6a985970d419f37 RaftAttributes:{PeerURLs:[https://master-etcd-pkgtnptbmp.master-etcd.mjudeikis-hcp.svc:2380]} Attributes:{Name: ClientURLs:[]}}
2018-07-13 10:47:54.104837 W | etcdserver: timed out waiting for read index response
2018-07-13 10:48:01.390910 I | etcdserver/membership: added member 48e34665e30d3732 [https://master-etcd-m9glrkvqm2.master-etcd.mjudeikis-hcp.svc:2380] to cluster 6b80ae4a9ed28414
it advertises as full fqdn:
- command:
- /usr/local/bin/etcd
- --data-dir=/var/etcd/data
- --name=master-etcd-5lswnbnvvb
- --initial-advertise-peer-urls=https://master-etcd-5lswnbnvvb.master-etcd.mjudeikis-hcp.svc:2380
- --listen-peer-urls=https://0.0.0.0:2380
- --listen-client-urls=https://0.0.0.0:2379
- --advertise-client-urls=https://master-etcd-5lswnbnvvb.master-etcd.mjudeikis-hcp.svc:2379
- --initial-cluster=master-etcd-5lswnbnvvb=https://master-etcd-5lswnbnvvb.master-etcd.mjudeikis-hcp.svc:2380
Where advertise peer url is:
func (m *Member) Addr() string {
return fmt.Sprintf("%s.%s.%s.svc", m.Name, clusterNameFromMemberName(m.Name), m.Namespace)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may need to loop in someone from the etcd team. I would expect both etcd peers and the operator could assume they can run on the same namespace.
#75 all issues tracker |
serviceAccountName: etcd-operator | ||
initContainers: | ||
- name: nanny | ||
image: docker.io/mangirdas/azure-nanny:latest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
POC for azure-nanny. This container takes in parameters and create storageaccount, container, updates secrets.
Just a poc to show how we could use etcd operator instead of static deployment.
TODO: