etcd operator storage, crd and certificate issues #75

mjudeikis · 2018-07-13T15:10:58Z

The plan was to use etcd-operators. Now where we struggle.

First, we need to know the namespace when generating certificates. This is because of
https://github.com/coreos/etcd-operator/blob/70d3bd74960dc7127870a393affffbe1df94728e/pkg/util/etcdutil/member.go#L38-L40
The result is that etcd advertises itself with name.namespace.svc and we need to have this in the certificates.

Second (and a little bit bigger on) is storage.
First, etcd-operator online contains multiple misleading docs, examples. So we rely on code.

Etcd pods itself does not have any persistence.
https://github.com/coreos/etcd-operator/blob/master/pkg/apis/etcd/v1beta2/cluster.go#L137
Upstream issue:
Persistent/Durable etcd cluster coreos/etcd-operator#1323

Idea is we run in memory and backup constantly. In DR situation if a single pod is alive - the operator will recover. If all pods restart - recovery is done using etcd-restore-operator and restore is done from backup.

For this we need etcd-backup and etcd-restore operators.
backup operator supports 2 backup methods (Azure ASB and AWS S3) https://github.com/coreos/etcd-operator/blob/master/pkg/apis/etcd/v1beta2/backup_types.go#L19-L28

Configuration is what causes an issue. We need to have secret with storage account name and key.
https://github.com/coreos/etcd-operator/blob/master/doc/design/abs_backup.md

This means pre-requested are:

Storage account created.
Key available during creation of secret.

We don't want to create a storage account during ARM deployment as is not a client facing configuration and artifact. We could use one storage account with multiple buckets per customer. And inject from the backend.

Last one issue is helm ordering for CRD:
helm/helm#2994
TL;DR: When helm created CRD it takes some time for the cluster to accept them. Creating CRD resources fails as it's not yet available.

In addition, we dont want to manage global CRD's for all users from the user configuration side. If CRD is deleted - all etcd cluster are deleted too. It would look like we need to manage them outside azure-helm as part of HCP management.

cc: @jim-minter @Kargakis @pweil-

The text was updated successfully, but these errors were encountered:

0xmichalis · 2018-07-13T15:25:02Z

The result is that etcd advertises itself with name.namespace.svc and we need to have this in the certificates.

How about opening a PR in etcd-operator repo to make this configurable?

Second (and a little bit bigger on) is storage.

How about an init container in the etcd operator deployment, to ensure that both the azure container and storage account are created before etcd comes up? And move backup operator to run as a second container in the etcd operator deployment?

In addition, we dont want to manage global CRD's for all users from the user configuration side. If CRD is deleted - all etcd cluster are deleted too. It would look like we need to manage them outside azure-helm as part of HCP management.

I like to think of the CRD as a global default in every underlay cluster. Our etcd operators shouldn't need cluster-wide access in order to create/delete the CRD.

mjudeikis · 2018-07-13T15:30:19Z

How about an init container in the etcd operator deployment, to ensure that both the azure container and storage account are created before etcd comes up? And move backup operator to run as a second container in the etcd operator deployment?

We could do something like this. Run all 3 operators in one pod. Question is it acceptable that etcd-controller provisions storage for itself and creates a secret for it (I think yes). The question which credential we should use to storage all these "backup storage accounts"

I like to think of the CRD as a global default in every underlay cluster. Our etcd operators shouldn't need cluster-wide access in order to create/delete the CRD.

Agreed. We just need nice way to manage and lifecycle them too,.

pweil- · 2018-07-16T09:33:06Z

The operator is used for aks. Might be a good topic for the sync call. I’d review the aks hrlm charts too.

…

On Fri, Jul 13, 2018 at 11:30 Mangirdas Judeikis ***@***.***> wrote: How about an init container in the etcd operator deployment, to ensure that both the azure container and storage account are created before etcd comes up? And move backup operator to run as a second container in the etcd operator deployment? We could do something like this. Run all 3 operators in one pod. Question is it acceptable that etcd-controller provisions storage for itself and creates a secret for it (I think yes). The question which credential we should use to storage all these "backup storage accounts" I like to think of the CRD as a global default in every underlay cluster. Our etcd operators shouldn't need cluster-wide access in order to create/delete the CRD. Agreed. We just need nice way to manage and lifecycle them too,. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#75 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE_PUy7UhedIrnuP2T0h0Y5OKLWo--uhks5uGL0MgaJpZM4VPDd_> .

mjudeikis · 2018-07-16T09:59:54Z

@pweil- It can be used but we need to agree on how. Sync call would be good. Check google doc in the email with small review. Is there any change I can get access to azure repo too?

0xmichalis · 2018-07-27T10:31:31Z

/kind feature

mjudeikis mentioned this issue Jul 13, 2018

[WIP] ETCD operator poc #67

Closed

8 tasks

mjudeikis mentioned this issue Jul 16, 2018

add master-data pvc for etcd #65

Merged

openshift-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 27, 2018

mjudeikis closed this as completed Aug 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcd operator storage, crd and certificate issues #75

etcd operator storage, crd and certificate issues #75

mjudeikis commented Jul 13, 2018

0xmichalis commented Jul 13, 2018 •

edited

Loading

mjudeikis commented Jul 13, 2018

pweil- commented Jul 16, 2018 via email

mjudeikis commented Jul 16, 2018

0xmichalis commented Jul 27, 2018

etcd operator storage, crd and certificate issues #75

etcd operator storage, crd and certificate issues #75

Comments

mjudeikis commented Jul 13, 2018

0xmichalis commented Jul 13, 2018 • edited Loading

mjudeikis commented Jul 13, 2018

pweil- commented Jul 16, 2018 via email

mjudeikis commented Jul 16, 2018

0xmichalis commented Jul 27, 2018

0xmichalis commented Jul 13, 2018 •

edited

Loading