Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate from Kubernetes External Secrets to ~External Secrets Operator~ CSI Driver #24869

Open
chaodaiG opened this issue Jan 13, 2022 · 29 comments
Assignees
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. sig/testing Categorizes an issue or PR as relevant to SIG Testing.

Comments

@chaodaiG
Copy link
Contributor

What would you like to be added:

Why is this needed:

As announced at external-secrets/kubernetes-external-secrets#864, Kubernetes External Secret is under maintenance mode right now, the new recommendation is to migrate over to External Secrets Operator.

There hasn't been any plan of turning down Kubernetes External Secret, so we might be fine for a while, until it's either incompatible with upcoming kubernetes versions, or newer features/bug fixes are only available from External Secrets Operator.

@chaodaiG chaodaiG added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 13, 2022
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jan 13, 2022
@chaodaiG
Copy link
Contributor Author

/sig testing

@k8s-ci-robot k8s-ci-robot added sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 13, 2022
@howardjohn
Copy link
Contributor

Any thoughts on SealedSecret as an alternative? Seems more gitops friendly

@chaodaiG
Copy link
Contributor Author

Any thoughts on SealedSecret as an alternative? Seems more gitops friendly

I can see that https://github.com/bitnami-labs/sealed-secrets is similar to KES(Kubernetes external secret) in terms of generating kubernetes secrets from a more secure custom resources, which is not the only purpose of KES.

The original purpose of KES was introduced to solve the problem:

  • Kubernetes secrets were manually applied to the cluster by kubectl apply from dev machine(s)
  • The secret might get lost if someone accidentally update/delete the value of the secret, or even the cluster was accidentally deleted

As KES syncs secrets from major secret manager providers into kubernetes cluster, so the recovery of kubernetes secret is as simple as re-applying ExternalSecret CR into the kubectl cluster, for example

apiVersion: kubernetes-client.io/v1

In short, SealedSecret is probably not the best replacement for KES

@howardjohn
Copy link
Contributor

howardjohn commented Feb 22, 2022 via email

@chaodaiG
Copy link
Contributor Author

Agree with you that both need a manual operation of either kubectl apply and gcloud secret create, the gitops side is pretty similar though, one is SealedSecret CR and the other is ExternalSecret CR, they both can live in git. However SealedSecret can not solve the problem of a user accidentally modifying the secret from the source(previous applied configuration from k8s somewhat helps in this case, but is not capable of recovering from kubectl delete SealedSecret, or even the cluster was accidentally deleted). Using KES reduces such risk level as GCP secret manager version controls secrets, so:

  • if someone accidentally changed the value in GCP the secrets values can still be recovered
  • if the cluster was accidentally deleted, secrets can still be recovered by applying git source controlled KES CR

@howardjohn
Copy link
Contributor

I feel like you could say the same about SealedSecret though...

  • if someone accidentally changed the value in GCPK8s the secrets values can still be recovered (from git)
  • if the cluster was accidentally deleted, secrets can still be recovered by applying git source controlled KESSealedSecret CR

Except for "cluster deleted" I guess you would need to keep the sealed secret keys (to decrypt if cluster is deleted) somewhere, so at some point you need to bootstrap...

Anyhow I have no strong agenda either way, just wanted to throw the idea out there

@chaodaiG
Copy link
Contributor Author

Thank you @howardjohn , this is really great discussion!

I think I have misunderstood SealSecret to certain extent, now that with your explanation it's a bit more clear now. So SealedSecret:

  • Stores secrets as "plain text" in SealedSecret CR
  • The private key for decrypting the secrets is only available in k8s cluster
  • A secret can only be created by a user running kubeseal, which would use public keys from k8s cluster

This sounds pretty good, and I can see that other than the cluster being deleted scenario this is also pretty reliable. One thing not super clear from the documentation, is that when a user runs kubeseal <mysecret.json >mysealedsecret.json as of https://github.com/bitnami-labs/sealed-secrets#usage, kubeseal needs to fetch public key from the cluster, do you happen to know whether this was true, @howardjohn ?

@howardjohn
Copy link
Contributor

@chaodaiG I think you can fetch the pubkey once and store in git. Then a dev experience to add a secret or update would be kubeseal mysecret --key pubkey.crt > mysealedsecret.json. Then postsubmit job kubectl applys to cluster; dev never needs access to the cluster.

But one concern would be that it said the key expires are 30d... so that may not work. I do not have much practical experience with sealed secrets so not 100% sure

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 23, 2022
@chaodaiG
Copy link
Contributor Author

chaodaiG commented Jun 6, 2022

https://kubernetes.slack.com/archives/C09QZ4DQB/p1654433983124889 is one of the reasons why this should be prioritized. TLDR: syncing build clusters tokens into prow is now a crucial piece in prow working with build cluster, KES flakiness would break this and cause prow stop working with the build cluster

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 6, 2022
chaodaiG added a commit to chaodaiG/test-infra that referenced this issue Jun 6, 2022
Prow now authenticates with build clusters with tokens that are valid for 2 days. The token is refreshed by a prow job https://prow.k8s.io/?type=periodic&job=ci-test-infra-gencred-refresh-kubeconfig and stores in GCP secret manager, KES is responsible for syncing the secrets into prow. Have observed KES being flaky at time to time, generally more than 10 days after the KES pods started running. See kubernetes#24869 (comment)

This is a temporary solution aim to mitigate the issue of long running KES pods
liurupeng pushed a commit to liurupeng/test-infra that referenced this issue Jun 7, 2022
Prow now authenticates with build clusters with tokens that are valid for 2 days. The token is refreshed by a prow job https://prow.k8s.io/?type=periodic&job=ci-test-infra-gencred-refresh-kubeconfig and stores in GCP secret manager, KES is responsible for syncing the secrets into prow. Have observed KES being flaky at time to time, generally more than 10 days after the KES pods started running. See kubernetes#24869 (comment)

This is a temporary solution aim to mitigate the issue of long running KES pods
kaalams pushed a commit to kaalams/test-infra that referenced this issue Jul 14, 2022
Prow now authenticates with build clusters with tokens that are valid for 2 days. The token is refreshed by a prow job https://prow.k8s.io/?type=periodic&job=ci-test-infra-gencred-refresh-kubeconfig and stores in GCP secret manager, KES is responsible for syncing the secrets into prow. Have observed KES being flaky at time to time, generally more than 10 days after the KES pods started running. See kubernetes#24869 (comment)

This is a temporary solution aim to mitigate the issue of long running KES pods
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 4, 2022
@chaodaiG
Copy link
Contributor Author

chaodaiG commented Sep 4, 2022

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 4, 2022
@chaodaiG
Copy link
Contributor Author

chaodaiG commented Nov 8, 2022

Build cluster token failed to sync issue happened again https://kubernetes.slack.com/archives/C7J9RP96G/p1667877096344719, this is not good.

/assign

@dims
Copy link
Member

dims commented Nov 8, 2022

uh-oh. thanks @chaodaiG

@jimangel
Copy link
Member

jimangel commented Nov 9, 2022

I don't want to derail / delay efforts going on in #27932, but has something like https://secrets-store-csi-driver.sigs.k8s.io/ been considered? We could use that with the GCP provider today and there's support for AWS, Azure, and Vault providers if we need to change.

Using the CSI driver + Google Secrets Manager (provider) would allow us to leverage Workload Identity for IAM secret access. I believe we'd also have better insight into access / auditing.

I know GCP costs are a concern, the pricing page indicates that it would be $0.06 per secret/per location and $0.03 per 10,000 access operations (and $0.05 per rotation). I don't think the costs would be astronomical, but it would be worth doing a closer inspection if we decided to pivot towards this solution.

I'm happy to demo / help move forward if we want to go that direction, however I understand the urgency and value the progress already made.

Edit: It looks like External Secrets Operator also allows us to use Secret Manager + WID if we'd like: https://external-secrets.io/v0.6.1/provider/google-secrets-manager/. I think it would come down to the easier to maintain, and more active (future-proof-ish?), solution.

@chaodaiG
Copy link
Contributor Author

chaodaiG commented Nov 9, 2022

hi @jimangel , it's not a derail at all. iirc csi driver for GCP was in its super early release cycle when we decided to adopt Kubernetes External Secret. The proposal of transition from Kubernetes External Secret to External Secret Operator was pretty much a lazy action based on the recommendation from Kubernetes External Secret.

In terms of the cost, we don't have that many secrets and lots of access operations so I wouldn't be too worried about it.

I would be glad to take another look at that csi driver for GCP since it's ready now, will do a quick evaluation by myself in terms of operational and maintenance perspective, and will get your thoughts if there is any question come up.

@chaodaiG
Copy link
Contributor Author

Had an extensive and wonderful offline discussion with @jimangel , and here is what we agreed on:

  • External Secret Operator works as a central proxy service, which uses a dedicated k8s cluster SA that is WI binded with a GCP SA, this GCP SA is given GCP secret manager permissions to all secrets that are used in the k8s cluster, and these secrets are synced one way into the k8s cluster, and all pods from the cluster are allowed to use any of these secrets as long as they are in the same namespace.
  • CSI driver works by using the authentication methods from the pods that need to mount the secrets. For GCP this means the workload identity binded cluster SA on the pods are used for authenticating with GCP secret manager.
  • Technically speaking CSI driver is more secure than External Secret Operator as a prowjob pod will only be able to use secrets that the SA is allowed to(we don't have a fine grained separation of different team using different SAs yet, so this is more like future proof)
  • Other than security boundaries, one benefit of using CSI driver is that it avoids a GCP SA from Prow service cluster be given GCP secret manager permissions in other projects, and as a result migrating or recovering would be much easier(there will be no IAM changes required from users projects)
  • One "downside", is that In terms of authentication it only supports workload identity for GCP, so jobs that are not using workload identity will not be able to use this feature
  • The other "downside/WAI", is that pod would failed to start when a secret is not available. This is expected for a Prowjob, and imo is even better than it failed due to using stale secrets that were synced from 7 days ago. For Prow services we will need to make sure that all the kubeconfig secrets are stored in the GCP project where Prow is in, to avoid the case where a user provided kubeconfig secrets being deleted in GCP causing Prow downtime.

With all those being said, I'm convinced that CSI driver is better suited for our use case. Kudos to @jimangel , thank you so much for the discussion, I felt I have learned a lot!

@BenTheElder @spiffxp @dims @ameukam @cjwagner , WDYT?

@jimangel
Copy link
Member

Awesome write up @chaodaiG! Agreed, it was fun chatting.

One "downside", is that In terms of authentication it only supports workload identity for GCP, so jobs that are not using workload identity will not be able to use this feature

There are alternatives for authentication outlined here: https://github.com/GoogleCloudPlatform/secrets-store-csi-driver-provider-gcp/blob/main/docs/authentication.md but the general consensus is to use WI if at all possible.

@dims
Copy link
Member

dims commented Nov 10, 2022

@chaodaiG @jimangel Nice! +100

@cjwagner
Copy link
Member

Sounds like a nice improvement to me!

@ameukam
Copy link
Member

ameukam commented Nov 11, 2022

@chaodaiG @jimangel Nice idea!

Let's try it.

@ameukam
Copy link
Member

ameukam commented Nov 11, 2022

@jimangel So if the secret is mounted as a volume in the pod, how this is isolated from the other pods running in the same node ?

@jimangel
Copy link
Member

jimangel commented Nov 11, 2022

So if the secret is mounted as a volume in the pod, how this is isolated from the other pods running in the same node?

I believe the threat model is the same as before (or more secure). Access today is segmented by namespace (k8s "built-in" secrets). With the CSI driver, access is only permitted when all conditions are met:

  1. A namespace scoped SecretProviderClass (CRD) defining access exists (This directs the mount to the appropriate GCP project / secret).
  2. GCP IAM bindings to a Service Account / Workload Identity in GCP to access a specific secret resource(s).

NOTE: Any workload/job/pod in a shared k8s namespace could use the same service account to access the permitted secret(s)/SecretProviderClass. That should be no different than any pod, in the same namespace, accessing the same secret, today.

As far as what access pods on the same node have (isolation)... If any pod/actor can mount/escape a pod to access node-level storage layers; you are as screwed as you'd be if you were using Kubernetes "built-in" secrets. 😅

Let me know if I misunderstood what you're asking @ameukam!

Edit: There are a couple "security considerations" called out in the repo itself.

@jimangel
Copy link
Member

Hey all! Checking in here, what would be the next steps @chaodaiG? Should we try a small-scale demo or is there somewhere to test?

@chaodaiG
Copy link
Contributor Author

@cjwagner could you please take a look

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 30, 2023
@BenTheElder BenTheElder added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 1, 2023
@michelle192837

This comment was marked as off-topic.

@michelle192837 michelle192837 added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Sep 12, 2023
@michelle192837
Copy link
Contributor

This is open for contribution if anyone's willing to do so. (We do keep seeing infrequent errors or flakes that require KES to be restarted, so while it's not urgent it'd be helpful!).

@ameukam
Copy link
Member

ameukam commented Oct 3, 2023

@michelle192837 Assuming this need to be deployed on a Google-owned GKE cluster, one action would be to create a SA with workload identity so we can use it to retrieve the secrets from the secret manager. I think it only can be done by EngProd.

@cjwagner cjwagner changed the title Migrate from Kubernetes External Secrets to External Secrets Operator Migrate from Kubernetes External Secrets to ~External Secrets Operator~ CSI Driver Oct 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Projects
None yet
Development

No branches or pull requests

10 participants