Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi Tenancy for Persistent Volumes #47326

Open
krmayankk opened this issue Jun 12, 2017 · 44 comments
Open

Multi Tenancy for Persistent Volumes #47326

krmayankk opened this issue Jun 12, 2017 · 44 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/storage Categorizes an issue or PR as relevant to SIG Storage. wg/multitenancy Categorizes an issue or PR as relevant to WG Multitenancy.

Comments

@krmayankk
Copy link

What keywords did you search in Kubernetes issues before filing this one? tenancy

This issue intends to start a discussion on how to do Multi Tenancy in Kubernetes for Persistent Volumes in particular.

My team is currently trying to enable Stateful Apps for our internal customers. One requirement that keeps coming up is how to isolate PV's of one internal customer from PV's of another internal customer.

I see the following isolation mechanisms:-

  • A PV when bound to a PVC(inside namespace A) cannot be bound to another PVC(inside namespace B) unless the unbind happens and hence are exclusive.
  • When using StorageClass, A PV of certain class can only be bound to PVC of the same class. So that means PVC(of class A) can only be bound to PV(of class A). This allows a PV allocated to one customer to not accidentaly get allocated to another customer), assuming each customer gets a separate StorageClass.

While the above isolation is good, its not enough(as i understand it). In a multi -tenant environment we want mechanisms which can guarantee that a volume allocated to one customer can never be accidentally allocated/mounted/accessed by another customer.

When using Kubernetes, what should we recommend to our customers ?

Few more questions, considerations:-

  • Why are Persistent Volumes not namespaced ?
  • Is one or more StorageClass'es per Customer a good multi tenancy model ?
  • Why does AttachDisk on volumes always happen on the node using kubelet ? The reason i ask is ,in some networked storage, for e.g. EBS and RBD, this can be dome from the master nodes. Doing this from kubelet means exposing more permissions to users then is necessary.

Some form of RBAC might be good, but currently it works cluster wide or per namespace. Since Volumes are considered cluster wide, we can only do cluster wide RBAC, but the volumes belong to different tenants, so we want something else(we need to define that something else)

Some related discussion on multi tenancy here.
#40403

@k8s-github-robot
Copy link

@krmayankk There are no sig labels on this issue. Please add a sig label by:
(1) mentioning a sig: @kubernetes/sig-<team-name>-misc
(2) specifying the label manually: /sig <label>

Note: method (1) will trigger a notification to the team. You can find the team list here and label list here

@k8s-github-robot k8s-github-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jun 12, 2017
@krmayankk
Copy link
Author

/sig storage

@k8s-ci-robot k8s-ci-robot added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Jun 12, 2017
@k8s-github-robot k8s-github-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jun 12, 2017
@krmayankk
Copy link
Author

some answers from the mailing list:-
@davidopp says

Regarding the concern of reusing a PV from one tenant to another, can you just use the "delete" reclaim policy? That means the underlying volume will not be reused (of course you can't really guarantee anything about what happens under the covers for network storage when it tells you it has been deleted.) For PV types that do recycling, you could use a custom recycler, to ensure the data is really deleted.

If your set of tenants is very static, i guess you could have one StorageClass per tenant and only use "recycle" reclaim policy (which seems to be what you're advocating). But this seems pretty inefficient from a utilization standpoint, as you'd end up accumulating the max number of PVs used by each tenant.

What kind of volume type are you interested in?

@davidopp Could you elaborate on the inefficient utilization when you mention the following ?"If your set of tenants is very static, i guess you could have one StorageClass per tenant and only use "recycle" reclaim policy (which seems to be what you're advocating). But this seems pretty inefficient from a utilization standpoint, as you'd end up accumulating the max number of PVs used by each tenant."
We are interested in RBD to start with. We would also be doing EBS Volumes as well.

@krmayankk
Copy link
Author

@mikedanese says
I don't think this is possible today but if we supported disk
encryption (e.g. LUKS) on a subset of storage drivers, a "recycle" to
a tenant that didn't have access to the secret to unlock the disk is
essentially a cryptographic erase. This would allow a single tenant to
best case recycle a disk without consequence if the disk is reclaimed
by another tenant and without the inefficiencies of maintaining
multiple storage pools.

@mikedanese yes looks like encryption seems like a reasonable solution as long as we can separate the encryption key per tenant and only the tenant has access to its own encryption keys. EBS volumes are the only ones supporting encryption. So for RBD , we are out of luck. Is there a general pattern of doing encryption with external provisioning ? What about incorporating tenancy in in-tree plugins ?

@krmayankk
Copy link
Author

@smarterclayton says
For hard tenancy we use the delete policy extensively in combination with dynamic provisioning. The goal is simply to ensure we never put a new volume into the pool without it coming from a clean disk. Reclaim is going to be a best effort, but it's also more work to use and isn't as flexible as your own provisioner.

@smarterclayton Doesn't OpenShift have use cases for keeping the volumes around after the pods are gone ? Does OpenShift do any kind of multi tenancy on top ?

@krmayankk
Copy link
Author

@thockin says Hard multi-year (whatever that means) is not currently a feature of kubernetes in a first class way. There are many things you can do to approximate it, but it is not pervasive or consistently designed.

Re why PV is not namespaced, before provisioning was available, PVs were a cluster-owned thing. Now with provisioning, we could maybe simplify, but we still need to be compatible, so there's that challenge.

@thockin thanks. Not advocating that PV should or should not be namespaced but just trying to understand the rationale for it. One thinking was if PV's were dynamically provisioned in the namespace of the customer, that might further limit their access.

@krmayankk
Copy link
Author

@jeffvance "... guarantee that a volume allocated to one customer can never be accidentally allocated/mounted/accessed by another customer"

I don't see "delete" as the answer here since this deletes the actual data in the volume. What if the customer has legacy data or important data that lives beyond the pod(s) accessing it? Perhaps an "exclusive" accessMode might help? FSGroup IDs can control access and there's been talk about adding ACLs to PVs. Or, maybe I've misunderstood the question?

@jeffvance your understanding of my question is correct. Could you link the issues which talk about ACL'ing PV's ?

Overall i want a multi tenant model, where:-
-- accidentally its not possible for one tenant to mount a volume created by another tenant
-- its more secure and attacks /compromise limit the surface area of attacks and access to volumes

In the absence of encryption if there is a Kubernetes native way of ACL'ing the PV's that doesnt rely on the underlying storage implementation would be great.

@jeffvance
Copy link
Contributor

@krmayankk I don't think volume ACLs are a formal issue yet. I've cc'd @childsb and @erinboyd since I think they may have some preliminary notes on this topic.

@wongma7
Copy link
Contributor

wongma7 commented Jun 12, 2017

If we assume each customer gets a StorageClass you can use ResourceQuotas (per-namespace) to prevent certain namespaces from requesting storage i.e. set .storageclass.storage.k8s.io/persistentvolumeclaims to 0 for everybody but customer x.

You can also set the claimRef namespace field in PVs to restrict each to with only PVCs from a certain namespace.

(Not very elegant solutions, IMO/admittedly :) )

@msau42
Copy link
Member

msau42 commented Jun 12, 2017

Regarding your question about AttachDisk, it does occur on the master node by default, and not kubelet.

@krmayankk
Copy link
Author

@msau42 Specifically for EBS and RBD, where does AttachDisk happen on the master or on the node ? I see code in both kubelet and the controller. How do we know where this is happening ? Is this configurable, for each volume type ?

@msau42
Copy link
Member

msau42 commented Jun 13, 2017

By default, the attach/detach controller runs in the master. It will invoke the plugin's attach routine through the operation executor. There is a kubelet option to enable attach/detach controller in kubelet, but it is off by default. It is not on a per-volume basis.

@mikedanese mikedanese added the sig/auth Categorizes an issue or PR as relevant to SIG Auth. label Jun 13, 2017
@mikedanese
Copy link
Member

mikedanese commented Jun 13, 2017

yes looks like encryption seems like a reasonable solution as long as we can separate the encryption key per tenant and only the tenant has access to its own encryption keys. EBS volumes are the only ones supporting encryption. So for RBD , we are out of luck. Is there a general pattern of doing encryption with external provisioning ? What about incorporating tenancy in in-tree plugins ?

Any volume plugin that is backed by a block device can probably support LUKS. @kubernetes/sig-storage-feature-requests have we ever discussed LUKS encryption layers for block device volumes?

@rootfs
Copy link
Contributor

rootfs commented Jun 13, 2017

Why does AttachDisk on volumes always happen on the node using kubelet ? The reason i ask is ,in some networked storage, for e.g. EBS and RBD, this can be dome from the master nodes. Doing this from kubelet means exposing more permissions to users then is necessary.

AttachDisk happens on Kubernetes master for cloud block storage (EBS, PD, Cinder, Azure). Kubelet doesn't have to have privileged credentials.

@krmayankk
Copy link
Author

Interesting @rootfs @msau42 so in RBD, why do we need the user secret in user namespace or it could be my wrong understanding. Putting it other way, in RBD we need two secrets admin and user. My understanding is that user secrets must be in the same namespace as the PVC and they are used for AttachDisk. If AttachDisk is happening on master, we should allow the user secret to be in any namespace or accept a namespace field for it.

@rootfs
Copy link
Contributor

rootfs commented Jun 13, 2017

Interesting @rootfs @msau42 so in RBD, why do we need the user secret in user namespace or it could be my wrong understanding. Putting it other way, in RBD we need two secrets admin and user. My understanding is that user secrets must be in the same namespace as the PVC and they are used for AttachDisk. If AttachDisk is happening on master, we should allow the user secret to be in any namespace or accept a namespace field for it.

rbd doesn't support 3rd party attach, so rbd map has to happen on kubelet.

rbd admin and user keyrings are for different purposes: admin keyring for rbd image provisioning (admin privilege), while user keyring for rbd map (non admin privilege). Pods that use rbd image don't have admin keyrings.

@krmayankk
Copy link
Author

@rootfs @msau42 is there a configuration which controls where the AttachDisk happens , or is it just check based , so at pod creation time, if AttachDisk is not already called, it will call AttachDisk. So for EBS, it will never be called since Attach has already happened in the master ?

@msau42
Copy link
Member

msau42 commented Jun 13, 2017

The attach operation is only performed by the attach/detach controller. And that controller is only enabled to run in the master node by default. There is a kubelet option to turn on attach/detach on the node, but it's going to be removed.

@jsafrane
Copy link
Member

Overall i want a multi tenant model, where:-
-- accidentally its not possible for one tenant to mount a volume created by another tenant

That's already implemented. If a PV gets bound to a PVC the PV can never get bound to another PVC. Only the pods in the same namespace as the PVC can use it.

When the PVC is deleted, the PV gets Released. Based on persistentVolumeReclaimPolicy of the PV, the PV is either deleted or recycled (data on the PV are discarded in both cases) or remains Released forever and thus nobody can bind to it. Only admin can manually access data on the PV or forcefully bind the PV to another PVC.

@ddysher
Copy link
Contributor

ddysher commented Jun 14, 2017

I think one key point from what @krmayankk describes is the identity of a PV (e.g. PV created by a tenant, PV of an internal customer). After binding, PV is attached the identity of the PVC, but the binding process itself doesn't seem to take that into consideration. From what I know, it looks into selector, storageclass, accessmode, etc. As of now, it seems the best way is to mimic identity information using selector and storageclass.

@krmayankk why do you want to reserve PV for a tenant?

@krmayankk
Copy link
Author

@jsafrane @ddysher in case of dynamic provisioning, the reclaim policy is always delete. But enterprises might want to still keep the PV and not delete them immediately for safety reasons. So we cant make use of the Released phased for preventing the binding.

The identity of the PV is important because, in no circumstance we want the binding process to accidentally bind a PV of one customer to a PVC of different customer. There is nothing that prevents it. Agreed that until a PVC is bind to PV, it cannot bind to another PV, but if accidentally we have some unbound PV's (due to bug or whatever) from customer A, we would want them to not get bound to customer B's PVC. The only way to prevent that today i can think of is assign per customer storage classes.

@msau42
Copy link
Member

msau42 commented Jun 21, 2017

@krmayankk once the PV is deleted, then there is no PV object that exists anymore, so you can't accidentally bind to it. You would have to recreate the PV, either statically or through dynamic provisioning.

Is what you really want the ability to specify Retain policy for dynamically provisioned PVs, so that you have the chance to clean up the data before it gets put back into the provisioning pool?

@krmayankk
Copy link
Author

@msau42 Since dynamic provisioning doesnt support reclaim policy and always deletes. What we are doing is when deleting statefulsets, we are not deleting the PVC's. That way we explicitly GC the PVC at a later time and hence the PV. While the PVC and hence the PV is in GC phase, i am worried that the PVC could get unbound due to bugs and hence the PV will become available for further binding by other tenants. Do you think this is possible ? If somehow the PVC/PV get unbound for a dynamically provisioned PV, will the phase of PV be Release or Available ?
Yes the ability to specify Retain policy would really give us a chance to clean up the data although some tenants may not want that and they would want new PV not recycled ones

@ddysher
Copy link
Contributor

ddysher commented Jun 22, 2017

If a PV is bound to a PVC, then there are two pieces of information about the two-way binding:

  • In PVC, spec.volumeName tells which PV the PVC is bound to
  • In PV, spec.claimRef tells which PVC is bound with the PV

pvc.spec.volumeName can not be edited if already set. pv.spec.claimRef can be removed entirely and if so, pv.status.phase will become Available. However, since pvc.spec.volumeName is non-empty and point to the pv, pv_controller will try to bind the PV and PVC again.

My suspect is that In between, if another PVC is also pending, then it's possible that the other PVC will bind to the PV, and the original PVC will just stay as is (and pv_controller keeps rebinding and keeps failing). If my understanding is correct, I think the likelihood that PV is bound by other tenants is pretty low, if not impossible.

If somehow the PVC/PV get unbound for a dynamically provisioned PV, will the phase of PV be Release or Available ?

@msau42
Copy link
Member

msau42 commented Jun 22, 2017

Assuming no recycle policy, once the PV is unbound, it always goes to Released or Retain, or the PV object is deleted entirely. The same PV object cannot go to Available state. So the same PV object is never recycled. It's always a new PV object.

Now, whether or not the data on the underlying backing volume gets cleaned up before being put back into the storage pool (used for dynamic provisioning) by the storage provider is a different story, and it's going to depend on each provider. For example, for GCE PD, when the disk gets deleted, it is guaranteed that the content gets cleaned up in the underlying volume before it could be reused for a new disk. For local storage, the provided external provisioner will cleanup the data when the PV is released. For other volume plugins, that may not be the case, and I believe that is why @krmayankk wants the Retain policy to be able to manually cleanup the data on the volume.

I still think if we had the ability to set reclaim policy to Retain on dynamically provisioned volumes, then it should be able to address your concern about being able to clean up volumes before being used again by other tenants.

@jeffvance
Copy link
Contributor

jeffvance commented Jun 22, 2017

See also issue 38192 Allow configuration of reclaim policy in StorageClass

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 10, 2018
@mikedanese mikedanese removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Feb 20, 2018
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 21, 2018
@redbaron
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 21, 2018
@CalvinHartwell
Copy link
Contributor

@krmayankk sorry to respond to this necro post, but did you come to any conclusion on this? I assume you can create a LimitRange which is really small for namespaces which should be restricted from using certain storageclasses, correct?

Have CSI plugins in k8s 1.10 and the new storage improvements in 1.11 sorted out any issues?

https://kubernetes.io/docs/tasks/administer-cluster/limit-storage-consumption/

Restricting storage access with Ceph is quite easy, not sure about NetApp using Trident or other mechanisms though right now.

Thanks!

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 2, 2018
@krmayankk
Copy link
Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 2, 2018
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 31, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 30, 2019
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@krmayankk
Copy link
Author

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Oct 21, 2019
@krmayankk
Copy link
Author

/reopen

@k8s-ci-robot
Copy link
Contributor

@krmayankk: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Oct 21, 2019
@krmayankk
Copy link
Author

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Oct 21, 2019
@krmayankk krmayankk added this to Enterprise Readiness in Technical Debt Research Oct 21, 2019
@tallclair tallclair added wg/multitenancy Categorizes an issue or PR as relevant to WG Multitenancy. and removed sig/auth Categorizes an issue or PR as relevant to SIG Auth. labels Oct 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/storage Categorizes an issue or PR as relevant to SIG Storage. wg/multitenancy Categorizes an issue or PR as relevant to WG Multitenancy.
Projects
Technical Debt Research
Enterprise Readiness
Development

No branches or pull requests