-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate alternatives for out-of-tree PersistentVolume labelling #4
Comments
/milestone v1.14 |
/help |
/remove-help |
@liggitt @dims @msau42 I'm currently working on building a reference implementation of a mutating admission webhook that should replace the PVL controller (that previously used Initializers). I'm stuck on a few things and some clarification would help.
Thanks! |
cc @mbohlool ^ |
Long term, PVs should only need PV.NodeAffinity to schedule pods to appropriate nodes. And this will be populated through dynamic provisioning without requiring the admission controller. At least aws, gce, and azure cloud providers already have this logic since 1.12 that I'm aware of. So the admission controller for PV labeling only serves two purposes now:
We also have to consider if we need to add additional logic for PVs created manually for CSI volume types. CSI drivers define their own topology keys that are different from the kubernetes labels (because CSI drivers are CO-agnostic). As it stands today, PVs backed by CSI drivers that are manually created by users does not have any auto-labeling. |
So this might be a reasonable expectation for out-of-tree cloud providers. The only reason we are investigating an out-of-tree PV labelling mechanism is to support existing legacy behavior of labelling PVs that the current in-tree providers rely on. But if we are temporarily supporting this legacy behaviour via the in-tree PVL admission controller and the long term goal is to only require PV.NodeAffinity, then maybe there's no need for out-of-tree PV labelling at all and we should be steering new providers towards the end goal rather than also having them deal with deprecating/removing labels in the future? Maybe we can add a controller to the CCM that updates PV.NodeAffinity only? The only concern with that is the volume can bind to a pod before the NodeAffinity rule is applied (something Initializers solved for us). Maybe this possible race condition is acceptable if we're only talking about manually created PVs?
To clarify, PVs created by CSI drivers don't have auto labeling support but they set their own keys as part of PV.NodeAffinity? Are these keys common across all CSI drivers? |
Yeah I think we want to add PV.NodeAffinity at admission time, or require the users to set it (could break backwards compatibility). PVs that CSI drivers dynamically provision will automatically get PV.NodeAffinity. However the keys they use are vendor-specific, and not the common Kubernetes topology labels. For example, the topology key that GCE PD CSI driver uses is "topology.gke.io/zone". The primary reasons for this were:
But this was also decided with the thought that Kubernetes wouldn't make zones/regions first class labels and all topology would be provider/user-defined. Maybe we can reconsider the CSI decision now. |
Oh sorry, I forgot to clarify, users that manually create PVs backed by CSI drivers do not automatically get labels nor PV.NodeAffinity. So that behavior is already different from in-tree volumes. |
Thanks for clarifying. For CSI driver support, seems like we need to have a longer discussion about this. For v1.14 though, we do need to figure out what needs to happen for the out-of-tree PV labelling admission controller. Seems like the options are:
For the v1.14 time frame, I'm personally in favor of 1) and making sure we follow through with the long term goal of removing zone labelling support entirely and only using |
For volumes out of tree, even the out-of-tree PV labeler didn't work for them right? Because it needs to query each cloud provider for the zone, and the only out-of-tree types that we implemented were the same as the in-tree types. |
The out-of-tree PV labeler was a bit naive and applied the PV labels to every PV with the initializer flag set (maybe this is a bug?), so there was nothing stopping an out-of-tree provider from implementing |
I'm not familiar with the original out of tree labeler design. What determined what type of PV gets the initalizer flag? ie, how do you distinguish between an nfs PV vs gce PV? |
Same here 😆. But from what I'm reading, if you configure your cluster correctly (i.e. The original PR kubernetes/kubernetes#44680 actually did check whether the PV was of the correct type (supported AWS EBS / GCE PD), but then a follow-up PR kubernetes/kubernetes#52169 replaced that check with a call to The more I dig into the history of this controller, I'm hesitant to try to find a replacement for v1.14 because it seems like:
How do we feel about getting rid of the out-of-tree PVL controller for v1.14, backporting volumes to the in-tree PVL admission controller for the time being (where it makes sense) until we have a better solution in place for v1.15 that also accounts for:
|
cc @cheftako |
Hm yes, looking at the gce implementation at least, this function will segfault if it was passed any PV type other than gce pd... I'm ok with getting rid of the controller for 1.14 considering that it's alpha and has various issues. |
Deleted the PV controller for v1.14, will work with SIG storage for next steps on this for v1.15 |
/milestone v1.15 note: update issue to only scope investigation or KEP |
/close PVs in the long term won't be using labels for topology anyways so I don't think this is needed anymore. ref kubernetes/kubernetes#72139 |
@andrewsykim: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
PVs do not need labels for scheduling, however it is unclear if we still want an admission controller that automatically generates PV.NodeAffinity for manually creating PVs, or require that if users are manually creating their PVs, they also manually fill in the PV.NodeAffinity. The current expectation is that the admission controller does this |
When we removed (alpha) Initializer support, we essentially made the PVL (persistent volume labelling) controller unusable. We should either update the controller or add another implementation that uses mutating admission webhook instead.
If we use mutating admission webhook, we should delete the PVL controller we have in-tree.
ref: kubernetes/kubernetes#73319
The text was updated successfully, but these errors were encountered: