Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate alternatives for out-of-tree PersistentVolume labelling #4

Closed
andrewsykim opened this issue Feb 12, 2019 · 21 comments
Closed
Assignees
Labels
P2 Priority 2
Milestone

Comments

@andrewsykim
Copy link
Member

When we removed (alpha) Initializer support, we essentially made the PVL (persistent volume labelling) controller unusable. We should either update the controller or add another implementation that uses mutating admission webhook instead.

If we use mutating admission webhook, we should delete the PVL controller we have in-tree.

ref: kubernetes/kubernetes#73319

@andrewsykim
Copy link
Member Author

/milestone v1.14

@andrewsykim
Copy link
Member Author

/help

@k8s-ci-robot k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Feb 12, 2019
@andrewsykim andrewsykim added this to the v1.14 milestone Feb 12, 2019
@andrewsykim
Copy link
Member Author

/remove-help
/assign

@k8s-ci-robot k8s-ci-robot removed the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Feb 20, 2019
@andrewsykim
Copy link
Member Author

@liggitt @dims @msau42 I'm currently working on building a reference implementation of a mutating admission webhook that should replace the PVL controller (that previously used Initializers). I'm stuck on a few things and some clarification would help.

  • Dealing with certs & configuring the MutatingWebhookConfiguration requires a lot of work, is there a library that can make this configuration a bit easier and more dynamic? We can have the server create the MutatingWebhookConfiguration to ease some of the configuration pains but I couldn't think of anything else aside from that.
  • Is embedding the mutating webhook as part of the cloud-controller-manager something we can consider? If this is a common behaviour we need to support for each provider, we can add a mutating webhook server in the CCM, but this seems awkward and against the idea that these webhooks should run separately from the core control plane (correct me if I'm wrong though please).
  • Is there anything being worked on for GA admission webhooks that would improve the above?

Thanks!

@liggitt
Copy link
Member

liggitt commented Feb 21, 2019

cc @mbohlool ^

@msau42
Copy link
Member

msau42 commented Feb 21, 2019

Long term, PVs should only need PV.NodeAffinity to schedule pods to appropriate nodes. And this will be populated through dynamic provisioning without requiring the admission controller. At least aws, gce, and azure cloud providers already have this logic since 1.12 that I'm aware of.

So the admission controller for PV labeling only serves two purposes now:

  • Add PV.NodeAffinity to PVs that were manually created by users (instead of via dynamic provisioning). Can we require that users manually set PV.NodeAffinity if they want their volumes to be scheduled correctly in multi-zone environments? Maybe not, this could be considered backwards compatibility breaking since it was working fine before with the zone labels.
  • Adds zone labels, which are purely informational now. It is planned for the VolumeZone predicate to be deprecated and removed, and replaced by the VolumeBinding predicate which uses PV.NodeAffinity, not PV labels.

We also have to consider if we need to add additional logic for PVs created manually for CSI volume types. CSI drivers define their own topology keys that are different from the kubernetes labels (because CSI drivers are CO-agnostic). As it stands today, PVs backed by CSI drivers that are manually created by users does not have any auto-labeling.

@andrewsykim
Copy link
Member Author

Can we require that users manually set PV.NodeAffinity if they want their volumes to be scheduled correctly in multi-zone environments?

So this might be a reasonable expectation for out-of-tree cloud providers. The only reason we are investigating an out-of-tree PV labelling mechanism is to support existing legacy behavior of labelling PVs that the current in-tree providers rely on. But if we are temporarily supporting this legacy behaviour via the in-tree PVL admission controller and the long term goal is to only require PV.NodeAffinity, then maybe there's no need for out-of-tree PV labelling at all and we should be steering new providers towards the end goal rather than also having them deal with deprecating/removing labels in the future?

Maybe we can add a controller to the CCM that updates PV.NodeAffinity only? The only concern with that is the volume can bind to a pod before the NodeAffinity rule is applied (something Initializers solved for us). Maybe this possible race condition is acceptable if we're only talking about manually created PVs?

We also have to consider if we need to add additional logic for PVs created manually for CSI volume types. CSI drivers define their own topology keys that are different from the kubernetes labels (because CSI drivers are CO-agnostic). As it stands today, PVs backed by CSI drivers that are manually created by users does not have any auto-labeling.

To clarify, PVs created by CSI drivers don't have auto labeling support but they set their own keys as part of PV.NodeAffinity? Are these keys common across all CSI drivers?

@msau42
Copy link
Member

msau42 commented Feb 22, 2019

Yeah I think we want to add PV.NodeAffinity at admission time, or require the users to set it (could break backwards compatibility).

PVs that CSI drivers dynamically provision will automatically get PV.NodeAffinity. However the keys they use are vendor-specific, and not the common Kubernetes topology labels. For example, the topology key that GCE PD CSI driver uses is "topology.gke.io/zone". The primary reasons for this were:

  • CSI drivers are supposed to be CO-agnostic so should not use K8s specific names/labels
  • Multiple CSI drivers from the same organization may want to use the same topology keys. ie, gcp has 2 csi drivers and both have the same notion of a zone.

But this was also decided with the thought that Kubernetes wouldn't make zones/regions first class labels and all topology would be provider/user-defined. Maybe we can reconsider the CSI decision now.

@msau42
Copy link
Member

msau42 commented Feb 22, 2019

Oh sorry, I forgot to clarify, users that manually create PVs backed by CSI drivers do not automatically get labels nor PV.NodeAffinity. So that behavior is already different from in-tree volumes.

@andrewsykim
Copy link
Member Author

Thanks for clarifying.

For CSI driver support, seems like we need to have a longer discussion about this. For v1.14 though, we do need to figure out what needs to happen for the out-of-tree PV labelling admission controller.

Seems like the options are:

  1. Remove the out-of-tree PV labeler controller entirely and backport zone labelling support to the in-tree PVL admission controller where possible. For volumes not in-tree, we would require users to manually set PV.NodeAffinity (would this be breaking backwards compatibility if that volume never had PV labeling support in the first place?). This would assume that longer-term we want to remove the zone label functionality entirely as you mentioned.
  2. Build support for mutating admission webhook that users can deploy for clusters using an out-of-tree provider. My concern with this is that users won't actually go through the setup steps of configuring the admission webhook. My assumption here is based on the fact that very few users used the existing PV labeler controller because it required enabling an alpha feature & setting up the InitializerConfiguration for PVs.
  3. Revert the PR that removes Initializer support and re-enable the PV labeler controller until we have a better story for this. I don't think we want to go in this direction unless absolutely necessary. From what I can tell, very few users were relying on this behaviour and we've accounted for those users already by backporting volume support into the in-tree PV admission controller (Cinder & vSphere volumes).

For the v1.14 time frame, I'm personally in favor of 1) and making sure we follow through with the long term goal of removing zone labelling support entirely and only using PV.NodeAffinity as you mentioned.

@msau42
Copy link
Member

msau42 commented Feb 22, 2019

Remove the out-of-tree PV labeler controller entirely and backport zone labelling support to the in-tree PVL admission controller where possible. For volumes not in-tree, we would require users to manually set PV.NodeAffinity (would this be breaking backwards compatibility if that volume never had PV labeling support in the first place?). This would assume that longer-term we want to remove the zone label functionality entirely as you mentioned.

For volumes out of tree, even the out-of-tree PV labeler didn't work for them right? Because it needs to query each cloud provider for the zone, and the only out-of-tree types that we implemented were the same as the in-tree types.

@andrewsykim
Copy link
Member Author

andrewsykim commented Feb 23, 2019

Because it needs to query each cloud provider for the zone, and the only out-of-tree types that we implemented were the same as the in-tree types.

The out-of-tree PV labeler was a bit naive and applied the PV labels to every PV with the initializer flag set (maybe this is a bug?), so there was nothing stopping an out-of-tree provider from implementing GetLabelsForVolume and applying those labels to all PVs in the cluster. This is likely one of the reasons that it wasn't adopted much in the first place and maybe another reason we should remove it.

@msau42
Copy link
Member

msau42 commented Feb 23, 2019

I'm not familiar with the original out of tree labeler design. What determined what type of PV gets the initalizer flag? ie, how do you distinguish between an nfs PV vs gce PV?

@andrewsykim
Copy link
Member Author

andrewsykim commented Feb 25, 2019

I'm not familiar with the original out of tree labeler design. What determined what type of PV gets the initalizer flag? ie, how do you distinguish between an nfs PV vs gce PV?

Same here 😆. But from what I'm reading, if you configure your cluster correctly (i.e. InitializerConfiguration is set for PVs and the Initializer admission controller is enabled), then all PVs in your cluster, regardless of type would be processed by the PV labeler controller.

The original PR kubernetes/kubernetes#44680 actually did check whether the PV was of the correct type (supported AWS EBS / GCE PD), but then a follow-up PR kubernetes/kubernetes#52169 replaced that check with a call to cloudprovider.GetLabelsForVolume. I think the intention there was to have GetLabelsForVolume check that the volume was the correct type before labelling it but I don't see any providers doing that today (except Cinder). (cc @rrati @dims).

The more I dig into the history of this controller, I'm hesitant to try to find a replacement for v1.14 because it seems like:

  • it was not used - because InitializerConfiguration was required which also required users to enable an alpha feature. I also asked maintainers of various out-of-tree providers.
  • it did not have feature parity with in-tree admission - I think users would have reported issues if they noticed their PVs were being labelled incorrectly (or maybe we want all PVs regardless of type to have zone labels?)

How do we feel about getting rid of the out-of-tree PVL controller for v1.14, backporting volumes to the in-tree PVL admission controller for the time being (where it makes sense) until we have a better solution in place for v1.15 that also accounts for:

  • long term plan to only use PV.NodeAffinity for volumes (for scheduling at least)
  • uses the new GA labels for zone/region labels (topology.kubernetes.io/{zone,region})
  • Kuberetes topology support for CSI drivers

@andrewsykim
Copy link
Member Author

cc @cheftako

@msau42
Copy link
Member

msau42 commented Feb 26, 2019

Hm yes, looking at the gce implementation at least, this function will segfault if it was passed any PV type other than gce pd...

I'm ok with getting rid of the controller for 1.14 considering that it's alpha and has various issues.

@andrewsykim
Copy link
Member Author

Deleted the PV controller for v1.14, will work with SIG storage for next steps on this for v1.15

@andrewsykim
Copy link
Member Author

/milestone v1.15
/priority important-longterm

note: update issue to only scope investigation or KEP

@k8s-ci-robot k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Mar 20, 2019
@andrewsykim andrewsykim modified the milestones: v1.14, v1.15 Mar 20, 2019
@andrewsykim andrewsykim changed the title Consider alternatives for out-of-tree PersistentVolume labelling Investigate alternatives for out-of-tree PersistentVolume labelling Mar 20, 2019
@andrewsykim andrewsykim added P1 Priority 1 and removed priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Mar 28, 2019
@andrewsykim andrewsykim removed this from the v1.15 milestone Jun 12, 2019
@andrewsykim andrewsykim added this to the v1.16 milestone Jun 12, 2019
@andrewsykim andrewsykim added P2 Priority 2 and removed P1 Priority 1 labels Jul 10, 2019
@andrewsykim andrewsykim modified the milestones: v1.16, Next Jul 10, 2019
@andrewsykim
Copy link
Member Author

/close

PVs in the long term won't be using labels for topology anyways so I don't think this is needed anymore. ref kubernetes/kubernetes#72139

@k8s-ci-robot
Copy link
Contributor

@andrewsykim: Closing this issue.

In response to this:

/close

PVs in the long term won't be using labels for topology anyways so I don't think this is needed anymore. ref kubernetes/kubernetes#72139

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@msau42
Copy link
Member

msau42 commented Jul 10, 2019

PVs do not need labels for scheduling, however it is unclear if we still want an admission controller that automatically generates PV.NodeAffinity for manually creating PVs, or require that if users are manually creating their PVs, they also manually fill in the PV.NodeAffinity. The current expectation is that the admission controller does this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 Priority 2
Projects
None yet
Development

No branches or pull requests

4 participants