Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add StatefulSet Volume Expansion Kep #660

Conversation

SidakM-zz
Copy link

@SidakM-zz SidakM-zz commented Dec 21, 2018

Add Kep for StatefulSet Volume Expansion.
Wrote this up quickly after the discussion here:
kubernetes/kubernetes#71384

Feature tracking issue: #661

Let me know if something is missing or if this is the right way to merge this 🤔

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 21, 2018
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/pm size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 21, 2018
@SidakM-zz
Copy link
Author

/assign @janetkuo


### Risks and Mitigations

Under the `ExpandPersistentVolumes` feature, pods referencing a volume must be restarted for file system expansion to occur after the `FileSystemResizePending` condition is True on the persistent volume claim. If the StatefulSet is configured with the `RollingUpdate` update strategy, then the StatefulSet controller would need to wait for the `FileSystemResizePending` condition to be satisfied on each persistent volume claim referenced by the pod it is updating before restarting the pod. This could noticeably increase the update time of a StatefulSet with many replicas.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can check if the ExpandInUsePersistentVolumes feature gate enabled, and restart the pods immediately without waiting controller side resizing finished if it is enabled.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed we can. That would just translate into not running the following check if that feature flag in enabled in the current implementation: https://github.com/kubernetes/kubernetes/pull/71384/files#diff-988743e0eb9313bcfa8fd43ac3a0977fR229

@janetkuo janetkuo added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Dec 21, 2018
@janetkuo
Copy link
Member

@kubernetes/sig-apps-feature-requests @kubernetes/sig-storage-api-reviews

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API labels Dec 21, 2018

While updating a pod, the StatefulSet controller will update a referenced persistent volume claim object if its storage request in the associated `volumeClaimTemplate` has been increased.

The pod will be restarted after the `FileSystemResizePending` condition is True on all updated persistent volume claims it references.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "The StatefulSet controller will delete and recreate the pod after..."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if ExpandInUsePersistentVolumes feature is enabled, we don't need to kill the pod.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking into this right now and to not kill the pod we would have to recognize that only the the volumeClaimTemplate storage requests have been modified and ExpandInUsePersistentVolumes is enabled. This would be achieved by comparing the current and updated version of the statefulsets which are formed by applying the patches in the updateRevision and currentRevision to a representation of the statefulset (in this case it's just named set).

The issue is that the representation of the statefulset these patches are applied to seems to be the updated version of the StatefulSet. This is problematic whilst forming the current representation of the statefulset. For example, if the updated representation of the set has two labels {"label1": "value1", "label2": "value2"} while the current version only has 1 label {"label1": "value1"}. Applying the currentRevision patch yields a set with labels {"label1": "value1", "label2": "value2"} which is not the current version of the StatefulSet. This issue is problematic as it makes comparisons between the currentSet and updateSet unreliable to determine if termination of pods is required(as we are trying to achieve above).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can start with destroying the Pod and recreating it. It's a trade off in that workloads that need to rebuild a large in memory working set from persistent data would greatly benefit from in place resizing. However, all other operations that operate on a StatefulSet Pod currently require that the Pod be destroyed and recreated. If we want to support in place mutation later we can do so.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, that's also how its currently implemented in the PR so I will leave that as is. I'll make changes to this proposal to reflect that.

While updating a pod, the StatefulSet controller will update a referenced persistent volume claim object if its storage request in the associated `volumeClaimTemplate` has been increased.

The pod will be restarted after the `FileSystemResizePending` condition is True on all updated persistent volume claims it references.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VolumeClaimTemplate should be recorded in the StatefulSet's ControllerRevision object as well for rollback.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Being done in implementation forgot to record it here 🤦‍♂️


The apiserver will allow for increases to storage requests in the `volumeClaimTemplates` component of a StatefulSet.

During the StatefulSet update process, the StatefulSet controller will detect an update to a `volumeClaimTemplate` by comparing the updated and current revision of the StatefulSet.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change should be detected by looking at the StatefulSet's ControllerRevision objects.

Copy link
Author

@SidakM-zz SidakM-zz Dec 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The updateSet is formed by applying the updateRevision (ControllerRevision) to the current set. That should be adequate right?

Under the `ExpandPersistentVolumes` feature, pods referencing a volume must be restarted for file system expansion to occur after the `FileSystemResizePending` condition is True on the persistent volume claim. If the StatefulSet is configured with the `RollingUpdate` update strategy, then the StatefulSet controller would need to wait for the `FileSystemResizePending` condition to be satisfied on each persistent volume claim referenced by the pod it is updating before restarting the pod. This could noticeably increase the update time of a StatefulSet with many replicas.

A potential mitigation would be the eventual adoption of the `ExpandInUsePersistentVolumes` alpha feature in Kubernetes v1.11 which enables file system expansion on volumes being used by a pod.

Copy link
Member

@janetkuo janetkuo Dec 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's optional, but you can include the name of the feature gate of this new feature in this proposal, e.g. StatefulSetVolumeExpansion (just a candidate, feel free to propose other names).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StatefulSetVolumeExpansion sound good to me :)

@msau42
Copy link
Member

msau42 commented Dec 26, 2018

/assign @gnufied
who is looking at potential changes to the online volume resizing feature


While updating a pod, the StatefulSet controller will update a referenced persistent volume claim object if its storage request in the associated `volumeClaimTemplate` has been increased.

The Statefulset controller will delete and recreate the pod after the `FileSystemResizePending` condition is True on all updated persistent volume claims it references. However, if the `ExpandInUsePersistentVolumes` feature is enabled, then deleting the pod is unnecessary to complete file system expansion on updated persistent volumes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I like this bit. It kinda circumvents the fact that currently volume resizing is offline by default and it kinda implements a redundant mechanism that will become unnecessary when online resizing graduates to beta.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least for alpha feature I propose skipping this mechanism because once we implement this - user will come to rely on it and it will be harder to go back on it. We should also recognize the fact that certain volume types can not be resized when in-use (like AzureDisk for example). Currently this information is surfaced to the user via PVC events if a PVC can't be resized when used in a pod.

Copy link
Author

@SidakM-zz SidakM-zz Jan 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I like this bit. It kinda circumvents the fact that currently volume resizing is offline by default and it kinda implements a redundant mechanism that will become unnecessary when online resizing graduates to beta.

Not sure what mechanism you are referring to here. Is it the mechanism of detecting the Alpha feature being enabled and not killing the pod? Or are you suggesting that we don't kill the pod in any case since online resizing will eventually graduate to beta.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I am suggesting we skip killing the pods in stateful set controller since online resizing will eventually graduate to beta (most likely in 1.14 release itself).

Copy link
Author

@SidakM-zz SidakM-zz Jan 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh ok makes sense. I was just confused because it is killing the pod by default and a mechanism would need to be implemented to ensure that the pod is not killed at the end of the update loop*.

We should also recognize the fact that certain volume types can not be resized when in-use (like AzureDisk for example). Currently this information is surfaced to the user via PVC events if a PVC can't be resized when used in a pod.

So do we want to make an exception in this case to kill the pod from the controller or just update and let the event surface on the PVCs(and possibly add an additional indicator) ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to start with always restarting the Pod after the volume increase request is fulfilled. (As below) the workloads controllers don't to in place updates for any other container or Pod modifications today. We can (maybe should) consider optimizing to detect and implement in place updates at a later point in time.


The Statefulset controller will delete and recreate the pod after the `FileSystemResizePending` condition is True on all updated persistent volume claims it references. However, if the `ExpandInUsePersistentVolumes` feature is enabled, then deleting the pod is unnecessary to complete file system expansion on updated persistent volumes.

The functionality provided by this enhancement will be gated by the `StatefulSetVolumeExpansion` feature gate.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to think about the patch mechanism used to apply controller revisions and the implications to rollback (both of StatefulSets and Kubernetes API Servers) when this feature is enabled. Also, you to think about modifications to validation that are necessary to allow mutation of the VolumeClaimsTemplate and how that effects backward compatibility during upgrade. It may be necessary to relax the constraint in the API server one release prior to adding the expansion feature into StatefulSet to enable upgrade in multi-API Server clusters and to allow for safe downgrades/rollbacks.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the implementation PR, the apiserver only allows increases to storageRequests if the feature is enabled (relaxing validation). Mentioned it here in this proposal.

I don't think there should be issues with backwards compatibility for the api, unless a client has (possibly inadvertently) built functionality that relied upon receiving an error response while modifying storageRequests in volumeClaimTemplates before this feature is added.

In regards to patches, for the implementation PR I just added the volumeClaimTemplates to the patch. However, rollbacks seem to be a problem since only volume expansion is supported. So rolling back to a previous revision after expanding a volume would imply the client is attempting to shrink volumes (which is unsupported). Currently the apiserver will err if the client attempts such a rollback and relay to the client that storage requests can only be increased. This can be an issue if the client has attempted a volume expansion, which for some reason failed on all PVCs, and then attempts to rollback the change. In all other cases of rolling back volume expansion changes (partial or full successes) I'm not sure if a rollback should be allowed. On the other hand, this can be problematic if the client attempts to rollback some other change but is prevented from using that functionality as they had also expanded a volumeClaimTemplate in the same or any following revision.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For PVC expansion feature, we also check if expansion is explicitly enabled in SC via an admission controller. Will this be possible to do that here for Stateful sets? It looks like - if feature gate is enabled volume expansion will generally be available for all volume types which is problematic (since not all volume types support expansion).

The other tricky bit this design does not address is - how it will handle expansion of volumes that only can be expanded offline. For PVC themselves, it is not that big of deal. But for stateful sets - user must take the Pod offline before volume that the pod was using can be expanded. So in effect - will this mean scaling entire stateful set to 0?

Copy link
Author

@SidakM-zz SidakM-zz Jan 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other tricky bit this design does not address is - how it will handle expansion of volumes that only can be expanded offline. For PVC themselves, it is not that big of deal. But for stateful sets - user must take the Pod offline before volume that the pod was using can be expanded. So in effect - will this mean scaling entire stateful set to 0?

As it says in the design The Statefulset controller will delete and recreate the pod after the FileSystemResizePending condition is True on all updated persistent volume claims it references.. Thus it will wait until that condition is reached before deleting the pod. This is how I have it implemented in the PR. This way the pod will continue to be available until all updated PVCs that must be expanded offline have the FileSystemResizePending condition (and be deleted and recreated after that).

Let me know if this addresses that concern.

For PVC expansion feature, we also check if expansion is explicitly enabled in SC via an admission controller. Will this be possible to do that here for Stateful sets? It looks like - if feature gate is enabled volume expansion will generally be available for all volume types which is problematic (since not all volume types support expansion).

I did see the PersistentVolumeClaimResize admission controller when initially implementing this. I wasn't sure if it would be necessary to build a similar(perhaps redundant) admission controller for a statefulset as when the statefulset controller eventually attempts to update the pvc it would catch the error sent by the admission controller and the update to the statefulset would not succeed. On the other hand building a specific admission controller for this would ensure that such an update to a statefulset is never validated.

Copy link
Member

@gnufied gnufied Jan 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if this addresses that concern.

Hmm not entirely. Thing is for some volume types(Azure Disk for example), the volume can not be expanded via control-plane call if volume is being used. So literally no expansion can happen to volume while it is in use. The user must first detach the volume from the node before he can call Azure API call that expands the volume. So in this case - there will not be a FileSystemResizePending added to PVC because control-plane call itself has not succeeded (and will not succeed until volume is taken offline).

I wasn't sure if it would be necessary to build a similar(perhaps redundant) admission controller for a statefulset as when the statefulset controller eventually attempts to update the pvc it would catch the error sent by the admission controller and the update to the statefulset would not succeed.

I think, this is already too late. api-server has at this point "accepted" the API request that it can not fulfill. Will stateful set controller not retry PVC update if first update has failed? The idea behind rejecting the request as early as possible is - controllers shouldn't have to keep retrying it. It may be worth expanding existing admission controller to reject stateful set updates in similar fashion.

Copy link
Author

@SidakM-zz SidakM-zz Jan 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thing is for some volume types(Azure Disk for example), the volume can not be expanded via control-plane call if volume is being used. So literally no expansion can happen to volume while it is in use.

Oh ok. I read through this PR and I think I get what you mean now. I see two issues with trying to get it to work with such volumes types.

  1. For each pvc template in the statefulset it would now be necessary to query the api for its storage class to find the provisioner. Then it would be possible to determine how to apply the update. This is doable.

  2. The statefulset controller is structured so that all updates to associated resources and the pod spec occur before the pod it deleted and recreated (as this makes sense in almost every context when updating a statefulset). However, to support the volume types above we would be required to delete the pod before its associated PVCs can be updated. This changes/breaks the flow of how the statefulset is currently updated even though it should be possible to implement.

So in effect - will this mean scaling entire stateful set to 0?

Well since there is a 1:1 map from PVC to PV monotonic updates should still be possible. (delete pod, update pvc request, wait until condition reached, restart pod).

@kow3ns how do you think this should this be addressed


A potential mitigation would be the eventual adoption of the `ExpandInUsePersistentVolumes` feature, which enables file system expansion on in-use volumes thus, making it unecessary to delete and recreate any refrencing pods during volume expansion.

## Graduation Criteria
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to add a bit more (as suggested above) about we plan to roll the feature out and how we plan to test it.


Move to Alpha after initial implementation and approvals.

Move to Beta once feature has been in Alpha for a set duration with no issues being reported.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also like a notional outline of GA graduation criteria. Another concern that should be addressed is that, as the modification effects a GA API, enabling by default at Beta will effectively release it globally. This should at least be called out.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a notional outline? 😅 Is it an outline which attempts to identify possible scenarios and the suggested approach to handling them?

@kow3ns
Copy link
Member

kow3ns commented Jan 17, 2019

I don't see any API changes proposed so I don't think this needs API approvers. I'd like to discuss this at SIG Apps. Generally, I think its something that is necessary to do in order to support the volume resizing feature implemented at that storage layer, and I think we should (pending a few changes) merge it as accepted and work on getting it to implementable.
@mattfarina @prydonius

Copy link
Member

@justaugustus justaugustus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove any references to NEXT_KEP_NUMBER and rename the KEP to just be the draft date and KEP title.
KEP numbers will be obsolete once #703 merges.

@gnufied
Copy link
Member

gnufied commented Jan 23, 2019

@kow3ns since this relies on relaxing an API validation, it was decided to run this by api reviewers as well.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 22, 2020
@ayanamist
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 23, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Aug 21, 2020
@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Aug 21, 2020
@justaugustus justaugustus requested review from justaugustus and removed request for justaugustus August 22, 2020 17:05
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 25, 2020
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@travisghansen
Copy link

/reopen

@k8s-ci-robot
Copy link
Contributor

@travisghansen: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Bessonov
Copy link

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 17, 2021
@cprivitere
Copy link
Member

So...where is this at?

@m-yosefpor
Copy link

So...where is this at?

See here #2842

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-review Categorizes an issue or PR as actively needing an API review. approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.