Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] FileSystemResizePending event is not emited on PVC resizing #2749

Open
almorgv opened this issue Jun 30, 2021 · 13 comments
Open

[BUG] FileSystemResizePending event is not emited on PVC resizing #2749

almorgv opened this issue Jun 30, 2021 · 13 comments

Comments

@almorgv
Copy link

almorgv commented Jun 30, 2021

Describe the bug
FileSystemResizePending event is not emited on pvc resizing.

We use strimzi-kafka-operator that handles volume resizing and waiting for FileSystemResizePending condition to restart pods:

The Cluster Operator automatically changes the requested volume size in the PVCs and waits until a restart of the pod is required. Once the condition of the PVC is set to FileSystemResizePending (read the original blog post for more information about the different states the PVC can be in during the resizing), Strimzi automatically restarts the pod using this PVC.

But once PVC requested size is modified it status changes to Resizing.

PVC events:

Normal 	Resizing 	External resizer is resizing volume pvc-68e713ee-2a1d-488a-a146-2dbbbe8b24ab 	a few seconds ago
Warning 	VolumeResizeFailed 	resize volume pvc-68e713ee-2a1d-488a-a146-2dbbbe8b24ab failed: rpc error: code = FailedPrecondition desc = Invalid volume state attached for expansion 	a few seconds ago
Warning 	ExternalExpanding 	Ignoring the PVC: didn't find a plugin capable of expanding the volume; waiting for an external controller to process this PVC. 	3 minutes ago

To Reproduce
Steps to reproduce the behavior:

  1. Create pvc and volume with any size
  2. Resize pvc by modifying requsted size
  3. See pvc status Resizing and no FileSystemResizePending in events

Expected behavior
FileSystemResizePending status before Resizing

Environment:

  • Longhorn version: 1.1.1
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): helm
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: Rancher v1.20.6
    • Number of management node in the cluster: 3
    • Number of worker node in the cluster: 6
  • Node config
    • OS type and version: Ubuntu 20.04.2
    • CPU per node: 4
    • Memory per node: 16
    • Disk type(e.g. SSD/NVMe): SSD
    • Network bandwidth between the nodes: 10Gi
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): VMWare
  • Number of Longhorn volumes in the cluster: 20
@c3y1huang c3y1huang added this to New in Community Issue Review via automation Jun 30, 2021
@c3y1huang c3y1huang moved this from New to Team review required in Community Issue Review Jul 1, 2021
@c3y1huang
Copy link
Contributor

c3y1huang commented Jul 1, 2021

Expected behavior
FileSystemResizePending status before Resizing

To clarify the expectation, AFAIK there are 3 phases for PVC expansion in order: Normal -> Resizing -> FileSystemResizePending.

Base on how K8s implements support for the external resize controller, Longhorn has intentionally stopped before the FileSystemResizePending phase here because currently Longhorn only supports offline expansion.

We use strimzi-kafka-operator that handles volume resizing and waiting for FileSystemResizePending condition to restart pods:

Therefore, I am not sure if this case is still valid even if the FileSystemResizePending condition is there. Because Longhorn requires the volume to be in the detached state - To prevent the frontend expansion from interference by unexpected data R/W. Thus, there shouldn't be a pod waiting to be restart...

@c3y1huang c3y1huang moved this from Team review required to Pending user response in Community Issue Review Jul 1, 2021
@almorgv
Copy link
Author

almorgv commented Jul 1, 2021

AFAIK there are 3 phases for PVC expansion in order: Normal -> Resizing -> FileSystemResizePending.

I did not know that Resizing comes before FileSystemResizePending.

Therefore, I am not sure if this case is still valid even if the FileSystemResizePending condition is there. Because Longhorn requires the volume to be in the detached state - To prevent the frontend expansion from interference by unexpected data R/W. Thus, there shouldn't be a pod waiting to be restart...

So you mean that there is no way for Longhorn to proceed to FileSystemResizePending state while pod is running, right? And it seems like simple pod deletion does not trigger resizing because the volume does not have enough time to detach before pod is restarted.

@c3y1huang
Copy link
Contributor

c3y1huang commented Jul 1, 2021

So you mean that there is no way for Longhorn to proceed to FileSystemResizePending state while pod is running, right? And it seems like simple pod deletion does not trigger resizing because the volume does not have enough time to detach before pod is restarted.

Longhorn does not support volume expansion while the pod is running.

@almorgv
Copy link
Author

almorgv commented Jul 1, 2021

Longhorn does not support volume expansion while the pod is running.

But can Longhorn wait for volume detach in FileSystemResizePending state? Is it mandatory to wait in Resizing?

@c3y1huang
Copy link
Contributor

c3y1huang commented Jul 1, 2021

But can Longhorn wait for volume detach in FileSystemResizePending state? Is it mandatory to wait in Resizing?

The volume needs to be in the detached state before the Longhorn controller server calls for volume expansion. And this is when the Resizing condition will be added to the PVC indicating the resizing had started. And because the volume is not detached, the PVC event then shows VolumeResizeFailed. So in your case, the expansion would not be successful and will not reach the FileSystemResizePending condition even if Longhorn supports it.

@almorgv
Copy link
Author

almorgv commented Jul 1, 2021

And this is when the Resizing condition will be added to the PVC indicating the resizing had started

If I got it right you mean that Resizing condition should be set only then the volume is detached and the actual resizing has started? But in my case pvc comes into Resizing and stuck in this state when the pod is running and the volume is attached.

Try this

  1. Pod is up and running and the volume is attached
  2. Modify PVC and change requested size
  3. PVC is in Resizing state

@c3y1huang
Copy link
Contributor

c3y1huang commented Jul 1, 2021

If I got it right you mean that Resizing condition should be set only then the volume is detached and the actual resizing has started?

AFAIK it will be added when the ControllerExpandVolume RPC is called.

But in my case pvc comes into Resizing and stuck in this state when the pod is running and the volume is attached.

I believe that is expected according to the KEP.

If ControllerExpandVolume call fails:

Then PVC will retain Resizing condition and will have appropriate events added to the PVC.
Controller will retry resizing operation with exponential backoff, assuming it corrects itself.

Try this

  1. Pod is up and running and the volume is attached
  2. Modify PVC and change requested size
  3. PVC is in Resizing state

You can try to detach the volume and wait for a while (depends on the exponential backoff time), the condition should get removed.

@almorgv
Copy link
Author

almorgv commented Jul 1, 2021

I believe that is expected according to the KEP.

I see, so there is nothing that could be done in Longhorn to help such operators that handles pvc expansion and waits for FileSystemResizePending, because Longhorn supports offline expansion only?

You can try to detach the volume and wait for a while (depends on the exponential backoff time), the condition should get removed.

Yeah, you right

@c3y1huang
Copy link
Contributor

I see, so there is nothing that could be done in Longhorn to help such operators that handles pvc expansion and waits for FileSystemResizePending, because Longhorn supports offline expansion only?

AFAIK and there is online volume expansion in planning.

@almorgv
Copy link
Author

almorgv commented Jul 2, 2021

Can you please take a look at a comment in strimzi/strimzi-kafka-operator#5234 (comment)?
Is there any status that might indicate that action is needed to resize pvc other then Resizing which is a bit ambiguous

@c3y1huang
Copy link
Contributor

c3y1huang commented Jul 5, 2021

Can you please take a look at a comment in strimzi/strimzi-kafka-operator#5234 (comment)?
Is there any status that might indicate that action is needed to resize pvc other then Resizing which is a bit ambiguous

ATM, Longhorn does not support auto detection on whether a volume needs to be resized. You can create a Feature request for this.

@c3y1huang c3y1huang self-assigned this Jul 6, 2021
@joshimoo
Copy link
Contributor

joshimoo commented Jul 8, 2021

@almorgv @c3y1huang FSRezizePending is only set on the PVC, if the csi driver implements NodeExpandVolume.
https://github.com/gnufied/external-resizer/blob/38e5ebf959a7018a28adad81483a716cfcf58b59/pkg/csi/client.go

We currently don't use the csi driver for filesystem expansion, we are planning to transition to that though.
Since it's required for implementation of PV encryption #1859
Filesystem expansion via csi driver issue #2794

The other issue is that we use an old resizer side car which does not check for volumes in use and always tries to call ControllerExpandVolume. We are planning on updating this as part of #2388

@WolfspiritM
Copy link

I'm having the same problem right now. I wouldn't even mind restarting the pods manually after updating the PVC but in the current state it's really hard to expand volumes for stateful sets without tearing down the whole STS.

Can longhorn maybe check if the volume is supposed to be resized before attaching it again and instead do the resizing? Currently when the pvc is changed and the pod is stopped the sts controller recreates it so quickly that a resize isn't possible. I have to delete the sts (with cascade= orphan) then delete the pod, wait 4-5 minutes for it to be resized and then create the sts again, then continue with the other pods of the sts the same way.

I'd expect it to be in a pending state and when I restart the pod it will wait until the resize happened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Pending user response
Community Issue Review
Pending user response
Development

No branches or pull requests

4 participants