Better handling of Lustre volume that go to unrecoverable Failed status #239

kanor1306 · 2022-01-28T11:22:42Z

Is your feature request related to a problem?/Why is this needed
From time to time I get Lustre volumes in "Failed" state, usually with the next message:

which I assume is due to lack of capacity in the AWS side (as I am not reaching my quota limits).

The issue comes with the PVC, as it stays in Pending state forever, while the volume it represents is in a state that is not recoverable. See:

And the file system in the AWS console:

The PVC keeps looping waiting for the volume to be created, while the FSx volume is just failed, and that will not change.

/feature

Describe the solution you'd like in detail
My proposed solution does not solve the issue directly, but at least allows you to manage the problem yourself. I would like that the PVC will go to a different state, where it is clear that it is in an unrecoverable situation. This way you could handle the "Failed" situation from within your software without using the AWS API, just by chcking the status of the PVC

Describe alternatives you've considered

Ideally, when the Lustre volume fails to be created in this manner, I would like the driver to do something to retry the creation of the volume, though I don't know if the AWS SDK allows for this.
Other alternative could be that the driver creates a new volume when the creation fails, but I can imagine that this could lead to issues in cases where it also goes to "Failed" but for different reasons

Additional context
Another effect of the current situation is that when the PVC is removed, it leaves behind the Lustre volume in AWS, so you need to cleanup manually.

Also note that if you remove the Failed volume, the driver will create a new one, becoming healthy again (if the new one does not also go to Failed state)

Edit: add the Additional context

k8s-triage-robot · 2022-05-17T17:46:38Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

kanor1306 · 2022-05-20T11:06:00Z

/remove-lifecycle stale

k8s-triage-robot · 2022-08-18T11:24:26Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-09-17T11:55:20Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

kanor1306 · 2022-09-20T15:33:07Z

/remove-lifecycle rotten

k8s-triage-robot · 2022-12-19T15:50:10Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

jacobwolfaws · 2022-12-22T14:26:55Z

/remove-lifecycle stale

kanor1306 · 2022-12-22T14:33:42Z

@jacobwolfaws This I haven't seen happening for a long while, though not sure if has to do with changes on the AWS side or that I just didn't happen to run into capacity issues. Feel free to close it, as it seems also that there are no other people reporting it.

jacobwolfaws · 2022-12-22T14:52:34Z

@kanor1306 thanks updating this! Going to leave this thread open, because I do agree we need to improve our capacity shortage messaging

jacobwolfaws · 2023-01-12T21:40:36Z

/lifecycle frozen

kanor1306 changed the title ~~Mark PVC as failed when Lustre volumes goes to unrecoverable Failed status~~ Mark PVC as failed when Lustre volume goes to unrecoverable Failed status Jan 28, 2022

kanor1306 changed the title ~~Mark PVC as failed when Lustre volume goes to unrecoverable Failed status~~ Better handling of Lustre volume that go to unrecoverable Failed status Feb 16, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 17, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 20, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 18, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 17, 2022

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 20, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 19, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 22, 2022

k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Jan 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better handling of Lustre volume that go to unrecoverable Failed status #239

Better handling of Lustre volume that go to unrecoverable Failed status #239

kanor1306 commented Jan 28, 2022 •

edited

Loading

k8s-triage-robot commented May 17, 2022

kanor1306 commented May 20, 2022

k8s-triage-robot commented Aug 18, 2022

k8s-triage-robot commented Sep 17, 2022

kanor1306 commented Sep 20, 2022

k8s-triage-robot commented Dec 19, 2022

jacobwolfaws commented Dec 22, 2022

kanor1306 commented Dec 22, 2022

jacobwolfaws commented Dec 22, 2022

jacobwolfaws commented Jan 12, 2023

Better handling of Lustre volume that go to unrecoverable Failed status #239

Better handling of Lustre volume that go to unrecoverable Failed status #239

Comments

kanor1306 commented Jan 28, 2022 • edited Loading

k8s-triage-robot commented May 17, 2022

kanor1306 commented May 20, 2022

k8s-triage-robot commented Aug 18, 2022

k8s-triage-robot commented Sep 17, 2022

kanor1306 commented Sep 20, 2022

k8s-triage-robot commented Dec 19, 2022

jacobwolfaws commented Dec 22, 2022

kanor1306 commented Dec 22, 2022

jacobwolfaws commented Dec 22, 2022

jacobwolfaws commented Jan 12, 2023

kanor1306 commented Jan 28, 2022 •

edited

Loading