Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better handling of Lustre volume that go to unrecoverable Failed status #239

Open
kanor1306 opened this issue Jan 28, 2022 · 10 comments
Open
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@kanor1306
Copy link

kanor1306 commented Jan 28, 2022

Is your feature request related to a problem?/Why is this needed
From time to time I get Lustre volumes in "Failed" state, usually with the next message:
Screenshot 2022-01-28 at 11 04 26
which I assume is due to lack of capacity in the AWS side (as I am not reaching my quota limits).

The issue comes with the PVC, as it stays in Pending state forever, while the volume it represents is in a state that is not recoverable. See:
Screenshot 2022-01-28 at 10 31 49

And the file system in the AWS console:
Screenshot 2022-01-28 at 11 04 18

The PVC keeps looping waiting for the volume to be created, while the FSx volume is just failed, and that will not change.

/feature

Describe the solution you'd like in detail
My proposed solution does not solve the issue directly, but at least allows you to manage the problem yourself. I would like that the PVC will go to a different state, where it is clear that it is in an unrecoverable situation. This way you could handle the "Failed" situation from within your software without using the AWS API, just by chcking the status of the PVC

Describe alternatives you've considered

  • Ideally, when the Lustre volume fails to be created in this manner, I would like the driver to do something to retry the creation of the volume, though I don't know if the AWS SDK allows for this.
  • Other alternative could be that the driver creates a new volume when the creation fails, but I can imagine that this could lead to issues in cases where it also goes to "Failed" but for different reasons

Additional context
Another effect of the current situation is that when the PVC is removed, it leaves behind the Lustre volume in AWS, so you need to cleanup manually.

Also note that if you remove the Failed volume, the driver will create a new one, becoming healthy again (if the new one does not also go to Failed state)

Edit: add the Additional context

@kanor1306 kanor1306 changed the title Mark PVC as failed when Lustre volumes goes to unrecoverable Failed status Mark PVC as failed when Lustre volume goes to unrecoverable Failed status Jan 28, 2022
@kanor1306 kanor1306 changed the title Mark PVC as failed when Lustre volume goes to unrecoverable Failed status Better handling of Lustre volume that go to unrecoverable Failed status Feb 16, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 17, 2022
@kanor1306
Copy link
Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 20, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 18, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 17, 2022
@kanor1306
Copy link
Author

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 20, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 19, 2022
@jacobwolfaws
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 22, 2022
@kanor1306
Copy link
Author

@jacobwolfaws This I haven't seen happening for a long while, though not sure if has to do with changes on the AWS side or that I just didn't happen to run into capacity issues. Feel free to close it, as it seems also that there are no other people reporting it.

@jacobwolfaws
Copy link
Contributor

@kanor1306 thanks updating this! Going to leave this thread open, because I do agree we need to improve our capacity shortage messaging

@jacobwolfaws
Copy link
Contributor

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Jan 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests

4 participants