Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IMPROVEMENT] Disable Automatically Delete Workload Pod when The Volume Is Detached Unexpectedly for RWX volumes #5017

Closed
derekbit opened this issue Dec 7, 2022 · 13 comments
Assignees
Labels
area/resilience System or volume resilience area/setting Global setting or volume setting area/stability System or volume stability area/volume-rwx Volume RWX related kind/improvement Request for improvement of existing function priority/0 Must be fixed in this release (managed by PO)
Milestone

Comments

@derekbit
Copy link
Member

derekbit commented Dec 7, 2022

Is your improvement request related to a feature? Please describe (馃憤 if you like this request)

Longhorn has a feature Automatically Delete Workload Pod when Volume is Detached Unexpectedly, and it is enabled by default. For a RWX volume, when the volume is unexpectedly detached, it will be recreated on another node. Additionally, the NFS sever already has a dedicated recovery backend. Thus, there is no need to delete the workload pods using the volume, and IO should continue after the new volume is started.

Describe the solution you'd like

A clear and concise description of what you want to happen.

Describe alternatives you've considered

A clear and concise description of any alternative solutions or features you've considered.

Additional context

Add any other context or screenshots about the feature request here.

@derekbit derekbit added area/volume-rwx Volume RWX related kind/improvement Request for improvement of existing function labels Dec 7, 2022
@innobead
Copy link
Member

innobead commented Dec 7, 2022

So this is specific for RWX only. I feel we need to include it in 1.4.0? WDYT?

@derekbit
Copy link
Member Author

derekbit commented Dec 7, 2022

So this is specific for RWX only. I feel we need to include it in 1.4.0? WDYT?

Yes. Should be in v1.4.0
Waiting for e2e https://ci.longhorn.io/job/private/job/longhorn-tests-regression/2651/

@innobead innobead added the priority/0 Must be fixed in this release (managed by PO) label Dec 7, 2022
@innobead innobead added this to the v1.4.0 milestone Dec 7, 2022
@innobead innobead added the area/setting Global setting or volume setting label Dec 7, 2022
@innobead
Copy link
Member

innobead commented Dec 7, 2022

If we don't have this improvement, what users will encounter is just the workload getting restarted to reattach to the RWX volume from the stretch w/o the previous client session (maintained in our newly built recovery backend), but there will be no data loss problem because the current nfs mode is changed to hard mode, correct?

@derekbit
Copy link
Member Author

derekbit commented Dec 7, 2022

If we don't have this patch, the pod (client side) will be stopped and then recreated and attached to the volume. The disconnection can still introduce the data loss issue. hard mode only works when the client is still running without stopping.

I'm thinking if we need to have a setting for Automatically Delete Workload Pod when The Volume Is Detached Unexpectedly for RWX volumes. Not sure if there is any use case needing the function.

@innobead
Copy link
Member

innobead commented Dec 7, 2022

I see. Also, I know you would like to have a setting for keeping the original function. It's quite weird.

But I do feel this setting should originally be for RWO, because the volume/engine is next to the workload.

@shuo-wu @joshimoo WDYT?

@derekbit
Copy link
Member Author

derekbit commented Dec 7, 2022

Agree. Then, we can keep the current PR and see if there is any feedback from users. WDYT?

@innobead
Copy link
Member

innobead commented Dec 7, 2022

Sounds good. BTW, could you check the commit which introduced this setting to see the context of the feature? See if this was actually a concern for RWO or also RWX included.

@shuo-wu
Copy link
Contributor

shuo-wu commented Dec 7, 2022

That setting is mainly for RWO volumes. I don't think we need a separate setting for RWX volumes after the improvement.
We can see if there are any community users/customers requesting it in the future.

@derekbit
Copy link
Member Author

derekbit commented Dec 7, 2022

Sounds good. BTW, could you check the comment which introduced this setting to see the context of the feature? See if this was actually a concern for RWO or also RWX included.

Yes, I've checked the PR, but it doesn't mention any concern for RWO or RWX volume.
But the test cases are only for RWO volume.
Ref: #1719

@innobead
Copy link
Member

innobead commented Dec 7, 2022

Then let's do this improvement ;)

@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Dec 8, 2022

Pre Ready-For-Testing Checklist

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at:

The steps are same as test_rwx.py::test_rwx_delete_share_manager_pod updated in longhorn/longhorn-tests#1221.

The workload can be a DaemonSet, Deployment or Deployment.

longhorn/longhorn-manager#1589

  • Which areas/issues this PR might have potential impacts on?
    Area: RWX volume
    Issues

@derekbit
Copy link
Member Author

derekbit commented Dec 8, 2022

e2e afte updating test_rwx.py::test_rwx_delete_share_manager_pod updated

https://ci.longhorn.io/job/private/job/longhorn-tests-regression/2657/

@chriscchien
Copy link
Contributor

Verified in longhorn master 13ac2a with test steps
Result Pass

  1. After delete share-manager pod then wait share-manager pod work again, write extra data into statesfulset pod mount point, those new added data can be seen in path /export/'<pv_name> in share-manager pod.
  2. Test case passed in master-head pipeline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/resilience System or volume resilience area/setting Global setting or volume setting area/stability System or volume stability area/volume-rwx Volume RWX related kind/improvement Request for improvement of existing function priority/0 Must be fixed in this release (managed by PO)
Projects
None yet
Development

No branches or pull requests

5 participants