Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IMPROVEMENT] Recreate instance manager pod for v2 volume when spdk_tgt is dead #7551

Closed
derekbit opened this issue Jan 4, 2024 · 4 comments
Assignees
Labels
area/spdk SPDK upstream/downstream area/v2-data-engine v2 data engine (SPDK) component/longhorn-instance-manager Longhorn instance manager (interface between control and data plane) kind/improvement Request for improvement of existing function priority/0 Must be fixed in this release (managed by PO) require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated require/backport Require backport. Only used when the specific versions to backport have not been definied. require/doc Require updating the longhorn.io documentation
Milestone

Comments

@derekbit
Copy link
Member

derekbit commented Jan 4, 2024

Is your improvement request related to a feature? Please describe (馃憤 if you like this request)

spdk_tgt might be somehow dead due to

[ 6720.296447] reactor_0[11215]: segfault at 10 ip 00000000004fb2da sp 00007ffe5c445cb0 error 4 in spdk_tgt[400000+3d5000]
[ 6720.296466] Code: 48 8b b7 f8 02 00 00 48 8b 40 18 48 8b 18 48 85 f6 74 17 48 8b 97 00 03 00 00 e8 51 ff ff ff 48 c7 85 f8 02 00 00 00 00 00 00 <8b> 43 10 3b 43 14 0f 83 8a 00 00 00 83 c0 01 89 43 10 48 8b 03 48

Instance manager pod should detect it and terminate itself.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

@derekbit derekbit added component/longhorn-instance-manager Longhorn instance manager (interface between control and data plane) require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated require/doc Require updating the longhorn.io documentation kind/improvement Request for improvement of existing function area/v2-data-engine v2 data engine (SPDK) require/backport Require backport. Only used when the specific versions to backport have not been definied. area/spdk SPDK upstream/downstream labels Jan 4, 2024
@derekbit derekbit self-assigned this Jan 4, 2024
@innobead innobead added this to the v1.7.0 milestone Jan 4, 2024
@innobead innobead added the priority/0 Must be fixed in this release (managed by PO) label Jan 4, 2024
@innobead
Copy link
Member

innobead commented Jan 4, 2024

cc @DamiaSan

@derekbit
Copy link
Member Author

derekbit commented Jan 5, 2024

segfault in spdk_tgt needs further investigation and @DamiaSan's help.

The PRs I submitted are improving the resilience.

@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Jan 5, 2024

Pre Ready-For-Testing Checklist

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at:
  • Fresh installation
    1. Enable v1-data-engine and v2-data engine
    2. Check instance-manager pods for v1 and v2 data engines works as expected
    3. Go to one of instance-manager pods for v2 data engine. Then, killall -9 spdk_tgt
    4. Wait for a while, the instance-manager pod should be recreated. Other pods should not be impacted.
  • Upgrade
    1. Install Longhorn v1.5.3

    2. Create some v1 volumes

    3. Upgrade Longhorn to master-head

    4. Enable v1-data-engine and v2-data engine

    5. Old and new instance-manager pods works as expected

    6. Go to one of instance-manager pods for v2 data engine. Then, killall -9 spdk_tgt

    7. Wait for a while, the instance-manager pod should be recreated. Other pods should not be impacted.

    8. Detach v1 volumes

    9. Old instance-manager pods should be deleted

    10. Attach v1 volumes and should work

  • Does the PR include the explanation for the fix or the feature?

Update liveness probe of instance-manager pods.

  • Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
    The PR is at

longhorn/longhorn-manager#2428
longhorn/longhorn-spdk-engine#87
longhorn/longhorn-instance-manager#356

  • Which areas/issues this PR might have potential impacts on?
    Area: instance manager pod liveness probe. instance manager for v2 data engine.
    Issues

@chriscchien
Copy link
Contributor

Verified pass on longhorn master (longhorn-manager 325252) test steps

After kill spdk_tgt in v2 volume instance-manager pod, the instance-manager pod will recreate and after pod ready, all volumes worked well. (tested on freash installed v1.6.0-dev and upgrade from v1.5.3)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/spdk SPDK upstream/downstream area/v2-data-engine v2 data engine (SPDK) component/longhorn-instance-manager Longhorn instance manager (interface between control and data plane) kind/improvement Request for improvement of existing function priority/0 Must be fixed in this release (managed by PO) require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated require/backport Require backport. Only used when the specific versions to backport have not been definied. require/doc Require updating the longhorn.io documentation
Projects
None yet
Development

No branches or pull requests

4 participants