Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IMPROVEMENT] Potentially reduce the two minute iSCSI timeout for v1 volumes #8382

Open
ejweber opened this issue Apr 17, 2024 · 1 comment
Assignees
Labels
kind/improvement Request for improvement of existing function require/backport Require backport. Only used when the specific versions to backport have not been definied. require/doc Require updating the longhorn.io documentation require/manual-test-plan Require adding/updating manual test cases if they can't be automated
Milestone

Comments

@ejweber
Copy link
Contributor

ejweber commented Apr 17, 2024

Is your improvement request related to a feature? Please describe (馃憤 if you like this request)

While investigating #2187, we did a deep dive into the behavior of v1 volumes when the instance-manager process group is abruptly killed: https://github.com/longhorn/longhorn/wiki/Freezing-File-Systems-With-dmsetup-suspend-Versus-fsfreeze.

During the investigation, we noticed that all I/O was blocked (it could neither complete successfully or return an error) until two minutes after the crash. Relevant dmesg logs look like:

[Wed Apr 17 15:26:51 2024]  connection4:0: detected conn error (1020)
[Wed Apr 17 15:28:54 2024]  session4: session recovery timed out after 120 secs
[Wed Apr 17 15:28:54 2024] sd 5:0:0:1: rejecting I/O to offline device
[Wed Apr 17 15:28:54 2024] I/O error, dev sdc, sector 19395608 op 0x1:(WRITE) flags 0x104000 phys_seg 320 prio class 2
[Wed Apr 17 15:28:54 2024] I/O error, dev sdc, sector 19390488 op 0x1:(WRITE) flags 0x104000 phys_seg 320 prio class 2
[Wed Apr 17 15:28:54 2024] I/O error, dev sdc, sector 19393048 op 0x1:(WRITE) flags 0x104000 phys_seg 320 prio class 2
[Wed Apr 17 15:28:54 2024] I/O error, dev sdc, sector 19385368 op 0x1:(WRITE) flags 0x104000 phys_seg 320 prio class 2
[Wed Apr 17 15:28:54 2024] I/O error, dev sdc, sector 19387928 op 0x1:(WRITE) flags 0x104000 phys_seg 320 prio class 2
[Wed Apr 17 15:28:54 2024] I/O error, dev sdc, sector 19382808 op 0x1:(WRITE) flags 0x104000 phys_seg 320 prio class 2
[Wed Apr 17 15:28:54 2024] I/O error, dev sdc, sector 10485849 op 0x1:(WRITE) flags 0x9800 phys_seg 8 prio class 2

Since iSCSI traffic is all local to a node, it is unlikely there is a timeout for any reason OTHER than a tgtd crash, so waiting two minutes does not seem necessary.

Describe the solution you'd like

Reduce the iSCSI timeout if it is practical to do so.

Describe alternatives you've considered

If it is not practical to reduce the iSCSI timeout, we can keep it like it is.

Additional context

There are various online sources discussing ways to change iSCSI and/or SCSI timeouts. We need to do a bit of investigation to determine which timeout and method is correct for this use case.

@ejweber ejweber added require/doc Require updating the longhorn.io documentation require/manual-test-plan Require adding/updating manual test cases if they can't be automated kind/improvement Request for improvement of existing function require/backport Require backport. Only used when the specific versions to backport have not been definied. labels Apr 17, 2024
@ejweber ejweber added this to the v1.7.0 milestone Apr 17, 2024
@ejweber ejweber self-assigned this Apr 17, 2024
@derekbit
Copy link
Member

derekbit commented Apr 18, 2024

Might be related to http://github.com/longhorn/longhorn/issues/6339

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/improvement Request for improvement of existing function require/backport Require backport. Only used when the specific versions to backport have not been definied. require/doc Require updating the longhorn.io documentation require/manual-test-plan Require adding/updating manual test cases if they can't be automated
Projects
None yet
Development

No branches or pull requests

2 participants