[IMPROVEMENT] Potentially reduce the two minute iSCSI timeout for v1 volumes #8382

ejweber · 2024-04-17T19:03:12Z

Is your improvement request related to a feature? Please describe (👍 if you like this request)

While investigating #2187, we did a deep dive into the behavior of v1 volumes when the instance-manager process group is abruptly killed: https://github.com/longhorn/longhorn/wiki/Freezing-File-Systems-With-dmsetup-suspend-Versus-fsfreeze.

During the investigation, we noticed that all I/O was blocked (it could neither complete successfully or return an error) until two minutes after the crash. Relevant dmesg logs look like:

[Wed Apr 17 15:26:51 2024]  connection4:0: detected conn error (1020)
[Wed Apr 17 15:28:54 2024]  session4: session recovery timed out after 120 secs
[Wed Apr 17 15:28:54 2024] sd 5:0:0:1: rejecting I/O to offline device
[Wed Apr 17 15:28:54 2024] I/O error, dev sdc, sector 19395608 op 0x1:(WRITE) flags 0x104000 phys_seg 320 prio class 2
[Wed Apr 17 15:28:54 2024] I/O error, dev sdc, sector 19390488 op 0x1:(WRITE) flags 0x104000 phys_seg 320 prio class 2
[Wed Apr 17 15:28:54 2024] I/O error, dev sdc, sector 19393048 op 0x1:(WRITE) flags 0x104000 phys_seg 320 prio class 2
[Wed Apr 17 15:28:54 2024] I/O error, dev sdc, sector 19385368 op 0x1:(WRITE) flags 0x104000 phys_seg 320 prio class 2
[Wed Apr 17 15:28:54 2024] I/O error, dev sdc, sector 19387928 op 0x1:(WRITE) flags 0x104000 phys_seg 320 prio class 2
[Wed Apr 17 15:28:54 2024] I/O error, dev sdc, sector 19382808 op 0x1:(WRITE) flags 0x104000 phys_seg 320 prio class 2
[Wed Apr 17 15:28:54 2024] I/O error, dev sdc, sector 10485849 op 0x1:(WRITE) flags 0x9800 phys_seg 8 prio class 2

Since iSCSI traffic is all local to a node, it is unlikely there is a timeout for any reason OTHER than a tgtd crash, so waiting two minutes does not seem necessary.

Describe the solution you'd like

Reduce the iSCSI timeout if it is practical to do so.

Describe alternatives you've considered

If it is not practical to reduce the iSCSI timeout, we can keep it like it is.

Additional context

There are various online sources discussing ways to change iSCSI and/or SCSI timeouts. We need to do a bit of investigation to determine which timeout and method is correct for this use case.

The text was updated successfully, but these errors were encountered:

derekbit · 2024-04-18T00:02:26Z

Might be related to http://github.com/longhorn/longhorn/issues/6339

ejweber added this to the v1.7.0 milestone Apr 17, 2024

ejweber self-assigned this Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[IMPROVEMENT] Potentially reduce the two minute iSCSI timeout for v1 volumes #8382

[IMPROVEMENT] Potentially reduce the two minute iSCSI timeout for v1 volumes #8382

ejweber commented Apr 17, 2024

derekbit commented Apr 18, 2024 •

edited

[IMPROVEMENT] Potentially reduce the two minute iSCSI timeout for v1 volumes #8382

[IMPROVEMENT] Potentially reduce the two minute iSCSI timeout for v1 volumes #8382

Comments

ejweber commented Apr 17, 2024

Is your improvement request related to a feature? Please describe (👍 if you like this request)

Describe the solution you'd like

Describe alternatives you've considered

Additional context

derekbit commented Apr 18, 2024 • edited

derekbit commented Apr 18, 2024 •

edited