Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server reboot stucked due to unkillable mhvtl (SCSI Target: 15) #108

Closed
smajl87 opened this issue Sep 21, 2022 · 4 comments
Closed

Server reboot stucked due to unkillable mhvtl (SCSI Target: 15) #108

smajl87 opened this issue Sep 21, 2022 · 4 comments

Comments

@smajl87
Copy link

smajl87 commented Sep 21, 2022

Hi,

after configuring 2 libraries and multiple drives I rebooted server (after installing IBM lin_tape). All library/tape properly detected by OS. However since next reboot it gets stucked every time,w aiting for vtltape&vtllibrary processes to terminate (unless I do reboot -f now).
I tried to bring up all vtltape/vtllibrary services up one by one, and then stop them one by one while watching logs (debug enabled in mhvtl.conf)

root@sp02:~# systemctl start vtllibrary@10
root@sp02:~# systemctl start vtltape@11
root@sp02:~# systemctl start vtltape@12
root@sp02:~# systemctl start vtltape@13
root@sp02:~# systemctl start vtltape@14
root@sp02:~# systemctl start vtltape@15
root@sp02:~# systemctl start vtltape@16
root@sp02:~# systemctl start vtllibrary@30
root@sp02:~# systemctl start vtltape@31
root@sp02:~# systemctl start vtltape@32
root@sp02:~# systemctl start vtltape@33
root@sp02:~# systemctl start vtltape@34
root@sp02:~# systemctl start vtltape@35
root@sp02:~# systemctl start vtltape@36
root@sp02:~# systemctl start vtltape@37
root@sp02:~# systemctl start vtltape@38
root@sp02:~# systemctl start vtltape@39
root@sp02:~# systemctl start vtltape@40
root@sp02:~# systemctl stop vtltape@40
root@sp02:~# systemctl stop vtltape@39
root@sp02:~# systemctl stop vtltape@38

I saw tapes polling in journalctl -f until this occured:

Sep 21 10:03:39 sp02 vtltape[2419]: ioctl(VX_TAPE_POLL_STATUS) returned: 0, interval: 184005
Sep 21 10:03:39 sp02 vtltape[2443]: ioctl(VX_TAPE_POLL_STATUS) returned: 0, interval: 179605
Sep 21 10:03:39 sp02 vtltape[2479]: ioctl(VX_TAPE_POLL_STATUS) returned: 0, interval: 170405
Sep 21 10:03:39 sp02 kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000
Sep 21 10:03:39 sp02 kernel: #PF: supervisor read access in kernel mode
Sep 21 10:03:39 sp02 kernel: #PF: error_code(0x0000) - not-present page
Sep 21 10:03:39 sp02 kernel: PGD 0 P4D 0
Sep 21 10:03:39 sp02 kernel: Oops: 0000 [#1] SMP PTI
Sep 21 10:03:39 sp02 kernel: CPU: 1 PID: 2495 Comm: vtltape Tainted: G        W  OE     5.4.0-125-generic #141-Ubuntu
Sep 21 10:03:39 sp02 kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
Sep 21 10:03:39 sp02 kernel: RIP: 0010:scsi_remove_device+0x10/0x40
Sep 21 10:03:39 sp02 kernel: Code: ef e8 44 db 39 00 e9 71 ff ff ff 48 89 df e8 87 b6 ff ff eb 87 0f 1f 44 00 00 0f 1f 44 00 00 55 48 89 e5 41 55 49 89 fd>
Sep 21 10:03:39 sp02 kernel: RSP: 0018:ffffa3ce41163df8 EFLAGS: 00010246
Sep 21 10:03:39 sp02 kernel: RAX: 0000000000000026 RBX: 0000000000000000 RCX: 0000000000000000
Sep 21 10:03:39 sp02 kernel: RDX: ffff8ebb948d8410 RSI: 00007ffc59e044b0 RDI: 0000000000000000
Sep 21 10:03:39 sp02 kernel: RBP: ffffa3ce41163e08 R08: dead000000000100 R09: 0000000000000000
Sep 21 10:03:39 sp02 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff8ebb948d8410
Sep 21 10:03:39 sp02 kernel: R13: 0000000000000000 R14: ffff8ebb948d8410 R15: dead000000000122
Sep 21 10:03:39 sp02 kernel: FS:  00007fc2642de180(0000) GS:ffff8ebbaba80000(0000) knlGS:0000000000000000
Sep 21 10:03:39 sp02 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 21 10:03:39 sp02 kernel: CR2: 0000000000000000 CR3: 0000000519e40000 CR4: 00000000000006e0
Sep 21 10:03:39 sp02 kernel: Call Trace:
Sep 21 10:03:39 sp02 kernel:  mhvtl_c_ioctl+0x250/0x4e0 [mhvtl]
Sep 21 10:03:39 sp02 kernel:  do_vfs_ioctl+0x407/0x670
Sep 21 10:03:39 sp02 kernel:  ? copy_compat_msqid_to_user+0x130/0x130
Sep 21 10:03:39 sp02 kernel:  ksys_ioctl+0x67/0x90
Sep 21 10:03:39 sp02 kernel:  __x64_sys_ioctl+0x1a/0x20
Sep 21 10:03:39 sp02 kernel:  do_syscall_64+0x57/0x190
Sep 21 10:03:39 sp02 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Sep 21 10:03:39 sp02 kernel: RIP: 0033:0x7fc2644163ab
Sep 21 10:03:39 sp02 kernel: Code: 0f 1e fa 48 8b 05 e5 7a 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00>
Sep 21 10:03:39 sp02 kernel: RSP: 002b:00007ffc59e04468 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Sep 21 10:03:39 sp02 kernel: RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc2644163ab
Sep 21 10:03:39 sp02 kernel: RDX: 00007ffc59e044a4 RSI: 0000000000000205 RDI: 0000000000000003
Sep 21 10:03:39 sp02 kernel: RBP: 0000000000000080 R08: 0000000000000000 R09: 0000000000000000
Sep 21 10:03:39 sp02 kernel: R10: 00007fc2643adcf0 R11: 0000000000000246 R12: 00007ffc59e04530
Sep 21 10:03:39 sp02 kernel: R13: 000056287256f5c0 R14: 000056287256f5c0 R15: 000000000002a7b0
Sep 21 10:03:39 sp02 kernel: Modules linked in: st ch mhvtl(OE) lin_tape(OE) pfo(OE) nls_iso8859_1 xfs dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua >
Sep 21 10:03:39 sp02 kernel: CR2: 0000000000000000
Sep 21 10:03:39 sp02 kernel: ---[ end trace 3b78984384adab82 ]---
Sep 21 10:03:39 sp02 kernel: RIP: 0010:scsi_remove_device+0x10/0x40
Sep 21 10:03:39 sp02 kernel: Code: ef e8 44 db 39 00 e9 71 ff ff ff 48 89 df e8 87 b6 ff ff eb 87 0f 1f 44 00 00 0f 1f 44 00 00 55 48 89 e5 41 55 49 89 fd>
Sep 21 10:03:39 sp02 kernel: RSP: 0018:ffffa3ce41163df8 EFLAGS: 00010246
Sep 21 10:03:39 sp02 kernel: RAX: 0000000000000026 RBX: 0000000000000000 RCX: 0000000000000000
Sep 21 10:03:39 sp02 kernel: RDX: ffff8ebb948d8410 RSI: 00007ffc59e044b0 RDI: 0000000000000000
Sep 21 10:03:39 sp02 kernel: RBP: ffffa3ce41163e08 R08: dead000000000100 R09: 0000000000000000
Sep 21 10:03:39 sp02 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff8ebb948d8410
Sep 21 10:03:39 sp02 kernel: R13: 0000000000000000 R14: ffff8ebb948d8410 R15: dead000000000122
Sep 21 10:03:39 sp02 kernel: FS:  00007fc2642de180(0000) GS:ffff8ebbaba80000(0000) knlGS:0000000000000000
Sep 21 10:03:39 sp02 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 21 10:03:39 sp02 kernel: CR2: 0000000000000000 CR3: 0000000519e40000 CR4: 00000000000006e0

Not able to kill any of vtl* processes:

root@sp02:~# ps -ef|grep [v]tl
root        2133       1  0 10:02 ?        00:00:00 /usr/bin/vtllibrary -F -q10 -v1 -d
root        2160       1  0 10:02 ?        00:00:00 /usr/bin/vtltape -F -q11 -v1 -d
root        2256       1  0 10:02 ?        00:00:00 /usr/bin/vtltape -F -q12 -v1 -d
root        2283       1  0 10:02 ?        00:00:00 /usr/bin/vtltape -F -q13 -v1 -d
root        2293       1  0 10:02 ?        00:00:00 /usr/bin/vtltape -F -q14 -v1 -d
root        2301       1  0 10:02 ?        00:00:00 /usr/bin/vtltape -F -q15 -v1 -d
root        2309       1  0 10:02 ?        00:00:00 /usr/bin/vtltape -F -q16 -v1 -d
root        2390       1  0 10:02 ?        00:00:00 /usr/bin/vtllibrary -F -q30 -v1 -d
root        2419       1  0 10:02 ?        00:00:00 /usr/bin/vtltape -F -q31 -v1 -d
root        2443       1  0 10:02 ?        00:00:00 /usr/bin/vtltape -F -q32 -v1 -d
root        2451       1  0 10:02 ?        00:00:00 /usr/bin/vtltape -F -q33 -v1 -d
root        2463       1  0 10:03 ?        00:00:00 /usr/bin/vtltape -F -q34 -v1 -d
root        2471       1  0 10:03 ?        00:00:00 /usr/bin/vtltape -F -q35 -v1 -d
root        2479       1  0 10:03 ?        00:00:00 /usr/bin/vtltape -F -q36 -v1 -d
root        2487       1  0 10:03 ?        00:00:00 /usr/bin/vtltape -F -q37 -v1 -d
root@sp02:~# kill -9 2487
root@sp02:~# ps -ef|grep [q]37
root        2487       1  0 10:03 ?        00:00:00 /usr/bin/vtltape -F -q37 -v1 -d

I tried to stop vtltape from 11..40 and also from 40..11 so identify the one with ID 38 to cause issues.
The offending tape drive is defined as follows:

Drive: 38 CHANNEL: 00 TARGET: 15 LUN: 00
 Library ID: 30 Slot: 264
 Vendor identification: IBM
 Product identification: ULT3580-TD6
#Product revision level: 2160
 Unit serial number: 00078AC648
 NAA: 30:22:33:44:ab:00:15:00
 Compression: factor 1 enabled 1
 Compression type: lzo
 Backoff: 400
# fifo: /var/tmp/mhvtl

After I changed target from 15 to 16, problem is gone

Drive: 38 CHANNEL: 00 TARGET: 16 LUN: 00

Current (working) state:

root@sp02:~# lsscsi -g
[2:0:0:0]    disk    ATA      QEMU HARDDISK    2.5+  /dev/sda   /dev/sg0
[3:0:0:0]    disk    ATA      QEMU HARDDISK    2.5+  /dev/sdb   /dev/sg1
[4:0:0:0]    disk    ATA      QEMU HARDDISK    2.5+  /dev/sdc   /dev/sg2
[5:0:0:0]    disk    ATA      QEMU HARDDISK    2.5+  /dev/sdd   /dev/sg3
[6:0:0:0]    disk    ATA      QEMU HARDDISK    2.5+  /dev/sde   /dev/sg4
[7:0:0:0]    disk    ATA      QEMU HARDDISK    2.5+  /dev/sdf   /dev/sg5
[8:0:0:0]    disk    ATA      QEMU HARDDISK    2.5+  /dev/sdg   /dev/sg6
[9:0:0:0]    disk    ATA      QEMU HARDDISK    2.5+  /dev/sdh   /dev/sg7
[10:0:0:0]   disk    ATA      QEMU HARDDISK    2.5+  /dev/sdi   /dev/sg8
[11:0:0:0]   disk    ATA      QEMU HARDDISK    2.5+  /dev/sdj   /dev/sg9
[12:0:0:0]   disk    ATA      QEMU HARDDISK    2.5+  /dev/sdk   /dev/sg10
[14:0:0:0]   mediumx IBM      3573-TL          0107  -          /dev/sg11
[14:0:1:0]   tape    IBM      ULT3580-TD8      0107  -          /dev/sg21
[14:0:2:0]   tape    IBM      ULT3580-TD8      0107  -          /dev/sg13
[14:0:3:0]   tape    IBM      ULT3580-TD8      0107  -          /dev/sg20
[14:0:4:0]   tape    IBM      ULT3580-TD8      0107  -          /dev/sg17
[14:0:5:0]   tape    IBM      ULT3580-TD8      0107  -          /dev/sg19
[14:0:6:0]   tape    IBM      ULT3580-TD8      0107  -          /dev/sg23
[14:0:7:0]   mediumx IBM      03584L53         0107  -          /dev/sg28
[14:0:8:0]   tape    IBM      ULT3580-TD6      0107  -          /dev/sg12
[14:0:9:0]   tape    IBM      ULT3580-TD6      0107  -          /dev/sg26
[14:0:10:0]  tape    IBM      ULT3580-TD6      0107  -          /dev/sg16
[14:0:11:0]  tape    IBM      ULT3580-TD6      0107  -          /dev/sg27
[14:0:12:0]  tape    IBM      ULT3580-TD6      0107  -          /dev/sg15
[14:0:13:0]  tape    IBM      ULT3580-TD6      0107  -          /dev/sg14
[14:0:14:0]  tape    IBM      ULT3580-TD6      0107  -          /dev/sg18
[14:0:16:0]  tape    IBM      ULT3580-TD6      0107  -          /dev/sg22
[14:0:17:0]  tape    IBM      ULT3580-TD6      0107  -          /dev/sg25
[14:0:18:0]  tape    IBM      ULT3580-TD6      0107  -          /dev/sg24

Q: What's wrong with TARGET:15?

QEMU VM running Ubuntu 20.04.5 LTS

@markh794
Copy link
Owner

Thanks for the bug report.

I'm on holidays without access to a computer for next 3 weeks.

Looks like the kernel module oops, so when user space daemons call ioctl(), it never returns. daemons won't receive kill signal until the syscall() returns - hence are unkillable

I'll look into it more when back from leave.

Any chance for 'uname -r' as well

@smajl87
Copy link
Author

smajl87 commented Sep 22, 2022

root@sp02:~# uname -r
5.4.0-125-generic
root@sp02:~# uname -a
Linux sp02 5.4.0-125-generic #141-Ubuntu SMP Wed Aug 10 13:42:03 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

@markh794
Copy link
Owner

markh794 commented Oct 2, 2022

A look over the kernel module src (kernel/mhvtl.c) line 331 show SCSI ID 15 assigned to the mhvtl hba.

https://github.com/markh794/mhvtl/blob/master/kernel/mhvtl.c

.this_id = 15

Try commenting out this line (leave unassigned) or set to a large value such as 4096 and recompile the kernel module.

I'll try it out when back in front of a computer (in the next couple of weeks)

markh794 added a commit that referenced this issue Oct 11, 2022
Results in oops and hung user space daemon if a target
is spun up using same SCSI ID as initiator.

#108

Reported-by: https://github.com/smajl87
Signed-off-by: Mark Harvey <markh794@gmail.com>
@markh794
Copy link
Owner

Commit 4f36ea7 will fix this.
Please re-open the bug report if you encounter similar issues with this patch applied.
Thanks, Mark

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants