New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel panic with Smartmontools 6.6 on ESXi 6.7 when trying to enable smart on rpool (vdisk) #960

Closed
guenther-alka opened this Issue Aug 27, 2018 · 8 comments

Comments

Projects
None yet
2 participants
@guenther-alka

guenther-alka commented Aug 27, 2018

OmniOS should not crash on this command

Environment:
ESXi 6.7
OmniOS 151026 b6848f4455 July 2018 on a ESXi vdisk on Sata Controller 0
Smartmontools 6.6

Kernelpanic on the following command (rpool on c1t0d0s0)
/usr/sbin/smartctl -d sat,12 -T permissive --smart=on /dev/rdsk/c1t0d0s0

same command on OpenIndiana 2018.04
root@oi201804:# uname -a
SunOS oi201804 5.11 illumos-acab0a4f50 i86pc i386 i86pc
root@oi201804:
# cat /etc/release
OpenIndiana Hipster 2018.04 (powered by illumos)
OpenIndiana Project, part of The Illumos Foundation (C) 2010-2018
Use is subject to license terms.
Assembled 27 April 2018

results in

Smartctl open device: /dev/rdsk/c1t0d0s0 [SAT] failed: No such device
root@oi201804:~# /usr/sbin/smartctl -d sat,12 -T permissive --smart=on /dev/rdsk/c4t0d0s0
smartctl 6.6 2017-11-05 r4594 [i386-pc-solaris2.11] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

Read Device Identity failed: scsi error unsupported scsi opcode

SMART support is: Ambiguous - ATA IDENTIFY DEVICE words 82-83 don't show if SMART supported.
SMART support is: Ambiguous - ATA IDENTIFY DEVICE words 85-87 don't show if SMART is enabled.
A mandatory SMART

image

Crashdump at http://openzfs.hfg-gmuend.de/tmp/vmdump.8

@citrus-it citrus-it added the bug label Aug 27, 2018

@citrus-it citrus-it self-assigned this Aug 27, 2018

@citrus-it

This comment has been minimized.

Member

citrus-it commented Aug 27, 2018

 > $C
fffffe0005ce6a30 ahci_dump_commands+0x77(fffffe04e7dda740, 0, 80000000)
fffffe0005ce6ad0 ahci_intr_fatal_error+0x2d6(fffffe04e7dda740, fffffe04e7de3940, 0, 40000000)
fffffe0005ce6b40 ahci_port_intr+0x229(fffffe04e7dda740, fffffe04e7de3940, 0)
fffffe0005ce6b80 ahci_intr+0xb8(fffffe04e7dda740, 0)
fffffe0005ce6bf0 apix_dispatch_pending_autovect+0x101(5)
fffffe0005ce6c20 apix_dispatch_pending_hardint+0x34(0, 0)
fffffe00070364b0 switch_sp_and_call+0x13()

> ::status
debugging crash dump vmcore.8 (64-bit) from napp-it-026
operating system: 5.11 omnios-r151026-b6848f4455 (i86pc)
image uuid: ff50b83b-96a2-ca7f-a98b-fa7c47b82476
panic message:
BAD TRAP: type=e (#pf Page fault) rp=fffffe0005ce6820 addr=48 occurred in module
 "ahci" due to a NULL pointer dereference
dump content: kernel pages only
  (curproc requested, but a kernel thread panicked)
@citrus-it

This comment has been minimized.

Member

citrus-it commented Aug 27, 2018

static void
ahci_dump_commands(ahci_ctl_t *ahci_ctlp, uint8_t port,
    uint32_t slot_tags)
{
        ahci_port_t *ahci_portp;
        int tmp_slot;
        sata_pkt_t *spkt;
        sata_cmd_t cmd;

        ahci_portp = ahci_ctlp->ahcictl_ports[port];
        ASSERT(ahci_portp != NULL);

        while (slot_tags) {
                tmp_slot = ddi_ffs(slot_tags) - 1;
                if (tmp_slot == -1) {
                        break;
                }

                spkt = ahci_portp->ahciport_slot_pkts[tmp_slot];
                ASSERT(spkt != NULL);
                cmd = spkt->satapkt_cmd;
> $C ! head -1
fffffe0005ce6a30 ahci_dump_commands+0x77(fffffe04e7dda740, 0, 80000000)

> fffffe04e7dda740::print -t ahci_ctl_t ahcictl_ports[0]
ahci_port_t *ahcictl_ports[0] = 0xfffffe04e7de3940
> ::regs ! grep rbx
%rbx = 0x000000000000001f                 %r10 = 0x0000000000000001
> 0xfffffe04e7de3940::print -t ahci_port_t ahciport_slot_pkts[0x1f]
sata_pkt_t *ahciport_slot_pkts[0x1f] = 0

The crash is occurring because ahci_ctlp->ahcictl_ports[port=0]->ahci_slot_pkts[tmp_slot=0x1f] is NULL. 0x1f is the correct bit set in slot_tags (0x80000000) meaning that slot_tags references an empty slot.
In fact, all of the ahci_slot_pkts are null:

> 0xfffffe04e7de3940::print -t ahci_port_t ahciport_slot_pkts
sata_pkt_t *[32] ahciport_slot_pkts = [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]
>

None of this code has been recently updated so it's strange that OpenIndiana does not crash here.

@citrus-it

This comment has been minimized.

Member

citrus-it commented Aug 27, 2018

Looking at the sata ring buffer:

[2018 Aug 27 17:49:10:765:741:663] ahci0: ahci_intr_fatal_error: port 0 task_file_status = 0x441
[2018 Aug 27 17:49:10:765:742:666] ahci0: ahci_intr_fatal_error: spkt 0x0 is being processed when fatal error occurred for port 0

spkt (which is spkt = ahci_portp->ahciport_slot_pkts[failed_slot];) is 0x0 when the fault is detected. The failed slot is being read from the HBA.

@citrus-it citrus-it changed the title from Kernelpanic with Smartmontools 6.6 on ESXi 6.7 when trying to enable smart on rpool (vdisk) to Kernel panic with Smartmontools 6.6 on ESXi 6.7 when trying to enable smart on rpool (vdisk) Aug 27, 2018

@citrus-it

This comment has been minimized.

Member

citrus-it commented Aug 27, 2018

Raised issue against upstream illumos at https://www.illumos.org/issues/9772

@citrus-it

This comment has been minimized.

Member

citrus-it commented Aug 27, 2018

Can you please try this hot-fix and report back?

% pfexec pkg apply-hot-fix --be-name=ahci_9772 https://downloads.omniosce.org/pkg/r151026/ahci_9772.p5p
% pfexec init 6
@guenther-alka

This comment has been minimized.

guenther-alka commented Aug 28, 2018

Thanks a lot
The hot-fix works, no kernel panic now!

@citrus-it

This comment has been minimized.

Member

citrus-it commented Aug 28, 2018

Great. We’ll get this upstreamed to illumos.

@citrus-it

This comment has been minimized.

Member

citrus-it commented Sep 18, 2018

This will be fixed in the release of r151026u next week (and r151022bt the week after)

@citrus-it citrus-it closed this Sep 18, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment