nvme smart-log starts returning meaningless data #1729

dankamongmen · 2022-11-12T08:37:55Z

This might be a kernel bug, or a hardware failure. I'm just hoping to start discussion and troubleshooting here.

I've got 5 NVMe devices in my workstation:

[schwarzgerat](0) $ sudo nvme list
Node                  Generic               SN                   Model                                    Namespace Usage                      Format           FW Rev  
--------------------- --------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme4n1          /dev/ng4n1            1908E1805012         WDS100T3X0C-00SJG0                       1           1.00  TB /   1.00  TB    512   B +  0 B   102000WD
/dev/nvme3n1          /dev/ng3n1            1908E1801188         WDS100T3X0C-00SJG0                       1           1.00  TB /   1.00  TB    512   B +  0 B   102000WD
/dev/nvme2n1          /dev/ng2n1            PHBT729201SR016D     INTEL MEMPEK1W016GA                      1          14.40  GB /  14.40  GB    512   B +  0 B   K3110310
/dev/nvme1n1          /dev/ng1n1            S59CNM0R605099L      Samsung SSD 970 EVO Plus 2TB             1           1.42  TB /   2.00  TB    512   B +  0 B   2B2QEXM7
/dev/nvme0n1          /dev/ng0n1            S59CNM0R736612Y      Samsung SSD 970 EVO Plus 2TB             1           1.42  TB /   2.00  TB    512   B +  0 B   2B2QEXM7
[schwarzgerat](0) $

Three are in motherboard (Gigabyte Aorus XTreme TRX40) M.2 slots, and two are in a PCIe 4.0x16 card using PCIe 4x4x4x4 bifurcation:

[schwarzgerat](0) $ lsscsi -t | grep nvm
[N:0:4:1]    disk    pcie 0x144d:0xa801                         /dev/nvme0n1
[N:1:4:1]    disk    pcie 0x144d:0xa801                         /dev/nvme1n1
[N:2:0:1]    disk    pcie 0x8086:0x3806                         /dev/nvme2n1
[N:3:8215:1] disk    pcie 0x15b7:0x5002                         /dev/nvme3n1
[N:4:8215:1] disk    pcie 0x15b7:0x5002                         /dev/nvme4n1
[schwarzgerat](0) $

I've got a shell script querying nvme smart-log for each of the devices approximately every 15s to get temperature information. This was running fine for several weeks. Then, a few days ago, the Samsung 970 2TB devices began returning a temperature of -273 (0.15 above absolute zero), and all 0s otherwise (I assume the -273 to be a bias, and thus that's also returning a 0):

[schwarzgerat](0) $ sudo nvme smart-log /dev/nvme1
Smart Log for NVME device:nvme1 namespace-id:ffffffff
critical_warning			: 0
temperature				: -273°C (0 Kelvin)
available_spare				: 0%
available_spare_threshold		: 0%
percentage_used				: 0%
endurance group critical warning summary: 0
data_units_read				: 0
data_units_written			: 0
host_read_commands			: 0
host_write_commands			: 0
controller_busy_time			: 0
power_cycles				: 0
power_on_hours				: 0
unsafe_shutdowns			: 0
media_errors				: 0
num_err_log_entries			: 0
Warning Temperature Time		: 0
Critical Composite Temperature Time	: 0
Thermal Management T1 Trans Count	: 0
Thermal Management T2 Trans Count	: 0
Thermal Management T1 Total Time	: 0
Thermal Management T2 Total Time	: 0
[schwarzgerat](0) $

given that my workstation is in fact not careening silently through the unfathomable voids between superclusters of galaxies, nor immersed in a laser-pumped rubidium well, nor actively violating the laws of thermodynamics (so far as i know), this temperature seems low.

this return has persisted ever since, including over a reboot. the other three devices are fine.

how can i assist in debugging this problem? thanks!

The text was updated successfully, but these errors were encountered:

dankamongmen · 2022-11-12T08:41:22Z

oh and since i didn't really make this clear: the devices are otherwise working fine. i've got a zfs mirror atop them, and the world is good. they even seem to be reporting a sensible temperature via some other means, as i'm displaying one in my Growlight application:

it looks like i'm getting that through libatasmart?

keithbusch · 2022-11-12T09:04:09Z

The log reports temperature in degrees Kelvin, but the output converts it to Celsius, so you're getting an all 0's log. Either the controller did that, or perhaps it didn't transfer anything at all and the tool is showing an uninitiated buffer.

Either way, I think this is more likely a Samsung firmware bug

dankamongmen · 2022-11-12T10:20:09Z

Either way, I think this is more likely a Samsung firmware bug

that is my suspicion, as well. the thing that surprises me is that both devices failed this way at the same time.

i'll look into the nvme kernel module's implementation of NVME_IOCTL_ADMIN_CMD. i'd like to leave this bug here to track the effort and serve as a beacon for other people experiencing similar problems.

dankamongmen · 2022-11-12T10:22:21Z

also, it might or might not be a good idea to add code to check for a 0 temperature, and interpret it as a failed request? i think that logic would belong in the kernel, though. just putting the idea out there.

dankamongmen · 2022-11-12T14:35:49Z

in a surprising development, smartctl shows a reasonable temperature (44C) despite appearing to acquire this information via NVME_IOCTL_ADMIN_CMD:

ioctl(3, NVME_IOCTL_ADMIN_CMD, 0x7fff902d4290) = 0
write(1, "SMART overall-health self-assess"..., 57SMART overall-health self-assessment test result: PASSED
) = 57
write(1, "\n", 1
)                       = 1
write(1, "SMART/Health Information (NVMe L"..., 41SMART/Health Information (NVMe Log 0x02)
) = 41
write(1, "Critical Warning:               "..., 41Critical Warning:                   0x00
) = 41
write(1, "Temperature:                    "..., 47Temperature:                        44 Celsius
) = 47
write(1, "Available Spare:                "..., 41Available Spare:                    100%
) = 41
write(1, "Available Spare Threshold:      "..., 40Available Spare Threshold:          10%
) = 40
write(1, "Percentage Used:                "..., 39Percentage Used:                    0%
) = 39
write(1, "Data Units Read:                "..., 57Data Units Read:                    30,401,375 [15.5 TB]
) = 57
write(1, "Data Units Written:             "..., 57Data Units Written:                 38,855,085 [19.8 TB]
) = 57
write(1, "Host Read Commands:             "..., 48Host Read Commands:                 154,838,138
) = 48
write(1, "Host Write Commands:            "..., 48Host Write Commands:                337,074,955
) = 48
write(1, "Controller Busy Time:           "..., 40Controller Busy Time:               836
) = 40
write(1, "Power Cycles:                   "..., 40Power Cycles:                       982
) = 40
write(1, "Power On Hours:                 "..., 42Power On Hours:                     1,345
) = 42
write(1, "Unsafe Shutdowns:               "..., 40Unsafe Shutdowns:                   598
) = 40
write(1, "Media and Data Integrity Errors:"..., 38Media and Data Integrity Errors:    0
) = 38
write(1, "Error Information Log Entries:  "..., 40Error Information Log Entries:      917
) = 40
write(1, "Warning  Comp. Temperature Time:"..., 38Warning  Comp. Temperature Time:    0
) = 38
write(1, "Critical Comp. Temperature Time:"..., 38Critical Comp. Temperature Time:    0
) = 38
write(1, "Temperature Sensor 1:           "..., 47Temperature Sensor 1:               44 Celsius
) = 47
write(1, "Temperature Sensor 2:           "..., 47Temperature Sensor 2:               46 Celsius

this is making me think the problem actually is in nvme, since they're using the same ioctl.

dankamongmen · 2022-11-12T14:38:09Z

i'm gonna look at the two implementations and see if there's a PR i can't prep for you...

keithbusch · 2022-11-12T21:38:37Z

Didn't you mention that this was also working with nvme for several iterations? Perhaps this is some memory address alignment issue, though the driver always enforces what the spec requires.

dankamongmen · 2022-11-12T22:21:37Z

it was working with nvme for some time. i don't know whether the failure correlated with an nvme upgrade, but i can check that.

dankamongmen · 2022-11-12T22:25:12Z

2022-11-07 1125 is when we start seeing the failure. well, hot damn:

Aptitude 0.8.13: log report    
Mon, Nov  7 2022 11:17:50 -0500   
...
[UPGRADE] nvme-cli:amd64 2.1.2-2 -> 2.2.1-1
...

dankamongmen · 2022-11-12T22:29:07Z

Didn't you mention that this was also working with nvme for several iterations? Perhaps this is some memory address alignment issue, though the driver always enforces what the spec requires.

I would expect this to generate EADDR on the kernel side? your ioctl() is succeeding (well, 0 is being returned).

keithbusch · 2022-11-12T22:41:31Z

If the kernel gets an unaligned address here, it bounces it through an aligned one, so no need to return an error. I was just wondering if the device was secretly relying on some nonstandard alignment that happened to work in some cases.

But since you identified an earlier version of that was successful, it should be easier to investigate, or maybe bisect.

dankamongmen · 2022-11-13T00:18:17Z

yep, i'll get to this shortly

dankamongmen · 2022-11-13T21:13:05Z

notes to self:

nvme:

nvme_show_smart_log in common code nvme-print.c dumps this data
temperature is a kelvin u16: __u16 temperature = smart->temperature[1] << 8 | smart->temperature[0];
kelvin 0 (absolute zero) is celsius -273.15
smart_log is filled in in nvme.c's call to nvme_cli_get_log_smart(), which calls through macro do_admin_op() to either nvme_get_log_smart() or nvme_mi_admin_get_log_smart() based off d->type. either way, we cross the libnvme border.

smartmontools:

print_smart_log() in nvmeprint.cpp gets an nvme_smart_log filled in by nvme_read_smart_log(): int k = sg_get_unaligned_le16(smart_log.temperature);
smartctl does not link against libnvme, instead implementing its own nvme_pass_through(), nvme_read_log_page_1(), and nvme_read_smart_log()

so let's see whether they're getting the same bytes out. if so, it's a matter of translation/presentation in nvme. if not, it's an issue in libnvme. given the breakage corresponded to a new nvme, i'm guessing the former, but given that all smart data is 0, maybe it's the latter.

dankamongmen · 2022-11-13T21:22:13Z

i see 0s for both smart->temperature[0] and smart->temperature[1] at the top of nvme_show_smart_log(). so the problem is maybe down in libnvme's nvme_get_log_smart().

dankamongmen · 2022-11-13T21:24:24Z

nvme_get_log_smart() -> nvme_get_nsid_log() -> nvme_get_log_page() ->

~~suspicious initialization here: .args_size = sizeof(args), where args is a struct nvme_get_log_args *args. do we not want (*args)?~~ args is the struct being initialized, not a problem

dankamongmen · 2022-11-13T21:33:56Z

looking through libnvme, Daniel Wagner's changes in early november to nvme_get_log_page() seem suspicious by proximity...looking into it

dankamongmen · 2022-11-13T21:35:14Z

argh ninja install wrote to my root instead of /usr/local, erp. oh well uninstall appears to work!

dankamongmen · 2022-11-13T21:36:52Z

building libnvme 061c89c84c9f9a68dfd00d4e00218041facb3fef does not fix the problem.

dankamongmen · 2022-11-13T21:39:52Z

nvme-cli commit a633aef with libnvme commit a001e59b7751fcb2076231642e2e4d077afb4fad successfully retrieve the data; this is a definite regression, and not a hardware failure (though the hardware might be doing something weird). i will now zap this with a blessed +3 wand of bisection.

dankamongmen · 2022-11-13T21:43:24Z

rebuilding with libnvme HEAD (7e9f5a2ab4f6b56200aa4a66ea2b080d2ecea52b) works just fine. the problem is in nvme-cli.

i have eliminated the hand-built libnvme from /usr/local, and reinstalled debian unstable's libnvme1 and libnvme-dev. the specified nvme revision doesn't build, but looks like it can be easily fixed up. bisection ought now reveal the cultprit.

dankamongmen · 2022-11-13T21:52:38Z

[schwarzgerat](0) $ git bisect good
cc73f65fc0162d3624e0b5afebbd0eb006967f98 is the first bad commit
commit cc73f65fc0162d3624e0b5afebbd0eb006967f98
Author: Jeremy Kerr <jk@codeconstruct.com.au>
Date:   Wed Jul 20 12:28:40 2022 +0800

    nvme: Add wrappers for Get Log page helpers
    
    Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>

 nvme-wrap.c | 190 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 nvme-wrap.h |  69 ++++++++++++++++++++++
 nvme.c      |  97 +++++++++++++++----------------
 3 files changed, 306 insertions(+), 50 deletions(-)
[schwarzgerat](0) $

dankamongmen · 2022-11-13T21:53:03Z

@jk-ozlabs you might wanna take a look at this

dankamongmen · 2022-11-13T22:36:37Z

PR incoming

dankamongmen · 2022-11-13T22:47:27Z

Alright, this PR flips the bool handed to nvme_get_smart_log() back to ~~true~~ false, and fixes the issue for me. I've tested, and all five of my devices work as expected. I haven't looked into the deeper semantics and history of this bool, and it may well be that this is not the appropriate fix. I'll leave that in y'alls' capable hands; this ought be enough to remedy any further issue. Feel free to merge my PR or ask me to test one of y'alls'.

keithbusch · 2022-11-14T01:55:34Z

Just want to double check, you wrote your change sets a bool to true, but the patch sets it to false. I am assuming the code matches your intent.

dankamongmen · 2022-11-14T02:04:31Z

Just want to double check, you wrote your change sets a bool to true, but the patch sets it to false. I am assuming the code matches your intent.

aye

Changing this to true resulted in my Western Digital 970 EVO Pro NVMe SSDs returning all zeros for smartlog information. Reverts change made in cc73f65 ("nvme: Add wrappers for Get Log page helpers") Resolves linux-nvme#1729. Signed-off-by: nick black <dankamongmen@gmail.com>

dankamongmen mentioned this issue Nov 13, 2022

nvme: call nvme_get_smart_log() with false #1731

Merged

igaw closed this as completed in #1731 Nov 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nvme smart-log starts returning meaningless data #1729

nvme smart-log starts returning meaningless data #1729

dankamongmen commented Nov 12, 2022

dankamongmen commented Nov 12, 2022

keithbusch commented Nov 12, 2022

dankamongmen commented Nov 12, 2022

dankamongmen commented Nov 12, 2022

dankamongmen commented Nov 12, 2022

dankamongmen commented Nov 12, 2022

keithbusch commented Nov 12, 2022 •

edited

Loading

dankamongmen commented Nov 12, 2022

dankamongmen commented Nov 12, 2022 •

edited

Loading

dankamongmen commented Nov 12, 2022

keithbusch commented Nov 12, 2022

dankamongmen commented Nov 13, 2022 via email

dankamongmen commented Nov 13, 2022

dankamongmen commented Nov 13, 2022

dankamongmen commented Nov 13, 2022 •

edited

Loading

dankamongmen commented Nov 13, 2022

dankamongmen commented Nov 13, 2022

dankamongmen commented Nov 13, 2022

dankamongmen commented Nov 13, 2022

dankamongmen commented Nov 13, 2022

dankamongmen commented Nov 13, 2022

dankamongmen commented Nov 13, 2022

dankamongmen commented Nov 13, 2022

dankamongmen commented Nov 13, 2022 •

edited

Loading

keithbusch commented Nov 14, 2022

dankamongmen commented Nov 14, 2022

nvme smart-log starts returning meaningless data #1729

nvme smart-log starts returning meaningless data #1729

Comments

dankamongmen commented Nov 12, 2022

dankamongmen commented Nov 12, 2022

keithbusch commented Nov 12, 2022

dankamongmen commented Nov 12, 2022

dankamongmen commented Nov 12, 2022

dankamongmen commented Nov 12, 2022

dankamongmen commented Nov 12, 2022

keithbusch commented Nov 12, 2022 • edited Loading

dankamongmen commented Nov 12, 2022

dankamongmen commented Nov 12, 2022 • edited Loading

dankamongmen commented Nov 12, 2022

keithbusch commented Nov 12, 2022

dankamongmen commented Nov 13, 2022 via email

dankamongmen commented Nov 13, 2022

dankamongmen commented Nov 13, 2022

dankamongmen commented Nov 13, 2022 • edited Loading

dankamongmen commented Nov 13, 2022

dankamongmen commented Nov 13, 2022

dankamongmen commented Nov 13, 2022

dankamongmen commented Nov 13, 2022

dankamongmen commented Nov 13, 2022

dankamongmen commented Nov 13, 2022

dankamongmen commented Nov 13, 2022

dankamongmen commented Nov 13, 2022

dankamongmen commented Nov 13, 2022 • edited Loading

keithbusch commented Nov 14, 2022

dankamongmen commented Nov 14, 2022

keithbusch commented Nov 12, 2022 •

edited

Loading

dankamongmen commented Nov 12, 2022 •

edited

Loading

dankamongmen commented Nov 13, 2022 •

edited

Loading

dankamongmen commented Nov 13, 2022 •

edited

Loading