Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvme smart-log starts returning meaningless data #1729

Closed
dankamongmen opened this issue Nov 12, 2022 · 26 comments · Fixed by #1731
Closed

nvme smart-log starts returning meaningless data #1729

dankamongmen opened this issue Nov 12, 2022 · 26 comments · Fixed by #1731

Comments

@dankamongmen
Copy link
Contributor

This might be a kernel bug, or a hardware failure. I'm just hoping to start discussion and troubleshooting here.

I've got 5 NVMe devices in my workstation:

[schwarzgerat](0) $ sudo nvme list
Node                  Generic               SN                   Model                                    Namespace Usage                      Format           FW Rev  
--------------------- --------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme4n1          /dev/ng4n1            1908E1805012         WDS100T3X0C-00SJG0                       1           1.00  TB /   1.00  TB    512   B +  0 B   102000WD
/dev/nvme3n1          /dev/ng3n1            1908E1801188         WDS100T3X0C-00SJG0                       1           1.00  TB /   1.00  TB    512   B +  0 B   102000WD
/dev/nvme2n1          /dev/ng2n1            PHBT729201SR016D     INTEL MEMPEK1W016GA                      1          14.40  GB /  14.40  GB    512   B +  0 B   K3110310
/dev/nvme1n1          /dev/ng1n1            S59CNM0R605099L      Samsung SSD 970 EVO Plus 2TB             1           1.42  TB /   2.00  TB    512   B +  0 B   2B2QEXM7
/dev/nvme0n1          /dev/ng0n1            S59CNM0R736612Y      Samsung SSD 970 EVO Plus 2TB             1           1.42  TB /   2.00  TB    512   B +  0 B   2B2QEXM7
[schwarzgerat](0) $ 

Three are in motherboard (Gigabyte Aorus XTreme TRX40) M.2 slots, and two are in a PCIe 4.0x16 card using PCIe 4x4x4x4 bifurcation:

[schwarzgerat](0) $ lsscsi -t | grep nvm
[N:0:4:1]    disk    pcie 0x144d:0xa801                         /dev/nvme0n1
[N:1:4:1]    disk    pcie 0x144d:0xa801                         /dev/nvme1n1
[N:2:0:1]    disk    pcie 0x8086:0x3806                         /dev/nvme2n1
[N:3:8215:1] disk    pcie 0x15b7:0x5002                         /dev/nvme3n1
[N:4:8215:1] disk    pcie 0x15b7:0x5002                         /dev/nvme4n1
[schwarzgerat](0) $ 

I've got a shell script querying nvme smart-log for each of the devices approximately every 15s to get temperature information. This was running fine for several weeks. Then, a few days ago, the Samsung 970 2TB devices began returning a temperature of -273 (0.15 above absolute zero), and all 0s otherwise (I assume the -273 to be a bias, and thus that's also returning a 0):

[schwarzgerat](0) $ sudo nvme smart-log /dev/nvme1
Smart Log for NVME device:nvme1 namespace-id:ffffffff
critical_warning			: 0
temperature				: -273°C (0 Kelvin)
available_spare				: 0%
available_spare_threshold		: 0%
percentage_used				: 0%
endurance group critical warning summary: 0
data_units_read				: 0
data_units_written			: 0
host_read_commands			: 0
host_write_commands			: 0
controller_busy_time			: 0
power_cycles				: 0
power_on_hours				: 0
unsafe_shutdowns			: 0
media_errors				: 0
num_err_log_entries			: 0
Warning Temperature Time		: 0
Critical Composite Temperature Time	: 0
Thermal Management T1 Trans Count	: 0
Thermal Management T2 Trans Count	: 0
Thermal Management T1 Total Time	: 0
Thermal Management T2 Total Time	: 0
[schwarzgerat](0) $ 

given that my workstation is in fact not careening silently through the unfathomable voids between superclusters of galaxies, nor immersed in a laser-pumped rubidium well, nor actively violating the laws of thermodynamics (so far as i know), this temperature seems low.

this return has persisted ever since, including over a reboot. the other three devices are fine.

how can i assist in debugging this problem? thanks!

@dankamongmen
Copy link
Contributor Author

oh and since i didn't really make this clear: the devices are otherwise working fine. i've got a zfs mirror atop them, and the world is good. they even seem to be reporting a sensible temperature via some other means, as i'm displaying one in my Growlight application:

2022-11-12-033944_1453x1410_scrot

it looks like i'm getting that through libatasmart?

@keithbusch
Copy link
Contributor

The log reports temperature in degrees Kelvin, but the output converts it to Celsius, so you're getting an all 0's log. Either the controller did that, or perhaps it didn't transfer anything at all and the tool is showing an uninitiated buffer.

Either way, I think this is more likely a Samsung firmware bug

@dankamongmen
Copy link
Contributor Author

Either way, I think this is more likely a Samsung firmware bug

that is my suspicion, as well. the thing that surprises me is that both devices failed this way at the same time.

i'll look into the nvme kernel module's implementation of NVME_IOCTL_ADMIN_CMD. i'd like to leave this bug here to track the effort and serve as a beacon for other people experiencing similar problems.

@dankamongmen
Copy link
Contributor Author

also, it might or might not be a good idea to add code to check for a 0 temperature, and interpret it as a failed request? i think that logic would belong in the kernel, though. just putting the idea out there.

@dankamongmen
Copy link
Contributor Author

in a surprising development, smartctl shows a reasonable temperature (44C) despite appearing to acquire this information via NVME_IOCTL_ADMIN_CMD:

ioctl(3, NVME_IOCTL_ADMIN_CMD, 0x7fff902d4290) = 0
write(1, "SMART overall-health self-assess"..., 57SMART overall-health self-assessment test result: PASSED
) = 57
write(1, "\n", 1
)                       = 1
write(1, "SMART/Health Information (NVMe L"..., 41SMART/Health Information (NVMe Log 0x02)
) = 41
write(1, "Critical Warning:               "..., 41Critical Warning:                   0x00
) = 41
write(1, "Temperature:                    "..., 47Temperature:                        44 Celsius
) = 47
write(1, "Available Spare:                "..., 41Available Spare:                    100%
) = 41
write(1, "Available Spare Threshold:      "..., 40Available Spare Threshold:          10%
) = 40
write(1, "Percentage Used:                "..., 39Percentage Used:                    0%
) = 39
write(1, "Data Units Read:                "..., 57Data Units Read:                    30,401,375 [15.5 TB]
) = 57
write(1, "Data Units Written:             "..., 57Data Units Written:                 38,855,085 [19.8 TB]
) = 57
write(1, "Host Read Commands:             "..., 48Host Read Commands:                 154,838,138
) = 48
write(1, "Host Write Commands:            "..., 48Host Write Commands:                337,074,955
) = 48
write(1, "Controller Busy Time:           "..., 40Controller Busy Time:               836
) = 40
write(1, "Power Cycles:                   "..., 40Power Cycles:                       982
) = 40
write(1, "Power On Hours:                 "..., 42Power On Hours:                     1,345
) = 42
write(1, "Unsafe Shutdowns:               "..., 40Unsafe Shutdowns:                   598
) = 40
write(1, "Media and Data Integrity Errors:"..., 38Media and Data Integrity Errors:    0
) = 38
write(1, "Error Information Log Entries:  "..., 40Error Information Log Entries:      917
) = 40
write(1, "Warning  Comp. Temperature Time:"..., 38Warning  Comp. Temperature Time:    0
) = 38
write(1, "Critical Comp. Temperature Time:"..., 38Critical Comp. Temperature Time:    0
) = 38
write(1, "Temperature Sensor 1:           "..., 47Temperature Sensor 1:               44 Celsius
) = 47
write(1, "Temperature Sensor 2:           "..., 47Temperature Sensor 2:               46 Celsius

this is making me think the problem actually is in nvme, since they're using the same ioctl.

@dankamongmen
Copy link
Contributor Author

i'm gonna look at the two implementations and see if there's a PR i can't prep for you...

@keithbusch
Copy link
Contributor

keithbusch commented Nov 12, 2022

Didn't you mention that this was also working with nvme for several iterations? Perhaps this is some memory address alignment issue, though the driver always enforces what the spec requires.

@dankamongmen
Copy link
Contributor Author

it was working with nvme for some time. i don't know whether the failure correlated with an nvme upgrade, but i can check that.

2022-11-12_17-21

@dankamongmen
Copy link
Contributor Author

dankamongmen commented Nov 12, 2022

2022-11-07 1125 is when we start seeing the failure. well, hot damn:

Aptitude 0.8.13: log report    
Mon, Nov  7 2022 11:17:50 -0500   
...
[UPGRADE] nvme-cli:amd64 2.1.2-2 -> 2.2.1-1
...

@dankamongmen
Copy link
Contributor Author

Didn't you mention that this was also working with nvme for several iterations? Perhaps this is some memory address alignment issue, though the driver always enforces what the spec requires.

I would expect this to generate EADDR on the kernel side? your ioctl() is succeeding (well, 0 is being returned).

@keithbusch
Copy link
Contributor

If the kernel gets an unaligned address here, it bounces it through an aligned one, so no need to return an error. I was just wondering if the device was secretly relying on some nonstandard alignment that happened to work in some cases.

But since you identified an earlier version of that was successful, it should be easier to investigate, or maybe bisect.

@dankamongmen
Copy link
Contributor Author

dankamongmen commented Nov 13, 2022 via email

@dankamongmen
Copy link
Contributor Author

notes to self:

nvme:

  • nvme_show_smart_log in common code nvme-print.c dumps this data
  • temperature is a kelvin u16: __u16 temperature = smart->temperature[1] << 8 | smart->temperature[0];
  • kelvin 0 (absolute zero) is celsius -273.15
  • smart_log is filled in in nvme.c's call to nvme_cli_get_log_smart(), which calls through macro do_admin_op() to either nvme_get_log_smart() or nvme_mi_admin_get_log_smart() based off d->type. either way, we cross the libnvme border.

smartmontools:

  • print_smart_log() in nvmeprint.cpp gets an nvme_smart_log filled in by nvme_read_smart_log(): int k = sg_get_unaligned_le16(smart_log.temperature);
  • smartctl does not link against libnvme, instead implementing its own nvme_pass_through(), nvme_read_log_page_1(), and nvme_read_smart_log()

so let's see whether they're getting the same bytes out. if so, it's a matter of translation/presentation in nvme. if not, it's an issue in libnvme. given the breakage corresponded to a new nvme, i'm guessing the former, but given that all smart data is 0, maybe it's the latter.

@dankamongmen
Copy link
Contributor Author

i see 0s for both smart->temperature[0] and smart->temperature[1] at the top of nvme_show_smart_log(). so the problem is maybe down in libnvme's nvme_get_log_smart().

@dankamongmen
Copy link
Contributor Author

dankamongmen commented Nov 13, 2022

nvme_get_log_smart() -> nvme_get_nsid_log() -> nvme_get_log_page() ->

suspicious initialization here: .args_size = sizeof(args), where args is a struct nvme_get_log_args *args. do we not want (*args)? args is the struct being initialized, not a problem

@dankamongmen
Copy link
Contributor Author

looking through libnvme, Daniel Wagner's changes in early november to nvme_get_log_page() seem suspicious by proximity...looking into it

@dankamongmen
Copy link
Contributor Author

argh ninja install wrote to my root instead of /usr/local, erp. oh well uninstall appears to work!

@dankamongmen
Copy link
Contributor Author

building libnvme 061c89c84c9f9a68dfd00d4e00218041facb3fef does not fix the problem.

@dankamongmen
Copy link
Contributor Author

nvme-cli commit a633aef with libnvme commit a001e59b7751fcb2076231642e2e4d077afb4fad successfully retrieve the data; this is a definite regression, and not a hardware failure (though the hardware might be doing something weird). i will now zap this with a blessed +3 wand of bisection.

@dankamongmen
Copy link
Contributor Author

rebuilding with libnvme HEAD (7e9f5a2ab4f6b56200aa4a66ea2b080d2ecea52b) works just fine. the problem is in nvme-cli.

i have eliminated the hand-built libnvme from /usr/local, and reinstalled debian unstable's libnvme1 and libnvme-dev. the specified nvme revision doesn't build, but looks like it can be easily fixed up. bisection ought now reveal the cultprit.

@dankamongmen
Copy link
Contributor Author

[schwarzgerat](0) $ git bisect good
cc73f65fc0162d3624e0b5afebbd0eb006967f98 is the first bad commit
commit cc73f65fc0162d3624e0b5afebbd0eb006967f98
Author: Jeremy Kerr <jk@codeconstruct.com.au>
Date:   Wed Jul 20 12:28:40 2022 +0800

    nvme: Add wrappers for Get Log page helpers
    
    Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>

 nvme-wrap.c | 190 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 nvme-wrap.h |  69 ++++++++++++++++++++++
 nvme.c      |  97 +++++++++++++++----------------
 3 files changed, 306 insertions(+), 50 deletions(-)
[schwarzgerat](0) $ 

@dankamongmen
Copy link
Contributor Author

@jk-ozlabs you might wanna take a look at this

@dankamongmen
Copy link
Contributor Author

PR incoming

@dankamongmen
Copy link
Contributor Author

dankamongmen commented Nov 13, 2022

Alright, this PR flips the bool handed to nvme_get_smart_log() back to true false, and fixes the issue for me. I've tested, and all five of my devices work as expected. I haven't looked into the deeper semantics and history of this bool, and it may well be that this is not the appropriate fix. I'll leave that in y'alls' capable hands; this ought be enough to remedy any further issue. Feel free to merge my PR or ask me to test one of y'alls'.

@keithbusch
Copy link
Contributor

Just want to double check, you wrote your change sets a bool to true, but the patch sets it to false. I am assuming the code matches your intent.

@dankamongmen
Copy link
Contributor Author

Just want to double check, you wrote your change sets a bool to true, but the patch sets it to false. I am assuming the code matches your intent.

aye

igaw pushed a commit to dankamongmen/nvme-cli that referenced this issue Nov 16, 2022
Changing this to true resulted in my Western Digital 970 EVO Pro NVMe
SSDs returning all zeros for smartlog information. Reverts change made
in cc73f65 ("nvme: Add wrappers for Get Log page helpers")

Resolves linux-nvme#1729.

Signed-off-by: nick black <dankamongmen@gmail.com>
lgdacunh pushed a commit to solidigm/nvme-cli that referenced this issue Nov 10, 2023
Changing this to true resulted in my Western Digital 970 EVO Pro NVMe
SSDs returning all zeros for smartlog information. Reverts change made
in cc73f65 ("nvme: Add wrappers for Get Log page helpers")

Resolves linux-nvme#1729.

Signed-off-by: nick black <dankamongmen@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants