-
Notifications
You must be signed in to change notification settings - Fork 658
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nvme smart-log starts returning meaningless data #1729
Comments
oh and since i didn't really make this clear: the devices are otherwise working fine. i've got a zfs mirror atop them, and the world is good. they even seem to be reporting a sensible temperature via some other means, as i'm displaying one in my Growlight application: it looks like i'm getting that through libatasmart? |
The log reports temperature in degrees Kelvin, but the output converts it to Celsius, so you're getting an all 0's log. Either the controller did that, or perhaps it didn't transfer anything at all and the tool is showing an uninitiated buffer. Either way, I think this is more likely a Samsung firmware bug |
that is my suspicion, as well. the thing that surprises me is that both devices failed this way at the same time. i'll look into the nvme kernel module's implementation of |
also, it might or might not be a good idea to add code to check for a 0 temperature, and interpret it as a failed request? i think that logic would belong in the kernel, though. just putting the idea out there. |
in a surprising development,
this is making me think the problem actually is in |
i'm gonna look at the two implementations and see if there's a PR i can't prep for you... |
Didn't you mention that this was also working with |
2022-11-07 1125 is when we start seeing the failure. well, hot damn:
|
I would expect this to generate |
If the kernel gets an unaligned address here, it bounces it through an aligned one, so no need to return an error. I was just wondering if the device was secretly relying on some nonstandard alignment that happened to work in some cases. But since you identified an earlier version of that was successful, it should be easier to investigate, or maybe bisect. |
yep, i'll get to this shortly
|
notes to self: nvme:
smartmontools:
so let's see whether they're getting the same bytes out. if so, it's a matter of translation/presentation in nvme. if not, it's an issue in libnvme. given the breakage corresponded to a new nvme, i'm guessing the former, but given that all smart data is 0, maybe it's the latter. |
i see 0s for both |
|
looking through libnvme, Daniel Wagner's changes in early november to |
argh |
building libnvme 061c89c84c9f9a68dfd00d4e00218041facb3fef does not fix the problem. |
nvme-cli commit a633aef with libnvme commit a001e59b7751fcb2076231642e2e4d077afb4fad successfully retrieve the data; this is a definite regression, and not a hardware failure (though the hardware might be doing something weird). i will now zap this with a blessed +3 wand of bisection. |
rebuilding with libnvme HEAD (7e9f5a2ab4f6b56200aa4a66ea2b080d2ecea52b) works just fine. the problem is in nvme-cli. i have eliminated the hand-built libnvme from /usr/local, and reinstalled debian unstable's |
|
@jk-ozlabs you might wanna take a look at this |
PR incoming |
Alright, this PR flips the bool handed to |
Just want to double check, you wrote your change sets a bool to true, but the patch sets it to false. I am assuming the code matches your intent. |
aye |
Changing this to true resulted in my Western Digital 970 EVO Pro NVMe SSDs returning all zeros for smartlog information. Reverts change made in cc73f65 ("nvme: Add wrappers for Get Log page helpers") Resolves linux-nvme#1729. Signed-off-by: nick black <dankamongmen@gmail.com>
Changing this to true resulted in my Western Digital 970 EVO Pro NVMe SSDs returning all zeros for smartlog information. Reverts change made in cc73f65 ("nvme: Add wrappers for Get Log page helpers") Resolves linux-nvme#1729. Signed-off-by: nick black <dankamongmen@gmail.com>
This might be a kernel bug, or a hardware failure. I'm just hoping to start discussion and troubleshooting here.
I've got 5 NVMe devices in my workstation:
Three are in motherboard (Gigabyte Aorus XTreme TRX40) M.2 slots, and two are in a PCIe 4.0x16 card using PCIe 4x4x4x4 bifurcation:
I've got a shell script querying
nvme smart-log
for each of the devices approximately every 15s to get temperature information. This was running fine for several weeks. Then, a few days ago, the Samsung 970 2TB devices began returning a temperature of -273 (0.15 above absolute zero), and all 0s otherwise (I assume the -273 to be a bias, and thus that's also returning a 0):given that my workstation is in fact not careening silently through the unfathomable voids between superclusters of galaxies, nor immersed in a laser-pumped rubidium well, nor actively violating the laws of thermodynamics (so far as i know), this temperature seems low.
this return has persisted ever since, including over a reboot. the other three devices are fine.
how can i assist in debugging this problem? thanks!
The text was updated successfully, but these errors were encountered: