Disk R/W Data (node_disk_read_bytes_total ) shows incorrect for NVMe formatted with 4KiB sector size, or HDD with 4K native #2310

jmhands · 2022-03-07T07:31:28Z

Host operating system: output of `uname -a`

Linux msiz590 5.13.0-30-generic #33-Ubuntu SMP Fri Feb 4 17:03:31 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

node_exporter version: output of `node_exporter --version`

node_exporter, version 1.3.1 (branch: HEAD, revision: a2321e7)

node_exporter command line flags

default

Are you running node_exporter in Docker?

no

What did you do that produced an error?

node_disk_read_bytes_total does not work correctly for 4KiB sector size disk, either NVMe SSD or SATA 4K native HDD. It over estimated the bytes read by 8x (since its converting sectors to bytes read)

What did you expect to see?

iostat, dstat, and /proc/diskstats all show correct amount of data written

this is correct data written

$ iostat
Linux 5.13.0-30-generic (msiz590)       03/07/2022      _x86_64_        (16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          27.98    0.00    2.60    3.57    0.00   65.85

Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
dm-0             47.04       102.03       234.63      3358.76    2683668    6171252   88342868
loop0             0.00         0.03         0.00         0.00        682          0          0
loop1             0.02         0.23         0.00         0.00       6077          0          0
loop2             0.00         0.08         0.00         0.00       2163          0          0
loop3             0.01         0.17         0.00         0.00       4457          0          0
loop4             0.00         0.08         0.00         0.00       2129          0          0
loop5             0.15         5.08         0.00         0.00     133706          0          0
loop6             0.05         0.84         0.00         0.00      22154          0          0
loop7             0.07         2.17         0.00         0.00      57039          0          0
nvme0n1        1241.20     49994.22     54355.51     65160.73 1314960084 1429671740 1713873244
nvme1n1         945.44     55943.57     60888.21    109731.13 1471441298 1601496248 2886174544
nvme2n1        1261.86     50596.13     54423.33    192913.83 1330791548 1431455452 5074065904
nvme3n1        1249.30     50162.56     54363.83     59417.52 1319387660 1429890424 1562813784
nvme4n1         298.98     50338.30     54385.89    126362.38 1324009988 1430470680 3323613588
nvme5n1         473.56     50220.49     54364.51     68881.78 1320911404 1429908263 1811745016
sda              12.45       103.18       233.67      3409.88    2713791    6146121   89687508

What did you see instead?

node_disk_read_bytes_total reporting 10808419123200 bytes
this messes up grafana node exporter dashboard for Disk R/W Data

very easy to reproduce
take any modern NVMe drive and do
sudo apt install nvme-cli
find output of identify namespace for which LBA format is 0 metadata size, and 4096 bytes
sudo nvme id-ns /dev/nvme0n1 -H
LBA Format 2 : Metadata Size: 0 bytes - Data Size: 4096 bytes - Relative Performance: 0x2 Good (in use)
format the drive and change sector size (this also wipes all data, does a cryptographic erase on most NVMe)
sudo nvme format /dev/nvme0n1 -l 2

The text was updated successfully, but these errors were encountered:

SuperQ · 2022-03-07T07:54:14Z

Can you share the results of /sys/block/X/queue/ogical_block_size for these devices? The node_exporter attempts to detect the block size via this method.

For example on my laptop:

$ sudo nvme id-ns /dev/nvme0n1 -H | grep 'Data Size:'
LBA Format  0 : Metadata Size: 0   bytes - Data Size: 512 bytes - Relative Performance: 0 Best (in use)

$ cat /sys/block/nvme0n1/queue/logical_block_size 
512

jmhands · 2022-03-07T15:24:37Z

output of nvme list -v

Device       NSID     Usage                      Format           Controllers
------------ -------- -------------------------- ---------------- ----------------
nvme0n1      1        400.09  GB / 400.09  GB    512   B +  0 B   nvme0
nvme1n1      1          1.60  TB /   1.60  TB    512   B +  0 B   nvme1
nvme2n1      1          3.84  TB /   3.84  TB      4 KiB +  0 B   nvme2
nvme3n1      1          1.60  TB /   1.60  TB      4 KiB +  0 B   nvme3
nvme4n1      1          2.05  TB /   2.05  TB      4 KiB +  0 B   nvme4
nvme5n1      1        500.11  GB / 500.11  GB    512   B +  0 B   nvme5

and the block size for these devices

$ cat /sys/block/nvme*n1/queue/logical_block_size
512
512
4096
4096
4096
512

I was running a benchmark testing lots of different drive models, so I noticed grafana was accurate for some and not others for disk bandwidth, which is how I found this.

ventifus · 2022-03-07T17:56:09Z

I wonder if I got this wrong when I refactored diskstats_linux.go. Currently it takes IOStats.ReadSectors and multiplies by the sector size BlockQueueStats.LogicalBlockSize. Previous versions of node_exporter used a hard-coded 512 sector size.

node_exporter/collector/diskstats_linux.go

Line 213 in e3a18fd

float64(stats.ReadSectors) * diskSectorSize,

SuperQ · 2022-03-07T18:02:56Z

@ventifus Good question. I wonder how iostat accounts for different sector sizes.

ventifus · 2022-03-07T22:02:40Z

I found the answer in https://www.kernel.org/doc/Documentation/block/stat.txt

read sectors, write sectors, discard_sectors
============================================

These values count the number of sectors read from, written to, or
discarded from this block device.  The "sectors" in question are the
standard UNIX 512-byte sectors, not any device- or filesystem-specific
block size.  The counters are incremented when the I/O completes.

I was incorrect in assuming that ReadSectors / WriteSectors were in hardware sector units. I'll prepare a PR to use 512-byte "sectors" in all cases.

SuperQ · 2022-03-08T00:08:10Z

Yea, that decision fits with the "don't break userspace" philosophy. The kernel interface would not change out from under any version of iostat.

SuperQ · 2022-03-08T00:08:53Z

Yea, that decision fits with the "don't break userspace" philosophy. The kernel interface would not change out from under any version of iostat.

brian-brazil · 2022-03-08T08:09:41Z

This is a regression from #2141

v1.3.1, the most up to date released version, has a bug that inflates the bytes written by ~8x for NVMe drives (which in particular includes the default drives for our GCE roachprod machines). Fundamentally this is caused by the fact that these devices use a 4K sector size whereas the kernel will always report based on a 512B sector size. This took us a while to figure out, and to avoid repeating this exercise periodically, downgrade node_exporter to 1.2.2, which pre-dates a refactor that introduces the regression. See: prometheus/node_exporter#2310 Release note: None

83014: ui: add internal app filter to active statements and transactions pages r=ericharmeling a=ericharmeling This PR adds a single internal app filter option on to the Active Statements and Active Transactions pages. Active statements and transactions run by internal apps are no longer displayed by default. See commit message for release note. https://user-images.githubusercontent.com/27286675/174156635-39d8649a-df91-4550-adb5-b3c167d54ed5.mov Fixes #81072. 83707: roachtest: run workload from the tenant node r=knz a=stevendanna The secure URL refers to paths on disk on the clusters in the node. Since we only create the tenant-scoped certs on the tenant node, we need to run workload from that node. Fixes #82266 Depends on #83703 Release note: None 84003: storage: close pebble iter gracefully when NewPebbleSSTIterator fails r=erikgrinaker a=msbutler Currently, if `pebble.NewExternalIter` sets pebbleIterator.inuse to True, but then fails, the subsequent `pebbleIterator.destroy()` will panic unecessarily, since the caller of `pebble.NewExternalIter` is not actually using the iter. This bug causes TestBackupRestoreChecksum to flake in #83984. To fix, this patch uses pebble.Close() to gracefully close the pebbleIterator if `pebble.NewExternalIter` fails. Release Note: None 84039: prometheus: use older node_exporter r=nicktrav a=tbg v1.3.1, the most up to date released version, has a bug that inflates the bytes written by ~8x for NVMe drives (which in particular includes the default drives for our GCE roachprod machines). Fundamentally this is caused by the fact that these devices use a 4K sector size whereas the kernel will always report based on a 512B sector size. This took us a while to figure out, and to avoid repeating this exercise periodically, downgrade node_exporter to 1.2.2, which pre-dates a refactor that introduces the regression. See: prometheus/node_exporter#2310 Release note: None Co-authored-by: Eric Harmeling <eric.harmeling@cockroachlabs.com> Co-authored-by: Steven Danna <danna@cockroachlabs.com> Co-authored-by: Michael Butler <butler@cockroachlabs.com> Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>

ventifus mentioned this issue Mar 7, 2022

diskstats_linux: always scale reads and writes by 512 bytes, not by device units #2311

Merged

discordianfish closed this as completed in #2311 Mar 22, 2022

tbg mentioned this issue Jul 7, 2022

prometheus: use older node_exporter cockroachdb/cockroach#84039

Merged

amnonh mentioned this issue May 8, 2024

Raid0 partition and single NVME disks that comprise that partition metrics don't match scylladb/scylla-monitoring#2276

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disk R/W Data (node_disk_read_bytes_total ) shows incorrect for NVMe formatted with 4KiB sector size, or HDD with 4K native #2310

Disk R/W Data (node_disk_read_bytes_total ) shows incorrect for NVMe formatted with 4KiB sector size, or HDD with 4K native #2310

jmhands commented Mar 7, 2022

SuperQ commented Mar 7, 2022

jmhands commented Mar 7, 2022

ventifus commented Mar 7, 2022

SuperQ commented Mar 7, 2022

ventifus commented Mar 7, 2022

SuperQ commented Mar 8, 2022

SuperQ commented Mar 8, 2022

brian-brazil commented Mar 8, 2022

Disk R/W Data (node_disk_read_bytes_total ) shows incorrect for NVMe formatted with 4KiB sector size, or HDD with 4K native #2310

Disk R/W Data (node_disk_read_bytes_total ) shows incorrect for NVMe formatted with 4KiB sector size, or HDD with 4K native #2310

Comments

jmhands commented Mar 7, 2022

Host operating system: output of uname -a

node_exporter version: output of node_exporter --version

node_exporter command line flags

Are you running node_exporter in Docker?

What did you do that produced an error?

What did you expect to see?

What did you see instead?

SuperQ commented Mar 7, 2022

jmhands commented Mar 7, 2022

ventifus commented Mar 7, 2022

SuperQ commented Mar 7, 2022

ventifus commented Mar 7, 2022

SuperQ commented Mar 8, 2022

SuperQ commented Mar 8, 2022

brian-brazil commented Mar 8, 2022

Host operating system: output of `uname -a`

node_exporter version: output of `node_exporter --version`