Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disk R/W Data (node_disk_read_bytes_total ) shows incorrect for NVMe formatted with 4KiB sector size, or HDD with 4K native #2310

Closed
jmhands opened this issue Mar 7, 2022 · 8 comments · Fixed by #2311

Comments

@jmhands
Copy link

jmhands commented Mar 7, 2022

Host operating system: output of uname -a

Linux msiz590 5.13.0-30-generic #33-Ubuntu SMP Fri Feb 4 17:03:31 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

node_exporter version: output of node_exporter --version

node_exporter, version 1.3.1 (branch: HEAD, revision: a2321e7)

node_exporter command line flags

default

Are you running node_exporter in Docker?

no

What did you do that produced an error?

node_disk_read_bytes_total does not work correctly for 4KiB sector size disk, either NVMe SSD or SATA 4K native HDD. It over estimated the bytes read by 8x (since its converting sectors to bytes read)

What did you expect to see?

iostat, dstat, and /proc/diskstats all show correct amount of data written

this is correct data written

$ iostat
Linux 5.13.0-30-generic (msiz590)       03/07/2022      _x86_64_        (16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          27.98    0.00    2.60    3.57    0.00   65.85

Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
dm-0             47.04       102.03       234.63      3358.76    2683668    6171252   88342868
loop0             0.00         0.03         0.00         0.00        682          0          0
loop1             0.02         0.23         0.00         0.00       6077          0          0
loop2             0.00         0.08         0.00         0.00       2163          0          0
loop3             0.01         0.17         0.00         0.00       4457          0          0
loop4             0.00         0.08         0.00         0.00       2129          0          0
loop5             0.15         5.08         0.00         0.00     133706          0          0
loop6             0.05         0.84         0.00         0.00      22154          0          0
loop7             0.07         2.17         0.00         0.00      57039          0          0
nvme0n1        1241.20     49994.22     54355.51     65160.73 1314960084 1429671740 1713873244
nvme1n1         945.44     55943.57     60888.21    109731.13 1471441298 1601496248 2886174544
nvme2n1        1261.86     50596.13     54423.33    192913.83 1330791548 1431455452 5074065904
nvme3n1        1249.30     50162.56     54363.83     59417.52 1319387660 1429890424 1562813784
nvme4n1         298.98     50338.30     54385.89    126362.38 1324009988 1430470680 3323613588
nvme5n1         473.56     50220.49     54364.51     68881.78 1320911404 1429908263 1811745016
sda              12.45       103.18       233.67      3409.88    2713791    6146121   89687508

What did you see instead?

node_disk_read_bytes_total reporting 10808419123200 bytes
this messes up grafana node exporter dashboard for Disk R/W Data

Screenshot 2022-03-06 232521

very easy to reproduce
take any modern NVMe drive and do
sudo apt install nvme-cli
find output of identify namespace for which LBA format is 0 metadata size, and 4096 bytes
sudo nvme id-ns /dev/nvme0n1 -H
LBA Format 2 : Metadata Size: 0 bytes - Data Size: 4096 bytes - Relative Performance: 0x2 Good (in use)
format the drive and change sector size (this also wipes all data, does a cryptographic erase on most NVMe)
sudo nvme format /dev/nvme0n1 -l 2

@SuperQ
Copy link
Member

SuperQ commented Mar 7, 2022

Can you share the results of /sys/block/X/queue/ogical_block_size for these devices? The node_exporter attempts to detect the block size via this method.

For example on my laptop:

$ sudo nvme id-ns /dev/nvme0n1 -H | grep 'Data Size:'
LBA Format  0 : Metadata Size: 0   bytes - Data Size: 512 bytes - Relative Performance: 0 Best (in use)

$ cat /sys/block/nvme0n1/queue/logical_block_size 
512

@jmhands
Copy link
Author

jmhands commented Mar 7, 2022

output of nvme list -v

Device       NSID     Usage                      Format           Controllers
------------ -------- -------------------------- ---------------- ----------------
nvme0n1      1        400.09  GB / 400.09  GB    512   B +  0 B   nvme0
nvme1n1      1          1.60  TB /   1.60  TB    512   B +  0 B   nvme1
nvme2n1      1          3.84  TB /   3.84  TB      4 KiB +  0 B   nvme2
nvme3n1      1          1.60  TB /   1.60  TB      4 KiB +  0 B   nvme3
nvme4n1      1          2.05  TB /   2.05  TB      4 KiB +  0 B   nvme4
nvme5n1      1        500.11  GB / 500.11  GB    512   B +  0 B   nvme5

and the block size for these devices

$ cat /sys/block/nvme*n1/queue/logical_block_size
512
512
4096
4096
4096
512

I was running a benchmark testing lots of different drive models, so I noticed grafana was accurate for some and not others for disk bandwidth, which is how I found this.

@ventifus
Copy link
Contributor

ventifus commented Mar 7, 2022

I wonder if I got this wrong when I refactored diskstats_linux.go. Currently it takes IOStats.ReadSectors and multiplies by the sector size BlockQueueStats.LogicalBlockSize. Previous versions of node_exporter used a hard-coded 512 sector size.

float64(stats.ReadSectors) * diskSectorSize,

@SuperQ
Copy link
Member

SuperQ commented Mar 7, 2022

@ventifus Good question. I wonder how iostat accounts for different sector sizes.

@ventifus
Copy link
Contributor

ventifus commented Mar 7, 2022

I found the answer in https://www.kernel.org/doc/Documentation/block/stat.txt

read sectors, write sectors, discard_sectors
============================================

These values count the number of sectors read from, written to, or
discarded from this block device.  The "sectors" in question are the
standard UNIX 512-byte sectors, not any device- or filesystem-specific
block size.  The counters are incremented when the I/O completes.

I was incorrect in assuming that ReadSectors / WriteSectors were in hardware sector units. I'll prepare a PR to use 512-byte "sectors" in all cases.

@SuperQ
Copy link
Member

SuperQ commented Mar 8, 2022

Yea, that decision fits with the "don't break userspace" philosophy. The kernel interface would not change out from under any version of iostat.

1 similar comment
@SuperQ
Copy link
Member

SuperQ commented Mar 8, 2022

Yea, that decision fits with the "don't break userspace" philosophy. The kernel interface would not change out from under any version of iostat.

@brian-brazil
Copy link
Contributor

This is a regression from #2141

tbg added a commit to tbg/cockroach that referenced this issue Jul 7, 2022
v1.3.1, the most up to date released version, has a bug that inflates
the bytes written by ~8x for NVMe drives (which in particular includes
the default drives for our GCE roachprod machines). Fundamentally this
is caused by the fact that these devices use a 4K sector size whereas
the kernel will always report based on a 512B sector size.

This took us a while to figure out, and to avoid repeating this exercise
periodically, downgrade node_exporter to 1.2.2, which pre-dates a
refactor that introduces the regression.

See: prometheus/node_exporter#2310

Release note: None
craig bot pushed a commit to cockroachdb/cockroach that referenced this issue Jul 7, 2022
83014: ui: add internal app filter to active statements and transactions pages r=ericharmeling a=ericharmeling

This PR adds a single internal app filter option on to the Active Statements and Active Transactions pages. Active
statements and transactions run by internal apps are no longer displayed by default.

See commit message for release note.


https://user-images.githubusercontent.com/27286675/174156635-39d8649a-df91-4550-adb5-b3c167d54ed5.mov



Fixes #81072.

83707: roachtest: run workload from the tenant node r=knz a=stevendanna

The secure URL refers to paths on disk on the clusters in the
node. Since we only create the tenant-scoped certs on the tenant node,
we need to run workload from that node.

Fixes #82266
Depends on #83703

Release note: None

84003: storage: close pebble iter gracefully when NewPebbleSSTIterator fails r=erikgrinaker a=msbutler

Currently, if `pebble.NewExternalIter` sets pebbleIterator.inuse to True, but
then fails, the subsequent `pebbleIterator.destroy()` will panic unecessarily,
since the caller of `pebble.NewExternalIter` is not actually using the iter.
This bug causes TestBackupRestoreChecksum to flake in #83984.

To fix, this patch uses pebble.Close() to  gracefully close the pebbleIterator
if `pebble.NewExternalIter` fails.

Release Note: None

84039: prometheus: use older node_exporter r=nicktrav a=tbg

v1.3.1, the most up to date released version, has a bug that inflates
the bytes written by ~8x for NVMe drives (which in particular includes
the default drives for our GCE roachprod machines). Fundamentally this
is caused by the fact that these devices use a 4K sector size whereas
the kernel will always report based on a 512B sector size.

This took us a while to figure out, and to avoid repeating this exercise
periodically, downgrade node_exporter to 1.2.2, which pre-dates a
refactor that introduces the regression.

See: prometheus/node_exporter#2310

Release note: None


Co-authored-by: Eric Harmeling <eric.harmeling@cockroachlabs.com>
Co-authored-by: Steven Danna <danna@cockroachlabs.com>
Co-authored-by: Michael Butler <butler@cockroachlabs.com>
Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants