Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Huge performance drop (30%~60%) after upgrading to 0.7.9 from 0.6.5.11 #7834

Closed
pruiz opened this issue Aug 27, 2018 · 61 comments
Closed

Huge performance drop (30%~60%) after upgrading to 0.7.9 from 0.6.5.11 #7834

pruiz opened this issue Aug 27, 2018 · 61 comments
Labels
Status: Stale No recent activity for issue Type: Performance Performance improvement or performance problem

Comments

@pruiz
Copy link

pruiz commented Aug 27, 2018

System information

Type Version/Name
Distribution Name CentOS
Distribution Version 7.5
Linux Kernel 3.10.0-862.9.1.el7.x86_64
Architecture x86_64
ZFS Version 0.6.5.11 => 0.7.9
SPL Version 0.6.5.11 => 0.7.9

Describe the problem you're observing

I've found a huge performance drop between zfs 0.6.5.11 and 0.7.9 which the following system/setup:

  • System Board: SuperMicro X8DTS
  • CPU: 2x Intel X5687 (8c) @3.60GHz
  • RAM: 32GB DDR3
  • Controller: LSI SAS2 2008 (IT firmware)
  • Disks: 12 x HGST HUSMR3280ASS201 (800GB, SAS/SSD)

At such system I've created the following RAID10 zpool:

  pool: DATA
 state: ONLINE
  scan: none requested
config:

        NAME                              STATE     READ WRITE CKSUM
        DATA                              ONLINE       0     0     0
          mirror-0                        ONLINE       0     0     0
            wwn-0x5000cca09f004a1c-part1  ONLINE       0     0     0
            wwn-0x5000cca09f004c14-part1  ONLINE       0     0     0
          mirror-1                        ONLINE       0     0     0
            wwn-0x5000cca09f00500c-part1  ONLINE       0     0     0
            wwn-0x5000cca09f005214-part1  ONLINE       0     0     0
          mirror-2                        ONLINE       0     0     0
            wwn-0x5000cca09f0052a8-part1  ONLINE       0     0     0
            wwn-0x5000cca09f005318-part1  ONLINE       0     0     0
          mirror-3                        ONLINE       0     0     0
            wwn-0x5000cca09f005700-part1  ONLINE       0     0     0
            wwn-0x5000cca09f005960-part1  ONLINE       0     0     0
          mirror-4                        ONLINE       0     0     0
            wwn-0x5000cca09f006178-part1  ONLINE       0     0     0
            wwn-0x5000cca09f00640c-part1  ONLINE       0     0     0
          mirror-5                        ONLINE       0     0     0
            wwn-0x5000cca09f00642c-part1  ONLINE       0     0     0
            wwn-0x5000cca09f006530-part1  ONLINE       0     0     0

And the following datasets:

NAME           USED  AVAIL  REFER  MOUNTPOINT
DATA          2.05G  4.22T    96K  none
DATA/db-data  1.01G  4.22T  1.01G  legacy

All of it created using the following commands:

zpool create -o ashift=12   DATA \
  mirror wwn-0x5000cca09f004a1c-part1 wwn-0x5000cca09f004c14-part1 \
  mirror wwn-0x5000cca09f00500c-part1 wwn-0x5000cca09f005214-part1 \
  mirror wwn-0x5000cca09f0052a8-part1 wwn-0x5000cca09f005318-part1 \
  mirror wwn-0x5000cca09f005700-part1 wwn-0x5000cca09f005960-part1 \
  mirror wwn-0x5000cca09f006178-part1 wwn-0x5000cca09f00640c-part1  \
  mirror wwn-0x5000cca09f00642c-part1 wwn-0x5000cca09f006530-part1

zfs set compression=lz4 DATA
zfs set mountpoint=none DATA
zfs create -o mountpoint=/mnt/db-data -o xattr=sa -o acltype=off -o atime=off -o relatime=off -o logbias=throughput -o recordsize=16K -o compression=lz4 DATA/db-data

While benchmarking (using fio among other tools) against DATA/db-data dataset, I've found quite a huge performance difference between version 0.6.5.11 and 0.7.9 of zfs/spl. As can be seen next.

  • Performance results using v0.6.5.11:
FILESIZE BS IODEPTH THREADS WMODE IOP/s (R) IOP/s (W) BW (R) BW (W)
1G 4K 1 1 SYNC 6526 2775 25.5MB 10.8MB
1G 8K 1 1 SYNC 6494 2783 50.7MB 21.7MB
1G 256K 1 1 SYNC 2267 974 567MB 244MB
1G 1M 1 1 SYNC 754 320 755MB 321MB
1G 4K 16 16 SYNC 21300 9151 83.2MB 35.7MB
1G 8K 16 16 SYNC 21500 9243 168MB 72.2MB
1G 256K 16 16 SYNC 3819 1638 955MB 410MB
1G 1M 16 16 SYNC 1303 560 1304MB 560MB
1G 16K 128 128 SYNC 126000 53900 1965MB 842MB
16G 4K 1 1 NOSYNC 17600 7566 68.9MB 29.6MB
16G 8K 1 1 NOSYNC 18300 7819 143MB 61.1MB
16G 256K 1 1 NOSYNC 3773 1623 936MB 403MB
16G 1M 1 1 NOSYNC 1178 502 1179MB 502MB
16G 4K 16 16 NOSYNC 125000 53400 487MB 209MB
16G 8K 16 16 NOSYNC 114000 48700 888MB 381MB
16G 256K 16 16 NOSYNC 8480 3637 2120MB 909MB
16G 1M 16 16 NOSYNC 2189 939 2189MB 940MB
  • Performance results using v0.7.9:
FILESIZE BS IODEPTH THREADS WMODE IOP/s (R) IOP/s (W) BW (R) BW (W)
1G 4K 1 1 SYNC 4236 1821 16.5MB 7.2MB
1G 8K 1 1 SYNC 4137 1764 32.3MB 16.8MB
1G 256K 1 1 SYNC 1413 578 353MB 145MB
1G 1M 1 1 SYNC 471 179 471MB 179MB
1G 4k 16 16 SYNC 18200 7791 70.9MB 30.4MB
1G 16k 16 16 SYNC 16000 7257 265MB 113MB
1G 256k 16 16 SYNC 3476 1498 869MB 375MB
1G 1M 16 16 SYNC 1095 467 1096MB 468MB
1G 16k 128 128 SYNC 43600 18600 681MB 291MB
16G 4K 1 1 SYNC 1842 801 7.4MB 3.2MB
16G 8K 1 1 SYNC 1838 791 14.5MB 6.4MB
16G 256K 1 1 SYNC 783 337 196MB 84.4MB
16G 1M 1 1 SYNC 255 108 255MB 109MB
1G 4K 1 1 NOSYNC 34200 14200 134MB 57.5MB
1G 8K 1 1 NOSYNC 33200 14200 259MB 111MB
1G 16K 1 1 NOSYNC 34300 14700 536MB 229MB
1G 32K 1 1 NOSYNC 17700 7612 552MB 238MB
1G 64K 1 1 NOSYNC 10100 4519 629MB 282MB
1G 128K 1 1 NOSYNC 5824 2383 728MB 298MB
1G 256K 1 1 NOSYNC 2847 1221 718MB 305MB
1G 1M 1 1 NOSYNC 756 322 756MB 322MB
1G 4k 16 16 NOSYNC 89800 38500 351MB 150MB
1G 8k 16 16 NOSYNC 88100 37800 688MB 295MB
1G 16k 16 16 NOSYNC 84300 36100 1317MB 565MB
1G 32k 16 16 NOSYNC 42900 18400 1342MB 576MB
1G 64k 16 16 NOSYNC 20900 8964 1305MB 560MB
16G 4K 1 1 NOSYNC 3033 1280 12.2MB 5.1MB
16G 8K 1 1 NOSYNC 2996 1292 23.8MB 10.3MB
16G 16K 1 1 NOSYNC 3976 1707 62.1MB 26.7MB
16G 32K 1 1 NOSYNC 3128 1342 97.8MB 41MB
16G 64K 1 1 NOSYNC 2462 1068 154MB 66.8MB
16G 128K 1 1 NOSYNC 1673 719 209MB 89.9MB
16G 256K 1 1 NOSYNC 750 322 186MB 80.1MB
16G 1M 1 1 NOSYNC 308 134 309MB 135MB
16G 4k 16 16 NOSYNC 24200 10400 94.4MB 40.5MB
16G 8k 16 16 NOSYNC 24400 10400 190MB 81.5MB
16G 256k 16 16 NOSYNC 3844 1654 889MB 404MB
16G 1M 16 16 NOSYNC 1017 433 1017MB 434MB

As can be seen, performance drops at both IOPs and BW, for all use cases, examples:

  • IOPs intensive workload (~30% difference):
    ** 0.6.5.11 => 4k,1,1,SYNC => 6526/2775 - 25.5MB/10.8MB
    ** 0.7.9 => 4k,1,1,SYNC => 4236/1821 - 16.5MB/7.2MB

  • BW intensive workload (~60% difference):
    ** 0.6.5.11 => 256k,16,16,NOSYNC => 8480/3637 - 2120MB/909MB
    ** 0.7.9 => 256k,16,16,NOSYNC => 3844/1654 - 889MB/404MB

Tests have been performed using the following commands, using average from 3 repetitions on each case.

rm -f kk
echo 3 > /proc/sys/vm/drop_caches
sleep 30
fio --filename=kk \
  -name=test --group_reporting --fallocate=none --ioengine=libaio \
  --rw=randrw --rwmixread=70 --refill_buffers --norandommap --randrepeat=0 --runtime=60 \
  --iodepth=$IODEPTH --numjobs=$THREADS \
  --direct=0 --sync=$WMODE --size=$FILESIZE --bs=$BS --time_based

NOTEs:

  • The zpool and datasets have been re-created from scratch after swapping zfs versions.
  • When upgrading/downgrading zfs/spl kernel modules, userland utilities were upgraded/downgraded too.
  • I've tested using whole-disks instead of partitions, with same results.
  • Tuning zfs module parameters on v0.7.9 (like zfs_vdev_*, etc.) makes no observable difference.
  • I've tried 0.7.9 with kernel 4.4.152 (from elrepo), but results are even a little bit worse (~5% slower) than 0.7.9 with redhat's stock kernel.
@pruiz pruiz changed the title Huge performance drop after upgrading to 0.7.9 from 0.6.5.11 Huge performance drop (30%~60%) after upgrading to 0.7.9 from 0.6.5.11 Aug 27, 2018
@DeHackEd
Copy link
Contributor

The use of scatter/gather lists for the ARC rather than chopping up vmalloc()'d blocks does incur a performance hit, but this seems a bit much...

@pruiz
Copy link
Author

pruiz commented Aug 27, 2018

@DeHackEd is there any module param or compile-time define I can set in order to disable s/g on 0.7.9 and redo benchmarks?

@pruiz
Copy link
Author

pruiz commented Aug 27, 2018

PD: I forgot to add, I've tried 0.7.9 with kernel 4.4.152 (from elrepo), but results are even a little bit worse (~5% slower) than 0.7.9 with redhat's stock kernel.

@loli10K loli10K added the Type: Performance Performance improvement or performance problem label Aug 27, 2018
@behlendorf
Copy link
Contributor

@pruiz you can set zfs_abd_scatter_enabled=0 to force ZFS to use the 0.6.5 allocation strategy and not use scatter/gather lists. You could also try setting zfs_compressed_arc_enabled=0 to disable keeping data compressed in the ARC. Both of these options will increase ZFS's memory footprint and cpu usage, but may improve performance for your test workload. I'd be interested to see your results.

We've also done some work in rthe master branch to improve performance. If you're comfortable building ZFS from source it would be interesting to see how the master branch compares on your hardware.

@pruiz
Copy link
Author

pruiz commented Aug 27, 2018

Hi @behlendorf,

Here are some preliminary results with 0.7.9 + zfs_abd_scatter_enabled=0 (same zpool & data set settings as previously):

FILESIZE BS IODEPTH THREADS WMODE IOP/s (R) IOP/s (W) BW (R) BW (W)
1G 4K 1 1 SYNC 5582 2399 21.8MB 9.6MB
1G 8K 1 1 SYNC 5155 2209 40.3MB 17.3MB
1G 256K 1 1 SYNC 2197 948 549MB 237MB
1G 1M 1 1 SYNC 687 294 687MB 295MB
1G 4k 16 16 SYNC 22300 9526 86.9MB 37.2MB
1G 8k 16 16 SYNC 22200 9513 174MB 74.3MB
1G 256k 16 16 SYNC 3842 1631 961MB 408MB
1G 1M 16 16 SYNC 1248 528 1248MB 528MB
16G 4K 1 1 NOSYNC 3349 1427 13.1MB 5.7MB
16G 8K 1 1 NOSYNC 3360 1452 26.3MB 11.3MB
16G 256K 1 1 NOSYNC 1674 726 419MB 182MB
16G 1M 1 1 NOSYNC 499 214 499MB 214MB
16G 4K 16 16 NOSYNC 25600 10900 99.8MB 42.7MB
16G 8K 16 16 NOSYNC 25400 10900 199MB 85.1MB
16G 256K 16 16 NOSYNC 4347 1859 1087MB 465MB
16G 1M 16 16 NOSYNC 1183 505 1184MB 506MB
  • Compared to other tests:
  1. IOPs intensive workload:
    ** 0.6.5.11 => 4k,1,1,SYNC => 6526/2775 - 25.5MB/10.8MB
    ** 0.7.9 => 4k,1,1,SYNC => 4236/1821 - 16.5MB/7.2MB
    ** 0.7.9+scatter=0 => 4k,1,1,SYNC => 5582/2399 - 21.8MB/9.6MB

Results: ~15% increase from plain 0.7.9, still lagging behind 0.6.5.11 (by another ~15%)

  1. BW intensive workload:
    ** 0.6.5.11 => 256k,16,16,NOSYNC => 8480/3637 - 2120MB/909MB
    ** 0.7.9 => 256k,16,16,NOSYNC => 3844/1654 - 889MB/404MB
    ** 0.7.9+scatter=0 => 256k,16,16,NOSYNC => 4347/1859 - 1087MB/465MB

Results: ~20% increase from plain 0.7.9, still lagging behind 0.6.5.11 (by ~50%)

@pruiz
Copy link
Author

pruiz commented Aug 27, 2018

And here are some preliminary results with 0.7.9 + zfs_compressed_arc_enabled=0 (same zpool & data set settings as in my original testing):

FILESIZE BS IODEPTH THREADS WMODE IOP/s (R) IOP/s (W) BW (R) BW (W)
1G 4K 1 1 SYNC 4940 2122 19.3MB 8.4MB
1G 8K 1 1 SYNC 4801 2059 37.5MB 16.1MB
1G 256K 1 1 SYNC 1969 837 492MB 209MB
1G 1M 1 1 SYNC 588 256 588MB 257MB
1G 4K 16 16 SYNC 21700 9299 84.9MB 36.3MB
1G 8K 16 16 SYNC 21500 9238 168MB 72.2MB
1G 256K 16 16 SYNC 3592 1545 898MB 386MB
1G 1M 16 16 SYNC 1086 465 1086MB 466MB
16G 4k 1 1 NOSYNC 3222 1387 12.6MB 5.5MB
16G 8k 1 1 NOSYNC 3233 1381 25.3MB 10.8MB
16G 256k 1 1 NOSYNC 1524 653 381MB 163MB
16G 1M 1 1 NOSYNC 442 192 443MB 192MB
16G 4k 16 16 NOSYNC 23900 10200 93.4MB 40MB
16G 8k 16 16 NOSYNC 23500 10100 184MB 78.7MB
16G 256k 16 16 NOSYNC 3826 1637 957MB 409MB
16G 1M 16 16 NOSYNC 1023 441 1023MB 442MB
  • Compared to other tests:
  1. IOPs intensive workload:
    ** 0.6.5.11 => 4k,1,1,SYNC => 6526/2775 - 25.5MB/10.8MB
    ** 0.7.9 => 4k,1,1,SYNC => 4236/1821 - 16.5MB/7.2MB
    ** 0.7.9+scatter=0 => 4k,1,1,SYNC => 5582/2399 - 21.8MB/9.6MB
    ** 0.7.9+comp_arc=0 => 4k,1,1,SYNC => 4940/2122 - 19.3MB/8.4MB

Results: ~10% increase from plain 0.7.9, still lagging behind 0.6.5.11

  1. BW intensive workload:
    ** 0.6.5.11 => 256k,16,16,NOSYNC => 8480/3637 - 2120MB/909MB
    ** 0.7.9 => 256k,16,16,NOSYNC => 3844/1654 - 889MB/404MB
    ** 0.7.9+scatter=0 => 256k,16,16,NOSYNC => 4347/1859 - 1087MB/465MB
    ** 0.7.9+comp_arc=0 => 256k,16,16,NOSYNC => 3826/1637 - 957MB/409MB

Results: ~10% decrease from plain 0.7.9..

@pruiz
Copy link
Author

pruiz commented Aug 28, 2018

And results from 0.7.9 with zfs_abd_scatter_enabled=0 + zfs_compressed_arc_enabled=0:

FILESIZE BS IODEPTH THREADS WMODE IOP/s (R) IOP/s (W) BW (R) BW (W)
1G 4K 1 1 SYNC 5277 2271 20.6MB 9MB
1G 8K 1 1 SYNC 5224 2226 40.8MB 17.4MB
1G 256K 1 1 SYNC 2171 935 543MB 234MB
1G 1M 1 1 SYNC 672 289 673MB 290MB
1G 4K 16 16 SYNC 22500 9659 88MB 37.7MB
1G 8K 16 16 SYNC 22400 9599 175MB 74MB
1G 256K 16 16 SYNC 3813 1631 953MB 408MB
1G 1M 16 16 SYNC 1232 530 1232MB 531MB
16G 4k 1 1 NOSYNC 3366 1442 13.2MB 5.7MB
16G 8k 1 1 NOSYNC 3346 1440 26.1MB 11.2MB
16G 256k 1 1 NOSYNC 1748 751 437MB 188MB
16G 1M 1 1 NOSYNC 507 217 508MB 218MB
16G 4k 16 16 NOSYNC 25700 11000 101MB 43.1MB
16G 8k 16 16 NOSYNC 25800 11000 202MB 86.3MB
16G 256k 16 16 NOSYNC 4514 1930 1129MB 483MB
16G 1M 16 16 NOSYNC 1205 513 1206MB 514MB
  • Compared to other tests:
  1. IOPs intensive workload:
    ** 0.6.5.11 => 4k,1,1,SYNC => 6526/2775 - 25.5MB/10.8MB
    ** 0.7.9 => 4k,1,1,SYNC => 4236/1821 - 16.5MB/7.2MB
    ** 0.7.9+scatter=0 => 4k,1,1,SYNC => 5582/2399 - 21.8MB/9.6MB
    ** 0.7.9+comp_arc=0 => 4k,1,1,SYNC => 4940/2122 - 19.3MB/8.4MB
    ** 0.7.9+scatter=0+comp_arc=0 => 4k,1,1,SYNC => 5277/2271 - 20.6MB/9MB

Results: nearly same performance from plain 0.7.9...

  1. BW intensive workload:
    ** 0.6.5.11 => 256k,16,16,NOSYNC => 8480/3637 - 2120MB/909MB
    ** 0.7.9 => 256k,16,16,NOSYNC => 3844/1654 - 889MB/404MB
    ** 0.7.9+scatter=0 => 256k,16,16,NOSYNC => 4347/1859 - 1087MB/465MB
    ** 0.7.9+comp_arc=0 => 256k,16,16,NOSYNC => 3826/1637 - 957MB/409MB
    ** 0.7.9+scatter=0+comp_arc=0 => 256k,16,16,NOSYNC => 4514/1930 - 1129MB/483MB

Results: ~15% increase from plain 0.7.9..

@pruiz
Copy link
Author

pruiz commented Aug 28, 2018

I'll try master tomorrow and report here..

@pruiz
Copy link
Author

pruiz commented Aug 28, 2018

Well, I've built zfs from master (v0.7.0-1533_g47ab01a), and initial testing does not look promising :(

FILESIZE BS IODEPTH THREADS WMODE IOP/s (R) IOP/s (W) BW (R) BW (W)
1G 4K 1 1 SYNC 4426 1909 17.3MB 7.6MB
1G 8K 1 1 SYNC 4348 1869 33MB 14.6MB
1G 256K 1 1 SYNC 1840 794 460MB 199MB
1G 1M 1 1 SYNC 597 259 597MB 260MB
..

@GregorKopka
Copy link
Contributor

Possibly the slowdown with the 0.7x version is somewhere in the codepath taken because of logbias=throughput on the dataset? Asking as I'm running with logbias=latency and I vaguely remember to have benchmarked 0.7 to be faster than the 0.6 series I upgraded a certain system from a while ago...

@pruiz
Copy link
Author

pruiz commented Aug 28, 2018 via email

@behlendorf
Copy link
Contributor

One thing I didn't originally notice from your first post is that recordsize=16k set on the dataset. That's definitely a less common configuration and a potential reason which could explain why you're seeing a performance regression while others have reported an overall improvement. Regardless, we'll need to find the bottleneck. Thank you for bringing it to our attention and posting your performance results.

@pruiz
Copy link
Author

pruiz commented Aug 31, 2018

@behlendorf yeah, our intended use in this case is for a db server, so 8k or 16k should be the optimal recordsize.. probably not as common as bigger recordsizes, as you stated.

Anyway, I would more than happy to test other configurations/options if you guys need it.

@matveevandrey
Copy link

matveevandrey commented Sep 6, 2018

@pruiz , great job!
Do you mind to perform your tests with recordsize=4k (as you have ashift=12 - so physical block size == 4k as well). I've also notices performance drop after upgrading (from 0.6.5 to 0.7) our NFS server (used as Proxmox shared storage)

@pruiz pruiz closed this as completed Sep 6, 2018
@pruiz pruiz reopened this Sep 6, 2018
@pruiz
Copy link
Author

pruiz commented Sep 6, 2018

Tests with recordsize=4k, logbias=throughput (fio, using randrw 70/30, as usual, with both 1G & 16G test files):

  • Using v0.6.5.11, with 1G test file:
FILESIZE BS IODEPTH THREADS WMODE IOP/s (R) IOP/s (W) BW (R) BW (W)
1G 4K 1 1 SYNC 6498 2786 25.4MB 10.9MB
1G 8K 1 1 SYNC 7243 3100 56.6MB 24.2MB
1G 256K 1 1 SYNC 1349 581 337MB 145MB
1G 1M 1 1 SYNC 372 160 372MB 160MB
1G 4k 16 16 SYNC 25900 11100 101MB 43.4MB
1G 8k 16 16 SYNC 22000 9827 179MB 76.8MB
1G 256k 16 16 SYNC 1927 823 482MB 206MB
1G 1M 16 16 SYNC 483 211 484MB 211MB
1G 4k 1 1 NOSYNC 75300 32200 294MB 126MB
1G 8k 1 1 NOSYNC 49200 21100 384MB 165MB
1G 256k 1 1 NOSYNC 2444 1040 611MB 260MB
1G 1M 1 1 NOSYNC 618 263 618MB 264MB
1G 4k 16 16 NOSYNC 211000 90400 824MB 353MB
1G 8k 16 16 NOSYNC 115000 49300 898MB 385MB
1G 256k 16 16 NOSYNC 4193 1800 1048MB 450MB
1G 1M 16 16 NOSYNC 1065 459 1066MB 460MB
  • Using v0.6.5.11, with 16G test file:
FILESIZE BS IODEPTH THREADS WMODE IOP/s (R) IOP/s (W) BW (R) BW (W)
16G 4K 1 1 SYNC 4678 2003 18.3MB 8MB
16G 1M 1 1 SYNC 287 125 288MB 125MB
16G 4K 16 16 SYNC 23100 9881 90.1MB 38.6MB
16G 1M 16 16 SYNC 462 200 463MB 200MB
16G 4K 1 1 NOSYNC 11800 5083 46.3MB 19.9MB
16G 1M 1 1 NOSYNC 359 154 359MB 155MB
16G 4K 16 16 NOSYNC 99900 42800 390MB 167MB
16G 1M 16 16 NOSYNC 874 373 874MB 374MB
  • Using master (v0.7-1533_g47ab01a), with 1G test file:
FILESIZE BS IODEPTH THREADS WMODE IOP/s (R) IOP/s (W) BW (R) BW (W)
1G 4K 1 1 SYNC 4942 2125 19.3MB 8.5MB
1G 8K 1 1 SYNC 4841 2072 37.8MB 16.2MB
1G 256K 1 1 SYNC 1240 533 310MB 133MB
1G 1M 1 1 SYNC 377 162 378MB 163MB
1G 4K 16 16 SYNC 42500 18200 166MB 71.1MB
1G 8K 16 16 SYNC 32600 13000 255MB 109MB
1G 256K 16 16 SYNC 1666 707 417MB 177MB
1G 1M 16 16 SYNC 438 188 438MB 188MB
1G 4K 1 1 NOSYNC 57500 24600 224MB 96.2MB
1G 8K 1 1 NOSYNC 36600 15700 286MB 122MB
1G 256K 1 1 NOSYNC 1754 754 499MB 189MB
1G 1M 1 1 NOSYNC 460 201 461MB 201MB
1G 4K 16 16 NOSYNC 130000 55800 508MB 218MB
1G 8K 16 16 NOSYNC 64200 27500 501MB 215MB
1G 256K 16 16 NOSYNC 2251 970 563MB 243MB
1G 1M 16 16 NOSYNC 583 248 584MB 249MB
  • Using master (v0.7-1533_g47ab01a), with 16G test file:
FILESIZE BS IODEPTH THREADS WMODE IOP/s (R) IOP/s (W) BW (R) BW (W)
16G 4K 1 1 SYNC 3612 1535 14.1MB 6.1MB
16G 1M 1 1 SYNC 288 122 288MB 123MB
16G 4K 16 16 SYNC 19500 8364 76.1MB 32.7MB
16G 1M 16 16 SYNC 389 168 390MB 169MB
16G 4K 1 1 NOSYNC 12900 5547 50.6MB 21.7MB
16G 1M 1 1 NOSYNC 339 145 340MB 146MB
16G 4K 16 16 NOSYNC 36500 15600 143MB 61.1MB
16G 1M 16 16 NOSYNC 510 219 511MB 220MB
  • Results summary:

    1. Baseline 4k IOPs (SYNC)
      -> v0.6.5.11 - 1G/4k/1/1 => 6498 / 2786 (25.4MB / 10.9MB)
      -> v0.7-master - 1G/4k/1/1 => 4942 / 2125 (19.3MB / 8.5MB)
      => v0.6.5 wins this case by a 20%.

    2. Baseline 4k IOPs (NOSYNC)
      -> v0.6.5.11 - 1G/4k/1/1 => 75300 / 32200 (294MB / 126MB)
      -> v0.7-master - 1G/4k/1/1 => 57500 / 24600 (224MB / 96.2MB)
      => v0.6.5 wins again..

    3. Highest IOPs (SYNC)
      -> v0.6.5.11 - 1G/4k/16/16 => 25900 / 11100 (101MB / 43.4MB)
      -> v0.7-master - 1G/4k/16/16 => 42500 / 18200 (166MB / 71.1MB)
      => Winner is v0.7-master by an impressive 50%+

    4. Highest IOPs (NOSYNC)
      -> v0.6.5.11 - 1G/4k/16/16 => 211000 / 90400 (824MB / 353MB)
      -> v0.7-master - 1G/4k/16/16 => 130000 / 55800 (508MB / 218MB)
      => In this case v0.6.5 wins by far, nearly double.

    5. Highest Throughput (SYNC)
      -> v0.6.5.11 - 1G/1M/16/16 => 483 / 211 (484MB / 211MB)
      -> v0.7-master - 1G/1M/16/16 => 438 / 188 (438MB / 188MB)
      => I would call this a tie.

    6. Highest Throughput (NOSYNC)
      -> v0.6.5.11 - 1G/1M/16/16 => 1065 / 459 (1066MB / 460MB)
      -> v0.7-master - 1G/1M/16/16 => 583 / 248 (584MB / 249MB)
      => Another long win for v0.6.5.

NOTEs:

  • I've verified with zdb that ashift is 12 as intended.
  • Testing against 0.7-master have abd_scatter_enabled=1 & compressed_arc_enabled=1
  • High concurrency/iodepth gains on 4k/sync on v0.7-master is impressive.. however it looks like we have a huge drop on the opposite use cases (no-sync tests)

I will try to add test results against v0.7.9 if I find some spare time tonite, as I would love to know wether those 4k/SYNC IOPs results of v0.7-master are reproducible with v0.7.9 too.

@tonynguien
Copy link
Contributor

Using zfs-test performance regression, I'm seeing similar regression for cached reads, random reads, and random writes. I'll start bisecting commits between 0.6.5.11 and 0.7.0 tags.

@olavgg
Copy link

olavgg commented Sep 15, 2018

I would like to add some details here. I have used ZFS on FreeBSD for almost 10 years, it has always had decent ZFS performance. But I have a newer build with only SSD's and Optane 900p as slog and the sync write performance is really bad. I've compared with different Linux distributions and other filesystems.

The tool I use to test sync write performance is pg_test_fsync

Here is the performance on my FreeBSD server with 3x raidz * 5x 5400RPM spinners (15 disks total) and with Optane 32GB.

$ pg_test_fsync -f /tank/rot/testfile
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
        open_datasync                                   n/a
        fdatasync                          7134.022 ops/sec     140 usecs/op
        fsync                              7138.345 ops/sec     140 usecs/op
        fsync_writethrough                              n/a
        open_sync                          7436.686 ops/sec     134 usecs/op

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync is Linux's default)
        open_datasync                                   n/a
        fdatasync                          5139.483 ops/sec     195 usecs/op
        fsync                              4403.700 ops/sec     227 usecs/op
        fsync_writethrough                              n/a
        open_sync                          2606.494 ops/sec     384 usecs/op

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB in different write
open_sync sizes.)
         1 * 16kB open_sync write          5082.113 ops/sec     197 usecs/op
         2 *  8kB open_sync writes         3707.069 ops/sec     270 usecs/op
         4 *  4kB open_sync writes         2144.459 ops/sec     466 usecs/op
         8 *  2kB open_sync writes         1271.302 ops/sec     787 usecs/op
        16 *  1kB open_sync writes          636.725 ops/sec    1571 usecs/op

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written on a different
descriptor.)
        write, fsync, close                5989.971 ops/sec     167 usecs/op
        write, close, fsync                5913.696 ops/sec     169 usecs/op

Non-sync'ed 8kB writes:
        write                             72071.214 ops/sec      14 usecs/op

With 6x striped 800GB enterprise class ssds, Optane 900p as slog and ZFS on Linux 0.8.0-rc1 on Ubuntu 18.04

$ sudo /usr/lib/postgresql/10/bin/pg_test_fsync -f /tank/rot/testfile  
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
        open_datasync                      2574,871 ops/sec     388 usecs/op
        fdatasync                          2265,568 ops/sec     441 usecs/op
        fsync                              2242,302 ops/sec     446 usecs/op
        fsync_writethrough                              n/a
        open_sync                          2510,196 ops/sec     398 usecs/op

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync is Linux's default)
        open_datasync                      1301,706 ops/sec     768 usecs/op
        fdatasync                          2101,979 ops/sec     476 usecs/op
        fsync                              2082,698 ops/sec     480 usecs/op
        fsync_writethrough                              n/a
        open_sync                          1441,130 ops/sec     694 usecs/op

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB in different write
open_sync sizes.)
         1 * 16kB open_sync write          2421,870 ops/sec     413 usecs/op
         2 *  8kB open_sync writes         1286,643 ops/sec     777 usecs/op
         4 *  4kB open_sync writes          674,385 ops/sec    1483 usecs/op
         8 *  2kB open_sync writes          352,586 ops/sec    2836 usecs/op
        16 *  1kB open_sync writes          179,682 ops/sec    5565 usecs/op

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written on a different
descriptor.)
        write, fsync, close                2469,133 ops/sec     405 usecs/op
        write, close, fsync                2522,016 ops/sec     397 usecs/op

Non-sync'ed 8kB writes:
        write                            113709,613 ops/sec       9 usecs/op

For comparison, exact same hardware, default settings ZoL 0.7.9, benchmarked with pg_test_fsync

Ubuntu 18.04 2200 iops
Debian 9 2000 iops
CentOS 7 8000 iops

FreeBSD 11.2 16000 iops

Ubuntu 18.04 + XFS 34000 iops
Ubuntu 18.04 + EXT4 32000 iops
Ubuntu 18.04 + BcacheFS 14000 iops 

If there is anything I can help with, please ask. I now know how to build from source 8-)

@GregorKopka
Copy link
Contributor

/tank/rot/testfile

What are the settings on the zfs filesystem (default of recordsize=128k would explain quite a lot)?

@olavgg
Copy link

olavgg commented Sep 16, 2018

In my case it is default, which is a dynamic record size, which means that when pg_test_fsync writes, ZFS will write 8kb blocks.

@GregorKopka
Copy link
Contributor

GregorKopka commented Sep 16, 2018

ZFS will maintain files in a filesystem in $recordsize sized blocks on-disk.
You tested performance of read-modify-write cycles with 128k on-disk blocks by partially rewriting 8k chunks, which naturally isn't great.

In case you want zfs to write 8k on-disk blocks:
Set recordsize of the filesystem to that value, recreate the file, test again.

@olavgg
Copy link

olavgg commented Sep 16, 2018

The recordsize is dynamic, it writes 8kb even if the recordsize is higher.

As this is easy to test I can confirm that I get the exact same numbers. Also iostat says the slog, which is an Optane 900p is writing around 20-30MB/s

@gmelikov
Copy link
Member

IIRC Recordsize is dynamic for files with size < than recordsize, or then compression is enabled.

@richardelling
Copy link
Contributor

Are you sure you're not being impacted by the write throttle? It can be tuned and
the default tuning is a bit of a guess.

https://github.com/zfsonlinux/zfs/wiki/ZFS-Transaction-Delay

@rlaager
Copy link
Member

rlaager commented Sep 17, 2018

Are you able to bisect this at all, even just using released versions? As a starting point, is 0.7.0 good like 0.6.5.11 or bad like 0.7.9?

@h1z1
Copy link

h1z1 commented Sep 17, 2018

Tuning zfs module parameters on v0.7.9 (like zfs_vdev_*, etc.) makes no observable difference.

Are you running stock settings or what was tweaked here?

grep . /sys/module/zfs/parameters/*

would help too if you can. From what I can tell above, 0.7 may have lower bandwidth and io/s, but it has quite a bit lower latency.

@dweeezil
Copy link
Contributor

dweeezil commented Sep 17, 2018

One commit that comes to mind for anyone able to bisect is 1ce23dc. It will [EDIT] increase [/EDIT] latency for single-threaded synchronous workloads such as pg_test_fsync but should help multi-threaded workloads as can be simulated with fio. See https://goo.gl/jBUiH5 for the author's performance testing on this commit.

@tonynguien
Copy link
Contributor

So the change in zio_notify_parent() which replaced the zio_execute() with zio_taskq_dispatch() introduced the performance regression. I reverted that change and got similar performance to pre-throttle code.

In master, https://github.com/zfsonlinux/zfs/pull/7736/commits reduced taskq context switching thus solved the above issue.

@pruiz Would you be able to test with master or 0.8 code to verify?

@tonynguien
Copy link
Contributor

tonynguien commented Oct 2, 2018

Additionally, I noticed two things:

  1. random writes and sequential reads numbers from pre write throttle code are still higher than numbers with 7736 change.

Pre write throttle numbers:

delphix@ZoL-ubuntu-4: grep iop random_writes.ksh.fio**
random_writes.ksh.fio.sync.8k-ios.128-threads.1-filesystems:  write: io=21403MB, bw=182629KB/s, iops=22828, runt=120005msec
random_writes.ksh.fio.sync.8k-ios.1-threads.1-filesystems:  write: io=4094.5MB, bw=34939KB/s, iops=4367, runt=120001msec
random_writes.ksh.fio.sync.8k-ios.32-threads.1-filesystems:  write: io=16115MB, bw=137498KB/s, iops=17187, runt=120011msec
delphix@ZoL-ubuntu-4:
delphix@ZoL-ubuntu-4: grep iop sequential_reads.ksh.fio*
sequential_reads.ksh.fio.sync.128k-ios.128-threads.1-filesystems:  read : io=114173MB, bw=973311KB/s, iops=7603, runt=120119msec
sequential_reads.ksh.fio.sync.128k-ios.16-threads.1-filesystems:  read : io=219190MB, bw=1826.6MB/s, iops=14612, runt=120001msec
sequential_reads.ksh.fio.sync.128k-ios.1-threads.1-filesystems:  read : io=64290MB, bw=548606KB/s, iops=4285, runt=120001msec
sequential_reads.ksh.fio.sync.128k-ios.64-threads.1-filesystems:  read : io=114187MB, bw=974304KB/s, iops=7611, runt=120011msec
sequential_reads.ksh.fio.sync.128k-ios.8-threads.1-filesystems:  read : io=270624MB, bw=2255.2MB/s, iops=18041, runt=120001msec
sequential_reads.ksh.fio.sync.1m-ios.128-threads.1-filesystems:  read : io=114034MB, bw=970752KB/s, iops=948, runt=120289msec
sequential_reads.ksh.fio.sync.1m-ios.16-threads.1-filesystems:  read : io=213920MB, bw=1782.6MB/s, iops=1782, runt=120006msec
sequential_reads.ksh.fio.sync.1m-ios.1-threads.1-filesystems:  read : io=61403MB, bw=523968KB/s, iops=511, runt=120001msec
sequential_reads.ksh.fio.sync.1m-ios.64-threads.1-filesystems:  read : io=115057MB, bw=981288KB/s, iops=958, runt=120065msec
sequential_reads.ksh.fio.sync.1m-ios.8-threads.1-filesystems:  read : io=263363MB, bw=2194.7MB/s, iops=2194, runt=120003msec
delphix@ZoL-ubuntu-4:

7736 numbers

delphix@ZoL-ubuntu-4: grep iop random_writes.ksh.fio**
random_writes.ksh.fio.sync.8k-ios.128-threads.1-filesystems:  write: io=17093MB, bw=145773KB/s, iops=18221, runt=120069msec
random_writes.ksh.fio.sync.8k-ios.1-threads.1-filesystems:  write: io=2680.9MB, bw=22876KB/s, iops=2859, runt=120001msec
random_writes.ksh.fio.sync.8k-ios.32-threads.1-filesystems:  write: io=12982MB, bw=110772KB/s, iops=13846, runt=120006msec
delphix@ZoL-ubuntu-4:
delphix@ZoL-ubuntu-4:
delphix@ZoL-ubuntu-4: grep iop sequential_reads.ksh.fio*
sequential_reads.ksh.fio.sync.128k-ios.128-threads.1-filesystems:  read : io=97437MB, bw=831078KB/s, iops=6492, runt=120055msec
sequential_reads.ksh.fio.sync.128k-ios.16-threads.1-filesystems:  read : io=158444MB, bw=1320.4MB/s, iops=10562, runt=120003msec
sequential_reads.ksh.fio.sync.128k-ios.1-threads.1-filesystems:  read : io=55948MB, bw=477418KB/s, iops=3729, runt=120001msec
sequential_reads.ksh.fio.sync.128k-ios.64-threads.1-filesystems:  read : io=92176MB, bw=786437KB/s, iops=6144, runt=120020msec
sequential_reads.ksh.fio.sync.128k-ios.8-threads.1-filesystems:  read : io=228701MB, bw=1905.9MB/s, iops=15246, runt=120001msec
sequential_reads.ksh.fio.sync.1m-ios.128-threads.1-filesystems:  read : io=98438MB, bw=838174KB/s, iops=818, runt=120262msec
sequential_reads.ksh.fio.sync.1m-ios.16-threads.1-filesystems:  read : io=155480MB, bw=1295.7MB/s, iops=1295, runt=120005msec
sequential_reads.ksh.fio.sync.1m-ios.1-threads.1-filesystems:  read : io=53686MB, bw=458117KB/s, iops=447, runt=120001msec
sequential_reads.ksh.fio.sync.1m-ios.64-threads.1-filesystems:  read : io=93378MB, bw=796341KB/s, iops=777, runt=120073msec
sequential_reads.ksh.fio.sync.1m-ios.8-threads.1-filesystems:  read : io=219143MB, bw=1826.2MB/s, iops=1826, runt=120003msec
delphix@ZoL-ubuntu-4:
  1. cached reads performance also dropped somewhere between 0.6.5 and before write throttle commit.

So we may still have some regressions. I'm looking at #2 now. Does it make sense to open new issue(s)?

@dweeezil
Copy link
Contributor

dweeezil commented Oct 2, 2018

I'd also like to mention that disabling dynamic taskqs (spl_taskq_thread_dynamic=0) can decrease latency and improve performance, especially in single-threaded benchmark scenarios.

@tonynguien
Copy link
Contributor

I'd also like to mention that disabling dynamic taskqs (spl_taskq_thread_dynamic=0) can decrease latency and improve performance, especially in single-threaded benchmark scenarios.

Thanks!

@hedongzhang
Copy link
Contributor

hedongzhang commented Oct 16, 2018

I use fio to test the performance of zfs-0.7.11 zvol, the write amplification more than 6x, this seriously affects the performance of zvol.

Type Version/Name
Distribution Name redhat-7.4
Distribution Version 7.4
Linux Kernel 3.10.0-693.el7.x86_64
Architecture x86_64
ZFS Version 0.7.11
SPL Version 0.7.11
Hardware 3 x SSD(370G)
  • 8K zvol randwrite

image

@hedongzhang
Copy link
Contributor

@kpande I don't quite understand what you mean. Can you elaborate more?

@janetcampbell
Copy link

You are handling all ZIL writes via indirect sync (logbias=throughout). This will trash your ability to aggregate read I/o over time due to data/metadata fragmentation, and will even greatly reduce your ability to agg between one data block and another. Any outstanding async write in the same sync domain may suffer as well.

I understand the desire for throughput but here it is coming at the expense of the pool data at large. In the real world, you would seldom set up a dataset like this unless read performance was totally unimportant. If you will read from a block at least once, it's worth doing direct sync.

If you test with logbias=latency, you need to either add a SLOG or increase zfs_immediate_write_sz.

I'd recommend doing a ZFS send while you watch zpool iostat -r. With 16k indirect writes you should have some absolutely amazing unaggregatable fragmentation.

@janetcampbell
Copy link

Another note - it looks like you are suffering reads even on full block writes. This should help greatly with that:

#8590

@pauful
Copy link

pauful commented Apr 23, 2019

We have encountered a very similar issue, in the form of a significant performance drop between zfs 0.6.5.9 and 0.7.11. We are able to overcome the issue by setting zfs_abd_scatter_enabled=0 & zfs_compressed_arc_enabled=0.

We are using Debian Stretch (version 9.8) and linux kernel 4.9.0-8-amd64.

Our recordsize is 128K and I don't think we would be able to decrease it.

@richardelling
Copy link
Contributor

I too have seen cases were ABD scatter/gather isn't as performant. So I can
believe it makes a difference for some workloads, but don't have a generic
guideline for when to use it and when to not use it. Experiment results appreciated.

I don't believe disabling compressed ARC will make much difference. Perhaps on
small memory machines? Can you toggle that and report results. This will be more
important soon as there is a proposal to force compressed ARC on. #7896

@pauful
Copy link

pauful commented Apr 26, 2019

I tried enabling zfs_compressed_arc_enabled and zfs_abd_scatter_enabled in separate tests.
Enabling compressed ARC makes the biggest performance drop of them both. By enabling ABD scatter/gather I can see a decrease in performance but not as noticeable as when I enable compressed ARC.
Our machine uses more than 250G of memory. Let me know if I can help by providing other information.

@jwittlincohen
Copy link
Contributor

@pauful What compression algorithm are you using on your datasets? lz4 is very fast to decompress but gzip would certainly cause issues.

@pauful
Copy link

pauful commented May 9, 2019

@jwittlincohen lz4 is the compression option used in our pools.

@matveevandrey
Copy link

@pruiz Have you tried current master? Some performance oriented commits have been applied so far

@pruiz
Copy link
Author

pruiz commented Jun 21, 2019 via email

@interduo
Copy link

interduo commented Jul 17, 2019

Did somebody do some performance tests on 0.8.2 and would like to share?
I didn't find anything interesting in google.

@interduo
Copy link

interduo commented Oct 9, 2019

@pruiz could You do the same test on 0.8.2 with the hardware You mentioned earlier?

@pruiz
Copy link
Author

pruiz commented Oct 9, 2019 via email

@interduo
Copy link

interduo commented Feb 18, 2020

@pruiz could You do the same test on 0.8.3 with the hardware You mentioned earlier?
You've done good work with that earlier.

@stevecs
Copy link

stevecs commented Oct 12, 2020

just a lurker on this bug here as I saw a similar drop on systems here back in 2018 doing the same transition from 0.6.5 to 0.7.x, decreasing performance to about 1/5 - 1/6th when running v0.7 so had to fail back. Been testing randomly over the last two years with newer versions against the same array but still in the 0.7 line but no luck. This last week just tried 0.8.3 and performance is back / comparable with 0.6.5. This is on one of my larger dev/qa systems and will watch it closely for the next month before upgrading the other systems. So 0.8.3 looks promising. Just wanted to bump this and see if pruiz could validate if this also alleviates his original problem.

@interduo
Copy link

@stevecs try 0.8.5, this version is having more ios.

@stevecs
Copy link

stevecs commented Oct 12, 2020

@interduo I'll see if I can get another window but it will be probably a couple weeks. I did a quick look a the commit deltas between 0.8.3 and 0.8.5 but didn't see much to catch my eye for I/O improvements (though did spot a couple other commits that were interesting). Can you give me a hint as to what commits you think may be relevant?

@interduo
Copy link

I just jumped from 0.8.2 to 0.8.5 with nice surprise on io graphs. I didn't look at commits.

@stale
Copy link

stale bot commented Oct 12, 2021

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Oct 12, 2021
@stale stale bot closed this as completed Jan 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Stale No recent activity for issue Type: Performance Performance improvement or performance problem
Projects
None yet
Development

No branches or pull requests