Random reads by 9 z_wr_iss threads crippling sequential write performance #7594

Atemu · 2018-06-01T23:42:43Z

System information

Type	Version/Name
Distribution Name	Archlinux
Distribution Version	-
Linux Kernel	4.16.12-1
Architecture	x86_64
ZFS Version	zfs-dkms 0.7.9-1
SPL Version	spl-dkms 0.7.9-1

Describe the problem you're observing

I have high random read access by 9 z_wr_iss threads while writing data to the zpool.
Each of those threads reads 100-300k/s which results in 100% disk utilization in iostat -x and cripples the sequential write performance of the writing application.

After ~30s they stop reading and I get very good write performance until they start reading again after a couple seconds.

I am not running into single- or multithreaded CPU bottlenecks and RAM is not limiting anything either (as far as I can tell).

Describe how to reproduce the problem

Write data to the zpool (you could dd /dev/urandom or copy any non empty file)
Open iotop and sort by read speed

Include any warning/errors/backtraces from the system logs

There are no errors, only unexpected/unwanted behavior; I'll put stats here instead:

zpool status -D

  pool: data
 state: ONLINE
  scan: scrub paused since Fri Jun  1 19:31:44 2018
	scrub started on   Fri Jun  1 19:07:25 2018
	31.9G scanned out of 1.71T
	0B repaired, 1.82% done
config:

	NAME                                                STATE     READ WRITE CKSUM
	data                                                ONLINE       0     0     0
	  mirror-0                                          ONLINE       0     0     0
	    ata-WDC_WD10EAVS-00D7B0_WD-WCAU41704170-part1   ONLINE       0     0     0
	    ata-WDC_WD10EADS-00L5B1_WD-WCAU48695497-part1   ONLINE       0     0     0
	  mirror-1                                          ONLINE       0     0     0
	    ata-ST1000DM003-9YN162_S1D3CGPZ-part1           ONLINE       0     0     0
	    ata-WDC_WD10EZEX-08WN4A0_WD-WCC6Y3AELSRD-part1  ONLINE       0     0     0
	cache
	  ata-OCZ-TRION100_65UB512QKMCX-part4               ONLINE       0     0     0

errors: No known data errors

 dedup: DDT entries 14037164, size 599B on disk, 193B in core

bucket              allocated                       referenced          
______   ______________________________   ______________________________
refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE
------   ------   -----   -----   -----   ------   -----   -----   -----
     1    10.2M   1.23T   1.03T   1.03T    10.2M   1.23T   1.03T   1.03T
     2    2.96M    376G    352G    352G    6.19M    785G    730G    730G
     4     197K   23.0G   20.7G   20.8G     967K    113G    103G    103G
     8    25.5K   2.99G   2.59G   2.59G     229K   26.7G   23.1G   23.1G
    16    1.89K    213M    141M    142M    41.0K   4.52G   3.01G   3.02G
    32      269   19.3M   9.76M   10.1M    13.2K    982M    504M    521M
    64      139   6.21M   2.34M   2.57M    11.9K    607M    251M    270M
   128       15    904K    160K    184K    2.55K    150M   35.3M   39.5M
   256        7    266K   17.5K     32K    2.82K   86.5M   6.34M   12.7M
   512        2      3K      3K      8K    1.67K   2.50M   2.50M   6.66M
    1K        2      3K      3K      8K    2.69K   4.03M   4.03M   10.7M
    2K        1   1.50K   1.50K      4K    2.29K   3.44M   3.44M   9.18M
    4K        1    128K      4K      4K    5.29K    677M   21.1M   21.1M
 Total    13.4M   1.63T   1.40T   1.40T    17.6M   2.14T   1.87T   1.87T

cat /proc/spl/kstat/zfs/arcstats

13 1 0x01 96 26112 1347083257 29980445681062
name                            type data
hits                            4    143616607
misses                          4    6675033
demand_data_hits                4    3193904
demand_data_misses              4    2128
demand_metadata_hits            4    139639371
demand_metadata_misses          4    3114153
prefetch_data_hits              4    41195
prefetch_data_misses            4    3448900
prefetch_metadata_hits          4    742137
prefetch_metadata_misses        4    109852
mru_hits                        4    10503032
mru_ghost_hits                  4    1058430
mfu_hits                        4    132483843
mfu_ghost_hits                  4    322015
deleted                         4    5994126
mutex_miss                      4    1215
access_skip                     4    0
evict_skip                      4    11774445
evict_not_enough                4    288284
evict_l2_cached                 4    257589443584
evict_l2_eligible               4    302754582528
evict_l2_ineligible             4    43114084352
evict_l2_skip                   4    11156
hash_elements                   4    499557
hash_elements_max               4    975057
hash_collisions                 4    3571836
hash_chains                     4    50859
hash_chain_max                  4    6
p                               4    667666820
c                               4    705057456
c_min                           4    516320256
c_max                           4    12884901888
size                            4    636434888
compressed_size                 4    457182720
uncompressed_size               4    473488896
overhead_size                   4    82210304
hdr_size                        4    40433848
data_size                       4    129216512
metadata_size                   4    410176512
dbuf_size                       4    5559856
dnode_size                      4    10672896
bonus_size                      4    4091200
anon_size                       4    87064576
anon_evictable_data             4    0
anon_evictable_metadata         4    0
mru_size                        4    85756928
mru_evictable_data              4    35782656
mru_evictable_metadata          4    8192
mru_ghost_size                  4    552566784
mru_ghost_evictable_data        4    531759104
mru_ghost_evictable_metadata    4    20807680
mfu_size                        4    366571520
mfu_evictable_data              4    0
mfu_evictable_metadata          4    351748096
mfu_ghost_size                  4    86577152
mfu_ghost_evictable_data        4    0
mfu_ghost_evictable_metadata    4    86577152
l2_hits                         4    1423981
l2_misses                       4    5239248
l2_feeds                        4    53233
l2_rw_clash                     4    0
l2_read_bytes                   4    5890176512
l2_write_bytes                  4    255218547712
l2_writes_sent                  4    41744
l2_writes_done                  4    41744
l2_writes_error                 4    0
l2_writes_lock_retry            4    17
l2_evict_lock_retry             4    4
l2_evict_reading                4    0
l2_evict_l1cached               4    326939
l2_free_on_write                4    27479
l2_abort_lowmem                 4    5
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    24279220224
l2_asize                        4    24237309952
l2_hdr_size                     4    36284064
memory_throttle_count           4    0
memory_direct_count             4    74
memory_indirect_count           4    408748
memory_all_bytes                4    16522248192
memory_free_bytes               4    9222279168
memory_available_bytes          3    8964120576
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    115
arc_meta_used                   4    507218376
arc_meta_limit                  4    9663676416
arc_dnode_limit                 4    966367641
arc_meta_max                    4    4876518032
arc_meta_min                    4    16777216
sync_wait_for_async             4    300029
demand_hit_predictive_prefetch  4    3154731
arc_need_free                   4    0
arc_sys_free                    4    258160128

/r/zfs thread about my issue

The text was updated successfully, but these errors were encountered:

Atemu · 2018-06-05T18:39:24Z

Bump

Does anyone know what z_wr_iss is supposed to do and what could cause it to do random reads?

abrasive · 2018-06-19T13:16:05Z

Are you running zed (the ZFS event daemon)?

Atemu · 2018-06-19T13:26:45Z

It should be running by default, so yes

abrasive · 2018-06-19T13:49:22Z

If your pool has autoexpand=on you might be getting bitten by #7366. The symptom sounds very familiar. Disabling either zed or autoexpand would prevent the problem for now.

…

On 19 June 2018 23:27:18 Atemu ***@***.***> wrote: It should be running by default, so yes — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

Atemu · 2018-06-19T14:15:40Z

autoexpand=off on my pool

abrasive · 2018-06-20T01:16:38Z

I can't help you then, sorry! ^_^

thewacokid · 2018-06-25T19:40:21Z

Laymen's guess - you have dedupe enabled, and you only have 16 GB of RAM. ~2.8 GB is taken up by your DDT (assuming it's all in memory). Writes trigger reads on dedup'd pools if the entire DDT is not in RAM.

Atemu · 2018-06-26T11:05:25Z

That's what I thought as well but the ARC is almost never full and limiting it to metadata only doesn't help either (it's just even smaller).

Btw is it intended that ARC empties itself automatically when it isn't even near the limit?

richardelling · 2018-06-27T22:42:22Z

the z_wr_iss taskq handles (wr)ites (iss)ued. In the ZIO pipeline writes can cause reads, but this
will be difficult to predict. For example, if the current metaslab is full, the next metaslab's space map is needed and might not be in memory, causing a spacemap load. Also, CPU cycles can be
consumed for compression, checksum generation stages, and dedup work (if enabled).

In general, you should get more response from the mailing list because github is a poor tool for
conversations.

Atemu · 2018-06-28T13:01:59Z

Thanks a bunch for your response, that was very insightful!

Do you know of a way to force the space maps to be cached all the time?
I've got plenty of unused space in my ARC and L2ARC.

richardelling · 2018-06-28T14:35:42Z

Yes, the zfs-module-parameters(8) tunable to control unloading of space maps is metaslab_debug_unload

Atemu · 2018-06-28T17:23:37Z

I set metaslab_debug_unload and metaslab_debug_load to 1 and now whenever I import a zpool it takes a while to read ~200MiB into the ARC but unfortunately the 9 z_wr_iss threads still do their random reads when writing.

janetcampbell · 2019-04-06T02:33:06Z

This is old, but I'd like to offer a few suggestions that may help.

Step 1 with these situations is almost always to use zpool iostat -r to your advantage. Find out what is being read - is it pool data, or is it metadata? What is your typical recordsize? Are they showing up as sync or async reads?

If it's pool data (reads are the same size as your recordsize), then you may be seeing RMW reads during TxG commit, as it brings in the missing pieces of each block that need to be written. These can get nasty and dominate the workload in some cases as we found out on Solaris with ZFS in the early days. Generally, RMW for sync writes will create sync reads, while RMW for async writes will create async reads. If they look like they may be RMW reads, you can try:

Decreasing your recordsize
Making TxG commit less frequent
Limiting the speed of TxG commit
Deprioritizing the reads

Decreasing the recordsize has a number of effects and may or may not be a practical decision. You need to look at your write distribution and try to figure out if it's worth doing. ZFS does so much better with large records, I consider going below about 32K a last resort except for small special-purpose cases.

If you delay TxG commit, you may be able to get all the pieces of a block before you need to write it. This is usually a matter of growing the txg timeout, dirty_data_sync, and possibly dirty_data_max. This is most useful with applications that drip out small writes and periodically do a "checkpoint" with a lot of big sequential writes. MongoDB is one example.

The TxG commit, and RMW reads that issue from it, can be limited with zfs_sync_taskq_batch_pct. The default (75%) is greatly oversized for most modern systems. Try decreasing it sharply (to 5-10%) and see if the commit and the reads get more orderly and stress the system less. This would be my first step. You can usually go quite low before the speed of the TxG commit itself is impacted.

Finally, you can deprioritize RMW reads at the vdev level. If they're sync reads (RMW for earlier sync writes), there are a lot of ways to do this, but they tend to rely on the fact that sync reads are either feast or famine, they are either not a problem or they bury the box. Try decreasing the sync read minimum active to 1-2, and you will move other IO classes up to much higher in the queueing order.

I'm working on an article on this to clarify more. Hope this helps.

stale · 2020-08-24T21:53:37Z

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

bunder2015 added the Type: Performance Performance improvement or performance problem label Jan 31, 2019

stale bot added the Status: Stale No recent activity for issue label Aug 24, 2020

stale bot closed this as completed Nov 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Random reads by 9 z_wr_iss threads crippling sequential write performance #7594

Random reads by 9 z_wr_iss threads crippling sequential write performance #7594

Atemu commented Jun 1, 2018 •

edited

Loading

Atemu commented Jun 5, 2018 •

edited

Loading

abrasive commented Jun 19, 2018

Atemu commented Jun 19, 2018

abrasive commented Jun 19, 2018 via email

Atemu commented Jun 19, 2018

abrasive commented Jun 20, 2018

thewacokid commented Jun 25, 2018 •

edited

Loading

Atemu commented Jun 26, 2018

richardelling commented Jun 27, 2018

Atemu commented Jun 28, 2018

richardelling commented Jun 28, 2018

Atemu commented Jun 28, 2018

janetcampbell commented Apr 6, 2019 •

edited

Loading

stale bot commented Aug 24, 2020

Random reads by 9 z_wr_iss threads crippling sequential write performance #7594

Random reads by 9 z_wr_iss threads crippling sequential write performance #7594

Comments

Atemu commented Jun 1, 2018 • edited Loading

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

Atemu commented Jun 5, 2018 • edited Loading

abrasive commented Jun 19, 2018

Atemu commented Jun 19, 2018

abrasive commented Jun 19, 2018 via email

Atemu commented Jun 19, 2018

abrasive commented Jun 20, 2018

thewacokid commented Jun 25, 2018 • edited Loading

Atemu commented Jun 26, 2018

richardelling commented Jun 27, 2018

Atemu commented Jun 28, 2018

richardelling commented Jun 28, 2018

Atemu commented Jun 28, 2018

janetcampbell commented Apr 6, 2019 • edited Loading

stale bot commented Aug 24, 2020

Atemu commented Jun 1, 2018 •

edited

Loading

Atemu commented Jun 5, 2018 •

edited

Loading

thewacokid commented Jun 25, 2018 •

edited

Loading

janetcampbell commented Apr 6, 2019 •

edited

Loading