Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random reads by 9 z_wr_iss threads crippling sequential write performance #7594

Closed
Atemu opened this issue Jun 1, 2018 · 14 comments
Closed
Labels
Status: Stale No recent activity for issue Type: Performance Performance improvement or performance problem

Comments

@Atemu
Copy link

Atemu commented Jun 1, 2018

System information

Type Version/Name
Distribution Name Archlinux
Distribution Version -
Linux Kernel 4.16.12-1
Architecture x86_64
ZFS Version zfs-dkms 0.7.9-1
SPL Version spl-dkms 0.7.9-1

Describe the problem you're observing

I have high random read access by 9 z_wr_iss threads while writing data to the zpool.
Each of those threads reads 100-300k/s which results in 100% disk utilization in iostat -x and cripples the sequential write performance of the writing application.

After ~30s they stop reading and I get very good write performance until they start reading again after a couple seconds.

I am not running into single- or multithreaded CPU bottlenecks and RAM is not limiting anything either (as far as I can tell).

Describe how to reproduce the problem

  1. Write data to the zpool (you could dd /dev/urandom or copy any non empty file)
  2. Open iotop and sort by read speed

Include any warning/errors/backtraces from the system logs

There are no errors, only unexpected/unwanted behavior; I'll put stats here instead:

zpool status -D

  pool: data
 state: ONLINE
  scan: scrub paused since Fri Jun  1 19:31:44 2018
	scrub started on   Fri Jun  1 19:07:25 2018
	31.9G scanned out of 1.71T
	0B repaired, 1.82% done
config:

	NAME                                                STATE     READ WRITE CKSUM
	data                                                ONLINE       0     0     0
	  mirror-0                                          ONLINE       0     0     0
	    ata-WDC_WD10EAVS-00D7B0_WD-WCAU41704170-part1   ONLINE       0     0     0
	    ata-WDC_WD10EADS-00L5B1_WD-WCAU48695497-part1   ONLINE       0     0     0
	  mirror-1                                          ONLINE       0     0     0
	    ata-ST1000DM003-9YN162_S1D3CGPZ-part1           ONLINE       0     0     0
	    ata-WDC_WD10EZEX-08WN4A0_WD-WCC6Y3AELSRD-part1  ONLINE       0     0     0
	cache
	  ata-OCZ-TRION100_65UB512QKMCX-part4               ONLINE       0     0     0

errors: No known data errors

 dedup: DDT entries 14037164, size 599B on disk, 193B in core

bucket              allocated                       referenced          
______   ______________________________   ______________________________
refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE
------   ------   -----   -----   -----   ------   -----   -----   -----
     1    10.2M   1.23T   1.03T   1.03T    10.2M   1.23T   1.03T   1.03T
     2    2.96M    376G    352G    352G    6.19M    785G    730G    730G
     4     197K   23.0G   20.7G   20.8G     967K    113G    103G    103G
     8    25.5K   2.99G   2.59G   2.59G     229K   26.7G   23.1G   23.1G
    16    1.89K    213M    141M    142M    41.0K   4.52G   3.01G   3.02G
    32      269   19.3M   9.76M   10.1M    13.2K    982M    504M    521M
    64      139   6.21M   2.34M   2.57M    11.9K    607M    251M    270M
   128       15    904K    160K    184K    2.55K    150M   35.3M   39.5M
   256        7    266K   17.5K     32K    2.82K   86.5M   6.34M   12.7M
   512        2      3K      3K      8K    1.67K   2.50M   2.50M   6.66M
    1K        2      3K      3K      8K    2.69K   4.03M   4.03M   10.7M
    2K        1   1.50K   1.50K      4K    2.29K   3.44M   3.44M   9.18M
    4K        1    128K      4K      4K    5.29K    677M   21.1M   21.1M
 Total    13.4M   1.63T   1.40T   1.40T    17.6M   2.14T   1.87T   1.87T

cat /proc/spl/kstat/zfs/arcstats

13 1 0x01 96 26112 1347083257 29980445681062
name                            type data
hits                            4    143616607
misses                          4    6675033
demand_data_hits                4    3193904
demand_data_misses              4    2128
demand_metadata_hits            4    139639371
demand_metadata_misses          4    3114153
prefetch_data_hits              4    41195
prefetch_data_misses            4    3448900
prefetch_metadata_hits          4    742137
prefetch_metadata_misses        4    109852
mru_hits                        4    10503032
mru_ghost_hits                  4    1058430
mfu_hits                        4    132483843
mfu_ghost_hits                  4    322015
deleted                         4    5994126
mutex_miss                      4    1215
access_skip                     4    0
evict_skip                      4    11774445
evict_not_enough                4    288284
evict_l2_cached                 4    257589443584
evict_l2_eligible               4    302754582528
evict_l2_ineligible             4    43114084352
evict_l2_skip                   4    11156
hash_elements                   4    499557
hash_elements_max               4    975057
hash_collisions                 4    3571836
hash_chains                     4    50859
hash_chain_max                  4    6
p                               4    667666820
c                               4    705057456
c_min                           4    516320256
c_max                           4    12884901888
size                            4    636434888
compressed_size                 4    457182720
uncompressed_size               4    473488896
overhead_size                   4    82210304
hdr_size                        4    40433848
data_size                       4    129216512
metadata_size                   4    410176512
dbuf_size                       4    5559856
dnode_size                      4    10672896
bonus_size                      4    4091200
anon_size                       4    87064576
anon_evictable_data             4    0
anon_evictable_metadata         4    0
mru_size                        4    85756928
mru_evictable_data              4    35782656
mru_evictable_metadata          4    8192
mru_ghost_size                  4    552566784
mru_ghost_evictable_data        4    531759104
mru_ghost_evictable_metadata    4    20807680
mfu_size                        4    366571520
mfu_evictable_data              4    0
mfu_evictable_metadata          4    351748096
mfu_ghost_size                  4    86577152
mfu_ghost_evictable_data        4    0
mfu_ghost_evictable_metadata    4    86577152
l2_hits                         4    1423981
l2_misses                       4    5239248
l2_feeds                        4    53233
l2_rw_clash                     4    0
l2_read_bytes                   4    5890176512
l2_write_bytes                  4    255218547712
l2_writes_sent                  4    41744
l2_writes_done                  4    41744
l2_writes_error                 4    0
l2_writes_lock_retry            4    17
l2_evict_lock_retry             4    4
l2_evict_reading                4    0
l2_evict_l1cached               4    326939
l2_free_on_write                4    27479
l2_abort_lowmem                 4    5
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    24279220224
l2_asize                        4    24237309952
l2_hdr_size                     4    36284064
memory_throttle_count           4    0
memory_direct_count             4    74
memory_indirect_count           4    408748
memory_all_bytes                4    16522248192
memory_free_bytes               4    9222279168
memory_available_bytes          3    8964120576
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    115
arc_meta_used                   4    507218376
arc_meta_limit                  4    9663676416
arc_dnode_limit                 4    966367641
arc_meta_max                    4    4876518032
arc_meta_min                    4    16777216
sync_wait_for_async             4    300029
demand_hit_predictive_prefetch  4    3154731
arc_need_free                   4    0
arc_sys_free                    4    258160128

/r/zfs thread about my issue

@Atemu
Copy link
Author

Atemu commented Jun 5, 2018

Bump

Does anyone know what z_wr_iss is supposed to do and what could cause it to do random reads?

@abrasive
Copy link
Contributor

Are you running zed (the ZFS event daemon)?

@Atemu
Copy link
Author

Atemu commented Jun 19, 2018

It should be running by default, so yes

@abrasive
Copy link
Contributor

abrasive commented Jun 19, 2018 via email

@Atemu
Copy link
Author

Atemu commented Jun 19, 2018

autoexpand=off on my pool

@abrasive
Copy link
Contributor

I can't help you then, sorry! ^_^

@thewacokid
Copy link

thewacokid commented Jun 25, 2018

Laymen's guess - you have dedupe enabled, and you only have 16 GB of RAM. ~2.8 GB is taken up by your DDT (assuming it's all in memory). Writes trigger reads on dedup'd pools if the entire DDT is not in RAM.

@Atemu
Copy link
Author

Atemu commented Jun 26, 2018

That's what I thought as well but the ARC is almost never full and limiting it to metadata only doesn't help either (it's just even smaller).

Btw is it intended that ARC empties itself automatically when it isn't even near the limit?

@richardelling
Copy link
Contributor

the z_wr_iss taskq handles (wr)ites (iss)ued. In the ZIO pipeline writes can cause reads, but this
will be difficult to predict. For example, if the current metaslab is full, the next metaslab's space map is needed and might not be in memory, causing a spacemap load. Also, CPU cycles can be
consumed for compression, checksum generation stages, and dedup work (if enabled).

In general, you should get more response from the mailing list because github is a poor tool for
conversations.

@Atemu
Copy link
Author

Atemu commented Jun 28, 2018

Thanks a bunch for your response, that was very insightful!

Do you know of a way to force the space maps to be cached all the time?
I've got plenty of unused space in my ARC and L2ARC.

@richardelling
Copy link
Contributor

Yes, the zfs-module-parameters(8) tunable to control unloading of space maps is metaslab_debug_unload

@Atemu
Copy link
Author

Atemu commented Jun 28, 2018

I set metaslab_debug_unload and metaslab_debug_load to 1 and now whenever I import a zpool it takes a while to read ~200MiB into the ARC but unfortunately the 9 z_wr_iss threads still do their random reads when writing.

@bunder2015 bunder2015 added the Type: Performance Performance improvement or performance problem label Jan 31, 2019
@janetcampbell
Copy link

janetcampbell commented Apr 6, 2019

This is old, but I'd like to offer a few suggestions that may help.

Step 1 with these situations is almost always to use zpool iostat -r to your advantage. Find out what is being read - is it pool data, or is it metadata? What is your typical recordsize? Are they showing up as sync or async reads?

If it's pool data (reads are the same size as your recordsize), then you may be seeing RMW reads during TxG commit, as it brings in the missing pieces of each block that need to be written. These can get nasty and dominate the workload in some cases as we found out on Solaris with ZFS in the early days. Generally, RMW for sync writes will create sync reads, while RMW for async writes will create async reads. If they look like they may be RMW reads, you can try:

Decreasing your recordsize
Making TxG commit less frequent
Limiting the speed of TxG commit
Deprioritizing the reads

Decreasing the recordsize has a number of effects and may or may not be a practical decision. You need to look at your write distribution and try to figure out if it's worth doing. ZFS does so much better with large records, I consider going below about 32K a last resort except for small special-purpose cases.

If you delay TxG commit, you may be able to get all the pieces of a block before you need to write it. This is usually a matter of growing the txg timeout, dirty_data_sync, and possibly dirty_data_max. This is most useful with applications that drip out small writes and periodically do a "checkpoint" with a lot of big sequential writes. MongoDB is one example.

The TxG commit, and RMW reads that issue from it, can be limited with zfs_sync_taskq_batch_pct. The default (75%) is greatly oversized for most modern systems. Try decreasing it sharply (to 5-10%) and see if the commit and the reads get more orderly and stress the system less. This would be my first step. You can usually go quite low before the speed of the TxG commit itself is impacted.

Finally, you can deprioritize RMW reads at the vdev level. If they're sync reads (RMW for earlier sync writes), there are a lot of ways to do this, but they tend to rely on the fact that sync reads are either feast or famine, they are either not a problem or they bury the box. Try decreasing the sync read minimum active to 1-2, and you will move other IO classes up to much higher in the queueing order.

I'm working on an article on this to clarify more. Hope this helps.

@stale
Copy link

stale bot commented Aug 24, 2020

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Aug 24, 2020
@stale stale bot closed this as completed Nov 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Stale No recent activity for issue Type: Performance Performance improvement or performance problem
Projects
None yet
Development

No branches or pull requests

6 participants