Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to stop RMW reads at write time (long) #8590

Open
janetcampbell opened this issue Apr 7, 2019 · 29 comments
Open

How to stop RMW reads at write time (long) #8590

janetcampbell opened this issue Apr 7, 2019 · 29 comments
Labels
Bot: Not Stale Override for the stale bot Type: Performance Performance improvement or performance problem

Comments

@janetcampbell
Copy link

janetcampbell commented Apr 7, 2019

First of all I'd like to say thanks for being such an active community. I've had some great success working with your work. Skip to the bottom if you want the punchline. This really is an issue just with a lot of backstory. My apologies if it actually is well known to you.

I've run into some misconceptions on RMW with ZFS and Linux, and it's made me realize that some things about a healthy ZFS system are not well known. I came from the Solaris world and was a long time ZFS user there as well. To review:

Between TxG commits, ZFS does NOT make a RMW read IO to deal with a direct write.

Blocks are not otherwise refactored at that point in time. If direct sync writes are issued, they will not include RMW or compression as they are stored in ZIL blocks (other than the usual ZIO zero tail removal). They will be stored as-is, variable length, not in their full recordsize.

Using a SLOG and logbias=latency, no sync writes or the async ones they commit with them will require a RMW read to be issued immediately. All these RMW reads will be deferred until TxG commit, so you can assemble all the pieces of the full block if possible without reading. Small async writes that end up filling a block require no RMW at any point.

But, I kept hearing about people complaining about immediate RMW on partial record writes, IO storms from mismatched ZVOL sizes, and realized what I'd almost forgotten - I had to deal with that on ZVOLs at first, where it was fairly easy to fix once I realized what was going on. I broke my back right after and forgot to pass the news on.

The RMW reads that accompany writes are not coming from ZFS. They are largely a product of a misunderstanding with the kernel pager.

If you look at them with blktrace on a ZVOL, you realize that they are coming in from outside of ZFS, and that the read comes before the write. They're usually full recordsize. So, what's happening?

https://elixir.bootlin.com/linux/v4.20.17/source/fs/buffer.c#L1668

If you look through buffer.c, you can see where it tries to match data in its pages up to the i_size of the superblock inode. Especially relevant are:

__block_write_full_page
nobh_writepage

You can see how when the pagesize is smaller than the i_size, the kernel will pull in i_size worth of data to do the RMW itself, and then hand you back a full block. We want to stop this. It's especially devastating because it pollutes the ARC and stresses the ZIL much more than usual, making effective caching difficult.

With ZVOLs, you can do:

logbias=latency
primarycache=all
ideally add a SLOG
zfs create -V 10g -o volblocksize=128k tank/xfs
zfs create -V 100m -o volblocksize=4k tank/xfsjournal

mkfs.xfs -s size=4096 -d sw=1,su=131072 -m crc=0 -l logdev=/dev/zvol/tank/xfsjournal /dev/zvol/tank/xfs
mount -o largeio,discard,noatime,logbsize=256K,logbufs=8 /dev/zvol/tank/xfs /somewhere

This sets the filesystem superblock i_size to 4k, eliminating RMW reads from the kernel. It asks applications for 128k as a preferred io size and aligns at 128k. With the loss of the RMW reads we can stretch the TxG commit out and make it more efficient.

Separating the XFS journal from the data allows it to slam the journal ZVOL as hard as it wants with cache flushes and not hurt anything, so async data can build up in memory. Async writes stay perfectly quiet between TxG commits and I realize I'd forgotten how much this helped things. You can use much larger blocksizes with minimal impact and go to longer TxG commit intervals, as you could in Solaris.

You can learn a lot about a healthy TxG commit by turning it way down. Setting zfs_sync_taskq_batch_pct=1, running some zpool iostats and getting some popcorn can give you a great view into it and what it does.

RMW reads from direct sync writes turn into sync reads.
RMW reads from async writes turn into async reads.
If you avoid indirect sync, all RMW happens at this point in time.
As the reads die down, the async writers power up.

I suspect the ZPL could be greatly improved by similarly matching i_size to eliminate RMW reads. ZVOLs worked so well for us after this fix, we haven't done anything with ZPL. Performance is excellent and actually beats plain XFS in our DB benchmarks. We rate limit what few RMW reads we have and they're about as dangerous as fluffy kittens. Average application write size: 32k, average blocksize: 128k. Average TxG commit write size, >1M. And 256k blocksizes work just as well.

I hope this helps! Better late than never?

Janet

@richardelling
Copy link
Contributor

FWIW, it has been a long time since I’ve seen cases where logbias=throughput helps. It is about time to gather the notes and measurements to add more tuning tips to the wiki.

@janetcampbell
Copy link
Author

janetcampbell commented Apr 7, 2019

Oh, I agree. logbias=throughput is horrible for subsequent read performance - you can double the IOPS of every random read and sometimes it's even worse.

With RMW at write time gone, we run large block sizes, long TxG commit intervals, and either SLOGs or high zfs_immediate_write_size. A lot like big Solaris database servers, actually. Mongo actually got its best read performance with a 256K volblocksize in my testing.

@janetcampbell
Copy link
Author

This should also resolve #361

@zviratko
Copy link

zviratko commented Apr 7, 2019

Thanks for an extensive writeup!

But how can we mitigate this when ZVOLs are used by 3rd party VMs (or by something other than XFS)? Does that even apply? In my case there's DRBD sitting in-between the ZVOL and the VM to complicate things even more and all the writes are synchronous (AFAIK). The guest OS sees a different blocksize on the blockdevice than volblocksize, so that should help things, right?
We got the best results with logbias=throughput, but RMW is still pretty bad whenever I benchark something. We tried SLOG and it wasn't much better (it mostly just helps with the benchmarks to some degree). Our pools use SSDs and I feel like for some reason that makes some problems worse (and NVMe was even worse than that, ZFS just hammered it with minimal effective I/O).

@janetcampbell
Copy link
Author

janetcampbell commented Apr 7, 2019

You probably got the best results with logbias=throughput because you were being slammed with RMW reads on the ZVOL. To narrow down your problem, get a SLOG if possible, set a large ZVOL volblocksize (like 128k) and increase the txg timeout as well as dirty data sync (to about half of dirty data max). Set logbias=latency. Make sure primarycache=all!!!!!

Then send small writes while watching zpool iostat -r. You should see no RMW between txg commits if things are working properly. If you do see RMW reads, find out if they are present at all layers (VM, DRBD, ZVOL) to see which point is confused about the minimum size it must use. blktrace on the ZVOL device can be very helpful.

I believe that I could get the same results by setting one of the block sizes ("physical block size"?) on the ZVOL device node to 4096 using blktool or something similar. This may help. The proper change is probably to change how the block device parameters are set up within ZFS.

We use ZFS on top of RBD devices with great success. With a bit of tuning it can push 10Gbps across 100ms of WAN doing incremental ZFS sends, and that's a beautiful thing.

@zviratko
Copy link

zviratko commented Apr 8, 2019

I tried SLOG in the past with little effect. I see no reason to use it as the SSDs are all pretty fast with capacitor-backed cache, so flushes are fast, that's why I run with logbias=throughput (and I've seen better performance with it as well with no regression in latency benchmarks). If there's no way to avoid RMW without using a SLOG and logbias=latency then that sounds like at least a performance problem, more likely a bug, though I understand the history why it might work better with it...

@shodanshok
Copy link
Contributor

shodanshok commented Apr 8, 2019

@janetcampbell what you writes is surely true, but only if the to-be-modified blocks are already in ARC / L2ARC. If not present in cache, any writes to such blocks will cause a rmw cycle (with synchronous reads). See my comments on #8472

@ahrens
Copy link
Member

ahrens commented Apr 9, 2019

Between TxG commits, ZFS does NOT make a RMW read IO to deal with a direct write.

@janetcampbell Looking a the code, it seems that for partial-block writes, zfs_write() will call dmu_write_uio_dbuf() -> dmu_write_uio_dnode() -> dmu_buf_will_dirty() -> dmu_buf_will_dirty_impl() -> dbuf_read(). Where have I misunderstood?

Note, this code path does look different for zvols, where zvol_request() does:

			if (zvol_request_sync || need_sync ||
			    taskq_dispatch(zvol_taskq, zvol_write, zvr,
			    TQ_SLEEP) == TASKQID_INVALID)
				zvol_write(zvr);

If we don't need_sync then the entire request is dispatched to a taskq. zvol_write() looks like it will synchronously read the block when presented with a sub-block write, via dmu_write_uio_dnode() like zfs_write().

@ahrens ahrens added the Type: Performance Performance improvement or performance problem label Apr 9, 2019
@shodanshok
Copy link
Contributor

@ahrens this is consistent with my findings here: #8472 (comment)

In short, if the to-be-written partial block is not in ARC already, writing will bring it in ARC via a synchronous read.

@janetcampbell
Copy link
Author

janetcampbell commented Apr 9, 2019

I'm working on a longer post which will fully illustrate the issue - I promise it will be worth it, so bear with me a bit. In short, when you trace writes coming into ZFS that transit buffer.c (page writes), you will find out that they are i_size in bytes - that is why any direct sync ZIL writes that result from them show up as the i_size, whether they result from a single 4K write or a 256K write, and regardless of the volblocksize or recordsize. You can see this clear as day by adding a SLOG or raising zfs/zvol_immediate_write_size, setting sync=always and running zpool iostat -r while you generate page writes. It was especially noticable with XFS, which normally has a maximum i_size of 32K for any volblocksize higher than that and it's easy to set with mkfs.

The Linux kernel used to just flush things out 4K at a time regardless of i_size. I suspect this changed somewhere in the 2014 time range but I'm not 100% sure.

zfs_log_write and zvol_log_write receive the size of the write that zfs_write and zvol_write receive (edit: it can be broken up for large writes, I'm talking about sub-128k for the moment). They don't use the recordsize/volblocksize as the write size, and they certainly don't get the i_size from somewhere. That I/O is already coming in those chunks by the time it reaches zfs_write and zvol_write.

When i_size != pagesize, you also end up with read/write cycles coming into ZFS, which prevents it from efficiently aggregating I/O. Every write from the pager is interspersed with a sync read from it. When i_size == pagesize, the queue can be plugged while the writes buffer and then unplugged.

The DMU has been unfairly accused in all this. Solaris does not work this way, either.

@shodanshok
Copy link
Contributor

@janetcampbell Sure, linux pagecache is flushed in 4k blocks which are later re-aggregated and that cause rmw on zvol side.

However, in comment #8472 (comment) I tried a plain, non-sync 4k write against a simple file which cause a noticeable read amplification. That very read amplification go away if the block is already into ARC. Maybe I am missing something; can you explain my findings? Thanks.

@janetcampbell
Copy link
Author

janetcampbell commented Apr 9, 2019

Sure.

When that write is trying to be flushed out, buffer.c looks at your superblock i_size and sees 128k (you can check this with "stat"). It looks at the 4k that it has to write, performs a sync read of 128k, merges in a 4k page, and then writes 128k. What it should do, and what it does on the vast majority of things that look like disks to Linux, is to just write 4k. It used to do this universally, and I think that change by Linux was the cause of a significant performance degradation in the past involving increased RMW reads. They've been the norm for long enough for people to think they were innate to ZoL.

When it performs a read and the block is in the ARC, ZFS can fulfill the sync read without additional disk accesses. But, it's still happening between the pager and ZFS! Look at blktrace if you do it on a ZVOL and you will see it.

If you set sync=always on your dataset and add a SLOG (so all writes will be direct sync), you will find that you first get an i_size direct sync ZIL write to the SLOG, followed by an recordsize/volblocksize (or less with compression) write to the main pool vdev during TxG commit. How is zfs_log_write/zvol_log_write getting an i_size write? It gets the size from the incoming write request. The DMU is not involved with the size passed to *_log_write.

This is all really visible on ZVOLs where you can set i_size easily and you can blktrace to see the RMW coming in (or not, if i_size == pagesize), and the code in zvol.c is so simple it's unambiguous.

Edit: If you test this, make sure not to use /dev/zero - that ZIO zero tail trimming can surprise you!

Edit edit: it's actually even worse than that. I don't think it aggregates the pages it's writing, it slams you with reads and writes over and over per page even within the same i_size block. The reads can be answered out of cache and the writes just stack up in memory if they're async, but it's still startling. This is why pager writes with sync=always on a large block volume are so bad.

A full story of chasing these reads with examples will be forthcoming. Basically the flow as I see it is like:

for each pagesize block going into buffer.c for delivery
. if i_size != pagesize:
. . buffer makes i_size read from ZFS
. . ZFS reads recordsize from ARC (if cached) or disk, returns i_size read to buffer
. . buffer merges pagesize block into i_size read
. buffer makes i_size write
. ZFS makes i_size ZIL block write (if sync=always)
. ZFS incorporates i_size block into DMU

Which works immensely better when they match.

@janetcampbell
Copy link
Author

janetcampbell commented Apr 10, 2019

@janetcampbell Looking a the code, it seems that for partial-block writes, zfs_write() will call dmu_write_uio_dbuf() -> dmu_write_uio_dnode() -> dmu_buf_will_dirty() -> dmu_buf_will_dirty_impl() -> dbuf_read(). Where have I misunderstood?

@ahrens this is consistent with my findings here: #8472 (comment)

@ahrens @shodanshok

dmu_write_uio_dnode() will only dirty a dmu buf and fill it with a read if an individual dmu buf (dbuf block) will be partially filled. dmu_write_uio_dnode() calls dmu_buf_hold_array_by_dnode() to do a dbuf_hold() on the dbuf and returns the relevant blocks in an array, then iterates over them. If the individual dbuf blocks are fully filled, there will be no call to dmu_buf_will_dirty() and no RMW read, even if the write is smaller than recordsize/volblocksize.

For example: if you make a 4K write on a 128K record, but your dnodesize is legacy (512B) or anything 4K or below (assuming an aligned write), you will not partially fill any dbuf blocks, and you will not incur RMW reads here. dmu_write_uio_dnode() does not base its decision to dirty+read a dbuf block or not based on whether the recordsize has received a full write, it looks to see if the dbuf block has received a full write. This is entirely deliberate.

This is one reason why large-granularity dbufs ("large dnodes") are so destructive to small write performance, once you get the RMW reads from the pager out of the way. It's why Solaris deliberately still uses a very small dnodesize (512B) across the board unless you override it, and does not try to match it with volblocksize on ZVOLs. Nothing that accepts writes from the pager should have a dnodesize > 4K, as a general rule.

I'll be posting another issue to fix the large dnodesize dependency for ZVOLs but I wanted to get this (pager-based RMW reads) out of the way first. The dnodesize issue isn't really obvious until the RMW reads from the pager are gone. If you have a ZVOL and set dnodesize=4K or below, you totally eliminate dmu_buf_will_dirty() reads from dmu_write_uio_dnode(), but you must have a SLOG to avoid indirect sync.

Hope this helps! Please let me know if anything is unclear.

@ahrens
Copy link
Member

ahrens commented Apr 10, 2019

@janetcampbell I think you are confusing block size (aka dbuf size, determined by the recordsize property, default 128K) and dnode size (default 512B, determined by the dnodesize property), which are two different things. The dnode size has no impact on the write code path. The extra dnode space is used to store "system attributes" (aka SA's), such as extended attributes if the xattr property has been changed to sa.

dmu_write_uio_dnode() will only dirty a dmu buf and fill it with a read if an individual dmu buf (dbuf block) will be partially filled.

That's right. For example, when doing a 4K write to a 128K block, the DMU will do a read-modify-write. The write size is determined by the size argument to the write(2) syscall. The block size is determined by the recordsize property, which defaults to 128K.

if you make a 4K write on a 128K record, but your dnodesize is legacy (512B) or anything 4K or below (assuming an aligned write), you will not partially fill any dbuf blocks

The dnodesize is not relevant here. If you make a 4K write on a 128K block, you will partially fill the 128K block, because 4K is less than 128K.

large-granularity dbufs ("large dnodes") are so destructive to small write performance

Doing small writes to big blocks (i.e. write size < recordsize) causes the DMU to do a read-modify-write, which causes poor performance. The dbuf size is unrelated to "large dnodes".

@janetcampbell
Copy link
Author

I'm sorry for the confusion, it's been a long day and I've been looking at many source code versions. I'm going to focus on the pager RMW for the moment, because that issue must be solved before the rest becomes apparent, and it's extremely obvious when you see it. I think that those pager reads' presence obscured the introduction of other changes within the DMU that increased RMW.

To my knowledge, the DMU is not supposed to be doing RMW over the entire domain of the recordsize at the point in time that a write comes in. In Solaris, you will see it chopping things up into 512 byte buffers and then fill or dirty within those if necessary, and this has been true since at least 2009.

Sorry for the confused comment.

@shodanshok
Copy link
Contributor

shodanshok commented May 5, 2019

@janetcampbell I had some more time to think about the problem, and to do some more tests. In short, using a fully updated CentOS 7.6 amd64 system (kernel 3.10.0-957.5.1.el7.x86_64), I see no evidence of r/m/w carried by the pagecache. From the example below:

# create and overwrite a small zvol
[root@nas ~]# zfs create tank/vol1 -V 1G -b 128k
[root@nas ~]# dd if=/dev/urandom of=/dev/zvol/tank/vol1 bs=1M
dd: error writing ‘/dev/zvol/tank/vol1’: No space left on device
1025+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 22.6181 s, 47.5 MB/s

# drop caches and write 16 MB (4k blocks * 4096 io) via pagecache
[root@nas ~]# sync; echo 3 > /proc/sys/vm/drop_caches
[root@nas ~]#
[root@nas ~]#
[root@nas ~]# dd if=/dev/urandom of=/dev/zvol/tank/vol1 bs=4k count=4k
4096+0 records in
4096+0 records out
16777216 bytes (17 MB) copied, 1.13477 s, 14.8 MB/s

# on another terminal, observe how reads and writes are issued
[root@nas ~]# iostat -x -k 1 /dev/zd0 /dev/sda /dev/sdb
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00   12.00   38.50    0.00   49.50

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00   11.00  306.00  1284.00 35232.00   230.38     1.75    5.20    5.91    5.17   1.68  53.20
sda               0.00     0.00   98.00  309.00 12364.00 35232.00   233.89     1.43    3.25    1.83    3.71   1.51  61.60
zd0               0.00     0.00    0.00 4096.00     0.00 16384.00     8.00     6.14    1.50    0.00    1.50   0.05  19.60

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.50    0.00    2.01   32.66    0.00   64.82

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00   31.00    6.00   420.00    16.00    23.57     0.75   23.03   15.90   59.83  10.78  39.90
sda               0.00     0.00   45.00    6.00  1736.00    16.00    68.71     0.55   12.82    7.13   55.50  10.45  53.30
zd0               0.00     0.00  261.00    0.00  1044.00     0.00     8.00     0.00    0.00    0.00    0.00   0.00   0.10

You can see that almost all reads are issued by zfs against the physical disks, rather than by pagecache against the zvol. The only reads which are issued "above" the zvol layer account for 1MB only, with blktrace showing them at the end of the zvol device (this probably is a byproduct of how Linux block devices are opened/closed).

I will try a more recent kernel and update these findings.
Thanks.

@shodanshok
Copy link
Contributor

shodanshok commented May 6, 2019

@janetcampbell I just tried on a fully updated Fedora29 with kernel 5.0.10 and I see the same behavior: any read amplification seems due to the DMU rather than kernel pagecache. Running the same commands as above yields the following iostat -x -k 1 output:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    7.04   24.62    0.00   68.34

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sda            166.00  291.00  16524.00  34352.00     0.00     1.00   0.00   0.34    5.56    2.15   1.36    99.54   118.05   0.89  40.70
dm-0            35.00    0.00    140.00      0.00     0.00     0.00   0.00   0.00   11.57    0.00   0.41     4.00     0.00   0.20   0.70
dm-1             0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
zd0              0.00 4096.00      0.00  16384.00     0.00     0.00   0.00   0.00    0.00    1.67   6.83     0.00     4.00   0.04  15.50

Any reads seems issued by the zvol (zd0) against the physical device (sda).

@stale
Copy link

stale bot commented Aug 24, 2020

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Aug 24, 2020
@stale stale bot closed this as completed Nov 25, 2020
@janetcampbell
Copy link
Author

My apologies for how long it took me to get back to this, I was badly disabled shortly after opening this issue.

@janetcampbell I had some more time to think about the problem, and to do some more tests. In short, using a fully updated CentOS 7.6 amd64 system (kernel 3.10.0-957.5.1.el7.x86_64), I see no evidence of r/m/w carried by the pagecache. From the example below:

# create and overwrite a small zvol
[root@nas ~]# zfs create tank/vol1 -V 1G -b 128k
[root@nas ~]# dd if=/dev/urandom of=/dev/zvol/tank/vol1 bs=1M

You're not seeing activity from the page cache because this is not a mounted filesystem. read/write to the block device don't go through the path I'm concerned with. Pageouts to a mounted filesystem do; you can see the kernel-initiated RMW when an filesystem with an i_size that is larger than the OS pagesize is mounted. Forcing XFS to use a 4k i_size sidesteps this problem. This allows you to use largeio and a stripe unit size that matches the volblocksize.

I will be posting updated information to this thread as I get a testbed back together. I need to retest with current versions and should be able to support this with blktrace on the ZVOL dev.

Apologies again for not being able to follow up on this earlier, but better late than never.

Thanks,

Janet

@shodanshok
Copy link
Contributor

My apologies for how long it took me to get back to this, I was badly disabled shortly after opening this issue.

@janetcampbell I'm sincerely sorry - I hope you are better now.

You're not seeing activity from the page cache because this is not a mounted filesystem. read/write to the block device don't go through the path I'm concerned with. Pageouts to a mounted filesystem do; you can see the kernel-initiated RMW when an filesystem with an i_size that is larger than the OS pagesize is mounted. Forcing XFS to use a 4k i_size sidesteps this problem. This allows you to use largeio and a stripe unit size that matches the volblocksize.

A mounted filesystem can surely command its own r/m/w - especially with a sector size mismatch. However, the tests I did in the past were done on a) a simple zfs dataset or b) on a "raw" zvol device. In both cases a very significant read amplification was observed when overwriting with smaller-than-recordsize blocks.

But three years passed - let me do a new test on a new Rocky Linux 8 + ZFS 2.0.7 machine:

# create a new dataset
[root@localhost ~]# zfs create tank/test
[root@localhost ~]# zfs set recordsize=1M tank/test

# populate it with a test file
[root@localhost ~]# dd if=/dev/urandom of=/tank/test/random.img bs=1M count=32

# export/import the pool to avoid any caching
[root@localhost ~]# zpool export tank; zpool import tank

# overwrite the previously written file
[root@localhost ~]# dd if=/dev/urandom of=/tank/test/random.img bs=4k count=1024 conv=notrunc,nocreat
1024+0 records in
1024+0 records out
4194304 bytes (4.2 MB, 4.0 MiB) copied, 0.127304 s, 32.9 MB/s

# on another terminal, run iostat (headers trimmed for clarity)
Device             tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdb               5.00      4100.00         0.00       4100          0
sdb               0.00         0.00         0.00          0          0
sdb              52.00         0.00      4416.00          0       4416

As you can see, 4 MB were read before writing anything to the disk. In this specific case of a VM running on an NVMe disk, performance remains acceptable. However, doing the same on HDD would lead to vastly lower throughput due to the read amplification commanded by r/m/w.

At best of my understanding this performance issue was going to be addressed by the "async DMU" patches - which were retired due to data corruption.

Am I missing something?
Thanks.

@janetcampbell
Copy link
Author

@janetcampbell I'm sincerely sorry - I hope you are better now.

Thank you. A broken spine in many places but I'm made of tough stuff :)

Am I missing something? Thanks.

I believe so. You should be able to confirm by looking at "zpool iostat -r" while performing this test. Postponing TxG commit (by increasing txg_sync and the dirty data variables) can make things clearer and allow you to separate I/O that happens at write time, with I/O that happens at TxG commit time.

If at write time you see sync reads of the size of the record, this is likely the VM RMW I mentioned earlier.

Under normal operation, writes will be stored in memory until TxG commit is reached. At that point, the sync taskq threads will initiate async reads with the recordsize if and only if the block has not been fully filled in. Varying the sync taskq allows you to control the degree of read concurrency in the TxG commit RMW process.

This is how things have operated on Solaris since the early days, and how OpenZFS operated on Linux in the earlier 0.6 days. Two unrelated changes happened that appear to have caused the pathology: i_size as provided by ZFS was altered, and the Linux kernel added RMW for cases where i_size != pagesize.

This behavior (no RMW at write time when properly configured) is relied on extensively in the Solaris/Oracle world. One can have an Oracle redo log on a dataset with a large recordsize, but in the presence of a SLOG and logbias=latency, no RMW is done at write time. At TxG commit time, the writes are efficiently packaged into large records and incur a minimum of RMW reads.

The XFS setup I mentioned at the top of this thread remediated the problem successfully in the past, and you can see an elimination of sync read RMW at write time along with the initiation of fewer async reads at TxG commit time. I'm getting my testbed set up again and will post once I have some examples.

Thanks,

Janet

@shodanshok
Copy link
Contributor

Thank you. A broken spine in many places but I'm made of tough stuff :)

Happy to know that you feel better!

I believe so. You should be able to confirm by looking at "zpool iostat -r" while performing this test. Postponing TxG commit (by increasing txg_sync and the dirty data variables) can make things clearer and allow you to separate I/O that happens at write time, with I/O that happens at TxG commit time.

If at write time you see sync reads of the size of the record, this is likely the VM RMW I mentioned earlier.

Ok, let see what is happening here... I increased zfs_txg_timeout to 300s to avoid "surprise syncs", while zfs_dirty_data_max is at ~800M so I left it unchanged.

# this is the same file and dataset of the previous test (1M recordsize, compression=off)
[root@localhost parameters]# zpool export tank; zpool import tank
[root@localhost parameters]# dd if=/dev/urandom of=/tank/test/test.img bs=4k count=4096 conv=notrunc,nocreat
4096+0 records in
4096+0 records out
16777216 bytes (17 MB, 16 MiB) copied, 0.5638 s, 29.8 MB/s

# 32 MB are read *before* anything is written
root@localhost parameters]# dstat -d -D sdb
--dsk/sdb--
 read  writ
   0     0
   0     0
  32M    0
   0     0
   0     0

# many 1M records are read before anything is written
[root@localhost parameters]# zpool iostat -r 1
tank          sync_read    sync_write    async_read    async_write      scrub         trim
req_size      ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512             0      0      0      0      0      0      0      0      0      0      0      0
1K              0      0      0      0      0      0      0      0      0      0      0      0
2K              0      0      0      0      0      0      0      0      0      0      0      0
4K              2      0      0      0      0      0      0      0      0      0      0      0
8K              0      0      0      0      0      0      0      0      0      0      0      0
16K             0      0      0      0      0      0      0      0      0      0      0      0
32K             0      0      0      0      0      0      0      0      0      0      0      0
64K             0      0      0      0      0      0      0      0      0      0      0      0
128K            0      0      0      0      0      0      0      0      0      0      0      0
256K            0      0      0      0      0      0      0      0      0      0      0      0
512K            0      0      0      0      0      0      0      0      0      0      0      0
1M             29      0      0      0      0      0      0      0      0      0      0      0
2M              0      0      0      0      0      0      0      0      0      0      0      0
4M              0      0      0      0      0      0      0      0      0      0      0      0
8M              0      0      0      0      0      0      0      0      0      0      0      0
16M             0      0      0      0      0      0      0      0      0      0      0      0
----------------------------------------------------------------------------------------------

As no filesystem other than ZFS are used, I fail to see how the pagecache/vm is involved (it should be a 100% "ARC affair"). Moreover nothing is written. It really seems (to me) a r/m/w commanded by ARC itself (ie: it needs to read an entire record before modifying and writing it).

It would be great if one could avoid r/m/w for partial record writes when entire records are about to be modified, but I fail to see how this is possible in the light of these results.

I noticed you referenced a SLOG device multiple times: are your finding valid for sync writes only (when they are flushed from SLOG to main pool)?

Thanks.

@janetcampbell
Copy link
Author

janetcampbell commented Feb 1, 2022

It really seems (to me) a r/m/w commanded by ARC itself

Humor me for just one more moment: blktrace the zvol device itself (not the pool device) while you do this test again. I expect you will see, as I have, reads going into the zvol, followed by writes going into the zvol. You would expect to see a stream of 4k writes and no reads if there was no RMW initiated by the kernel.

@shodanshok
Copy link
Contributor

Humor me for just one more moment: blktrace the zvol device itself (not the pool device) while you do this test again. I expect you will see, as I have, reads going into the zvol, followed by writes going into the zvol. You would expect to see a stream of 4k writes and no reads if there was no RMW initiated by the kernel.

Sure, these are the results:

# increase sync timeout
[root@localhost parameters]# echo 300 > zfs_txg_timeout

# empty zvol
[root@localhost ~]# zfs create tank/vol1 -V 1G -o volblocksize=1M
[root@localhost ~]# dd if=/root/random.img of=/dev/zvol/tank/vol1 bs=4k count=1024
1024+0 records in
1024+0 records out
4194304 bytes (4.2 MB, 4.0 MiB) copied, 0.0354642 s, 118 MB/s

# blktrace and blkparse
[root@localhost ~]# blktrace -d /dev/zvol/tank/vol1 -o - | blkparse -i -
CPU0 (vol1.trace):
 Reads Queued:           0,        0KiB  Writes Queued:       1,024,    4,096KiB
 Read Dispatches:        0,        0KiB  Write Dispatches:        0,        0KiB
 Reads Requeued:         0               Writes Requeued:         0
 Reads Completed:        0,        0KiB  Writes Completed:    1,024,    4,096KiB
 Read Merges:            0,        0KiB  Write Merges:            0,        0KiB
 Read depth:             0               Write depth:             0
 IO unplugs:             0               Timer unplugs:           0

# full zvol
[root@localhost ~]# dd if=/dev/urandom of=/dev/zvol/tank/vol1 bs=1M status=progress
1055916032 bytes (1.1 GB, 1007 MiB) copied, 33 s, 32.0 MB/s
dd: error writing '/dev/zvol/tank/vol1': No space left on device
1025+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 36.326 s, 29.6 MB/s

# export and import zpool (to release ARC)
[root@localhost ~]# dd if=/root/random.img of=/dev/zvol/tank/vol1 bs=4k count=1024
1024+0 records in
1024+0 records out
4194304 bytes (4.2 MB, 4.0 MiB) copied, 0.00390314 s, 1.1 GB/s

[root@localhost ~]# blktrace -d /dev/zvol/tank/vol1 -o - | blkparse -i -
CPU0 (230,0):
 Reads Queued:           0,        0KiB  Writes Queued:       1,024,    4,096KiB
 Read Dispatches:        0,        0KiB  Write Dispatches:        0,        0KiB
 Reads Requeued:         0               Writes Requeued:         0
 Reads Completed:        0,        0KiB  Writes Completed:    1,024,    4,096KiB
 Read Merges:            0,        0KiB  Write Merges:            0,        0KiB
 Read depth:             0               Write depth:             0
 IO unplugs:             0               Timer unplugs:           0

I can't see any read on the zvol (both when empty and full). Is that the expected behavior?
Thanks.

@janetcampbell
Copy link
Author

I can't see any read on the zvol (both when empty and full). Is that the expected behavior?

It’s not what I’ve seen when provoking RMW through different means, but the FIO path that induces this issue is somewhat different.

Thanks for bearing with me - I’m just now getting my dev environment back together after a lengthy hiatus. I should have some data to post on this next week - I should have waited to spin this thread back up until then, but this was a priority issue for me when I came back to tech.

My next post on this thread will have much more hard data on this issue.

ty

@behlendorf behlendorf removed the Status: Stale No recent activity for issue label Feb 4, 2022
@behlendorf behlendorf reopened this Feb 4, 2022
@DemiMarie
Copy link

@janetcampbell I am glad you are doing better! I hope there was no permanent damange.

@stale
Copy link

stale bot commented Mar 18, 2023

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Mar 18, 2023
@DemiMarie
Copy link

I see no reason to believe this has been fixed.

Copy link

stale bot commented Mar 17, 2024

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Mar 17, 2024
@ahrens ahrens added Bot: Not Stale Override for the stale bot and removed Status: Stale No recent activity for issue labels Mar 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bot: Not Stale Override for the stale bot Type: Performance Performance improvement or performance problem
Projects
None yet
Development

No branches or pull requests

7 participants