Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

severe performance regression on virtual disk migration for qcow2 on ZFS with ZFS 2.1.5 #14594

Open
devZer0 opened this issue Mar 7, 2023 · 15 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@devZer0
Copy link

devZer0 commented Mar 7, 2023

Distribution Name | Proxmox PVE
Distribution Version | 7.3
Kernel Version | 5.15.39-2-pve and later
Architecture | x86_64
OpenZFS Version | 2.1.5

Describe the problem you're observing

on recent proxmox releases which is delivered with zfs 2.1.5 as part of the kernel package, there is a significant slowdown when moving an empty qcow2 virtual disk file from an hdd based pool to any other pool

it seems this issue is related to atime updates.

the problem goes away when setting atime=off or atime=on/relatime=on

Describe how to reproduce the problem

update older proxmox with 5.15.39-1-pve kernel to recent proxmox version (pve-no-subscription or pvetest repo)

Include any warning/errors/backtraces from the system logs

see discussion/analysis at this thread:

https://forum.proxmox.com/threads/weird-disk-write-i-o-pattern-on-source-disks-when-moving-virtual-disk.123639/post-538583

start of thread:
https://forum.proxmox.com/threads/weird-disk-write-i-o-pattern-on-source-disks-when-moving-virtual-disk.123639/

not sure what change in zfs could cause this behaviour , maybe #13338 ? @rincebrain ?

5.15.39-1-pve with zfs 2.1.4:
time qm disk move 100 scsi0  nvmepool
real    0m19.974s

zpool iostat -r hddpool:
 
hddpool       sync_read    sync_write    async_read    async_write      scrub         trim
req_size      ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512            35      0      0      0      1      0    316      0     10      0      0      0
1K            307      0    144      0     54      0    329      5     41      0      0      0
2K             33      0      0      0      0      0    252     56      2     12      0      0
4K              9      0      0      0      0      2    335     77      0     12      0      0
8K            213      0     12      0      2      5      0    153     12      7      0      0
16K             0      0      0      0      0      5      0     71      0     20      0      0
32K             0      0      0      0      0      2      0      6      0     15      0      0
64K            16      0     32      0      0      0      0      0      0      2      0      0
128K            6      0      0      0      0      0      0      0      0      5      0      0
256K            0      0      0      0      0      0      0      0      0      3      0      0
512K            0      0      0      0      0      0      0      0      0      3      0      0
1M              0      0      0      0      0      0      0      0      0      0      0      0
2M              0      0      0      0      0      0      0      0      0      0      0      0
4M              0      0      0      0      0      0      0      0      0      0      0      0
8M              0      0      0      0      0      0      0      0      0      0      0      0
16M             0      0      0      0      0      0      0      0      0      0      0      0
----------------------------------------------------------------------------------------------
5.15.39-2-pve with zfs 2.1.5:
time qm disk move 100 scsi0  nvmepool
real    1m10.214s

zpool iostat -r hddpool:

hddpool       sync_read    sync_write    async_read    async_write      scrub         trim
req_size      ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512            33      0      0      0      3      0  13.2K      0      7      0      0      0
1K            324      0  5.94K      0     61      0  15.9K      0    171      1      0      0
2K             47      0      0      0      0      5  10.0K  2.62K     11      9      0      0
4K             16      0      0      0      0      3  22.9K  2.94K      8      8      0      0
8K            216      0     12      0      2      4     20  3.73K      1     13      0      0
16K             2      0      0      0      0      7    238  3.04K      0      8      0      0
32K             0      0      0      0      0      0      0    111      0      8      0      0
64K            16      0     32      0      0      0      0     87      0      0      0      0
128K            1      0      0      0      0      0    217      7      0      2      0      0
256K            0      0      0      0      0      0      0     44      0      2      0      0
512K            0      0      0      0      0      0      0      0      0      4      0      0
1M              0      0      0      0      0      0      0      0      0      0      0      0
2M              0      0      0      0      0      0      0      0      0      0      0      0
4M              0      0      0      0      0      0      0      0      0      0      0      0
8M              0      0      0      0      0      0      0      0      0      0      0      0
16M             0      0      0      0      0      0      0      0      0      0      0      0
----------------------------------------------------------------------------------------------
5.15.39-2-pve with zfs 2.1.5 and atime=on/relatime=on:
time qm disk move 100 scsi0  nvmepool
real    0m3.365s


# zpool iostat -r hddpool

hddpool       sync_read    sync_write    async_read    async_write      scrub         trim
req_size      ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512            38      0      0      0      0      0    169      0      5      0      0      0
1K             77      0     80      0     64      0    198      7    113      3      0      0
2K             17      0      0      0      0      3    133     13      4     20      0      0
4K             11      0      0      0      0      7    171     46      4      5      0      0
8K            216      0     12      0      2      5      0     52      5      6      0      0
16K             1      0      0      0      0      5      0     33      0      5      0      0
32K             1      0      0      0      0      1      0      0      0     11      0      0
64K            16      0     32      0      0      0      0      0      0      4      0      0
128K            1      0      0      0      0      0      0      0      0      3      0      0
256K            0      0      0      0      0      0      0      0      0      4      0      0
512K            0      0      0      0      0      0      0      0      0      4      0      0
1M              0      0      0      0      0      0      0      0      0      0      0      0
2M              0      0      0      0      0      0      0      0      0      0      0      0
4M              0      0      0      0      0      0      0      0      0      0      0      0
8M              0      0      0      0      0      0      0      0      0      0      0      0
16M             0      0      0      0      0      0      0      0      0      0      0      0
----------------------------------------------------------------------------------------------
@devZer0 devZer0 added the Type: Defect Incorrect behavior (e.g. crash, hang) label Mar 7, 2023
@devZer0
Copy link
Author

devZer0 commented Mar 7, 2023

with linux 6.1 kernel and zfs 2.1.9 it seems it's even slower then with 2.1.5 ( 1m27s vs. 3.18s with relatime=on)

@rincebrain
Copy link
Contributor

Pathology in common with #14512 maybe?

@rincebrain
Copy link
Contributor

(Replying here because I don't think commenting on that bug that this one seems unrelated is reasonable after I linked it here.)

It might be unrelated, you'd have to see where you're burning your time, but my speculation went something like "disk images often contain large sparse areas" => "this is a known edge case where trying to manipulate sparse areas on things being regularly updated can cause problems, maybe it's causing problems here too".

You could try flipping the tunable zfs_dmu_offset_next_sync to 0 to see if the behavior improves. It's not dangerous, it just makes things behave as though there's no holes if it would have required a forced TXG sync to think it had a correct view of the holes, so you might end up reading/writing a bunch of zeroes instead of jumping over them if something was using SEEK_HOLE/SEEK_DATA.

@devZer0
Copy link
Author

devZer0 commented Mar 8, 2023

i did strace the qemu-img process but it did not reveal anything usable besides the fact , that i can see a lot of lseek and every seek apparently causing one or more atime updates.

this is a known edge case where trying to manipulate sparse areas on things being regularly updated can cause problems,

you did notice that this is nothing related to WRITE access in any way but only READ access? it also happens if the virtual machine is powered off. so there is no proces writing to the file itself.

apparently , simple reading of the metadata pre-allocated qcow2 file is causing an massive amount of atime updates and how massive this is, also seems to depend from the layout of the qcow2 file.

when the file is being moved forth and back, the problem is gone. apparently , "qemu-img convert" does not preserve what "qemu-img create" did setup initially.

i have found that issue after i copied a virtual machine from an older cluster to a newer one and then moving that file with proxmox gui (i.e. qemu-img) from hdd to ssd was slower then the copy via scp/network.

@rincebrain
Copy link
Contributor

Hypothetically, it could be something like, you do something to access the VM image while it's idle (reading, not writing, just to be entirely clear), it dirties the file because of the needed atime update, and consequently you end up with the aforementioned feature triggering on SEEK_HOLE/SEEK_DATA and forcing a txg sync because it notices the file is dirty, nevermind in what way.

@devZer0
Copy link
Author

devZer0 commented Mar 8, 2023

You could try flipping the tunable zfs_dmu_offset_next_sync to 0 to see if the behavior improves

it does! it restores performance to pre 2.1.5 behaviour ! (20s instead of 1m20s)

@devZer0
Copy link
Author

devZer0 commented Mar 8, 2023

Hypothetically, it could be something like, you do something to access the VM image while it's idle (reading, not writing, just to be entirely clear), it dirties the file because of the needed atime update, and consequently you end up with the aforementioned feature triggering on SEEK_HOLE/SEEK_DATA and forcing a txg sync because it notices the file is dirty, nevermind in what way.

after looking at bb8526e , i think it's exactly like you tell !

i have no clue how this can be resolved in a sane way.

is there a way to check if dirtying the file was "just" an atime update (which is not worth forcing txg sync) ?

when reading through https://man7.org/linux/man-pages/man2/open.2.html , i think O_NOATIME is no option for open() in qemu-img

@devZer0
Copy link
Author

devZer0 commented Mar 8, 2023

@ryao
Copy link
Contributor

ryao commented Mar 8, 2023

9f69435 is the commit in 2.1.x that changed our behavior. It was first in 2.1.5.

@behlendorf
Copy link
Contributor

There's actually an open PR to optimize this, #13368. There's just some outstanding feedback which needs to be addressed before it can be merged.

@rincebrain
Copy link
Contributor

9f69435 is the commit in 2.1.x that changed our behavior. It was first in 2.1.5.

It goes back further and more wind-y than that, I think.

First, there was #11900, where you could get inconsistent hole data if you looked too fast, as I recall. Oopsie. So the logic was modified, though the tunable predates that, and we got #12724. But that had a flaw, so we got #12745, and #12746 because the argument was that if we effectively didn't report holes without that in most cases, the functionality was likely to bitrot and break strangely one day.

And now here we are, with not inconsistent data, but pathological performing, behavior.

I'm curious to see if #13368 will mean we can avoid this penalty in most cases or not in practice..

@devZer0
Copy link
Author

devZer0 commented Mar 14, 2023

apparently patch from matthew ahrens has been approved: #13368 (review)

behlendorf pushed a commit that referenced this issue Mar 14, 2023
`lseek(SEEK_DATA | SEEK_HOLE)` are only accurate when the on-disk blocks
reflect all writes, i.e. when there are no dirty data blocks.  To ensure
this, if the target dnode is dirty, they wait for the open txg to be
synced, so we can call them "stabilizing operations".  If they cause
txg_wait_synced often, it can be detrimental to performance.

Typically, a group of files are all modified, and then SEEK_DATA/HOLE
are performed on them.  In this case, the first SEEK does a
txg_wait_synced(), and subsequent SEEKs don't need to wait, so
performance is good.

However, if a workload involves an interleaved metadata modification,
the subsequent SEEK may do a txg_wait_synced() unnecessarily.  For
example, if we do a `read()` syscall to each file before we do its SEEK.
This applies even with `relatime=on`, when the `read()` is the first
read after the last write.  The txg_wait_synced() is unnecessary because
the SEEK operations only care that the structure of the tree of indirect
and data blocks is up to date on disk.  They don't care about metadata
like the contents of the bonus or spill blocks.  (They also don't care
if an existing data block is modified, but this would be more involved
to filter out.)

This commit changes the behavior of SEEK_DATA/HOLE operations such that
they do not call txg_wait_synced() if there is only a pending change to
the bonus or spill block.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by:  Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #13368 
Issue #14594 
Issue #14512 
Issue #14009
behlendorf pushed a commit to behlendorf/zfs that referenced this issue Mar 14, 2023
`lseek(SEEK_DATA | SEEK_HOLE)` are only accurate when the on-disk blocks
reflect all writes, i.e. when there are no dirty data blocks.  To ensure
this, if the target dnode is dirty, they wait for the open txg to be
synced, so we can call them "stabilizing operations".  If they cause
txg_wait_synced often, it can be detrimental to performance.

Typically, a group of files are all modified, and then SEEK_DATA/HOLE
are performed on them.  In this case, the first SEEK does a
txg_wait_synced(), and subsequent SEEKs don't need to wait, so
performance is good.

However, if a workload involves an interleaved metadata modification,
the subsequent SEEK may do a txg_wait_synced() unnecessarily.  For
example, if we do a `read()` syscall to each file before we do its SEEK.
This applies even with `relatime=on`, when the `read()` is the first
read after the last write.  The txg_wait_synced() is unnecessary because
the SEEK operations only care that the structure of the tree of indirect
and data blocks is up to date on disk.  They don't care about metadata
like the contents of the bonus or spill blocks.  (They also don't care
if an existing data block is modified, but this would be more involved
to filter out.)

This commit changes the behavior of SEEK_DATA/HOLE operations such that
they do not call txg_wait_synced() if there is only a pending change to
the bonus or spill block.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by:  Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes openzfs#13368 
Issue openzfs#14594 
Issue openzfs#14512 
Issue openzfs#14009
behlendorf pushed a commit that referenced this issue Mar 15, 2023
`lseek(SEEK_DATA | SEEK_HOLE)` are only accurate when the on-disk blocks
reflect all writes, i.e. when there are no dirty data blocks.  To ensure
this, if the target dnode is dirty, they wait for the open txg to be
synced, so we can call them "stabilizing operations".  If they cause
txg_wait_synced often, it can be detrimental to performance.

Typically, a group of files are all modified, and then SEEK_DATA/HOLE
are performed on them.  In this case, the first SEEK does a
txg_wait_synced(), and subsequent SEEKs don't need to wait, so
performance is good.

However, if a workload involves an interleaved metadata modification,
the subsequent SEEK may do a txg_wait_synced() unnecessarily.  For
example, if we do a `read()` syscall to each file before we do its SEEK.
This applies even with `relatime=on`, when the `read()` is the first
read after the last write.  The txg_wait_synced() is unnecessary because
the SEEK operations only care that the structure of the tree of indirect
and data blocks is up to date on disk.  They don't care about metadata
like the contents of the bonus or spill blocks.  (They also don't care
if an existing data block is modified, but this would be more involved
to filter out.)

This commit changes the behavior of SEEK_DATA/HOLE operations such that
they do not call txg_wait_synced() if there is only a pending change to
the bonus or spill block.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by:  Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #13368 
Issue #14594 
Issue #14512 
Issue #14009
lundman pushed a commit to openzfsonwindows/openzfs that referenced this issue Mar 17, 2023
`lseek(SEEK_DATA | SEEK_HOLE)` are only accurate when the on-disk blocks
reflect all writes, i.e. when there are no dirty data blocks.  To ensure
this, if the target dnode is dirty, they wait for the open txg to be
synced, so we can call them "stabilizing operations".  If they cause
txg_wait_synced often, it can be detrimental to performance.

Typically, a group of files are all modified, and then SEEK_DATA/HOLE
are performed on them.  In this case, the first SEEK does a
txg_wait_synced(), and subsequent SEEKs don't need to wait, so
performance is good.

However, if a workload involves an interleaved metadata modification,
the subsequent SEEK may do a txg_wait_synced() unnecessarily.  For
example, if we do a `read()` syscall to each file before we do its SEEK.
This applies even with `relatime=on`, when the `read()` is the first
read after the last write.  The txg_wait_synced() is unnecessary because
the SEEK operations only care that the structure of the tree of indirect
and data blocks is up to date on disk.  They don't care about metadata
like the contents of the bonus or spill blocks.  (They also don't care
if an existing data block is modified, but this would be more involved
to filter out.)

This commit changes the behavior of SEEK_DATA/HOLE operations such that
they do not call txg_wait_synced() if there is only a pending change to
the bonus or spill block.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by:  Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes openzfs#13368 
Issue openzfs#14594 
Issue openzfs#14512 
Issue openzfs#14009
lundman pushed a commit to openzfsonwindows/openzfs that referenced this issue Mar 17, 2023
`lseek(SEEK_DATA | SEEK_HOLE)` are only accurate when the on-disk blocks
reflect all writes, i.e. when there are no dirty data blocks.  To ensure
this, if the target dnode is dirty, they wait for the open txg to be
synced, so we can call them "stabilizing operations".  If they cause
txg_wait_synced often, it can be detrimental to performance.

Typically, a group of files are all modified, and then SEEK_DATA/HOLE
are performed on them.  In this case, the first SEEK does a
txg_wait_synced(), and subsequent SEEKs don't need to wait, so
performance is good.

However, if a workload involves an interleaved metadata modification,
the subsequent SEEK may do a txg_wait_synced() unnecessarily.  For
example, if we do a `read()` syscall to each file before we do its SEEK.
This applies even with `relatime=on`, when the `read()` is the first
read after the last write.  The txg_wait_synced() is unnecessary because
the SEEK operations only care that the structure of the tree of indirect
and data blocks is up to date on disk.  They don't care about metadata
like the contents of the bonus or spill blocks.  (They also don't care
if an existing data block is modified, but this would be more involved
to filter out.)

This commit changes the behavior of SEEK_DATA/HOLE operations such that
they do not call txg_wait_synced() if there is only a pending change to
the bonus or spill block.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by:  Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes openzfs#13368 
Issue openzfs#14594 
Issue openzfs#14512 
Issue openzfs#14009
lundman pushed a commit to openzfsonwindows/openzfs that referenced this issue Mar 17, 2023
`lseek(SEEK_DATA | SEEK_HOLE)` are only accurate when the on-disk blocks
reflect all writes, i.e. when there are no dirty data blocks.  To ensure
this, if the target dnode is dirty, they wait for the open txg to be
synced, so we can call them "stabilizing operations".  If they cause
txg_wait_synced often, it can be detrimental to performance.

Typically, a group of files are all modified, and then SEEK_DATA/HOLE
are performed on them.  In this case, the first SEEK does a
txg_wait_synced(), and subsequent SEEKs don't need to wait, so
performance is good.

However, if a workload involves an interleaved metadata modification,
the subsequent SEEK may do a txg_wait_synced() unnecessarily.  For
example, if we do a `read()` syscall to each file before we do its SEEK.
This applies even with `relatime=on`, when the `read()` is the first
read after the last write.  The txg_wait_synced() is unnecessary because
the SEEK operations only care that the structure of the tree of indirect
and data blocks is up to date on disk.  They don't care about metadata
like the contents of the bonus or spill blocks.  (They also don't care
if an existing data block is modified, but this would be more involved
to filter out.)

This commit changes the behavior of SEEK_DATA/HOLE operations such that
they do not call txg_wait_synced() if there is only a pending change to
the bonus or spill block.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by:  Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes openzfs#13368 
Issue openzfs#14594 
Issue openzfs#14512 
Issue openzfs#14009
lundman pushed a commit to openzfsonwindows/openzfs that referenced this issue Mar 17, 2023
`lseek(SEEK_DATA | SEEK_HOLE)` are only accurate when the on-disk blocks
reflect all writes, i.e. when there are no dirty data blocks.  To ensure
this, if the target dnode is dirty, they wait for the open txg to be
synced, so we can call them "stabilizing operations".  If they cause
txg_wait_synced often, it can be detrimental to performance.

Typically, a group of files are all modified, and then SEEK_DATA/HOLE
are performed on them.  In this case, the first SEEK does a
txg_wait_synced(), and subsequent SEEKs don't need to wait, so
performance is good.

However, if a workload involves an interleaved metadata modification,
the subsequent SEEK may do a txg_wait_synced() unnecessarily.  For
example, if we do a `read()` syscall to each file before we do its SEEK.
This applies even with `relatime=on`, when the `read()` is the first
read after the last write.  The txg_wait_synced() is unnecessary because
the SEEK operations only care that the structure of the tree of indirect
and data blocks is up to date on disk.  They don't care about metadata
like the contents of the bonus or spill blocks.  (They also don't care
if an existing data block is modified, but this would be more involved
to filter out.)

This commit changes the behavior of SEEK_DATA/HOLE operations such that
they do not call txg_wait_synced() if there is only a pending change to
the bonus or spill block.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by:  Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes openzfs#13368 
Issue openzfs#14594 
Issue openzfs#14512 
Issue openzfs#14009
@devZer0
Copy link
Author

devZer0 commented Mar 18, 2023

just for reference: #14641

@ryao
Copy link
Contributor

ryao commented Mar 18, 2023

apparently patch from matthew ahrens has been approved: #13368 (review)

It has been merged. Is the problem still here?

@ssergiienko
Copy link

Given that fix(#13368) was reverted in #14761 the question is seems actual, is degradation still there?
Or did it go away with some other optimizations? Or is there any other ticket which is going to solve that problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

5 participants