Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

single-chunk scatter ABDs can be treated as linear #8580

Merged
merged 1 commit into from Jun 11, 2019

Conversation

ahrens
Copy link
Member

@ahrens ahrens commented Apr 4, 2019

Motivation and Context

Scatter ABD's are allocated from a number of pages. In contrast to
linear ABD's, these pages are disjoint in the kernel's virtual address
space, so they can't be accessed as a contiguous buffer. Therefore
routines that need a linear buffer (e.g. abd_borrow_buf() and friends)
must allocate a separate linear buffer (with zio_buf_alloc()), and copy
the contents of the pages to/from the linear buffer. This can have a
measurable performance overhead on some workloads.

87c25d5 ("abd_alloc should use scatter for >1K allocations") increased the use
of scatter ABD's, specifically switching 1.5K through 4K (inclusive)
buffers from linear to scatter. For workloads that access blocks whose
compressed sizes are in this range, that commit introduced an additional
copy into the read code path. For example, the
sequential_reads_arc_cached tests in the test suite were reduced by
around 5% (this is doing reads of 8K-logical blocks, compressed to 3K,
which are cached in the ARC).

Description

This commit treats single-chunk scattered buffers as linear buffers,
because they are contiguous in the kernel's virtual address space.

All single-page (4K) ABD's can be represented this way. Some multi-page
ABD's can also be represented this way, if we were able to allocate a
single "chunk" (higher-order "page" which represents a power-of-2 series
of physically-contiguous pages). This is often the case for 2-page (8K)
ABD's.

Representing a single-entry scatter ABD as a linear ABD has the
performance advantage of avoiding the copy (and allocation) in
abd_borrow_buf_copy / abd_return_buf_copy. A performance increase of
around 5% has been observed for ARC-cached reads (of small blocks which
can take advantage of this), fixing the regression introduced by
87c25d5.

Note that this optimization is only possible because all physical memory
is always mapped into the kernel's address space. This is not the case
for HIGHMEM pages, so the optimization can not be made on 32-bit
systems.

How Has This Been Tested?

ZFS performance test suite.

The time spent in the extra memcpy() can be seen in CPU flame graphs. Note that although >5% of CPU is in memcpy(), the total time in zio_decompress_data() increases by <5%, and overall performance (MB/s) is decreased by around 5%. This is because some of the time spent in memcpy() is spent bringing the compressed data into cache, which improves performance of lz4_decompress_zfs(). When the cache is not "pre-warmed" by the memcpy(), lz4_decompress_zfs() takes a little longer.

Flame graphs while running sequential_reads_arc_cached.ksh.fio.sync.128k-ios.64-threads.1-filesystems:

Before 87c25d5:
image

After 87c25d5:
image

With this commit:
image

Thanks to @tonynguien for identifying the performance regression, running the performance tests, and generating the above flame graphs.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

@ahrens ahrens added Status: Code Review Needed Ready for review and testing Type: Performance Performance improvement or performance problem labels Apr 4, 2019
@behlendorf
Copy link
Contributor

FYI, the following IMPLY was hit by several of the test builders.

http://build.zfsonlinux.org/builders/Ubuntu%2018.04%20x86_64%20%28TEST%29/builds/3353/steps/shell_9/logs/console

[ 8514.850326] (abd_is_linear(zio->io_abd)) implies (abd_is_linear(data))
[ 8514.850328] PANIC at zio.c:338:zio_push_transform()
[ 8514.859069] Showing stack for process 30068
[ 8514.859072] CPU: 0 PID: 30068 Comm: cat Tainted: P           OE    4.15.0-1007-aws #7-Ubuntu
[ 8514.859074] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
[ 8514.859075] Call Trace:
[ 8514.859083]  dump_stack+0x63/0x8b
[ 8514.859095]  spl_dumpstack+0x29/0x30 [spl]
[ 8514.859099]  spl_panic+0xc8/0x110 [spl]
[ 8514.859193]  zio_push_transform+0xcf/0xe0 [zfs]
[ 8514.859303]  zio_read_bp_init+0x232/0x2a0 [zfs]
[ 8514.859356]  zio_nowait+0x12a/0x2d0 [zfs]
[ 8514.859389]  arc_read+0xa85/0x16a0 [zfs]
[ 8514.859456]  dbuf_read_impl+0x56f/0xa90 [zfs]
[ 8514.859542]  dbuf_read+0xf7/0x730 [zfs]
[ 8514.859613]  dbuf_hold_impl_arg+0x9b8/0xad0 [zfs]
[ 8514.859645]  dbuf_hold_impl+0x23/0x40 [zfs]
[ 8514.859677]  dbuf_hold_level+0x32/0x60 [zfs]
[ 8514.859709]  dbuf_hold+0x16/0x20 [zfs]
[ 8514.859754]  dmu_buf_hold_array_by_dnode+0xf2/0x5a0 [zfs]
[ 8514.859799]  dmu_read_uio_dnode+0x4b/0x130 [zfs]
[ 8514.859834]  dmu_read_uio_dbuf+0x49/0x70 [zfs]
[ 8514.859890]  zfs_read+0x10e/0x300 [zfs]
[ 8514.859944]  zpl_read_common_iovec+0x8d/0xe0 [zfs]
[ 8514.859997]  zpl_iter_read_common+0x82/0xb0 [zfs]
[ 8514.860049]  zpl_iter_read+0x53/0x80 [zfs]
[ 8514.860052]  new_sync_read+0xe4/0x130
[ 8514.860054]  __vfs_read+0x29/0x40
[ 8514.860056]  vfs_read+0x8e/0x130
[ 8514.860057]  SyS_read+0x55/0xc0
[ 8514.860060]  do_syscall_64+0x73/0x130
[ 8514.860063]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2

@tuxoko
Copy link
Contributor

tuxoko commented Apr 6, 2019

I think that IMPLY might not be needed anymore.
Everything below ZIO stack seems to be type agnostic now.
It used to be that ZIL buffer needs to be linear because the checksum code assumes it, but not anymore.

However, if we want to keep this then change it to
IMPLY(abd_is_linear(zio->io_abd) && !abd_is_linear_page(zio->io_abd), abd_is_linear(data))

@ahrens
Copy link
Member Author

ahrens commented Apr 6, 2019

@behlendorf @tuxoko Thanks for pointing that out. I agree that nothing actually cares about that now, so we can remove the assertion. I looked for other callers of abd_is_linear() outside of abd.c and found another assertion that I think we should remove, in arc_alloc_compressed_buf():

	if (!arc_buf_is_shared(buf)) {
		/*
		 * To ensure that the hdr has the correct data in it if we call
		 * arc_untransform() on this buf before it's been written to
		 * disk, it's easiest if we just set up sharing between the
		 * buf and the hdr.
		 */
		ASSERT(!abd_is_linear(hdr->b_l1hdr.b_pabd));
		arc_hdr_free_abd(hdr, B_FALSE);
		arc_share_buf(hdr, buf);
	}

I think that the b_pabd could be linear_page, in which case we won't share it. The assertion isn't needed - we can still free b_pabd and then point it to the buf's (linear) abd.

I also checked all calls to abd_to_buf() and they seemed safe (e.g. they weren't assuming that io_abd was linear).

@behlendorf
Copy link
Contributor

This PR appears to introduce a crash for the older 2.6.32 CentOS 6 kernel. Unfortunately, the CI seems fail in such a way that we're not able to get the console logs from the failure.

@behlendorf behlendorf added Status: Revision Needed Changes are required for the PR to be accepted and removed Status: Post 0.8.0 Status: Code Review Needed Ready for review and testing labels May 23, 2019
@behlendorf
Copy link
Contributor

@ahrens would you mind rebasing this. I'll do a little investigation to see what's going on for CentOS 6.

@ahrens
Copy link
Member Author

ahrens commented Jun 7, 2019

@behlendorf Done. Thanks a lot for taking a look at the CentOS 6 issue!

@ahrens
Copy link
Member Author

ahrens commented Jun 7, 2019

In the Ubuntu 18.04 test (http://build.zfsonlinux.org/builders/Ubuntu%2018.04%20x86_64%20%28TEST%29/builds/3904/steps/shell_9/logs/summary), zpool_trim_partial is failing. It looks like the logic that's supposed to wait for the first trim to complete didn't work right:

05:29:05.22 SUCCESS: set_tunable64 zfs_trim_metaslab_skip 0
05:29:05.30 SUCCESS: zpool sync
05:29:08.87 SUCCESS: test 3387244544 -gt 1073741824
05:29:08.90 cannot trim '/mnt/testdir/largefile': currently trimming
05:29:08.90 ERROR: zpool trim testpool exited 255

In the Debian 9 test, mmp_on_uberblocks is failing, the log (http://build.zfsonlinux.org/builders/Debian%209%20x86_64%20%28TEST%29/builds/3853/steps/shell_9/logs/log) shows:

06:51:26.31 ERROR: mmp_seq did not increase by 7; before 2 after 7

I don't see how these errors are related to the changes in this PR.

@behlendorf
Copy link
Contributor

I don't see how these errors are related to the changes in this PR.

Right, both of these failures appear on the list of tests we still see occasionally fail and are unrelated to this change. Regardless, I've submitted those builds again.

@behlendorf behlendorf added Status: Code Review Needed Ready for review and testing and removed Status: Revision Needed Changes are required for the PR to be accepted labels Jun 7, 2019
module/zfs/abd.c Outdated Show resolved Hide resolved
Scatter ABD's are allocated from a number of pages.  In contrast to
linear ABD's, these pages are disjoint in the kernel's virtual address
space, so they can't be accessed as a contiguous buffer.  Therefore
routines that need a linear buffer (e.g. abd_borrow_buf() and friends)
must allocate a separate linear buffer (with zio_buf_alloc()), and copy
the contents of the pages to/from the linear buffer.  This can have a
measurable performance overhead on some workloads.

openzfs@87c25d5
("abd_alloc should use scatter for >1K allocations") increased the use
of scatter ABD's, specifically switching 1.5K through 4K (inclusive)
buffers from linear to scatter.  For workloads that access blocks whose
compressed sizes are in this range, that commit introduced an additional
copy into the read code path.  For example, the
sequential_reads_arc_cached tests in the test suite were reduced by
around 5% (this is doing reads of 8K-logical blocks, compressed to 3K,
which are cached in the ARC).

This commit treats single-chunk scattered buffers as linear buffers,
because they are contiguous in the kernel's virtual address space.

All single-page (4K) ABD's can be represented this way.  Some multi-page
ABD's can also be represented this way, if we were able to allocate a
single "chunk" (higher-order "page" which represents a power-of-2 series
of physically-contiguous pages).  This is often the case for 2-page (8K)
ABD's.

Representing a single-entry scatter ABD as a linear ABD has the
performance advantage of avoiding the copy (and allocation) in
abd_borrow_buf_copy / abd_return_buf_copy.  A performance increase of
around 5% has been observed for ARC-cached reads (of small blocks which
can take advantage of this), fixing the regression introduced by
87c25d5.

Note that this optimization is only possible because all physical memory
is always mapped into the kernel's address space.  This is not the case
for HIGHMEM pages, so the optimization can not be made on 32-bit
systems.

Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
@behlendorf behlendorf removed the Status: Code Review Needed Ready for review and testing label Jun 10, 2019
@behlendorf behlendorf added the Status: Accepted Ready to integrate (reviewed, tested) label Jun 10, 2019
@codecov
Copy link

codecov bot commented Jun 11, 2019

Codecov Report

Merging #8580 into master will increase coverage by 0.13%.
The diff coverage is 97.29%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #8580      +/-   ##
==========================================
+ Coverage   78.61%   78.74%   +0.13%     
==========================================
  Files         382      382              
  Lines      117781   117798      +17     
==========================================
+ Hits        92592    92761     +169     
+ Misses      25189    25037     -152
Flag Coverage Δ
#kernel 79.33% <100%> (-0.01%) ⬇️
#user 67.31% <70.83%> (+0.35%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5a902f5...bb36001. Read the comment docs.

@behlendorf behlendorf merged commit 5662fd5 into openzfs:master Jun 11, 2019
@ahrens ahrens deleted the abd branch June 11, 2019 16:15
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Dec 24, 2019
Scatter ABD's are allocated from a number of pages.  In contrast to
linear ABD's, these pages are disjoint in the kernel's virtual address
space, so they can't be accessed as a contiguous buffer.  Therefore
routines that need a linear buffer (e.g. abd_borrow_buf() and friends)
must allocate a separate linear buffer (with zio_buf_alloc()), and copy
the contents of the pages to/from the linear buffer.  This can have a
measurable performance overhead on some workloads.

openzfs@87c25d5
("abd_alloc should use scatter for >1K allocations") increased the use
of scatter ABD's, specifically switching 1.5K through 4K (inclusive)
buffers from linear to scatter.  For workloads that access blocks whose
compressed sizes are in this range, that commit introduced an additional
copy into the read code path.  For example, the
sequential_reads_arc_cached tests in the test suite were reduced by
around 5% (this is doing reads of 8K-logical blocks, compressed to 3K,
which are cached in the ARC).

This commit treats single-chunk scattered buffers as linear buffers,
because they are contiguous in the kernel's virtual address space.

All single-page (4K) ABD's can be represented this way.  Some multi-page
ABD's can also be represented this way, if we were able to allocate a
single "chunk" (higher-order "page" which represents a power-of-2 series
of physically-contiguous pages).  This is often the case for 2-page (8K)
ABD's.

Representing a single-entry scatter ABD as a linear ABD has the
performance advantage of avoiding the copy (and allocation) in
abd_borrow_buf_copy / abd_return_buf_copy.  A performance increase of
around 5% has been observed for ARC-cached reads (of small blocks which
can take advantage of this), fixing the regression introduced by
87c25d5.

Note that this optimization is only possible because all physical memory
is always mapped into the kernel's address space.  This is not the case
for HIGHMEM pages, so the optimization can not be made on 32-bit
systems.

Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes openzfs#8580
tonyhutter pushed a commit that referenced this pull request Jan 23, 2020
Scatter ABD's are allocated from a number of pages.  In contrast to
linear ABD's, these pages are disjoint in the kernel's virtual address
space, so they can't be accessed as a contiguous buffer.  Therefore
routines that need a linear buffer (e.g. abd_borrow_buf() and friends)
must allocate a separate linear buffer (with zio_buf_alloc()), and copy
the contents of the pages to/from the linear buffer.  This can have a
measurable performance overhead on some workloads.

87c25d5
("abd_alloc should use scatter for >1K allocations") increased the use
of scatter ABD's, specifically switching 1.5K through 4K (inclusive)
buffers from linear to scatter.  For workloads that access blocks whose
compressed sizes are in this range, that commit introduced an additional
copy into the read code path.  For example, the
sequential_reads_arc_cached tests in the test suite were reduced by
around 5% (this is doing reads of 8K-logical blocks, compressed to 3K,
which are cached in the ARC).

This commit treats single-chunk scattered buffers as linear buffers,
because they are contiguous in the kernel's virtual address space.

All single-page (4K) ABD's can be represented this way.  Some multi-page
ABD's can also be represented this way, if we were able to allocate a
single "chunk" (higher-order "page" which represents a power-of-2 series
of physically-contiguous pages).  This is often the case for 2-page (8K)
ABD's.

Representing a single-entry scatter ABD as a linear ABD has the
performance advantage of avoiding the copy (and allocation) in
abd_borrow_buf_copy / abd_return_buf_copy.  A performance increase of
around 5% has been observed for ARC-cached reads (of small blocks which
can take advantage of this), fixing the regression introduced by
87c25d5.

Note that this optimization is only possible because all physical memory
is always mapped into the kernel's address space.  This is not the case
for HIGHMEM pages, so the optimization can not be made on 32-bit
systems.

Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #8580
allanjude pushed a commit to KlaraSystems/zfs that referenced this pull request Apr 28, 2020
Scatter ABD's are allocated from a number of pages.  In contrast to
linear ABD's, these pages are disjoint in the kernel's virtual address
space, so they can't be accessed as a contiguous buffer.  Therefore
routines that need a linear buffer (e.g. abd_borrow_buf() and friends)
must allocate a separate linear buffer (with zio_buf_alloc()), and copy
the contents of the pages to/from the linear buffer.  This can have a
measurable performance overhead on some workloads.

openzfs@87c25d5
("abd_alloc should use scatter for >1K allocations") increased the use
of scatter ABD's, specifically switching 1.5K through 4K (inclusive)
buffers from linear to scatter.  For workloads that access blocks whose
compressed sizes are in this range, that commit introduced an additional
copy into the read code path.  For example, the
sequential_reads_arc_cached tests in the test suite were reduced by
around 5% (this is doing reads of 8K-logical blocks, compressed to 3K,
which are cached in the ARC).

This commit treats single-chunk scattered buffers as linear buffers,
because they are contiguous in the kernel's virtual address space.

All single-page (4K) ABD's can be represented this way.  Some multi-page
ABD's can also be represented this way, if we were able to allocate a
single "chunk" (higher-order "page" which represents a power-of-2 series
of physically-contiguous pages).  This is often the case for 2-page (8K)
ABD's.

Representing a single-entry scatter ABD as a linear ABD has the
performance advantage of avoiding the copy (and allocation) in
abd_borrow_buf_copy / abd_return_buf_copy.  A performance increase of
around 5% has been observed for ARC-cached reads (of small blocks which
can take advantage of this), fixing the regression introduced by
87c25d5.

Note that this optimization is only possible because all physical memory
is always mapped into the kernel's address space.  This is not the case
for HIGHMEM pages, so the optimization can not be made on 32-bit
systems.

Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes openzfs#8580
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested) Type: Performance Performance improvement or performance problem
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants