Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve logging of 128KB writes. #9409

Merged
merged 1 commit into from Nov 11, 2019
Merged

Conversation

amotin
Copy link
Member

@amotin amotin commented Oct 4, 2019

Motivation and Context

Before my ZIL space optimization few years ago 128KB writes were logged
as two 64KB+ records in two 128KB log blocks. After that change it became
~127KB+/1KB+ in two 128KB log blocks to free space in the second block
for another record. Unfortunately in case of 128KB only writes, when space
in the second block remained unused, that change increased write latency by
unbalancing checksum computation and write times between parallel threads.
It also didn't help with SLOG space efficiency in that case.

Description

This change introduces new 68KB log block size, used for both writes below
67KB and 128KB-sharp writes. Writes of 68-127KB are still using one 128KB
block to not increase processing overhead. Writes above 131KB are still
using full 128KB blocks, since possible saving there is small. Mixed loads
will likely also fall back to previous 128KB, since code uses maximum of
the last 16 requested block sizes.

How Has This Been Tested?

It was tested on FreeBSD's native ZFS. Synchronous 128KB writes to different number of ZVOLs were generated by FIO, that measured write throughput. Statistics of SLOG write sizes was collected with DTrace.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

Before my ZIL space optimization few years ago 128KB writes were logged
as two 64KB+ records in two 128KB log blocks.  After that change it
became ~127KB+/1KB+ in two 128KB log blocks to free space in the second
block for another record.  Unfortunately in case of 128KB only writes,
when space in the second block remained unused, that change increased
write latency by unbalancing checksum computation and write times
between parallel threads.  It also didn't help with SLOG space
efficiency in that case.

This change introduces new 68KB log block size, used for both writes
below 67KB and 128KB-sharp writes.  Writes of 68-127KB are still using
one 128KB block to not increase processing overhead.  Writes above
131KB are still using full 128KB blocks, since possible saving there
is small.  Mixed loads will likely also fall back to previous 128KB,
since code uses maximum of the last 16 requested block sizes.

Signed-off-by:  Alexander Motin <mav@FreeBSD.org>
@ahrens
Copy link
Member

ahrens commented Oct 4, 2019

Did you consider making a 129K ZIL block (when the large blocks pool feature is enabled)?

@amotin
Copy link
Member Author

amotin commented Oct 4, 2019

Did you consider making a 129K ZIL block (when the large blocks pool feature is enabled)?

Yes, I though about it, but I see several uses of SPA_OLD_MAXBLOCKSIZE in the zil.c, so I worry it would break compatibility. I suppose it would require additional feature.

@ahrens
Copy link
Member

ahrens commented Oct 4, 2019

That's probably true. I'm not opposed to this PR, but I would guess you could get a bigger win if there was only 1 log block.

@codecov
Copy link

codecov bot commented Oct 4, 2019

Codecov Report

Merging #9409 into master will decrease coverage by 0.13%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9409      +/-   ##
==========================================
- Coverage   79.09%   78.96%   -0.14%     
==========================================
  Files         410      410              
  Lines      122498   122498              
==========================================
- Hits        96895    96725     -170     
- Misses      25603    25773     +170
Flag Coverage Δ
#kernel 79.72% <100%> (-0.03%) ⬇️
#user 66.42% <100%> (-0.25%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2cc479d...b7ba449. Read the comment docs.

{ 65536 + 4096, 65536 + 4096 }, /* 64KB writes */
{ 131072, 131072 }, /* < 128KB writes */
{ 131072 +4096, 65536 + 4096 }, /* 128KB writes */
{ UINT64_MAX, SPA_OLD_MAXBLOCKSIZE}, /* > 128KB writes */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We did quite a bit of experimenting here a few years ago. In general, fewer buckets is better. But also the previous default buckets aren't representative of today's workloads.
Given the large size of disks today, especially for slogs, there doesn't seem to be any real space savings to be gained by having smaller buckets. Thus the only benefit for small buckets is if you have mixed workload datasets in raidz pools with no slog.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using NVDIMMs for SLOG we are quite space-constrained at this point. May be new Apache Pass change that, but so far we are not there.

@amotin
Copy link
Member Author

amotin commented Oct 7, 2019

I would guess you could get a bigger win if there was only 1 log block.

It may be so for some call overhead, but not sure about latency. Plus on FreeBSD I/Os are still chunked by 128KB, so 132KB at the end would turn into 128+4KB of I/Os, so win may be smaller.

I've made a patch to remove the SPA_OLD_MAXBLOCKSIZE limitation in zil_parse(), which is the only ZIL block size limitation I saw so far, but I am not sure how those large blocks could be used efficiently. Always allocate 132KB means request for above chunking overhead even when not needed. Use 132KB only for ~128KB blocks limits scope of this change, and I am not sure the win will be sufficient. Some tricks of allocating 132KB but trying to limit writes to 128KB sound weird.

@ahrens How do you see large ZIL blocks could be used?

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any objections to this change. @ahrens @richardelling is there anything else you'd like to see?

@behlendorf behlendorf added this to FreeBSD Features in OpenZFS 2.0 Nov 1, 2019
@behlendorf
Copy link
Contributor

Unless someone has an objection to this change I'll go ahead and merge it next week.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Nov 9, 2019
@amotin
Copy link
Member Author

amotin commented Nov 11, 2019

I've discussed it with @ahrens during the developer summit. He had no objections.

@behlendorf behlendorf merged commit f15d6a5 into openzfs:master Nov 11, 2019
OpenZFS 2.0 automation moved this from FreeBSD Features to Done Nov 11, 2019
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Dec 26, 2019
Before my ZIL space optimization few years ago 128KB writes were logged
as two 64KB+ records in two 128KB log blocks.  After that change it
became ~127KB+/1KB+ in two 128KB log blocks to free space in the second
block for another record.  Unfortunately in case of 128KB only writes,
when space in the second block remained unused, that change increased
write latency by unbalancing checksum computation and write times
between parallel threads.  It also didn't help with SLOG space
efficiency in that case.

This change introduces new 68KB log block size, used for both writes
below 67KB and 128KB-sharp writes.  Writes of 68-127KB are still using
one 128KB block to not increase processing overhead.  Writes above
131KB are still using full 128KB blocks, since possible saving there
is small.  Mixed loads will likely also fall back to previous 128KB,
since code uses maximum of the last 16 requested block sizes.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:  Alexander Motin <mav@FreeBSD.org>
Closes openzfs#9409
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Dec 27, 2019
Before my ZIL space optimization few years ago 128KB writes were logged
as two 64KB+ records in two 128KB log blocks.  After that change it
became ~127KB+/1KB+ in two 128KB log blocks to free space in the second
block for another record.  Unfortunately in case of 128KB only writes,
when space in the second block remained unused, that change increased
write latency by unbalancing checksum computation and write times
between parallel threads.  It also didn't help with SLOG space
efficiency in that case.

This change introduces new 68KB log block size, used for both writes
below 67KB and 128KB-sharp writes.  Writes of 68-127KB are still using
one 128KB block to not increase processing overhead.  Writes above
131KB are still using full 128KB blocks, since possible saving there
is small.  Mixed loads will likely also fall back to previous 128KB,
since code uses maximum of the last 16 requested block sizes.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:  Alexander Motin <mav@FreeBSD.org>
Closes openzfs#9409
tonyhutter pushed a commit that referenced this pull request Jan 23, 2020
Before my ZIL space optimization few years ago 128KB writes were logged
as two 64KB+ records in two 128KB log blocks.  After that change it
became ~127KB+/1KB+ in two 128KB log blocks to free space in the second
block for another record.  Unfortunately in case of 128KB only writes,
when space in the second block remained unused, that change increased
write latency by unbalancing checksum computation and write times
between parallel threads.  It also didn't help with SLOG space
efficiency in that case.

This change introduces new 68KB log block size, used for both writes
below 67KB and 128KB-sharp writes.  Writes of 68-127KB are still using
one 128KB block to not increase processing overhead.  Writes above
131KB are still using full 128KB blocks, since possible saving there
is small.  Mixed loads will likely also fall back to previous 128KB,
since code uses maximum of the last 16 requested block sizes.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:  Alexander Motin <mav@FreeBSD.org>
Closes #9409
@amotin amotin deleted the log_128k branch August 24, 2021 20:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
No open projects
OpenZFS 2.0
  
Done
Development

Successfully merging this pull request may close these issues.

None yet

4 participants