New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve logging of 128KB writes. #9409
Conversation
Before my ZIL space optimization few years ago 128KB writes were logged as two 64KB+ records in two 128KB log blocks. After that change it became ~127KB+/1KB+ in two 128KB log blocks to free space in the second block for another record. Unfortunately in case of 128KB only writes, when space in the second block remained unused, that change increased write latency by unbalancing checksum computation and write times between parallel threads. It also didn't help with SLOG space efficiency in that case. This change introduces new 68KB log block size, used for both writes below 67KB and 128KB-sharp writes. Writes of 68-127KB are still using one 128KB block to not increase processing overhead. Writes above 131KB are still using full 128KB blocks, since possible saving there is small. Mixed loads will likely also fall back to previous 128KB, since code uses maximum of the last 16 requested block sizes. Signed-off-by: Alexander Motin <mav@FreeBSD.org>
|
Did you consider making a 129K ZIL block (when the large blocks pool feature is enabled)? |
Yes, I though about it, but I see several uses of SPA_OLD_MAXBLOCKSIZE in the zil.c, so I worry it would break compatibility. I suppose it would require additional feature. |
|
That's probably true. I'm not opposed to this PR, but I would guess you could get a bigger win if there was only 1 log block. |
Codecov Report
@@ Coverage Diff @@
## master #9409 +/- ##
==========================================
- Coverage 79.09% 78.96% -0.14%
==========================================
Files 410 410
Lines 122498 122498
==========================================
- Hits 96895 96725 -170
- Misses 25603 25773 +170
Continue to review full report at Codecov.
|
| { 65536 + 4096, 65536 + 4096 }, /* 64KB writes */ | ||
| { 131072, 131072 }, /* < 128KB writes */ | ||
| { 131072 +4096, 65536 + 4096 }, /* 128KB writes */ | ||
| { UINT64_MAX, SPA_OLD_MAXBLOCKSIZE}, /* > 128KB writes */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We did quite a bit of experimenting here a few years ago. In general, fewer buckets is better. But also the previous default buckets aren't representative of today's workloads.
Given the large size of disks today, especially for slogs, there doesn't seem to be any real space savings to be gained by having smaller buckets. Thus the only benefit for small buckets is if you have mixed workload datasets in raidz pools with no slog.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using NVDIMMs for SLOG we are quite space-constrained at this point. May be new Apache Pass change that, but so far we are not there.
It may be so for some call overhead, but not sure about latency. Plus on FreeBSD I/Os are still chunked by 128KB, so 132KB at the end would turn into 128+4KB of I/Os, so win may be smaller. I've made a patch to remove the SPA_OLD_MAXBLOCKSIZE limitation in zil_parse(), which is the only ZIL block size limitation I saw so far, but I am not sure how those large blocks could be used efficiently. Always allocate 132KB means request for above chunking overhead even when not needed. Use 132KB only for ~128KB blocks limits scope of this change, and I am not sure the win will be sufficient. Some tricks of allocating 132KB but trying to limit writes to 128KB sound weird. @ahrens How do you see large ZIL blocks could be used? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have any objections to this change. @ahrens @richardelling is there anything else you'd like to see?
|
Unless someone has an objection to this change I'll go ahead and merge it next week. |
|
I've discussed it with @ahrens during the developer summit. He had no objections. |
Before my ZIL space optimization few years ago 128KB writes were logged as two 64KB+ records in two 128KB log blocks. After that change it became ~127KB+/1KB+ in two 128KB log blocks to free space in the second block for another record. Unfortunately in case of 128KB only writes, when space in the second block remained unused, that change increased write latency by unbalancing checksum computation and write times between parallel threads. It also didn't help with SLOG space efficiency in that case. This change introduces new 68KB log block size, used for both writes below 67KB and 128KB-sharp writes. Writes of 68-127KB are still using one 128KB block to not increase processing overhead. Writes above 131KB are still using full 128KB blocks, since possible saving there is small. Mixed loads will likely also fall back to previous 128KB, since code uses maximum of the last 16 requested block sizes. Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes openzfs#9409
Before my ZIL space optimization few years ago 128KB writes were logged as two 64KB+ records in two 128KB log blocks. After that change it became ~127KB+/1KB+ in two 128KB log blocks to free space in the second block for another record. Unfortunately in case of 128KB only writes, when space in the second block remained unused, that change increased write latency by unbalancing checksum computation and write times between parallel threads. It also didn't help with SLOG space efficiency in that case. This change introduces new 68KB log block size, used for both writes below 67KB and 128KB-sharp writes. Writes of 68-127KB are still using one 128KB block to not increase processing overhead. Writes above 131KB are still using full 128KB blocks, since possible saving there is small. Mixed loads will likely also fall back to previous 128KB, since code uses maximum of the last 16 requested block sizes. Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes openzfs#9409
Before my ZIL space optimization few years ago 128KB writes were logged as two 64KB+ records in two 128KB log blocks. After that change it became ~127KB+/1KB+ in two 128KB log blocks to free space in the second block for another record. Unfortunately in case of 128KB only writes, when space in the second block remained unused, that change increased write latency by unbalancing checksum computation and write times between parallel threads. It also didn't help with SLOG space efficiency in that case. This change introduces new 68KB log block size, used for both writes below 67KB and 128KB-sharp writes. Writes of 68-127KB are still using one 128KB block to not increase processing overhead. Writes above 131KB are still using full 128KB blocks, since possible saving there is small. Mixed loads will likely also fall back to previous 128KB, since code uses maximum of the last 16 requested block sizes. Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #9409
Motivation and Context
Before my ZIL space optimization few years ago 128KB writes were logged
as two 64KB+ records in two 128KB log blocks. After that change it became
~127KB+/1KB+ in two 128KB log blocks to free space in the second block
for another record. Unfortunately in case of 128KB only writes, when space
in the second block remained unused, that change increased write latency by
unbalancing checksum computation and write times between parallel threads.
It also didn't help with SLOG space efficiency in that case.
Description
This change introduces new 68KB log block size, used for both writes below
67KB and 128KB-sharp writes. Writes of 68-127KB are still using one 128KB
block to not increase processing overhead. Writes above 131KB are still
using full 128KB blocks, since possible saving there is small. Mixed loads
will likely also fall back to previous 128KB, since code uses maximum of
the last 16 requested block sizes.
How Has This Been Tested?
It was tested on FreeBSD's native ZFS. Synchronous 128KB writes to different number of ZVOLs were generated by FIO, that measured write throughput. Statistics of SLOG write sizes was collected with DTrace.
Types of changes
Checklist:
Signed-off-by.