8338967: Improve performance for MemorySegment::fill#20712
8338967: Improve performance for MemorySegment::fill#20712minborg wants to merge 15 commits intoopenjdk:masterfrom
Conversation
|
👋 Welcome back pminborg! A progress list of the required criteria for merging this PR into |
|
@minborg This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be: You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 9 new commits pushed to the
Please see this link for an up-to-date comparison between the source branch of this pull request and the ➡️ To integrate this PR with the above commit message to the |
Webrevs
|
| // Use the old switch statement syntax to improve startup time | ||
| switch ((int) length) { | ||
| case 0 : checkReadOnly(false); checkValidState(); break; // Explicit tests | ||
| case 1 : set(JAVA_BYTE, 0, value); break; |
There was a problem hiding this comment.
beware using a switch, because if this code if is too big to be inlined (or we're unlucky) will die due to branch-mispredict in case the different "small fills" are unstable/unpredictable.
Having a test which feed different fill sizes per each iteration + counting branch misses, will reveal if the improvement is worthy even with such cases
There was a problem hiding this comment.
It is true, that this is a compromise where we give up inline space, code-cache space, and introduce added complexity against the prospect of better small-size performance. Depending on the workload, this may or may not pay off. In the (presumably common) case where we allocate/fill small segments of constant sizes, this is likely a win. Writing a dynamic performance test sounds like a good idea.
There was a problem hiding this comment.
Here is a benchmark that fills segments of various random sizes:
@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 5, time = 500, timeUnit = TimeUnit.MILLISECONDS)
@Measurement(iterations = 10, time = 500, timeUnit = TimeUnit.MILLISECONDS)
@State(Scope.Thread)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Fork(value = 3)
public class TestFill {
private static final int SIZE = 16;
private static final int[] INDICES = new Random(42).ints(0, 8)
.limit(SIZE)
.toArray();
private MemorySegment[] segments;
@Setup
public void setup() {
segments = IntStream.of(INDICES)
.mapToObj(i -> MemorySegment.ofArray(new byte[i]))
.toArray(MemorySegment[]::new);
}
@Benchmark
public void heap_segment_fill() {
for (int i = 0; i < SIZE; i++) {
segments[i].fill((byte) 0);
}
}
}
This produces the following on my Mac M1:
Benchmark Mode Cnt Score Error Units
TestFill.heap_segment_fill avgt 30 59.054 ? 3.723 ns/op
On average, an operation will take 59/16 = ~3 ns per operation (including looping).
A test with the same size for every benchmark looks like this on my machine:
Benchmark (ELEM_SIZE) Mode Cnt Score Error Units
TestFill.heap_segment_fill 0 avgt 30 1.112 ? 0.027 ns/op
TestFill.heap_segment_fill 1 avgt 30 1.602 ? 0.060 ns/op
TestFill.heap_segment_fill 2 avgt 30 1.583 ? 0.004 ns/op
TestFill.heap_segment_fill 3 avgt 30 1.909 ? 0.055 ns/op
TestFill.heap_segment_fill 4 avgt 30 1.605 ? 0.059 ns/op
TestFill.heap_segment_fill 5 avgt 30 1.900 ? 0.064 ns/op
TestFill.heap_segment_fill 6 avgt 30 1.891 ? 0.038 ns/op
TestFill.heap_segment_fill 7 avgt 30 2.237 ? 0.091 ns/op
There was a problem hiding this comment.
As discussed offline, can't we use a stable array of functions or something like that which can be populated lazily? That way you can access the function you want in a single array access, and we could put all these helper methods somewhere else.
There was a problem hiding this comment.
Unfortunately, a stable array of functions/MethodHandles didn't work from a performance perspective.
There was a problem hiding this comment.
Here is a benchmark that fills segments of various random sizes:
without proper branch misses perf counters is difficult to say if it is actually messing up with the Apple MX branch pred...
For my Ryzen this is the test which mess up with the branch prediction (which is fairly good in AMD); clearly not inlining fill is a trick to make MemorySegment::fill inlined and still makes the branch predictor targets "stable" for our purposes, but translates into a more costy call dispatch - meaning that based on the CPU cost of branch mispredict (and nuking the pipeline), it could still be fine (as these numbers shows):
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.CompilerControl;
import org.openjdk.jmh.annotations.Fork;
import org.openjdk.jmh.annotations.Measurement;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Param;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.annotations.Warmup;
import java.lang.foreign.MemorySegment;
import java.util.Random;
import java.util.concurrent.TimeUnit;
@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 5, time = 500, timeUnit = TimeUnit.MILLISECONDS)
@Measurement(iterations = 10, time = 500, timeUnit = TimeUnit.MILLISECONDS)
@State(Scope.Thread)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Fork(value = 3)
public class TestFill {
@Param({"false", "true"})
private boolean shuffle;
private MemorySegment[] segments;
@Param({ "1024", "128000"})
private int samples;
private byte[] segmentSequence;
@Setup
public void setup() {
segments = new MemorySegment[8];
// still allocates 8 different arrays
for (int i = 0; i < 8; i++) {
// we always pay the most of the cost here, for fun
byte[] a = shuffle? new byte[i + 1] : new byte[8];
segments[i] = MemorySegment.ofArray(a);
}
segmentSequence = new byte[samples];
var rnd = new Random(42);
for(int i = 0; i < samples; i++) {
// if shuffle == false always fall into the "worst" case of populating 8 bytes
segmentSequence[i] = (byte) rnd.nextInt(0, 8);
}
}
@Benchmark
public void heap_segment_fill() {
var segments = this.segments;
for (int nextIndex : segmentSequence) {
fill(segments[nextIndex]);
}
}
@CompilerControl(CompilerControl.Mode.DONT_INLINE)
public void fill(MemorySegment segment) {
segment.fill((byte) 0);
}
}
With
# JMH version: 1.34
# VM version: JDK 21, Java HotSpot(TM) 64-Bit Server VM, 21+35-LTS-2513
I got:
Which means that despite is not that optimized on JDK 21 still this benchmark mess up enough with the branch predictor that will hit badly as the perf counters shows
Benchmark (samples) (shuffle) Mode Cnt Score Error Units
TestFill.heap_segment_fill 1024 false avgt 30 10296.595 ± 19.694 ns/op
TestFill.heap_segment_fill:CPI 1024 false avgt 3 0.200 ± 0.006 clks/insn
TestFill.heap_segment_fill:IPC 1024 false avgt 3 5.006 ± 0.152 insns/clk
TestFill.heap_segment_fill:L1-dcache-load-misses 1024 false avgt 3 7.839 ± 35.541 #/op
TestFill.heap_segment_fill:L1-dcache-loads 1024 false avgt 3 90908.364 ± 19714.476 #/op
TestFill.heap_segment_fill:L1-icache-load-misses 1024 false avgt 3 0.458 ± 1.347 #/op
TestFill.heap_segment_fill:L1-icache-loads 1024 false avgt 3 70.000 ± 287.459 #/op
TestFill.heap_segment_fill:branch-misses 1024 false avgt 3 8.666 ± 10.013 #/op
TestFill.heap_segment_fill:branches 1024 false avgt 3 49674.054 ± 9931.580 #/op
TestFill.heap_segment_fill:cycles 1024 false avgt 3 46501.496 ± 8694.782 #/op
TestFill.heap_segment_fill:dTLB-load-misses 1024 false avgt 3 0.186 ± 0.549 #/op
TestFill.heap_segment_fill:dTLB-loads 1024 false avgt 3 1.426 ± 4.003 #/op
TestFill.heap_segment_fill:iTLB-load-misses 1024 false avgt 3 0.126 ± 0.405 #/op
TestFill.heap_segment_fill:iTLB-loads 1024 false avgt 3 0.249 ± 0.869 #/op
TestFill.heap_segment_fill:instructions 1024 false avgt 3 232778.290 ± 47179.208 #/op
TestFill.heap_segment_fill:stalled-cycles-frontend 1024 false avgt 3 257.566 ± 778.186 #/op
TestFill.heap_segment_fill 1024 true avgt 30 11003.331 ± 70.467 ns/op
TestFill.heap_segment_fill:CPI 1024 true avgt 3 0.208 ± 0.047 clks/insn
TestFill.heap_segment_fill:IPC 1024 true avgt 3 4.813 ± 1.077 insns/clk
TestFill.heap_segment_fill:L1-dcache-load-misses 1024 true avgt 3 8.734 ± 1.782 #/op
TestFill.heap_segment_fill:L1-dcache-loads 1024 true avgt 3 94231.271 ± 4742.906 #/op
TestFill.heap_segment_fill:L1-icache-load-misses 1024 true avgt 3 0.506 ± 2.508 #/op
TestFill.heap_segment_fill:L1-icache-loads 1024 true avgt 3 83.470 ± 216.408 #/op
TestFill.heap_segment_fill:branch-misses 1024 true avgt 3 8.894 ± 8.807 #/op
TestFill.heap_segment_fill:branches 1024 true avgt 3 50686.259 ± 404.635 #/op
TestFill.heap_segment_fill:cycles 1024 true avgt 3 49969.876 ± 11319.276 #/op
TestFill.heap_segment_fill:dTLB-load-misses 1024 true avgt 3 0.187 ± 0.655 #/op
TestFill.heap_segment_fill:dTLB-loads 1024 true avgt 3 1.587 ± 3.060 #/op
TestFill.heap_segment_fill:iTLB-load-misses 1024 true avgt 3 0.123 ± 0.660 #/op
TestFill.heap_segment_fill:iTLB-loads 1024 true avgt 3 0.293 ± 1.287 #/op
TestFill.heap_segment_fill:instructions 1024 true avgt 3 240463.595 ± 976.383 #/op
TestFill.heap_segment_fill:stalled-cycles-frontend 1024 true avgt 3 255.006 ± 988.846 #/op
TestFill.heap_segment_fill 128000 false avgt 30 1259362.873 ± 5934.195 ns/op
TestFill.heap_segment_fill:CPI 128000 false avgt 3 0.201 ± 0.025 clks/insn
TestFill.heap_segment_fill:IPC 128000 false avgt 3 4.982 ± 0.626 insns/clk
TestFill.heap_segment_fill:L1-dcache-load-misses 128000 false avgt 3 2872.859 ± 7141.312 #/op
TestFill.heap_segment_fill:L1-dcache-loads 128000 false avgt 3 10657359.179 ± 1907105.367 #/op
TestFill.heap_segment_fill:L1-icache-load-misses 128000 false avgt 3 60.908 ± 97.434 #/op
TestFill.heap_segment_fill:L1-icache-loads 128000 false avgt 3 8853.079 ± 8185.081 #/op
TestFill.heap_segment_fill:branch-misses 128000 false avgt 3 881.014 ± 3001.249 #/op
TestFill.heap_segment_fill:branches 128000 false avgt 3 6252293.868 ± 150888.746 #/op
TestFill.heap_segment_fill:cycles 128000 false avgt 3 5728074.407 ± 820865.748 #/op
TestFill.heap_segment_fill:dTLB-load-misses 128000 false avgt 3 24.925 ± 164.673 #/op
TestFill.heap_segment_fill:dTLB-loads 128000 false avgt 3 249.671 ± 987.855 #/op
TestFill.heap_segment_fill:iTLB-load-misses 128000 false avgt 3 14.258 ± 47.128 #/op
TestFill.heap_segment_fill:iTLB-loads 128000 false avgt 3 34.156 ± 248.858 #/op
TestFill.heap_segment_fill:instructions 128000 false avgt 3 28538131.024 ± 526036.510 #/op
TestFill.heap_segment_fill:stalled-cycles-frontend 128000 false avgt 3 27932.797 ± 27039.568 #/op
TestFill.heap_segment_fill 128000 true avgt 30 1857275.169 ± 4604.437 ns/op
TestFill.heap_segment_fill:CPI 128000 true avgt 3 0.288 ± 0.009 clks/insn
TestFill.heap_segment_fill:IPC 128000 true avgt 3 3.472 ± 0.109 insns/clk
TestFill.heap_segment_fill:L1-dcache-load-misses 128000 true avgt 3 3433.246 ± 15336.162 #/op
TestFill.heap_segment_fill:L1-dcache-loads 128000 true avgt 3 12940291.898 ± 4889405.663 #/op
TestFill.heap_segment_fill:L1-icache-load-misses 128000 true avgt 3 73.450 ± 231.916 #/op
TestFill.heap_segment_fill:L1-icache-loads 128000 true avgt 3 13483.446 ± 42337.545 #/op
TestFill.heap_segment_fill:branch-misses 128000 true avgt 3 86493.970 ± 8740.093 #/op
TestFill.heap_segment_fill:branches 128000 true avgt 3 6320125.417 ± 998773.918 #/op
TestFill.heap_segment_fill:cycles 128000 true avgt 3 8406053.515 ± 1319703.106 #/op
TestFill.heap_segment_fill:dTLB-load-misses 128000 true avgt 3 34.833 ± 105.768 #/op
TestFill.heap_segment_fill:dTLB-loads 128000 true avgt 3 307.842 ± 754.292 #/op
TestFill.heap_segment_fill:iTLB-load-misses 128000 true avgt 3 23.104 ± 51.968 #/op
TestFill.heap_segment_fill:iTLB-loads 128000 true avgt 3 55.073 ± 241.755 #/op
TestFill.heap_segment_fill:instructions 128000 true avgt 3 29183047.682 ± 4280293.555 #/op
TestFill.heap_segment_fill:stalled-cycles-frontend 128000 true avgt 3 707884.732 ± 176201.245 #/op
And -prof perfasm correcly show for samples = 128000 and shuffle = true
....[Hottest Region 1]..............................................................................
libjvm.so, Unsafe_SetMemory0 (82 bytes)
Which are likely the branches at https://github.com/openjdk/jdk21/blob/890adb6410dab4606a4f26a942aed02fb2f55387/src/hotspot/share/utilities/copy.cpp#L216-L244
|
|
||
| @Benchmark | ||
| public void buffer_fill() { | ||
| // Hopefully, the creation of the intermediate array will be optimized away. |
There was a problem hiding this comment.
This maybe won't....why not making the byte array a static final?
There was a problem hiding this comment.
The size of the array varies with the length of the array. It seams that escape analysis works here though.
I think the cost of transitioning to native code is what gives us this opportunity. So, there is always a fairly constant performance hit to transition to native code. Once native, the actual filling performance is way better in native code. The cut-off size appears to be close to 8 bytes for many platforms and hence, this PR handles 0 <= length < 7 specifically. |
|
How fast do we need to be here given we are measuring in a few nanoseconds per operation? What if the goal is not to regress from say explicitly filling in a small sized segment or a comparable array (e.g., < 8 bytes) then maybe a loop suffices and the code is simple? |
Fair question. I have another version (called "patch bits" below) that is based on bit logic (first doing int ops, then short and lastly byte, similar to |
The goal here is to be "competitive" with array bulk operations (given arrays do have bound checks as well) across all the segment size spectrum. It's fine to lose something in order to get to more maintainable code. |
| import sun.nio.ch.DirectBuffer; | ||
|
|
||
| import static java.lang.foreign.ValueLayout.JAVA_BYTE; | ||
| import static java.lang.foreign.ValueLayout.*; |
There was a problem hiding this comment.
Please use direct imports
mcimadamore
left a comment
There was a problem hiding this comment.
Added some nit coments - overall, the code looks very clean, and it's nice to see this improvements... now onto copy :-)
We have tried a loop, but sadly the performance is not great if the number of iteration is small. This is due to the fact that long loops are split into two loops, and outer and an inner, where the inner loop works up to This is being discussed with @rwestrel |
|
@franz1981 Here is what I get if I run your performance test on my M1 Mac (unfortunately no -perf data): |
| return this; | ||
| } | ||
| final long u = Byte.toUnsignedLong(value); | ||
| final long longValue = u << 56 | u << 48 | u << 40 | u << 32 | u << 24 | u << 16 | u << 8 | u; |
There was a problem hiding this comment.
this can be u * 0xFFFFFFFFFFFFL if value != 0 and just 0L if not: not sure if fast(er), need to measure.
Most of the time filling is happy with 0 since zeroing is the most common case
There was a problem hiding this comment.
this can be u * 0xFFFFFFFFFFFFL if value != 0 and just 0L if not: not sure if fast(er), need to measure.
Most of the time filling is happy with 0 since zeroing is the most common case
It's a clever trick. However, I was looking at similar tricks and found that the time spent here is irrelevant (e.g. I tried to always force 0 as the value, and couldn't see any difference).
There was a problem hiding this comment.
If I run:
@Benchmark
public long shift() {
return ELEM_SIZE << 56 | ELEM_SIZE << 48 | ELEM_SIZE << 40 | ELEM_SIZE << 32 | ELEM_SIZE << 24 | ELEM_SIZE << 16 | ELEM_SIZE << 8 | ELEM_SIZE;
}
@Benchmark
public long mul() {
return ELEM_SIZE * 0xFFFF_FFFF_FFFFL;
}
Then I get:
Benchmark (ELEM_SIZE) Mode Cnt Score Error Units
TestFill.mul 31 avgt 30 0.586 ? 0.045 ns/op
TestFill.shift 31 avgt 30 0.938 ? 0.017 ns/op
On my M1 machine.
There was a problem hiding this comment.
I found similar small improvements to be had (I wrote about them offline) when replacing the bitwise-based tests (e.g. foo & 4 != 0) with a more explicit check for remainingBytes >=4. Seems like bitwise operations are not as optimized (or perhaps the assembly instructions for them is overall more convoluted - I haven't checked).
There was a problem hiding this comment.
I've tried
final long longValue = Byte.toUnsignedLong(value) * 0x0101010101010101L;
But it had the same performance as explicit bit shifting on M1.
There was a problem hiding this comment.
@minborg the ELEM_SIZE is a Param field right? Just to be 100% sure of it...
There was a problem hiding this comment.
yes it is. But I think the reason we are not seeing any difference is that in the fill benchmarks, we are always using a value of zero. We should keep this trick in mind for later...
src/java.base/share/classes/jdk/internal/foreign/AbstractMemorySegmentImpl.java
Outdated
Show resolved
Hide resolved
|
Thanks @minborg to run it, so it seems that 128K, despite the additional call (due to not inlining something), makes nuking the pipeline of M1 a severe affair: now the interesting thing...there's really some other non-branchy way to handle this? is it worthy? TLDR:
Said that, overall is a big improvement over what exist now, so, great work regardless! |
If I understand correctly, this benchmark attempts to call |
Exactly - It has been designed to show the case when the conditions materialize (because are taken) and are performed in a non predictable sequence.
good point: relatively to the baseline, nope, cause the new version improve regardless, even when the new version got high branch misses |
@franz1981 It would be interesting to compare the overhead you measured in your shuffle benchmark with |
My feeling is that the intrinsic we have under the hood must be doing some similar branching to fixup the tail of the loop. In a way, what you are measuring is the worst possible case: a method that works on segments of different sizes, but whose size is so small not to benefit much from loop optimizations. Because of that, the cost of branching dominates everything. I think it's unavoidable to have some kind of jitter for small sizes. (e.g. even if we could write a single loop using On the other hand, the main point of this PR is to avoid the intrinsics for segments smaller than a certain size as jumping into the intrinsics seem to have some fixed cost that doesn't make it worth it for such small segments. (a similar situation arises for |
|
It is a good analysis; effectively even fill will likely have to handle tail/head for reminder bytes - and this will eventually lead to, more or less, some branchy code: this can be a tight loop, a series of if and byte per byte write (7 ifs), or as it is handled in this pr. One qq; by reading https://bugs.openjdk.org/browse/JDK-8139457 it appears to me that via some unsafe mechanism we could avoid being branchy; |
I'm no intrinsics expert, but if I had to guess I'd say that the intrinsics we have do not specialize for small sizes. Also, the use of vector instructions typically comes with additional alignment constraints - meaning that we need a pre-loop (and sometimes a post-loop). This logic, while faster for bigger sizes, has some drawbacks for smaller sizes. |
This looks good. Again, the goal of this PR is not to squeeze every nanosecond out - but, rather, to achieve a performance model that is "sensible" - whereby, if you work on smallish segments you get more or less the same degree of performance you get when operating on small arrays. These changes seem to achieve that goal. |
fully agree and yep - this looks pretty good already, well done @minborg ! thanks for it! I cannot wait to remove my ugly workaround of this for Netty :P |
|
/integrate |
|
Going to push as commit 7a418fc.
Your commit was automatically rebased without conflicts. |



The performance of the
MemorySegment::filcan be improved by replacing thecheckAccess()method call with callingcheckReadOnly()instead (as the bounds of the segment itself do not need to be checked).Also, smaller segments can be handled directly by Java code rather than transitioning to native code.
Here is how the
MemorySegment::fillperformance is improved by this PR:Operations involving 8 or more bytes are delegated to native code whereas smaller segments are handled via a switch rake.
It should be noted that
Arena::allocateis usingMemorySegment::fil. Hence, this PR will also have a positive effect on memory allocation performance.Progress
Issue
Reviewers
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/20712/head:pull/20712$ git checkout pull/20712Update a local copy of the PR:
$ git checkout pull/20712$ git pull https://git.openjdk.org/jdk.git pull/20712/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 20712View PR using the GUI difftool:
$ git pr show -t 20712Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/20712.diff
Webrev
Link to Webrev Comment