Skip to content

8338967: Improve performance for MemorySegment::fill#20712

Closed
minborg wants to merge 15 commits intoopenjdk:masterfrom
minborg:fill-performance
Closed

8338967: Improve performance for MemorySegment::fill#20712
minborg wants to merge 15 commits intoopenjdk:masterfrom
minborg:fill-performance

Conversation

@minborg
Copy link
Contributor

@minborg minborg commented Aug 26, 2024

The performance of the MemorySegment::fil can be improved by replacing the checkAccess() method call with calling checkReadOnly() instead (as the bounds of the segment itself do not need to be checked).

Also, smaller segments can be handled directly by Java code rather than transitioning to native code.

Here is how the MemorySegment::fill performance is improved by this PR:

image

Operations involving 8 or more bytes are delegated to native code whereas smaller segments are handled via a switch rake.

It should be noted that Arena::allocate is using MemorySegment::fil. Hence, this PR will also have a positive effect on memory allocation performance.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8338967: Improve performance for MemorySegment::fill (Enhancement - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/20712/head:pull/20712
$ git checkout pull/20712

Update a local copy of the PR:
$ git checkout pull/20712
$ git pull https://git.openjdk.org/jdk.git pull/20712/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 20712

View PR using the GUI difftool:
$ git pr show -t 20712

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/20712.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Aug 26, 2024

👋 Welcome back pminborg! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Aug 26, 2024

@minborg This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8338967: Improve performance for MemorySegment::fill

Reviewed-by: mcimadamore, psandoz

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 9 new commits pushed to the master branch:

  • 633fad8: 8326615: C1/C2 don't handle allocation failure properly during initialization (RuntimeStub::new_runtime_stub fatal crash)
  • 6f3e3fd: 8339411: [PPC64] cmpxchgw/h/b doesn't handle external Label
  • ed422ed: 8338817: Wrong indent in API docs for java.lang.management.ManagementFactory
  • 288fa60: 8338891: HotSpotDiagnosticsMXBean missing @SInCE tag
  • dc4fd89: 8339359: RISC-V: Use auipc explicitly in far_jump and far_call macro assembler routines
  • 3a88fd4: 8334724: C2: remove PhaseIdealLoop::cast_incr_before_loop()
  • 62dad3a: 8339351: Remove duplicate line in FileMapHeader::print
  • 0e6bb51: 8339063: [aarch64] Skip verify_sve_vector_length after native calls if SVE supports 128 bits VL only
  • b1163bc: 8256211: assert fired in java/net/httpclient/DependentPromiseActionsTest (infrequent)

Please see this link for an up-to-date comparison between the source branch of this pull request and the master branch.
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the rfr Pull request is ready for review label Aug 26, 2024
@openjdk
Copy link

openjdk bot commented Aug 26, 2024

@minborg The following label will be automatically applied to this pull request:

  • core-libs

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the core-libs core-libs-dev@openjdk.org label Aug 26, 2024
@mlbridge
Copy link

mlbridge bot commented Aug 26, 2024

@minborg
Copy link
Contributor Author

minborg commented Aug 26, 2024

Here is what it looks like for Windows x64:

image

@franz1981
Copy link
Contributor

Hi @minborg as #16760 has shown, handling small chunks in Java can be beneficial, but AFAIK Unsafe::setMemory has been intrinsified already at #18555, so I'm surprised that should provide some benefit now.

// Use the old switch statement syntax to improve startup time
switch ((int) length) {
case 0 : checkReadOnly(false); checkValidState(); break; // Explicit tests
case 1 : set(JAVA_BYTE, 0, value); break;
Copy link
Contributor

@franz1981 franz1981 Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

beware using a switch, because if this code if is too big to be inlined (or we're unlucky) will die due to branch-mispredict in case the different "small fills" are unstable/unpredictable.
Having a test which feed different fill sizes per each iteration + counting branch misses, will reveal if the improvement is worthy even with such cases

Copy link
Contributor Author

@minborg minborg Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is true, that this is a compromise where we give up inline space, code-cache space, and introduce added complexity against the prospect of better small-size performance. Depending on the workload, this may or may not pay off. In the (presumably common) case where we allocate/fill small segments of constant sizes, this is likely a win. Writing a dynamic performance test sounds like a good idea.

Copy link
Contributor Author

@minborg minborg Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a benchmark that fills segments of various random sizes:


@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 5, time = 500, timeUnit = TimeUnit.MILLISECONDS)
@Measurement(iterations = 10, time = 500, timeUnit = TimeUnit.MILLISECONDS)
@State(Scope.Thread)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Fork(value = 3)
public class TestFill {

    private static final int SIZE = 16;
    private static final int[] INDICES = new Random(42).ints(0, 8)
            .limit(SIZE)
            .toArray();


    private MemorySegment[] segments;

    @Setup
    public void setup() {
        segments = IntStream.of(INDICES)
                .mapToObj(i -> MemorySegment.ofArray(new byte[i]))
                .toArray(MemorySegment[]::new);
    }

    @Benchmark
    public void heap_segment_fill() {
        for (int i = 0; i < SIZE; i++) {
            segments[i].fill((byte) 0);
        }
    }

}

This produces the following on my Mac M1:

Benchmark                   Mode  Cnt   Score   Error  Units
TestFill.heap_segment_fill  avgt   30  59.054 ? 3.723  ns/op

On average, an operation will take 59/16 = ~3 ns per operation (including looping).

A test with the same size for every benchmark looks like this on my machine:

Benchmark                   (ELEM_SIZE)  Mode  Cnt  Score   Error  Units
TestFill.heap_segment_fill            0  avgt   30  1.112 ? 0.027  ns/op
TestFill.heap_segment_fill            1  avgt   30  1.602 ? 0.060  ns/op
TestFill.heap_segment_fill            2  avgt   30  1.583 ? 0.004  ns/op
TestFill.heap_segment_fill            3  avgt   30  1.909 ? 0.055  ns/op
TestFill.heap_segment_fill            4  avgt   30  1.605 ? 0.059  ns/op
TestFill.heap_segment_fill            5  avgt   30  1.900 ? 0.064  ns/op
TestFill.heap_segment_fill            6  avgt   30  1.891 ? 0.038  ns/op
TestFill.heap_segment_fill            7  avgt   30  2.237 ? 0.091  ns/op

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, can't we use a stable array of functions or something like that which can be populated lazily? That way you can access the function you want in a single array access, and we could put all these helper methods somewhere else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, a stable array of functions/MethodHandles didn't work from a performance perspective.

Copy link
Contributor

@franz1981 franz1981 Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a benchmark that fills segments of various random sizes:

without proper branch misses perf counters is difficult to say if it is actually messing up with the Apple MX branch pred...

For my Ryzen this is the test which mess up with the branch prediction (which is fairly good in AMD); clearly not inlining fill is a trick to make MemorySegment::fill inlined and still makes the branch predictor targets "stable" for our purposes, but translates into a more costy call dispatch - meaning that based on the CPU cost of branch mispredict (and nuking the pipeline), it could still be fine (as these numbers shows):

import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.CompilerControl;
import org.openjdk.jmh.annotations.Fork;
import org.openjdk.jmh.annotations.Measurement;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Param;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.annotations.Warmup;

import java.lang.foreign.MemorySegment;
import java.util.Random;
import java.util.concurrent.TimeUnit;

@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 5, time = 500, timeUnit = TimeUnit.MILLISECONDS)
@Measurement(iterations = 10, time = 500, timeUnit = TimeUnit.MILLISECONDS)
@State(Scope.Thread)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Fork(value = 3)
public class TestFill {

    @Param({"false", "true"})
    private boolean shuffle;
    private MemorySegment[] segments;
    @Param({ "1024", "128000"})
    private int samples;
    private byte[] segmentSequence;

    @Setup
    public void setup() {
        segments = new MemorySegment[8];
        // still allocates 8 different arrays
        for (int i = 0; i < 8; i++) {
            // we always pay the most of the cost here, for fun
            byte[] a = shuffle? new byte[i + 1] : new byte[8];
            segments[i] = MemorySegment.ofArray(a);
        }
        segmentSequence = new byte[samples];
        var rnd = new Random(42);
        for(int i = 0; i < samples; i++) {
            // if shuffle == false always fall into the "worst" case of populating 8 bytes
            segmentSequence[i] = (byte) rnd.nextInt(0, 8);
        }
    }

    @Benchmark
    public void heap_segment_fill() {
        var segments = this.segments;
        for (int nextIndex : segmentSequence) {
            fill(segments[nextIndex]);
        }
    }

    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    public void fill(MemorySegment segment) {
        segment.fill((byte) 0);
    }

}

With

# JMH version: 1.34
# VM version: JDK 21, Java HotSpot(TM) 64-Bit Server VM, 21+35-LTS-2513

I got:

Which means that despite is not that optimized on JDK 21 still this benchmark mess up enough with the branch predictor that will hit badly as the perf counters shows

Benchmark                                           (samples)  (shuffle)  Mode  Cnt         Score         Error      Units
TestFill.heap_segment_fill                               1024      false  avgt   30     10296.595 ±      19.694      ns/op
TestFill.heap_segment_fill:CPI                           1024      false  avgt    3         0.200 ±       0.006  clks/insn
TestFill.heap_segment_fill:IPC                           1024      false  avgt    3         5.006 ±       0.152  insns/clk
TestFill.heap_segment_fill:L1-dcache-load-misses         1024      false  avgt    3         7.839 ±      35.541       #/op
TestFill.heap_segment_fill:L1-dcache-loads               1024      false  avgt    3     90908.364 ±   19714.476       #/op
TestFill.heap_segment_fill:L1-icache-load-misses         1024      false  avgt    3         0.458 ±       1.347       #/op
TestFill.heap_segment_fill:L1-icache-loads               1024      false  avgt    3        70.000 ±     287.459       #/op
TestFill.heap_segment_fill:branch-misses                 1024      false  avgt    3         8.666 ±      10.013       #/op
TestFill.heap_segment_fill:branches                      1024      false  avgt    3     49674.054 ±    9931.580       #/op
TestFill.heap_segment_fill:cycles                        1024      false  avgt    3     46501.496 ±    8694.782       #/op
TestFill.heap_segment_fill:dTLB-load-misses              1024      false  avgt    3         0.186 ±       0.549       #/op
TestFill.heap_segment_fill:dTLB-loads                    1024      false  avgt    3         1.426 ±       4.003       #/op
TestFill.heap_segment_fill:iTLB-load-misses              1024      false  avgt    3         0.126 ±       0.405       #/op
TestFill.heap_segment_fill:iTLB-loads                    1024      false  avgt    3         0.249 ±       0.869       #/op
TestFill.heap_segment_fill:instructions                  1024      false  avgt    3    232778.290 ±   47179.208       #/op
TestFill.heap_segment_fill:stalled-cycles-frontend       1024      false  avgt    3       257.566 ±     778.186       #/op
TestFill.heap_segment_fill                               1024       true  avgt   30     11003.331 ±      70.467      ns/op
TestFill.heap_segment_fill:CPI                           1024       true  avgt    3         0.208 ±       0.047  clks/insn
TestFill.heap_segment_fill:IPC                           1024       true  avgt    3         4.813 ±       1.077  insns/clk
TestFill.heap_segment_fill:L1-dcache-load-misses         1024       true  avgt    3         8.734 ±       1.782       #/op
TestFill.heap_segment_fill:L1-dcache-loads               1024       true  avgt    3     94231.271 ±    4742.906       #/op
TestFill.heap_segment_fill:L1-icache-load-misses         1024       true  avgt    3         0.506 ±       2.508       #/op
TestFill.heap_segment_fill:L1-icache-loads               1024       true  avgt    3        83.470 ±     216.408       #/op
TestFill.heap_segment_fill:branch-misses                 1024       true  avgt    3         8.894 ±       8.807       #/op
TestFill.heap_segment_fill:branches                      1024       true  avgt    3     50686.259 ±     404.635       #/op
TestFill.heap_segment_fill:cycles                        1024       true  avgt    3     49969.876 ±   11319.276       #/op
TestFill.heap_segment_fill:dTLB-load-misses              1024       true  avgt    3         0.187 ±       0.655       #/op
TestFill.heap_segment_fill:dTLB-loads                    1024       true  avgt    3         1.587 ±       3.060       #/op
TestFill.heap_segment_fill:iTLB-load-misses              1024       true  avgt    3         0.123 ±       0.660       #/op
TestFill.heap_segment_fill:iTLB-loads                    1024       true  avgt    3         0.293 ±       1.287       #/op
TestFill.heap_segment_fill:instructions                  1024       true  avgt    3    240463.595 ±     976.383       #/op
TestFill.heap_segment_fill:stalled-cycles-frontend       1024       true  avgt    3       255.006 ±     988.846       #/op
TestFill.heap_segment_fill                             128000      false  avgt   30   1259362.873 ±    5934.195      ns/op
TestFill.heap_segment_fill:CPI                         128000      false  avgt    3         0.201 ±       0.025  clks/insn
TestFill.heap_segment_fill:IPC                         128000      false  avgt    3         4.982 ±       0.626  insns/clk
TestFill.heap_segment_fill:L1-dcache-load-misses       128000      false  avgt    3      2872.859 ±    7141.312       #/op
TestFill.heap_segment_fill:L1-dcache-loads             128000      false  avgt    3  10657359.179 ± 1907105.367       #/op
TestFill.heap_segment_fill:L1-icache-load-misses       128000      false  avgt    3        60.908 ±      97.434       #/op
TestFill.heap_segment_fill:L1-icache-loads             128000      false  avgt    3      8853.079 ±    8185.081       #/op
TestFill.heap_segment_fill:branch-misses               128000      false  avgt    3       881.014 ±    3001.249       #/op
TestFill.heap_segment_fill:branches                    128000      false  avgt    3   6252293.868 ±  150888.746       #/op
TestFill.heap_segment_fill:cycles                      128000      false  avgt    3   5728074.407 ±  820865.748       #/op
TestFill.heap_segment_fill:dTLB-load-misses            128000      false  avgt    3        24.925 ±     164.673       #/op
TestFill.heap_segment_fill:dTLB-loads                  128000      false  avgt    3       249.671 ±     987.855       #/op
TestFill.heap_segment_fill:iTLB-load-misses            128000      false  avgt    3        14.258 ±      47.128       #/op
TestFill.heap_segment_fill:iTLB-loads                  128000      false  avgt    3        34.156 ±     248.858       #/op
TestFill.heap_segment_fill:instructions                128000      false  avgt    3  28538131.024 ±  526036.510       #/op
TestFill.heap_segment_fill:stalled-cycles-frontend     128000      false  avgt    3     27932.797 ±   27039.568       #/op
TestFill.heap_segment_fill                             128000       true  avgt   30   1857275.169 ±    4604.437      ns/op
TestFill.heap_segment_fill:CPI                         128000       true  avgt    3         0.288 ±       0.009  clks/insn
TestFill.heap_segment_fill:IPC                         128000       true  avgt    3         3.472 ±       0.109  insns/clk
TestFill.heap_segment_fill:L1-dcache-load-misses       128000       true  avgt    3      3433.246 ±   15336.162       #/op
TestFill.heap_segment_fill:L1-dcache-loads             128000       true  avgt    3  12940291.898 ± 4889405.663       #/op
TestFill.heap_segment_fill:L1-icache-load-misses       128000       true  avgt    3        73.450 ±     231.916       #/op
TestFill.heap_segment_fill:L1-icache-loads             128000       true  avgt    3     13483.446 ±   42337.545       #/op
TestFill.heap_segment_fill:branch-misses               128000       true  avgt    3     86493.970 ±    8740.093       #/op
TestFill.heap_segment_fill:branches                    128000       true  avgt    3   6320125.417 ±  998773.918       #/op
TestFill.heap_segment_fill:cycles                      128000       true  avgt    3   8406053.515 ± 1319703.106       #/op
TestFill.heap_segment_fill:dTLB-load-misses            128000       true  avgt    3        34.833 ±     105.768       #/op
TestFill.heap_segment_fill:dTLB-loads                  128000       true  avgt    3       307.842 ±     754.292       #/op
TestFill.heap_segment_fill:iTLB-load-misses            128000       true  avgt    3        23.104 ±      51.968       #/op
TestFill.heap_segment_fill:iTLB-loads                  128000       true  avgt    3        55.073 ±     241.755       #/op
TestFill.heap_segment_fill:instructions                128000       true  avgt    3  29183047.682 ± 4280293.555       #/op
TestFill.heap_segment_fill:stalled-cycles-frontend     128000       true  avgt    3    707884.732 ±  176201.245       #/op

And -prof perfasm correcly show for samples = 128000 and shuffle = true

....[Hottest Region 1]..............................................................................
libjvm.so, Unsafe_SetMemory0 (82 bytes) 

Which are likely the branches at https://github.com/openjdk/jdk21/blob/890adb6410dab4606a4f26a942aed02fb2f55387/src/hotspot/share/utilities/copy.cpp#L216-L244


@Benchmark
public void buffer_fill() {
// Hopefully, the creation of the intermediate array will be optimized away.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This maybe won't....why not making the byte array a static final?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The size of the array varies with the length of the array. It seams that escape analysis works here though.

@minborg
Copy link
Contributor Author

minborg commented Aug 26, 2024

Hi @minborg as #16760 has shown, handling small chunks in Java can be beneficial, but AFAIK Unsafe::setMemory has been intrinsified already at #18555, so I'm surprised that should provide some benefit now.

I think the cost of transitioning to native code is what gives us this opportunity. So, there is always a fairly constant performance hit to transition to native code. Once native, the actual filling performance is way better in native code. The cut-off size appears to be close to 8 bytes for many platforms and hence, this PR handles 0 <= length < 7 specifically.

@PaulSandoz
Copy link
Member

How fast do we need to be here given we are measuring in a few nanoseconds per operation?

What if the goal is not to regress from say explicitly filling in a small sized segment or a comparable array (e.g., < 8 bytes) then maybe a loop suffices and the code is simple?

@minborg
Copy link
Contributor Author

minborg commented Aug 28, 2024

How fast do we need to be here given we are measuring in a few nanoseconds per operation?

What if the goal is not to regress from say explicitly filling in a small sized segment or a comparable array (e.g., < 8 bytes) then maybe a loop suffices and the code is simple?

Fair question. I have another version (called "patch bits" below) that is based on bit logic (first doing int ops, then short and lastly byte, similar to ArraySupport::vectorizedMismatch). This has slightly worse performance but is more scalable and perhaps simpler.

image

@mcimadamore
Copy link
Contributor

How fast do we need to be here given we are measuring in a few nanoseconds per operation?

The goal here is to be "competitive" with array bulk operations (given arrays do have bound checks as well) across all the segment size spectrum. It's fine to lose something in order to get to more maintainable code.

import sun.nio.ch.DirectBuffer;

import static java.lang.foreign.ValueLayout.JAVA_BYTE;
import static java.lang.foreign.ValueLayout.*;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use direct imports

Copy link
Contributor

@mcimadamore mcimadamore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some nit coments - overall, the code looks very clean, and it's nice to see this improvements... now onto copy :-)

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Aug 28, 2024
@openjdk openjdk bot removed the ready Pull request is ready to be integrated label Aug 28, 2024
@openjdk openjdk bot added the ready Pull request is ready to be integrated label Aug 28, 2024
@mcimadamore
Copy link
Contributor

then maybe a loop suffices and the code is simple?

We have tried a loop, but sadly the performance is not great if the number of iteration is small. This is due to the fact that long loops are split into two loops, and outer and an inner, where the inner loop works up to Integer.MAX_SIZE. While this strategy pays off for bigger iterations, it doesn't for very small ones. The C2 optimizations currently have a cutoff point, but it's possible that this point is set too high.

This is being discussed with @rwestrel

@minborg
Copy link
Contributor Author

minborg commented Aug 30, 2024

@franz1981 Here is what I get if I run your performance test on my M1 Mac (unfortunately no -perf data):

Base
Benchmark                         (samples)  (shuffle)  Mode  Cnt        Score        Error  Units
TestBranchFill.heap_segment_fill       1024      false  avgt   30    58597.625 ?   1871.313  ns/op
TestBranchFill.heap_segment_fill       1024       true  avgt   30    64309.859 ?   1164.360  ns/op
TestBranchFill.heap_segment_fill     128000      false  avgt   30  7136796.445 ? 152120.060  ns/op
TestBranchFill.heap_segment_fill     128000       true  avgt   30  7908474.120 ?  49184.950  ns/op


Patch
Benchmark                         (samples)  (shuffle)  Mode  Cnt        Score       Error  Units
TestBranchFill.heap_segment_fill       1024      false  avgt   30     3695.815 ?    24.615  ns/op
TestBranchFill.heap_segment_fill       1024       true  avgt   30     3938.582 ?   124.510  ns/op
TestBranchFill.heap_segment_fill     128000      false  avgt   30   420845.301 ?  1605.080  ns/op
TestBranchFill.heap_segment_fill     128000       true  avgt   30  1778362.506 ? 39250.756  ns/op

return this;
}
final long u = Byte.toUnsignedLong(value);
final long longValue = u << 56 | u << 48 | u << 40 | u << 32 | u << 24 | u << 16 | u << 8 | u;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be u * 0xFFFFFFFFFFFFL if value != 0 and just 0L if not: not sure if fast(er), need to measure.

Most of the time filling is happy with 0 since zeroing is the most common case

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be u * 0xFFFFFFFFFFFFL if value != 0 and just 0L if not: not sure if fast(er), need to measure.

Most of the time filling is happy with 0 since zeroing is the most common case

It's a clever trick. However, I was looking at similar tricks and found that the time spent here is irrelevant (e.g. I tried to always force 0 as the value, and couldn't see any difference).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I run:

    @Benchmark
    public long shift() {
        return ELEM_SIZE << 56 | ELEM_SIZE << 48 | ELEM_SIZE << 40 | ELEM_SIZE << 32 | ELEM_SIZE << 24 | ELEM_SIZE << 16 | ELEM_SIZE << 8 | ELEM_SIZE;
    }

    @Benchmark
    public long mul() {
        return ELEM_SIZE * 0xFFFF_FFFF_FFFFL;
    }

Then I get:

Benchmark       (ELEM_SIZE)  Mode  Cnt  Score   Error  Units
TestFill.mul             31  avgt   30  0.586 ? 0.045  ns/op
TestFill.shift           31  avgt   30  0.938 ? 0.017  ns/op

On my M1 machine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found similar small improvements to be had (I wrote about them offline) when replacing the bitwise-based tests (e.g. foo & 4 != 0) with a more explicit check for remainingBytes >=4. Seems like bitwise operations are not as optimized (or perhaps the assembly instructions for them is overall more convoluted - I haven't checked).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried

final long longValue = Byte.toUnsignedLong(value) * 0x0101010101010101L;

But it had the same performance as explicit bit shifting on M1.

Copy link
Contributor

@franz1981 franz1981 Sep 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@minborg the ELEM_SIZE is a Param field right? Just to be 100% sure of it...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it is. But I think the reason we are not seeing any difference is that in the fill benchmarks, we are always using a value of zero. We should keep this trick in mind for later...

@franz1981
Copy link
Contributor

franz1981 commented Aug 30, 2024

Thanks @minborg to run it, so it seems that 128K, despite the additional call (due to not inlining something), makes nuking the pipeline of M1 a severe affair:

Patch
Benchmark                         (samples)  (shuffle)  Mode  Cnt        Score       Error  Units
TestBranchFill.heap_segment_fill     128000      false  avgt   30   420845.301 ?  1605.080  ns/op
TestBranchFill.heap_segment_fill     128000       true  avgt   30  1778362.506 ? 39250.756  ns/op <----- HERE!

now the interesting thing...there's really some other non-branchy way to handle this? is it worthy?
I sadly have not much answers on this, since when I make something similar to Netty at netty/netty#13693 I've decided for a more drastic approach, see https://github.com/netty/netty/pull/13693/files#diff-49ee8d7612d5ecfcc27b46c38a801ad32ebdb169f7d79f1577313a1de70b0fbbR639-R649

TLDR:

  • modern x86 are decent to fill unaligned data in, but non-x86, not so great: that lead me to handle alignment; for off-heap memory, clearly
  • use 2 loops to reduce the branches (amortize, let's say) and hope the 1-7 bytes will work decently with unrolling and placing bytes singularly - but basically leveraging Unsafe for both
  • the cutoff value is much higher (i.e. 64 bytes) because of pre jdk 21 memset suboptimal impl (now fixed in main by @asgibbons )
  • the array/heap case was already handled by Arrays::fill

Said that, overall is a big improvement over what exist now, so, great work regardless!

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Aug 30, 2024
@mcimadamore
Copy link
Contributor

mcimadamore commented Aug 30, 2024

Thanks @minborg to run it, so it seems that 128K, despite the additional call (due to not inlining something), makes nuking the pipeline of M1 a severe affair:

If I understand correctly, this benchmark attempts to call fill with different segment sizes (in a loop) - correct? It's understandable that, in this case, we can't optimize as well, because we have different branches which get taken or not in a less predictable fashion. The important question is (for this PR): does the work proposed here cause a regression in the case you have in mind? E.g. is the setMemory intrinsics better than the branchy logic we have here?

@franz1981
Copy link
Contributor

franz1981 commented Aug 30, 2024

in this case, we can't optimize as well, because we have different branches which get taken or not in a less predictable fashion.

Exactly - It has been designed to show the case when the conditions materialize (because are taken) and are performed in a non predictable sequence.
I couldn't find an easier way to make it happen reliably if not by making the root method not been inlined, with the limit that the actuall call of this root method is now part of the cost, but should be order of magnitude less important than a pipeline nuke.
I could have made the method being C2 compiled in the setup with all the branches taken and "probably" I could have dropped the DONTINLINE in the root method - but I'm not sure.

The important question is (for this PR): does the work proposed here cause a regression in the case you have in mind? E.g. is the setMemory intrinsics better than the branchy logic we have here?

good point: relatively to the baseline, nope, cause the new version improve regardless, even when the new version got high branch misses

@mcimadamore
Copy link
Contributor

Thanks @minborg to run it, so it seems that 128K, despite the additional call (due to not inlining something), makes nuking the pipeline of M1 a severe affair:

If I understand correctly, this benchmark attempts to call fill with different segment sizes (in a loop) - correct? It's understandable that, in this case, we can't optimize as well, because we have different branches which get taken or not in a less predictable fashion. The important question is (for this PR): does the work proposed here cause a regression in the case you have in mind? E.g. is the setMemory intrinsics better than the branchy logic we have here?

@franz1981 It would be interesting to compare the overhead you measured in your shuffle benchmark with Array::fill - which is, I think, where we'd like to get at.

@mcimadamore
Copy link
Contributor

good point: relatively to the baseline, nope, cause the new version improve regardless, even when the new version got high branch misses

My feeling is that the intrinsic we have under the hood must be doing some similar branching to fixup the tail of the loop. In a way, what you are measuring is the worst possible case: a method that works on segments of different sizes, but whose size is so small not to benefit much from loop optimizations. Because of that, the cost of branching dominates everything. I think it's unavoidable to have some kind of jitter for small sizes. (e.g. even if we could write a single loop using byte and C2 auto-vectorized all that - there's going to be a loop tail where we need to fill in the contents for some remainder size - and that logic is going to be branchy).

On the other hand, the main point of this PR is to avoid the intrinsics for segments smaller than a certain size as jumping into the intrinsics seem to have some fixed cost that doesn't make it worth it for such small segments. (a similar situation arises for copy and mismatch - to an even greater extent).

@franz1981
Copy link
Contributor

franz1981 commented Aug 30, 2024

It is a good analysis; effectively even fill will likely have to handle tail/head for reminder bytes - and this will eventually lead to, more or less, some branchy code: this can be a tight loop, a series of if and byte per byte write (7 ifs), or as it is handled in this pr.
All of these strategies are better than what we have now, probably because the existing instrinsics still perform some poor decision, but I haven't dug yet into perfasm out to see what it does wrong; maybe is something which could be fixed in the intrinsic itself?
Said that, the 3 approaches I have mentioned could be interesting to check against both predictable or not workloads, I see pros and cons in all of them, TBH, although just as an academic exercise.

One qq; by reading https://bugs.openjdk.org/browse/JDK-8139457 it appears to me that via some unsafe mechanism we could avoid being branchy;
If a single byte[] still need to be 8 bytes (or 16?) aligned, we could just use long and write past the end of the array? Is it a safe assumption?

@minborg
Copy link
Contributor Author

minborg commented Sep 2, 2024

Here is the performance improvement for Linux a64. Looks like a significant improvement.

image

@mcimadamore
Copy link
Contributor

All of these strategies are better than what we have now, probably because the existing instrinsics still perform some poor decision, but I haven't dug yet into perfasm out to see what it does wrong; maybe is something which could be fixed in the intrinsic itself?

I'm no intrinsics expert, but if I had to guess I'd say that the intrinsics we have do not specialize for small sizes. Also, the use of vector instructions typically comes with additional alignment constraints - meaning that we need a pre-loop (and sometimes a post-loop). This logic, while faster for bigger sizes, has some drawbacks for smaller sizes.

@mcimadamore
Copy link
Contributor

mcimadamore commented Sep 2, 2024

Here is the performance improvement for Linux a64. Looks like a significant improvement.

This looks good. Again, the goal of this PR is not to squeeze every nanosecond out - but, rather, to achieve a performance model that is "sensible" - whereby, if you work on smallish segments you get more or less the same degree of performance you get when operating on small arrays. These changes seem to achieve that goal.

@franz1981
Copy link
Contributor

This looks good. Again, the goal of this PR is not to squeeze every nanosecond out - but, rather, to achieve a performance model that is "sensible"

fully agree and yep - this looks pretty good already, well done @minborg ! thanks for it! I cannot wait to remove my ugly workaround of this for Netty :P

@openjdk openjdk bot removed the ready Pull request is ready to be integrated label Sep 3, 2024
@openjdk openjdk bot added the ready Pull request is ready to be integrated label Sep 3, 2024
@minborg
Copy link
Contributor Author

minborg commented Sep 3, 2024

/integrate

@openjdk
Copy link

openjdk bot commented Sep 3, 2024

Going to push as commit 7a418fc.
Since your change was applied there have been 9 commits pushed to the master branch:

  • 633fad8: 8326615: C1/C2 don't handle allocation failure properly during initialization (RuntimeStub::new_runtime_stub fatal crash)
  • 6f3e3fd: 8339411: [PPC64] cmpxchgw/h/b doesn't handle external Label
  • ed422ed: 8338817: Wrong indent in API docs for java.lang.management.ManagementFactory
  • 288fa60: 8338891: HotSpotDiagnosticsMXBean missing @SInCE tag
  • dc4fd89: 8339359: RISC-V: Use auipc explicitly in far_jump and far_call macro assembler routines
  • 3a88fd4: 8334724: C2: remove PhaseIdealLoop::cast_incr_before_loop()
  • 62dad3a: 8339351: Remove duplicate line in FileMapHeader::print
  • 0e6bb51: 8339063: [aarch64] Skip verify_sve_vector_length after native calls if SVE supports 128 bits VL only
  • b1163bc: 8256211: assert fired in java/net/httpclient/DependentPromiseActionsTest (infrequent)

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Sep 3, 2024
@openjdk openjdk bot closed this Sep 3, 2024
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Sep 3, 2024
@openjdk
Copy link

openjdk bot commented Sep 3, 2024

@minborg Pushed as commit 7a418fc.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core-libs core-libs-dev@openjdk.org integrated Pull request has been integrated

Development

Successfully merging this pull request may close these issues.

4 participants