Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8268743: Require a better way for copying data between MemorySegments and on-heap arrays #560

Conversation

@mcimadamore
Copy link
Collaborator

@mcimadamore mcimadamore commented Jun 15, 2021

After some investigation, it seems that escape analysis is defeated in cases where a new heap segment is created fresh just before performing a bulk copy.

This is caused by the fact that, on segment creation, we perform this test:

static int defaultAccessModes(long size) {
        return (enableSmallSegments && size < Integer.MAX_VALUE) ?
                SMALL : 0;
    }

To make sure that segments whose size fits in an int do not incur in penalties associated with lack of optimizations over long loop bound check optimizations.

Unfortunately, this logic is control flow logic, and control flow disables escape analysis optimizations.

For segment wrappers around byte arrays we can workaround by removing the check (all byte segments are small by definition, since there's a 1-1 mapping between logical elements and physical bytes). For other segment kinds we cannot do much.

While it would be possible, in principle, to resort to more complex bound checks for heap segments, we believe the way forward is to eliminate the need for "small" segments, which will be possible once the PR below is completed:

openjdk/jdk#2045


Progress

  • Change must not contain extraneous whitespace
  • Change must be properly reviewed

Issue

  • JDK-8268743: Require a better way for copying data between MemorySegments and on-heap arrays

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/panama-foreign pull/560/head:pull/560
$ git checkout pull/560

Update a local copy of the PR:
$ git checkout pull/560
$ git pull https://git.openjdk.java.net/panama-foreign pull/560/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 560

View PR using the GUI difftool:
$ git pr show -t 560

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/panama-foreign/pull/560.diff

@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Jun 15, 2021

👋 Welcome back mcimadamore! A progress list of the required criteria for merging this PR into foreign-memaccess+abi will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

Loading

@openjdk openjdk bot added the rfr label Jun 15, 2021
@mlbridge
Copy link

@mlbridge mlbridge bot commented Jun 15, 2021

Webrevs

Loading

@uschindler
Copy link
Member

@uschindler uschindler commented Jun 15, 2021

But we have no solution for larger sizes like float[] or long[] arrays?

Loading

@mcimadamore
Copy link
Collaborator Author

@mcimadamore mcimadamore commented Jun 15, 2021

But we have no solution for larger sizes like float[] or long[] arrays?

Not for now - the real solution is to fix the performance woes which makes the implementation do all these workarounds for longs vs. ints. That said, I'd be curious, once this is integrated, if you could do a quick validation and re-run that benchmark where you were seeing lots of byte segment wrappers in the heap and make sure at least that behaves as expected.

Loading

@uschindler
Copy link
Member

@uschindler uschindler commented Jun 15, 2021

But we have no solution for larger sizes like float[] or long[] arrays?

Not for now - the real solution is to fix the performance woes which makes the implementation do all these workarounds for longs vs. ints. That said, I'd be curious, once this is integrated, if you could do a quick validation and re-run that benchmark where you were seeing lots of byte segment wrappers in the heap and make sure at least that behaves as expected.

For sure.

@rmuir and I already discussed about that. From what we understand: readLong(long[], offset, count) in our case is never reading more than 64 longs because of the limitations of PFOR algorithm inside Lucene. So we would remove the specialization in MemorySegmentIndexinput and the superclass will use a loop of readLong() instead. This also goes in line with your hint to copyMemory liveness check overhead. The readFloat variant we have is reading a maximum of 1024 float dimensions for our vector suppotr, I will do some quick investigations later, but I tend to remove the specialization, too. In future we will use FloatVector from vector API and that should possibly be wrapped over the memory segment (see also apache/lucene#18 for some quick investigations). For the longs and the PFOR algorithm we may also use a vectorized approach in future, so readFloat() and readLongs() will go away and be replaced to return a FloatVector/LongVector view on the memory segment (using ByteBuffer view or directly once panama and vector can share APIs).

The biggest problem on our side is our implementation of readBytes(byte[], offset, count) which is fixed here.

Loading

@Benchmark
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public void segment_small_copy_fresh() {
segment.asSlice(srcOffset, nbytes).copyFrom(MemorySegment.ofArray(bytes).asSlice(targetOffset, nbytes));
Copy link
Member

@uschindler uschindler Jun 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in our code we have the copy the other way round. Maybe add another memory segment -> heap array, too

Loading

@mcimadamore
Copy link
Collaborator Author

@mcimadamore mcimadamore commented Jun 15, 2021

I'm no longer too sure this will make an actual difference. While jmh shows something, digging deeper reveals that, in most iterations there's absolutely no allocation:

# Run progress: 0.00% complete, ETA 00:00:22
# Fork: 1 of 3
WARNING: Using incubator modules: jdk.incubator.foreign
# Warmup Iteration   1: 8.009 ns/op
# Warmup Iteration   2: 6.775 ns/op
# Warmup Iteration   3: 3.475 ns/op
# Warmup Iteration   4: 6.647 ns/op
# Warmup Iteration   5: 6.924 ns/op
Iteration   1: 7.389 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

Iteration   2: 6.593 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

Iteration   3: 6.472 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

Iteration   4: 7.103 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

Iteration   5: 6.287 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

Iteration   6: 6.976 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

Iteration   7: 6.503 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

Iteration   8: 7.057 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

Iteration   9: 6.554 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

Iteration  10: 6.654 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts


# Run progress: 33.33% complete, ETA 00:00:30
# Fork: 2 of 3
WARNING: Using incubator modules: jdk.incubator.foreign
# Warmup Iteration   1: 7.538 ns/op
# Warmup Iteration   2: 7.192 ns/op
# Warmup Iteration   3: 7.222 ns/op
# Warmup Iteration   4: 7.712 ns/op
# Warmup Iteration   5: 6.882 ns/op
Iteration   1: 7.016 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

Iteration   2: 7.009 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

Iteration   3: 6.958 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

Iteration   4: 7.210 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

Iteration   5: 7.860 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

Iteration   6: 7.700 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

Iteration   7: 3.655 ns/op   // <------------------------------------------------------------------------------------
                 ?gc.alloc.rate:                   21.009 MB/sec
                 ?gc.alloc.rate.norm:              0.161 B/op
                 ?gc.churn.G1_Eden_Space:          23.975 MB/sec
                 ?gc.churn.G1_Eden_Space.norm:     0.184 B/op
                 ?gc.churn.G1_Survivor_Space:      2.673 MB/sec
                 ?gc.churn.G1_Survivor_Space.norm: 0.021 B/op
                 ?gc.count:                        1.000 counts
                 ?gc.time:                         2.000 ms

Iteration   8: 7.049 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

Iteration   9: 6.910 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

Iteration  10: 7.325 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts


# Run progress: 66.67% complete, ETA 00:00:15
# Fork: 3 of 3
WARNING: Using incubator modules: jdk.incubator.foreign
# Warmup Iteration   1: 7.798 ns/op
# Warmup Iteration   2: 7.576 ns/op
# Warmup Iteration   3: 7.088 ns/op
# Warmup Iteration   4: 7.099 ns/op
# Warmup Iteration   5: 7.232 ns/op
Iteration   1: 7.368 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

Iteration   2: 7.193 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

Iteration   3: 6.731 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

Iteration   4: 7.656 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

Iteration   5: 7.626 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

Iteration   6: 7.232 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

Iteration   7: 7.276 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

Iteration   8: 7.521 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

Iteration   9: 6.777 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

Iteration  10: 7.298 ns/op
                 ?gc.alloc.rate:      ? 10?? MB/sec
                 ?gc.alloc.rate.norm: ? 10?? B/op
                 ?gc.count:           ? 0 counts

(this benchmark test the copy in the same order as Lucene is doing it). Only one iteration has non-zero allocation - presumably some kind of race between C2 and GC. But for the most part there's already no allocation here... at least according to JMH. We need a benchmark which reproduces the issue more precisely before attempting any fix, I think.

Loading

@mcimadamore
Copy link
Collaborator Author

@mcimadamore mcimadamore commented Jun 15, 2021

Adding a static method which avoids all the slicing doesn't seem to help performance-wise. It is still slower than an Unsafe call (20-25%). If that would help narrow things down, we could try adding such a static method at least in the Panama repo and maybe we could test that with Lucene? At least we would have removed allocations out of the equation (but I suspect that allocation is not where the performance slowdown is coming from). @uschindler let me know what would be the easiest for you to try.

Loading

@mcimadamore
Copy link
Collaborator Author

@mcimadamore mcimadamore commented Jun 15, 2021

Adding a static method which avoids all the slicing doesn't seem to help performance-wise. It is still slower than an Unsafe call (20-25%). If that would help narrow things down, we could try adding such a static method at least in the Panama repo and maybe we could test that with Lucene? At least we would have removed allocations out of the equation (but I suspect that allocation is not where the performance slowdown is coming from). @uschindler let me know what would be the easiest for you to try.

Actually, it's even worse if the benchmark is tweaked to use non-static segments/buffers:

Benchmark                                Mode  Cnt  Score   Error  Units
TestSmallCopy.buffer_small_copy          avgt   30  5.163 ? 0.146  ns/op
TestSmallCopy.segment_small_copy_slice   avgt   30  7.176 ? 0.082  ns/op
TestSmallCopy.segment_small_copy_static  avgt   30  8.091 ? 0.155  ns/op
TestSmallCopy.unsafe_small_copy          avgt   30  4.778 ? 0.052  ns/op

The benchmark code can be found here:

https://github.com/mcimadamore/panama-foreign/blob/small_copy_benchmark/test/micro/org/openjdk/bench/jdk/incubator/foreign/TestSmallCopy.java

Despite absence of allocation in this benchmark (which can be seen by running benchmark with -prof gc) performance of bulk copy is not great. Slicing is actually faster than using a static method, presumably because the slice instance is getting scalarized (while the segment stashed in the field is not), which is an effect we have seen other times.

I can't, on top of my head, explain the difference, especially with the buffer API - so I filed this:

https://bugs.openjdk.java.net/browse/JDK-8268822

Loading

@uschindler
Copy link
Member

@uschindler uschindler commented Jun 16, 2021

Hi,

we have clearly seen the allocation, but maybe bring some other part into the game: our segments are in a shared ResourceScope.

When comparing the benchmarks (which contain a lot of more code than just the segment copy, because we were testing query performance) there was one thing which I have seen: With our patch that used the slicing/copyFrom variant, it was also obvious from the outside that something was wrong: each iteration of the whole Lucene benchmark was like 20 % slower in total runtime (from startup till end of each iteration) in combination with garbage collection activity (which was also seen by JFR). So something in our case was triggering the allocations. The difference in runtime between unpatched and patched lucene benchmark was worse with tiered compilation disabled (for historical reasons the Lucene benchmark runs with tiered off and -Xbatch on). See outputs on apache/lucene#177

From your latest benchmarks it looks like in addition to the allocation we have seen, there is some other slowdown when doing the segment memorycopy. From my understanding this mostly affects small segments, because the overhead of liveness check overweighs. But isn't the liveness check not also done for varhandle access (although optimized away after some time), so rewriting to a loop would not bring much - as the lifeness check needs to be done at least at start of loop after optimizations kicking in? What is the difference in the lifeness check (on the shared memory segment) between copyMemory and a single varhandle call?

If that would help narrow things down, we could try adding such a static method at least in the Panama repo and maybe we could test that with Lucene? At least we would have removed allocations out of the equation (but I suspect that allocation is not where the performance slowdown is coming from). @uschindler let me know what would be the easiest for you to try.

Sure, I can run the benchmarks with a Linux build of Panama! I can compile my own on this machine (not today or tomorrow), but if you have some tar.gz file with a prebuilt binary available it would be great.

Loading

@mlbridge
Copy link

@mlbridge mlbridge bot commented Jun 16, 2021

Mailing list message from Maurizio Cimadamore on panama-dev:

On 16/06/2021 08:49, Uwe Schindler wrote:

On Tue, 15 Jun 2021 11:56:26 GMT, Maurizio Cimadamore <mcimadamore at openjdk.org> wrote:

After some investigation, it seems that escape analysis is defeated in cases where a new heap segment is created fresh just before performing a bulk copy.

This is caused by the fact that, on segment creation, we perform this test:

static int defaultAccessModes(long size) {
return (enableSmallSegments && size < Integer.MAX_VALUE) ?
SMALL : 0;
}

To make sure that segments whose size fits in an `int` do not incur in penalties associated with lack of optimizations over long loop bound check optimizations.

Unfortunately, this logic is control flow logic, and control flow disables escape analysis optimizations.

For segment wrappers around byte arrays we can workaround by removing the check (all byte segments are small by definition, since there's a 1-1 mapping between logical elements and physical bytes). For other segment kinds we cannot do much.

While it would be possible, in principle, to resort to more complex bound checks for heap segments, we believe the way forward is to eliminate the need for "small" segments, which will be possible once the PR below is completed:

https://github.com/openjdk/jdk/pull/2045
Hi,

we have clearly seen the allocation, but maybe bring some other part into the game: our segments are in a shared ResourceScope.

Shared scope shouldn't affect allocation profile. And slicing with
shared scopes doesn't cost extra, only closing shared scope costs extra.
So I think we can rule this out.

When comparing the benchmarks (which contain a lot of more code than just the segment copy, because we were testing query performance) there was one thing which I have seen: With our patch that used the slicing/copyFrom variant, it was also obvious from the outside that something was wrong: each iteration of the whole Lucene benchmark was like 20 % slower in total runtime (from startup till end of each iteration) in combination with garbage collection activity (which was also seen by JFR). So something in our case was triggering the allocations. The difference in runtime between unpatched and patched lucene benchmark was worse with tiered compilation disabled (for historical reasons the Lucene benchmark runs with tiered off and -Xbatch on). See outputs on https://github.com/apache/lucene/pull/177

From your latest benchmarks it looks like in addition to the allocation we have seen, there is some other slowdown when doing the segment memorycopy. From my understanding this mostly affects small segments, because the overhead of liveness check overweighs. But isn't the liveness check not also done for varhandle access (although optimized away after some time), so rewriting to a loop would not bring much - as the lifeness check needs to be done at least at start of loop after optimizations kicking in? What is the difference in the lifeness check (on the shared memory segment) between copyMemory and a single varhandle call?

So, in part this is known: the extra checks (liveness and bounds) prior
to memory access cost inevitably more than a plain, unchecked unsafe
access. So, if you do a _single_ var handle access, and you benchmark it
against a _single_ Unsafe access, it is slower - we have benchmark for
this stuff. In reality (1) there's not much we can do (we can't compete
with _all belts off_ mode) and (2) the use case of a single "hot" memory
access is not very realistic. Typically you will access memory in loops.
And in these cases the memory API can take advantage of C2
optimizations, checks are hoisted, and no extra price is paid in the
end. This is the same deal the ByteBuffer API does.

The issue with memory copy is that memory copy _is_ typically used in a
standalone fashion, e.g. outside of a loop. I'm pretty sure that if I
repeat the benchmark by wrapping it in a loop and do like 100
iterations, the difference with unsafe will disappear. But that's _not_
how users will use `copyFrom`.

What remains to be seen is as to whether this extra cost (even compared
with other APIs like ByteBuffer) is something which looks like a missing
C2 optimization, or something not optimal on our implementation. We plan
to investigate more and try to narrow this down.

If that would help narrow things down, we could try adding such a static method at least in the Panama repo and maybe we could test that with Lucene? At least we would have removed allocations out of the equation (but I suspect that allocation is not where the performance slowdown is coming from). @uschindler let me know what would be the easiest for you to try.
Sure, I can run the benchmarks with a Linux build of Panama! I can compile my own on this machine (not today or tomorrow), but if you have some tar.gz file with a prebuilt binary available it would be great.

I believe the thing that would make the most sense would be for you to
build the bits in this branch:

https://github.com/mcimadamore/panama-foreign/tree/small_copy_benchmark/

This is like Panama repo, but it has a new static method to do the copy,
in MemorySegment.

If using the static method instead of the slicing-based copyFrom removes
all the allocation AND it makes benchmark return back to where they
were, we can consider doing something like this. The benchmark I have in
that branch doesn't seem to suggest that the static method is any faster
than the other one, but it is always hard to generalize microbenchmarks
results - for instance, it might be that your copy code is "cold" and is
never touched by C2, or maybe it is touched only by one of the first
compiler tiers.

So, let's rule out allocation and GC cost first - if you use the static
method, you should no longer slice and dice before a copy. This should
at least remove the allocation you see: let's see if that's enough to
bring Lucene performance back to where it was.

Maurizio

Loading

1 similar comment
@mlbridge
Copy link

@mlbridge mlbridge bot commented Jun 16, 2021

Mailing list message from Maurizio Cimadamore on panama-dev:

On 16/06/2021 08:49, Uwe Schindler wrote:

On Tue, 15 Jun 2021 11:56:26 GMT, Maurizio Cimadamore <mcimadamore at openjdk.org> wrote:

After some investigation, it seems that escape analysis is defeated in cases where a new heap segment is created fresh just before performing a bulk copy.

This is caused by the fact that, on segment creation, we perform this test:

static int defaultAccessModes(long size) {
return (enableSmallSegments && size < Integer.MAX_VALUE) ?
SMALL : 0;
}

To make sure that segments whose size fits in an `int` do not incur in penalties associated with lack of optimizations over long loop bound check optimizations.

Unfortunately, this logic is control flow logic, and control flow disables escape analysis optimizations.

For segment wrappers around byte arrays we can workaround by removing the check (all byte segments are small by definition, since there's a 1-1 mapping between logical elements and physical bytes). For other segment kinds we cannot do much.

While it would be possible, in principle, to resort to more complex bound checks for heap segments, we believe the way forward is to eliminate the need for "small" segments, which will be possible once the PR below is completed:

https://github.com/openjdk/jdk/pull/2045
Hi,

we have clearly seen the allocation, but maybe bring some other part into the game: our segments are in a shared ResourceScope.

Shared scope shouldn't affect allocation profile. And slicing with
shared scopes doesn't cost extra, only closing shared scope costs extra.
So I think we can rule this out.

When comparing the benchmarks (which contain a lot of more code than just the segment copy, because we were testing query performance) there was one thing which I have seen: With our patch that used the slicing/copyFrom variant, it was also obvious from the outside that something was wrong: each iteration of the whole Lucene benchmark was like 20 % slower in total runtime (from startup till end of each iteration) in combination with garbage collection activity (which was also seen by JFR). So something in our case was triggering the allocations. The difference in runtime between unpatched and patched lucene benchmark was worse with tiered compilation disabled (for historical reasons the Lucene benchmark runs with tiered off and -Xbatch on). See outputs on https://github.com/apache/lucene/pull/177

From your latest benchmarks it looks like in addition to the allocation we have seen, there is some other slowdown when doing the segment memorycopy. From my understanding this mostly affects small segments, because the overhead of liveness check overweighs. But isn't the liveness check not also done for varhandle access (although optimized away after some time), so rewriting to a loop would not bring much - as the lifeness check needs to be done at least at start of loop after optimizations kicking in? What is the difference in the lifeness check (on the shared memory segment) between copyMemory and a single varhandle call?

So, in part this is known: the extra checks (liveness and bounds) prior
to memory access cost inevitably more than a plain, unchecked unsafe
access. So, if you do a _single_ var handle access, and you benchmark it
against a _single_ Unsafe access, it is slower - we have benchmark for
this stuff. In reality (1) there's not much we can do (we can't compete
with _all belts off_ mode) and (2) the use case of a single "hot" memory
access is not very realistic. Typically you will access memory in loops.
And in these cases the memory API can take advantage of C2
optimizations, checks are hoisted, and no extra price is paid in the
end. This is the same deal the ByteBuffer API does.

The issue with memory copy is that memory copy _is_ typically used in a
standalone fashion, e.g. outside of a loop. I'm pretty sure that if I
repeat the benchmark by wrapping it in a loop and do like 100
iterations, the difference with unsafe will disappear. But that's _not_
how users will use `copyFrom`.

What remains to be seen is as to whether this extra cost (even compared
with other APIs like ByteBuffer) is something which looks like a missing
C2 optimization, or something not optimal on our implementation. We plan
to investigate more and try to narrow this down.

If that would help narrow things down, we could try adding such a static method at least in the Panama repo and maybe we could test that with Lucene? At least we would have removed allocations out of the equation (but I suspect that allocation is not where the performance slowdown is coming from). @uschindler let me know what would be the easiest for you to try.
Sure, I can run the benchmarks with a Linux build of Panama! I can compile my own on this machine (not today or tomorrow), but if you have some tar.gz file with a prebuilt binary available it would be great.

I believe the thing that would make the most sense would be for you to
build the bits in this branch:

https://github.com/mcimadamore/panama-foreign/tree/small_copy_benchmark/

This is like Panama repo, but it has a new static method to do the copy,
in MemorySegment.

If using the static method instead of the slicing-based copyFrom removes
all the allocation AND it makes benchmark return back to where they
were, we can consider doing something like this. The benchmark I have in
that branch doesn't seem to suggest that the static method is any faster
than the other one, but it is always hard to generalize microbenchmarks
results - for instance, it might be that your copy code is "cold" and is
never touched by C2, or maybe it is touched only by one of the first
compiler tiers.

So, let's rule out allocation and GC cost first - if you use the static
method, you should no longer slice and dice before a copy. This should
at least remove the allocation you see: let's see if that's enough to
bring Lucene performance back to where it was.

Maurizio

Loading

@mcimadamore
Copy link
Collaborator Author

@mcimadamore mcimadamore commented Jun 16, 2021

Minor breakthrough. I explored more the assumption that the copy code in Lucene is not hot enough for C2 to kick in - and started to play with JMH's @CompilerControl(EXCLUDE) directive. Performance dips significantly (~10x), but what's striking is that I finally see lots of allocations when using slicing and no allocations when using the static copy method:

Benchmark                                                                Mode  Cnt    Score    Error   Units
TestSmallCopy.segment_small_copy_slice                                   avgt   30  157.831 ?  4.524   ns/op
TestSmallCopy.segment_small_copy_slice:?gc.alloc.rate                    avgt   30  241.978 ?  7.438  MB/sec
TestSmallCopy.segment_small_copy_slice:?gc.alloc.rate.norm               avgt   30   80.007 ?  0.002    B/op
TestSmallCopy.segment_small_copy_slice:?gc.churn.G1_Eden_Space           avgt   30  246.372 ? 74.872  MB/sec
TestSmallCopy.segment_small_copy_slice:?gc.churn.G1_Eden_Space.norm      avgt   30   81.763 ? 24.998    B/op
TestSmallCopy.segment_small_copy_slice:?gc.churn.G1_Survivor_Space       avgt   30    0.001 ?  0.002  MB/sec
TestSmallCopy.segment_small_copy_slice:?gc.churn.G1_Survivor_Space.norm  avgt   30   ? 10??             B/op
TestSmallCopy.segment_small_copy_slice:?gc.count                         avgt   30   25.000           counts
TestSmallCopy.segment_small_copy_slice:?gc.time                          avgt   30   23.000               ms

vs:

Benchmark                                                    Mode  Cnt    Score    Error   Units
TestSmallCopy.segment_small_copy_static                      avgt   30  100.394 ?  7.448   ns/op
TestSmallCopy.segment_small_copy_static:?gc.alloc.rate       avgt   30   ? 10??           MB/sec
TestSmallCopy.segment_small_copy_static:?gc.alloc.rate.norm  avgt   30   ? 10??             B/op
TestSmallCopy.segment_small_copy_static:?gc.count            avgt   30      ? 0           counts

This might well explain what @uschindler is seeing. If the code is not hot, EA won't even run, which would explain the allocation. So, if this is the case, we're far less concerned with the cost of single liveness checks, but we'd be more worried about the cost of introducing that many allocations.

Loading

@uschindler
Copy link
Member

@uschindler uschindler commented Jun 16, 2021

Hi Maurizio,
it's not possible today, because I have a talk at a conference later, but tomorrrow, I can for sure test the repo provided by you.

To call the new static MemorySegment.copy() method, I still need to wrap the byte[] array, so I should merge in THIS pull request? From looking at your repository it does not seem to have the HeapMemorySegment SMALL change.

Maybe you can also add this small change from this PR also to your benchmark repo.

About the liveness check: In my unsafe copy implementation, I called segment.address().toRawLongAddress(), so it was also doing the liveness checks. This had no effect on the performance. The only difference is the additional fencing after the whole copyMemory operation in ScopedMemoryAccess.

The issue with memory copy is that memory copy is typically used in a
standalone fashion, e.g. outside of a loop. I'm pretty sure that if I
repeat the benchmark by wrapping it in a loop and do like 100
iterations, the difference with unsafe will disappear. But that's not
how users will use copyFrom.

In most cases this is true. Actually we have sometimes a loop, but very seldom: We only use mmap when reading index files, but not for writing new index files, which are just streamed to a classical OutputStream. Sometimes we have to copy data to an OutputStream that was previously read from a MMapped InputStream (happens when merging Lucene index segments). In that case the code will call IndexInput.readBytes(...) and pass the byte array to OutputStream in the usual and well-known copy-loop. But that's of course not the usual case (especially not in our benchmark, because it does not build new indexes); it just executes queries.

Loading

@uschindler
Copy link
Member

@uschindler uschindler commented Jun 16, 2021

This might well explain what @uschindler is seeing. If the code is not hot, EA won't even run, which would explain the allocation. So, if this is the case, we're far less concerned with the cost of single liveness checks, but we'd be more worried about the cost of introducing that many allocations.

So would the new MemoryCopy class from the other pull request help, because it has @ForceInline?

Loading

@mcimadamore
Copy link
Collaborator Author

@mcimadamore mcimadamore commented Jun 16, 2021

To call the new static MemorySegment.copy() method, I still need to wrap the byte[] array, so I should merge in THIS pull request? From looking at your repository it does not seem to have the HeapMemorySegment SMALL change.

Leave alone this pull request, and just work off the branch I pointed out, which contains the static method. The static method just takes two segments, two offsets (in bytes) and a length, it's like System.arrayCopy.

Loading

@mcimadamore
Copy link
Collaborator Author

@mcimadamore mcimadamore commented Jun 16, 2021

So would the new MemoryCopy class from the other pull request help, because it has @ForceInline?

Possibly, although our philosophy behind adding static helpers is that they should be just "helpers", not a way to get more performant code. That is, you get same performance with a VarHandle call than you get with MemoryAccess::get. But if the basic copyFrom does not always offer the expected performance model, then I'd claim it's a bug in that the API exposes the "wrong" primitive (because use cases for scalar memory access are very different from bulk memory access, for the reasons mentioned above). But let's first verify that this is indeed what's causing the degradation for Lucene. You mentioned that the copy method were called "millions of times" which makes it unlikely for them to be completely "cold" and unoptimized.

Loading

@uschindler
Copy link
Member

@uschindler uschindler commented Jun 16, 2021

To call the new static MemorySegment.copy() method, I still need to wrap the byte[] array, so I should merge in THIS pull request? From looking at your repository it does not seem to have the HeapMemorySegment SMALL change.

Leave alone this pull request, and just work off the branch I pointed out, which contains the static method. The static method just takes two segments, two offsets (in bytes) and a length, it's like System.arrayCopy.

I will just check both branches separately. First this pull request, if it makes the allocations go away (which looks like it won't). Then the other one.

Give me a few days, will work on this after the conference.

But let's first verify that this is indeed what's causing the degradation for Lucene. You mentioned that the copy method were called "millions of times" which makes it unlikely for them to be completely "cold" and unoptimized.

That was meant "symbolic". Not sure how often it is called in reality. But from the number of allocations on heap, it seems often. We found similar issues in the past, too: Sometimes Hotspot refuses to optimize a method for unknown reasons. It seems to have to do with how it is called. Sometimes the complexity of code inside Lucene is dramatic. Internals are looking very "C like". Huge methods with complex spaghetti-like logic, randomly calling methods from MemorySegmentIndexInput (previously ByteBufferIndexInput). Most of the logic uses the "iterator/enum" aproach. Search thread gets new document hits by consuming some enum, which itsself calls IndexInput.

Loading

@mcimadamore
Copy link
Collaborator Author

@mcimadamore mcimadamore commented Jun 16, 2021

I will just check both branches separately. First this pull request, if it makes the allocations go away (which looks like it won't). Then the other one.

You might also wanna try the one in:
#555

As you said, the fact that the copy is wrapped in a @ForceInline might help. Also, I realized (and that's probably what you were referring to) that the single static copy method would still require wrapping the array in a segment.

Loading

@uschindler
Copy link
Member

@uschindler uschindler commented Jun 21, 2021

Hi @mcimadamore,
this pull request helps nothing. The garbage is still created. I compiled a JDK with exactly this branch and ran the benchmark against it:

thetaphi@serv1:~/panama-foreign$ git remote -v
mcimadamore     https://github.com/mcimadamore/panama-foreign.git (fetch)
mcimadamore     https://github.com/mcimadamore/panama-foreign.git (push)
origin  https://github.com/openjdk/panama-foreign.git (fetch)
origin  https://github.com/openjdk/panama-foreign.git (push)
thetaphi@serv1:~/panama-foreign$ git status
On branch copy-bench-lucene
Your branch is up to date with 'mcimadamore/copy-bench-lucene'.

nothing to commit, working tree clean
thetaphi@serv1:~/panama-foreign$

JFR output looks identical. Again it is worse with -Xbatch and -XX:-TieredCompilation from the total runtime.

I will now try #555 as an alternative. To me it looks like Hotspot does not try to optimize the methods at all because maybe the complex structure of our code.

Loading

@uschindler
Copy link
Member

@uschindler uschindler commented Jun 21, 2021

I think this one can be closed.

Loading

@mcimadamore
Copy link
Collaborator Author

@mcimadamore mcimadamore commented Jun 21, 2021

Closing as it does not address the original problem.

Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 participants