Skip to content

Conversation

@franz1981
Copy link

@franz1981 franz1981 commented Nov 21, 2023

The motivation behind this PR is explained at netty/netty#13692' and netty/netty#13693 which, due to the limits of Netty to be "just" a Java library, couldn't solve at its root the described performance issue ie zeroing few bytes performance is cannibalized by JNI bridge cross and chain of method calls.

Obviously this issue is magnified by 2 facts:

  1. microbenchmark keep L1 cache hot, which makes even a call indirection too costy
  2. Netty was unsafely exposing users to a zeroing API

I see that the original code (related Copy::fill_*) hasn't changed from long time, as none has noticed this performance issue, but looking at which code paths involve zeroing via memset clarify why: previously, there was no "hot path" using it.
In recent JDK releases, things have changed due to the introduction of https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/lang/foreign/MemorySegment.html#fill(byte) , exposing OpenJDK users to the poor performance of this method for small writes.

There are other Java frameworks in the High-Frequency-Trading land which has exposed users to this API and tried to improve things from Java eg https://github.com/real-logic/agrona/blob/cfdda89a367f5543e13168ab55361afa1cbef4a5/agrona/src/main/java/org/agrona/AbstractMutableDirectBuffer.java#L77-L120

I've prepared these changes to explore the different options I'm currently capable of (I'm not a OpenJDK developer) and I'm open to learn more interesting ways to solve it, which I'll explore few below.

First, a summary of what I've tried here:

  • 5fc0e3c: it is making Copy::fill_to_memory_atomic inlinable and it's replacing UNSAFE_ENTRY with UNSAFE_LEAF specializing it to save try resolving JNI handles (but I didn't perform some of the checks in
    // This function is a leaf since if the source and destination are both in native memory
    // the copy may potentially be very large, and we don't want to disable GC if we can avoid it.
    // If either source or destination (or both) are on the heap, the function will enter VM using
    // JVM_ENTRY_FROM_LEAF
    )
  • 7861d8d: this is the commit I don't like, because of my current inability to generate a proper fill via stub/macro and intrinsifying the Unsafe call.
    I'm leveraging some existing Java/C2 code paths (eg intrinsify_fill - although still getting bound checks :"( ) to avoid any JNI calls. And it's using an hardcoded cutoff value, which cannot survive in a real commit.

Numbers below are just indicative of the type of performance improvement I would like to achieve and not including the heap use case (which I still have to implement in the JMH benchmark) nor trying to "shuffle" the JIT decisions with the additional compiled branches.

baseline (25f9af9):

Benchmark                           (aligned)  (size)  Mode  Cnt  Score   Error  Units
MemorySegmentZeroUnsafe.panama           true       7  avgt   10  8.257 ? 0.006  ns/op
MemorySegmentZeroUnsafe.unsafe           true       7  avgt   10  7.894 ? 0.012  ns/op
MemorySegmentZeroUnsafe.panama           true      64  avgt   10  8.626 ? 0.073  ns/op
MemorySegmentZeroUnsafe.unsafe           true      64  avgt   10  8.041 ? 0.026  ns/op

Faster Unsafe::setMemory0

Benchmark                           (aligned)  (size)  Mode  Cnt  Score   Error  Units
MemorySegmentZeroUnsafe.panama           true       7  avgt   10  6.245 ? 0.004  ns/op
MemorySegmentZeroUnsafe.unsafe           true       7  avgt   10  5.837 ? 0.117  ns/op
MemorySegmentZeroUnsafe.panama           true      64  avgt   10  6.246 ? 0.007  ns/op
MemorySegmentZeroUnsafe.unsafe           true      64  avgt   10  5.798 ? 0.005  ns/op

Improving Java side for fill pattern and save JNI leafs

Benchmark                           (aligned)  (size)  Mode  Cnt  Score   Error  Units
MemorySegmentZeroUnsafe.panama           true       7  avgt   10  4.348 ? 0.053  ns/op
MemorySegmentZeroUnsafe.unsafe           true       7  avgt   10  3.546 ? 0.010  ns/op
MemorySegmentZeroUnsafe.panama           true      64  avgt   10  4.017 ? 0.002  ns/op
MemorySegmentZeroUnsafe.unsafe           true      64  avgt   10  2.757 ? 0.015  ns/op

Given that relying on JIT to do its job means assuming C2 to always kick-in, I suppose we care about getting a figure of performance if just C1 is used, which led to set -XX:+TieredCompilation -XX:TieredStopAtLevel=1 in the jvmArgs of the benchmark, and getting:

Improving Java side for fill pattern and save JNI leafs

Benchmark                       (aligned)  (size)  Mode  Cnt   Score   Error  Units
MemorySegmentZeroUnsafe.panama       true       7  avgt   10  13.231 ? 0.480  ns/op
MemorySegmentZeroUnsafe.unsafe       true       7  avgt   10  11.118 ? 0.013  ns/op

while, with Faster Unsafe::setMemory0

Benchmark                       (aligned)  (size)  Mode  Cnt   Score   Error  Units
MemorySegmentZeroUnsafe.panama       true       7  avgt   10  20.093 ? 4.528  ns/op
MemorySegmentZeroUnsafe.unsafe       true       7  avgt   10   9.575 ? 0.002  ns/op

And clearly, the panama version suffer from the MemorySegment abstraction around Unsafe, but the unsafe version is slightly regressed in the "Java optimized" version, a thing that will likely be more visible with smaller sizes.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Integration blockers

 ⚠️ The commit message does not reference any issue. To add an issue reference to this PR, edit the title to be of the format issue number: message. (failed with updated jcheck configuration in pull request)
 ⚠️ Too few reviewers with at least role reviewer found (have 0, need at least 1) (failed with updated jcheck configuration in pull request)
 ⚠️ Whitespace errors (failed with updated jcheck configuration in pull request)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/16760/head:pull/16760
$ git checkout pull/16760

Update a local copy of the PR:
$ git checkout pull/16760
$ git pull https://git.openjdk.org/jdk.git pull/16760/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 16760

View PR using the GUI difftool:
$ git pr show -t 16760

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/16760.diff

@bridgekeeper bridgekeeper bot added the oca Needs verification of OCA signatory status label Nov 21, 2023
@bridgekeeper
Copy link

bridgekeeper bot commented Nov 21, 2023

Hi @franz1981, welcome to this OpenJDK project and thanks for contributing!

We do not recognize you as Contributor and need to ensure you have signed the Oracle Contributor Agreement (OCA). If you have not signed the OCA, please follow the instructions. Please fill in your GitHub username in the "Username" field of the application. Once you have signed the OCA, please let us know by writing /signed in a comment in this pull request.

If you already are an OpenJDK Author, Committer or Reviewer, please click here to open a new issue so that we can record that fact. Please use "Add GitHub user franz1981" as summary for the issue.

If you are contributing this work on behalf of your employer and your employer has signed the OCA, please let us know by writing /covered in a comment in this pull request.

@openjdk
Copy link

openjdk bot commented Nov 21, 2023

@franz1981 The following labels will be automatically applied to this pull request:

  • core-libs
  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added hotspot hotspot-dev@openjdk.org core-libs core-libs-dev@openjdk.org labels Nov 21, 2023
@merykitty
Copy link
Member

merykitty commented Nov 22, 2023

If it is to set just a few bytes then I don't think this has gone far enough. An intrinsics would be definitely better.

Another approach is to implement this in pure Java, I think for large buffers the autovectoriser is pretty good and for small ones #16245 will help tremendously.

And clearly, the panama version suffer from the MemorySegment abstraction around Unsafe, but the unsafe version is slightly regressed in the "Java optimized" version, a thing that will likely be more visible with smaller sizes.

Is 1.5ns matter that much? I believe if every nanosecond counts then that particular piece of code will surely be C2-compiled.

@franz1981
Copy link
Author

franz1981 commented Nov 22, 2023

If it is to set just a few bytes then I don't think this has gone far enough. An intrinsics would be definitely better.

Fully agree: I've looked at the produced ASM (which I can report, but is platform specific) and I can say that Improving Java side for fill pattern and save JNI leafs is very compact and isn't distant from what an optimal intrinsic would produce (for <= 64 bytes writes), meaning that the numbers obtained (although limited by being a microbench...) are a good upper bound for what we could achieve, with the additional benefit for a proper intrinsic to have a more precise cutoff value(s), if any.
What I dislike of the intrinsic, instead, is that we still have a refreshed Unsafe story and will benefit "only" the code paths interested by explicitly relying on it (which are very limited, although important!).

Another approach is to implement this in pure Java, I think for large buffers the autovectoriser is pretty good and for small ones #16245 will help tremendously.

This is something I think is valuable: getting some great performance out of a common mechanism is exactly a good way to improve elsewhere, especially if something won't prove to be as good as expected: the sole "concern" of this is that "safe autovectorization" still perform hoisted bound checks, which we do perform the same in the Unsafe caller; at this point I can just piggyback to the inner one (which won't make it Unsafe anymore :P).
Another concern is that usually the slow path chosen for x86_64 autovectorization intrinsics is byte-by-byte AFAIK (similarly to String::indexOf despite being an intrinsic, as reported at netty/netty#13534 (comment)) and it won't still be good enough for small writes, but not tiny (eg [8, 128] bytes). I have to double check this last statement, anyway.

If I'll go down that route (maybe in a separate branch and compare), we basically can drop the native existing Unsafe and I can just perform a byte-by-byte copy, without any "trick" to perform batch writes using bigger strides?

Is 1.5ns matter that much? I believe if every nanosecond counts then that particular piece of code will surely be C2-compiled.

It is a fair point, although I've reported for knowledge: at least now I know that is a small number, but a difference exist.

@merykitty
Copy link
Member

Another concern is that usually the slow path chosen for x86_64 autovectorization intrinsics is byte-by-byte AFAIK (as per String::indexOf as reported at netty/netty#13534 (comment)) and it won't still be good enough for small writes, but not tiny (eg [8, 128] bytes). I have to double check this last statement, anyway.

I believe improvement is being made to post-loop vectoriser, which will transform the post-loop into a masked vector load/store instead of a loop. You can follow its development in #14581. For the nitpicking, String::indexOf is an intrinsics, and its implementation is in hand-written assembly. Currently, the autovectoriser would do the same, which is to do scalar loop when the number of elements falls below the vector size, though.

@franz1981
Copy link
Author

franz1981 commented Nov 22, 2023

Thanks @merykitty to point the other ongoing work on this:
I got few questions related what already exists fitting the fill API:

  1. void MacroAssembler::generate_fill(BasicType t, bool aligned,
    seems to perform some post loop which doesn't perform just byte-per-byte store unless the last 4 bytes (I'm thinking about the [8, 128] sized cases) - am I reading it wrong?
  2. but sadly, off-heap is not accounted in : is it correct?
  3. given that zeroing is probably the most common fill case, wouldn't be good to reuse
    void MacroAssembler::clear_mem(Register base, Register cnt, Register tmp, XMMRegister xtmp,
    and make it possible to work with non-heap data as well? It makes (any) sense?

For 3: I know that having specialized paths isn't ideal, but if zeroing is really the most common fill case, why spending cycles to analyze (& generate) a fill pattern where the JIT can detect and reuse the same one for zeroing memory on the allocated instances?

@franz1981
Copy link
Author

/covered

@bridgekeeper bridgekeeper bot added the oca-verify Needs verification of OCA signatory status label Nov 28, 2023
@bridgekeeper
Copy link

bridgekeeper bot commented Nov 28, 2023

Thank you! Please allow for a few business days to verify that your employer has signed the OCA. Also, please note that pull requests that are pending an OCA check will not usually be evaluated, so your patience is appreciated!

@bridgekeeper bridgekeeper bot removed oca Needs verification of OCA signatory status oca-verify Needs verification of OCA signatory status labels Nov 28, 2023
@JornVernee
Copy link
Member

Hey, I've had a look. I like where this is going, but I think the approach of dropping the thread state transition from the code in unsafe.cpp is problematic in the context of: #16792 Where we need all memory access that goes through ScopedMemoryAccess to happen in the VM or Java thread state to prevent use-after-free.

I think if we want to avoid the thread state transitions (both from Java -> native and native -> VM), then the best option is to intrinsify. This should be relatively straightforward to do by copying what we already do for Unsafe::copyMemory 1. We just need a stub to call for the fill. Note that runtime calls use the C calling convention, so I think it may be possible to just call Copy::fill_to_memory_atomic for the implementation of the intrinsic (through an extern "C" wrapper function). There's also a StubRoutines_jbyte_fill stub that is being used for loop transformations that we may be able to use (not sure if it's compatible).

but sadly, off-heap is not accounted in

MacroAssembler::generate_fill is used for the _jbyte_fill stub as well. I think it supports off-heap as well, as the destination is just an address. For the switch statement that you link to, we care about the T_BYTE case (i.e. the value we want to fill with is a byte).

Either way, I think having a Java-based version for smaller byte sizes is an interesting direction to explore, since it avoids the indirect call, and make everything for visible to the JIT.

@franz1981
Copy link
Author

franz1981 commented Dec 1, 2023

Thanks @JornVernee for the pointers! I like your proposal and I can give it a shot in a separate branch too; I really don't know how much to keep of the changes in java to address the "small writes" part of the problem, really....

I could do this way too:

  • send a PR java only: but what cutoff value to choose? that's arch-dep I believe - although 64/100/128 seems good candidate, at least for linux
  • send a separate PR which address when we go in the Unsafe's land; but it is yet t be seen if there's any gain for us there - because starting to move beyond 2 cache-lines, the callq cost start to fade away and NOT leveraging to memset could cause regressions, although I really don't understand the existing Copy::fill behaviour when addresses|size are 8/4-2 bytes aligned, to not use it (I have to do myself some research on how memset is implemented, I believe)

@JornVernee
Copy link
Member

but what cutoff value to choose? that's arch-dep I believe - although 64/100/128 seems good candidate, at least for linux

I think it also depends on what kind of improvements we see for the intrinsification. If that's fast enough, we probably want a lower cutoff for the Java implementation.

I think we should intrinsify in C2 first, and then look at providing Java versions for small sizes, since whether that's worth it depends on whether it's faster than just doing the native call. It might also be beneficial for copyMemory and copySwapMemory to have a java-based version.

I suggest splitting the work into 2 PRs: 1. that intrinsifies setMemory, and 2. that has the Java implementations.

* send a separate PR which address when we go in the Unsafe's land; but it is yet t be seen if there's any gain for us there - because starting to move beyond 2 cache-lines, the `callq` cost start to fade away and NOT leveraging to `memset` _could_  cause regressions, although I really don't understand the existing `Copy::fill` behaviour when `addresses|size` are 8/4-2 bytes aligned, to not use it (I have to do myself some research on how memset is implemented, I believe)

That code is very old, maybe memset wasn't as good back in the day. It might be worth it to re-evaluate. It could also be that memset does some wacky stuff that is not safe to do when writing to Java arrays (I don't know off the top of my head).

@openjdk
Copy link

openjdk bot commented Dec 7, 2023

@franz1981 this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout faster_set_mem
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

@openjdk openjdk bot added the merge-conflict Pull request has merge conflict with target branch label Dec 7, 2023
@bridgekeeper
Copy link

bridgekeeper bot commented Feb 1, 2024

@franz1981 This pull request has been inactive for more than 8 weeks and will be automatically closed if another 8 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@franz1981
Copy link
Author

Keep it alive

@openjdk
Copy link

openjdk bot commented Mar 13, 2024

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

@asgibbons
Copy link
Contributor

Hi. I've been asked to look into creating an intrinsic to help with this performance issue. I believe I'm a bit out of my depth in understanding the "unsafe" portion of this PR. I have a lot of experience with creation of intrinsics and I'd like to more fully understand the issue. Could someone please provide more information that would allow me to create such an intrinsic, including the "alignment" concern? Thanks.

@JornVernee
Copy link
Member

JornVernee commented Mar 27, 2024

@asgibbons Take a look at LibraryCallKit::inline_unsafe_copyMemory in library_call.cpp. We essentially need to do the same thing for Unsafe::setMemory. Initially we don't need a special stub for setMemory. We can just have an extern "C" function that calls Copy::fill_to_memory_atomic like the unsafe implementation does (it takes care of alignment already).

@franz1981
Copy link
Author

franz1981 commented Mar 27, 2024

Thanks both...I have been swallowed by others priorities in the daily job, but I can resume my work next week or luckly by the end of this one, on Friday.
Splitting the work in 2s and already provide a java-only improvement will clearly make the job easier for me although I would like to learn how to write a C2 intrinsics. I am open to any suggestion and clearly provide info to who want to help eg @asgibbons (thanks!)

@asgibbons
Copy link
Contributor

I'm not sure this link will work, but please take a look at this initial cut. setMemory.

I'll try running the benchmark (MemorySegmentZeroUnsafe.java), however my initial run on top-of-tree showed ~15ns/op, so I'm not sure if it's the correct one. Can you please verify?

Benchmark (aligned) (size) Mode Cnt Score Error Units
MemorySegmentZeroUnsafe.panama true 7 avgt 10 14.942 ± 0.017 ns/op
MemorySegmentZeroUnsafe.panama true 64 avgt 10 15.386 ± 0.017 ns/op
MemorySegmentZeroUnsafe.panama false 7 avgt 10 15.150 ± 0.011 ns/op
MemorySegmentZeroUnsafe.panama false 64 avgt 10 15.156 ± 0.018 ns/op
MemorySegmentZeroUnsafe.unsafe true 7 avgt 10 14.725 ± 0.016 ns/op
MemorySegmentZeroUnsafe.unsafe true 64 avgt 10 14.939 ± 0.017 ns/op
MemorySegmentZeroUnsafe.unsafe false 7 avgt 10 14.727 ± 0.017 ns/op
MemorySegmentZeroUnsafe.unsafe false 64 avgt 10 14.728 ± 0.017 ns/op

Thanks.

@franz1981
Copy link
Author

franz1981 commented Mar 27, 2024

Hi @asgibbons did you tried against the changes I made in this PR? Can you print the assembly to make sure what's produced by running the unsafe version?

The numbers looks strange

@asgibbons
Copy link
Contributor

@franz1981 Things have apparently changed a bit since you proposed these changes (see here). In unsafe.cpp there is no longer any notion of a scoped LEAF and I don't really know how to implement what you were trying to do. I made Copy::fill_to_memory_atomic a static inline, which had no noticeable effect. For Unsafe.java, it always seems to be called with o equal to null, so there's no effect with that change either.

Could you please outline the steps you used to create the benchmark numbers from above? As I said, the baseline numbers don't seem to match what you found without your changes. Every invocation of the benchmark for me gives ~15ns/op, both with and without your changes.

@franz1981
Copy link
Author

franz1981 commented Mar 27, 2024

both with and without your changes.

Sorry I didn't understood you were using already my branch..which CPU arch are you using?

I will re-run it reporting which steps I have performed, to help reproducing it

@asgibbons
Copy link
Contributor

I think you misunderstand. With baseline I cannot reproduce your numbers. No changes by me (or you), just pure baseline. It would help to see the exact command line you're using to invoke the benchmark. I'm using:

java -jar ./build/linux-x86_64-server-release/images/test/micro/benchmarks.jar MemorySegmentZeroUnsafe -f 1

Or am I misunderstanding?

@franz1981
Copy link
Author

franz1981 commented Mar 27, 2024

Yep, I have likely set both affinity and using --localalloc via numactl.

The numbers of baseline are not that different actually, just my cpu is faster (it's an AMD Ryzen 7950 with turbo boost disabled, governor performance and tuned network latency profile on)

Said that, numbers seems pretty in line with what I have obtained for the baseline, just different because of the different cpus used.

And for both 7/64 bytes, you indeed have a "base" overhead which doesn't depend by the length of bytes to zero.

@asgibbons
Copy link
Contributor

I just pushed another set of code, which I think works with a debugged intrinsic. Look here if you'd like to review it.

Numbers got slightly worse. I'll be profiling soon.

@JornVernee
Copy link
Member

I suggest setting up a test with -XX:+LogCompilation and seeing if the method using setMemory is actually being compiled successfully. Since you're dropping 2 thread state transitions with the intrinsic, you should see a pretty big difference on the ~15 ns time per call when it kicks in.

@JornVernee
Copy link
Member

Looking at the patch, you have an issue with the type you're using for the runtime call (passing long, but the type you use is int). That will mess up the register allocator and bail out compilation.

Need to use:

diff --git a/src/hotspot/share/opto/library_call.cpp b/src/hotspot/share/opto/library_call.cpp
index 6acae8ac2c5..a386a6ab94b 100644
--- a/src/hotspot/share/opto/library_call.cpp
+++ b/src/hotspot/share/opto/library_call.cpp
@@ -4990,7 +4991,7 @@ bool LibraryCallKit::inline_unsafe_setMemory() {
                     StubRoutines::unsafe_setmemory(),
                     "unsafe_setmemory",
                     dst_type,
-                    dst_addr, size, byte);
+                    dst_addr, size XTOP, byte);

   store_to_memory(control(), doing_unsafe_access_addr, intcon(0), doing_unsafe_access_bt, Compile::AliasIdxRaw, MemNode::unordered);

diff --git a/src/hotspot/share/opto/runtime.cpp b/src/hotspot/share/opto/runtime.cpp
index ec0b6655d0a..3b97c096237 100644
--- a/src/hotspot/share/opto/runtime.cpp
+++ b/src/hotspot/share/opto/runtime.cpp
@@ -774,12 +774,13 @@ const TypeFunc* OptoRuntime::void_void_Type() {

 const TypeFunc* OptoRuntime::make_setmemory_Type() {
   // create input type (domain)
-  int num_args      = 3;
+  int num_args      = 4;
   int argcnt = num_args;
   const Type** fields = TypeTuple::fields(argcnt);
   int argp = TypeFunc::Parms;
   fields[argp++] = TypePtr::NOTNULL;    // dest
-  fields[argp++] = TypeInt::INT;        // size
+  fields[argp++] = TypeLong::LONG;      // size
+  fields[argp++] = Type::HALF;          // size
   fields[argp++] = TypeInt::INT;        // bytevalue
   assert(argp == TypeFunc::Parms+argcnt, "correct decoding");
   const TypeTuple* domain = TypeTuple::make(TypeFunc::Parms+argcnt, fields);

@asgibbons
Copy link
Contributor

@JornVernee Holy cannoli! I guess I have a lot to learn about the whole compilation stuff. I made this change and now I get around 3ns per op! Thank you. I'm running tests on it now.

@franz1981 Do you think you could file a JBS on this? Looking at what I'm currently seeing, there is indeed value in intrinsifying.

@franz1981
Copy link
Author

franz1981 commented Mar 29, 2024

I sadly have no super powers to fill JBS; this one was likely my very first ambitious and unfinished first contribution ever (or the second one?) :/

I would be curious to see how the intrinsics looks like compared to a plain single long/int/short/bytes - that's what we do on array allocations for zeroing iirc (in the past was using rep stos for 'long" length, no idea nowadays, maybe AVX?)

Holy cannoli

That's something!:)

@JornVernee
Copy link
Member

@asgibbons
Copy link
Contributor

I just submitted a PR for this here

@asgibbons
Copy link
Contributor

@JornVernee I'm not sure I'm doing the right thing in my intrinsic with respect to atomicity and/or unsafe boundary marks. Is a 64-byte write aligned to a 64-byte boundary considered atomic? I believe 8-byte writes are guaranteed atomic within the HW, but I'm not sure about 32- or 16-byte writes. I assume if they're in a cache line it should be atomic, but was wondering if you had further insight.

I also see some code UnsafeCopyMemoryMark that does stuff I don't understand. Is this necessary for setMemory as well?

Thanks.

@asgibbons
Copy link
Contributor

@JornVernee I believe 8 byte writes are the only atomic writes. Do you have any insight into UnsafeCopyMemoryMark? I seem to be getting a failure in testStringLargerThanMaxInt():

test TestStringEncodingJumbo.testStringLargerThanMaxInt(): failure
java.lang.AssertionError: Expected IllegalArgumentException to be thrown, but nothing was thrown

@JornVernee
Copy link
Member

UnsafeCopyMemoryMark is used to indicate continuation PCs for when e.g. a page fault occurs. See how it's used in the signal handlers for instance.

I don't see how that would relate to a failure in TestStringEncodingJumbo.testStringLargerThanMaxInt though. That seems more of a case of setMemory not working as intended. Perhaps the fill size is being truncated to 32 bits somewhere?

@asgibbons
Copy link
Contributor

asgibbons commented Apr 11, 2024

To update all on this theread, I now have a PR ready for review here. This PR makes an intrinsic stub for Unsafe::setMemory and performs 3x-5x faster than the non-intrinsified version. Some things I noted:

  • Moving the Copy::fill_to_memory_atomic function into the header file and adding the inline decorator causes failure in InternalErrorTest.java for the linux 32-bit platform. I spent 3 days trying to determine exactly why, but couldn't find the reason. I moved it back into the .cpp file.
  • I added supporting code in all the os_* files for all platforms, so adding a stub for each should be easy.
  • Adding the Java code in setMemory for small sizes only performed better than the intrinsic for 1 and 2 byte fills. The intrinsic performed at least as good or better for all other sizes.
  • I added unsafe marks within the intrinsic for proper handling of SIGBUS errors during a fill.

This has all been tested with at least tier-1.

@franz1981
Copy link
Author

I can close this thanks to the work done by @asgibbons and others!

@franz1981 franz1981 closed this Apr 13, 2024
@asgibbons
Copy link
Contributor

@franz1981 Now that this change has been integrated, can you please run your benchmark (not the JMH - already done) to see if you get improvement? It would be nice to know. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core-libs core-libs-dev@openjdk.org hotspot hotspot-dev@openjdk.org merge-conflict Pull request has merge conflict with target branch

Development

Successfully merging this pull request may close these issues.

4 participants