-
Notifications
You must be signed in to change notification settings - Fork 6.1k
Draft: Improve Unsafe::setMemory for small writes #16760
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @franz1981, welcome to this OpenJDK project and thanks for contributing! We do not recognize you as Contributor and need to ensure you have signed the Oracle Contributor Agreement (OCA). If you have not signed the OCA, please follow the instructions. Please fill in your GitHub username in the "Username" field of the application. Once you have signed the OCA, please let us know by writing If you already are an OpenJDK Author, Committer or Reviewer, please click here to open a new issue so that we can record that fact. Please use "Add GitHub user franz1981" as summary for the issue. If you are contributing this work on behalf of your employer and your employer has signed the OCA, please let us know by writing |
|
@franz1981 The following labels will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command. |
|
If it is to set just a few bytes then I don't think this has gone far enough. An intrinsics would be definitely better. Another approach is to implement this in pure Java, I think for large buffers the autovectoriser is pretty good and for small ones #16245 will help tremendously.
Is 1.5ns matter that much? I believe if every nanosecond counts then that particular piece of code will surely be C2-compiled. |
Fully agree: I've looked at the produced ASM (which I can report, but is platform specific) and I can say that Improving Java side for fill pattern and save JNI leafs is very compact and isn't distant from what an optimal intrinsic would produce (for <= 64 bytes writes), meaning that the numbers obtained (although limited by being a microbench...) are a good upper bound for what we could achieve, with the additional benefit for a proper intrinsic to have a more precise cutoff value(s), if any.
This is something I think is valuable: getting some great performance out of a common mechanism is exactly a good way to improve elsewhere, especially if something won't prove to be as good as expected: the sole "concern" of this is that "safe autovectorization" still perform hoisted bound checks, which we do perform the same in the If I'll go down that route (maybe in a separate branch and compare), we basically can drop the native existing Unsafe and I can just perform a byte-by-byte copy, without any "trick" to perform batch writes using bigger strides?
It is a fair point, although I've reported for knowledge: at least now I know that is a small number, but a difference exist. |
I believe improvement is being made to post-loop vectoriser, which will transform the post-loop into a masked vector load/store instead of a loop. You can follow its development in #14581. For the nitpicking, |
|
Thanks @merykitty to point the other ongoing work on this:
For 3: I know that having specialized paths isn't ideal, but if zeroing is really the most common fill case, why spending cycles to analyze (& generate) a fill pattern where the JIT can detect and reuse the same one for zeroing memory on the allocated instances? |
|
/covered |
|
Thank you! Please allow for a few business days to verify that your employer has signed the OCA. Also, please note that pull requests that are pending an OCA check will not usually be evaluated, so your patience is appreciated! |
|
Hey, I've had a look. I like where this is going, but I think the approach of dropping the thread state transition from the code in I think if we want to avoid the thread state transitions (both from Java -> native and native -> VM), then the best option is to intrinsify. This should be relatively straightforward to do by copying what we already do for
Either way, I think having a Java-based version for smaller byte sizes is an interesting direction to explore, since it avoids the indirect call, and make everything for visible to the JIT. |
|
Thanks @JornVernee for the pointers! I like your proposal and I can give it a shot in a separate branch too; I really don't know how much to keep of the changes in java to address the "small writes" part of the problem, really.... I could do this way too:
|
I think it also depends on what kind of improvements we see for the intrinsification. If that's fast enough, we probably want a lower cutoff for the Java implementation. I think we should intrinsify in C2 first, and then look at providing Java versions for small sizes, since whether that's worth it depends on whether it's faster than just doing the native call. It might also be beneficial for copyMemory and copySwapMemory to have a java-based version. I suggest splitting the work into 2 PRs: 1. that intrinsifies setMemory, and 2. that has the Java implementations.
That code is very old, maybe |
|
@franz1981 this pull request can not be integrated into git checkout faster_set_mem
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push |
|
@franz1981 This pull request has been inactive for more than 8 weeks and will be automatically closed if another 8 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration! |
|
Keep it alive |
|
❗ This change is not yet ready to be integrated. |
|
Hi. I've been asked to look into creating an intrinsic to help with this performance issue. I believe I'm a bit out of my depth in understanding the "unsafe" portion of this PR. I have a lot of experience with creation of intrinsics and I'd like to more fully understand the issue. Could someone please provide more information that would allow me to create such an intrinsic, including the "alignment" concern? Thanks. |
|
@asgibbons Take a look at |
|
Thanks both...I have been swallowed by others priorities in the daily job, but I can resume my work next week or luckly by the end of this one, on Friday. |
|
I'm not sure this link will work, but please take a look at this initial cut. setMemory. I'll try running the benchmark (MemorySegmentZeroUnsafe.java), however my initial run on top-of-tree showed ~15ns/op, so I'm not sure if it's the correct one. Can you please verify?
Thanks. |
|
Hi @asgibbons did you tried against the changes I made in this PR? Can you print the assembly to make sure what's produced by running the unsafe version? The numbers looks strange |
|
@franz1981 Things have apparently changed a bit since you proposed these changes (see here). In Could you please outline the steps you used to create the benchmark numbers from above? As I said, the baseline numbers don't seem to match what you found without your changes. Every invocation of the benchmark for me gives ~15ns/op, both with and without your changes. |
Sorry I didn't understood you were using already my branch..which CPU arch are you using? I will re-run it reporting which steps I have performed, to help reproducing it |
|
I think you misunderstand. With baseline I cannot reproduce your numbers. No changes by me (or you), just pure baseline. It would help to see the exact command line you're using to invoke the benchmark. I'm using:
Or am I misunderstanding? |
|
Yep, I have likely set both affinity and using --localalloc via numactl. The numbers of baseline are not that different actually, just my cpu is faster (it's an AMD Ryzen 7950 with turbo boost disabled, governor performance and tuned network latency profile on) Said that, numbers seems pretty in line with what I have obtained for the baseline, just different because of the different cpus used. And for both 7/64 bytes, you indeed have a "base" overhead which doesn't depend by the length of bytes to zero. |
|
I just pushed another set of code, which I think works with a debugged intrinsic. Look here if you'd like to review it. Numbers got slightly worse. I'll be profiling soon. |
|
I suggest setting up a test with |
|
Looking at the patch, you have an issue with the type you're using for the runtime call (passing long, but the type you use is int). That will mess up the register allocator and bail out compilation. Need to use: |
|
@JornVernee Holy cannoli! I guess I have a lot to learn about the whole compilation stuff. I made this change and now I get around 3ns per op! Thank you. I'm running tests on it now. @franz1981 Do you think you could file a JBS on this? Looking at what I'm currently seeing, there is indeed value in intrinsifying. |
|
I sadly have no super powers to fill JBS; this one was likely my very first ambitious and unfinished first contribution ever (or the second one?) :/ I would be curious to see how the intrinsics looks like compared to a plain single long/int/short/bytes - that's what we do on array allocations for zeroing iirc (in the past was using rep stos for 'long" length, no idea nowadays, maybe AVX?)
That's something!:) |
|
I've filed: https://bugs.openjdk.org/browse/JDK-8329331 |
|
I just submitted a PR for this here |
|
@JornVernee I'm not sure I'm doing the right thing in my intrinsic with respect to atomicity and/or unsafe boundary marks. Is a 64-byte write aligned to a 64-byte boundary considered atomic? I believe 8-byte writes are guaranteed atomic within the HW, but I'm not sure about 32- or 16-byte writes. I assume if they're in a cache line it should be atomic, but was wondering if you had further insight. I also see some code Thanks. |
|
@JornVernee I believe 8 byte writes are the only atomic writes. Do you have any insight into
|
|
I don't see how that would relate to a failure in |
|
To update all on this theread, I now have a PR ready for review here. This PR makes an intrinsic stub for
This has all been tested with at least tier-1. |
|
I can close this thanks to the work done by @asgibbons and others! |
|
@franz1981 Now that this change has been integrated, can you please run your benchmark (not the JMH - already done) to see if you get improvement? It would be nice to know. Thanks! |
The motivation behind this PR is explained at netty/netty#13692' and netty/netty#13693 which, due to the limits of Netty to be "just" a Java library, couldn't solve at its root the described performance issue ie zeroing few bytes performance is cannibalized by JNI bridge cross and chain of method calls.
Obviously this issue is magnified by 2 facts:
callindirection too costyI see that the original code (related Copy::fill_*) hasn't changed from long time, as none has noticed this performance issue, but looking at which code paths involve zeroing via memset clarify why: previously, there was no "hot path" using it.
In recent JDK releases, things have changed due to the introduction of https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/lang/foreign/MemorySegment.html#fill(byte) , exposing OpenJDK users to the poor performance of this method for small writes.
There are other Java frameworks in the High-Frequency-Trading land which has exposed users to this API and tried to improve things from Java eg https://github.com/real-logic/agrona/blob/cfdda89a367f5543e13168ab55361afa1cbef4a5/agrona/src/main/java/org/agrona/AbstractMutableDirectBuffer.java#L77-L120
I've prepared these changes to explore the different options I'm currently capable of (I'm not a OpenJDK developer) and I'm open to learn more interesting ways to solve it, which I'll explore few below.
First, a summary of what I've tried here:
Copy::fill_to_memory_atomicinlinable and it's replacingUNSAFE_ENTRYwithUNSAFE_LEAFspecializing it to save try resolving JNI handles (but I didn't perform some of the checks injdk/src/hotspot/share/prims/unsafe.cpp
Lines 393 to 396 in e055fae
I'm leveraging some existing Java/C2 code paths (eg
intrinsify_fill- although still getting bound checks :"( ) to avoid any JNI calls. And it's using an hardcoded cutoff value, which cannot survive in a real commit.Numbers below are just indicative of the type of performance improvement I would like to achieve and not including the heap use case (which I still have to implement in the JMH benchmark) nor trying to "shuffle" the JIT decisions with the additional compiled branches.
baseline (25f9af9):
Faster Unsafe::setMemory0
Improving Java side for fill pattern and save JNI leafs
Given that relying on JIT to do its job means assuming C2 to always kick-in, I suppose we care about getting a figure of performance if just C1 is used, which led to set
-XX:+TieredCompilation -XX:TieredStopAtLevel=1in thejvmArgsof the benchmark, and getting:Improving Java side for fill pattern and save JNI leafs
while, with Faster Unsafe::setMemory0
And clearly, the
panamaversion suffer from theMemorySegmentabstraction aroundUnsafe, but theunsafeversion is slightly regressed in the "Java optimized" version, a thing that will likely be more visible with smaller sizes.Progress
Integration blockers
issue number:message. (failed with updated jcheck configuration in pull request)Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/16760/head:pull/16760$ git checkout pull/16760Update a local copy of the PR:
$ git checkout pull/16760$ git pull https://git.openjdk.org/jdk.git pull/16760/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 16760View PR using the GUI difftool:
$ git pr show -t 16760Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/16760.diff