Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a benchmark for strlen using Foreign Linker API #454

Closed

Conversation

@mcimadamore
Copy link
Collaborator

@mcimadamore mcimadamore commented Feb 17, 2021

I've been spending some time looking into this issue:

https://bugs.openjdk.java.net/browse/JDK-8261828

And, to understand better the problem, I put together an hopefully comprehensive benchmark of the strlen function; it turns out that the strlen call itself is fast, and it's the conversion from Java to native string where the benchmark spends most of its time.

While playing with the benchmark, I came up with alternative ways to do this conversion which greatly speed up the benchmark results, even surpassing (at least on my machine) what's possible with JNI. Jorn and I think that, for future references, it would be a good idea to include this benchmark in our suite.

For the curious, there are many factors which make the default CLinker::toCString go slower than expected:

  • allocating a fresh segment on each iteration is expensive, because it takes two native call (malloc, memset), plus a bunch of CAS to reserve memory in the Java runtime
  • freeing the segment on each iteration is equally expensive - one native call (free), plus again, some CAS to unreserve memory
  • bulk copy is fast, but again requires a native call
  • all in all, we need 4 native calls per iteration (malloc, memset, copy, free) each adding cost when it comes to state transitions

In other words, the advantage of JNI here is that (i) the level of safety provided by JNI is lower (e.g. the runtime doesn't need to track e.g. allocated memory, which segments do); also (ii) when we call the JNI-ified strlen function, the malloc, free, copy happen when we're in native code already - which means less state transitions are required.

Note that we can completely eliminate (i) basically by creating restricted segments using CLinker::allocateMemoryRestricted (which does a plain malloc). We can also eliminate (ii) by creating trivial function descriptors for the calls to malloc/free/strlen, thereby removing cost associated with state transitions there. Both routes are tested in the benchmark (note that they both requires some willingness to embrace restricted methods). I have put together a variant which shows how NativeScope can be used to speed allocation up (which works really well for small strings, and is not restricted).

What are the lessons learned for plain CLinker::toCString ?

  • While the logic is generally fast, all state transitions and unsafe calls are killing performance in such a tight scenario; perhaps worth considering intrinsifying Unsafe::allocateMemory/copyMemory/setMemory/freeMemory.
  • The way to go, performance-wise is not to rely on the default malloc-based allocation. This is where Panama has a big edge over JNI, whose allocation logic is fixed. Proposals such as the one described in [1] will make passing custom allocators to CLinker::toCString easier, so that clients can decide which allocation strategy best fits their use case.

[1] - https://inside.java/2021/01/25/memory-access-pulling-all-the-threads/


Progress

  • Change must not contain extraneous whitespace
  • Change must be properly reviewed

Reviewers

Download

$ git fetch https://git.openjdk.java.net/panama-foreign pull/454/head:pull/454
$ git checkout pull/454

@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Feb 17, 2021

👋 Welcome back mcimadamore! A progress list of the required criteria for merging this PR into foreign-memaccess+abi will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

Loading

@mcimadamore
Copy link
Collaborator Author

@mcimadamore mcimadamore commented Feb 17, 2021

For the records, benchmark results on my machine look as below:

Benchmark                                (size)  Mode  Cnt    Score   Error  Units
StrLenTest.jni_strlen                         5  avgt   30   47.952 ? 0.899  ns/op
StrLenTest.jni_strlen                        20  avgt   30   61.918 ? 2.668  ns/op
StrLenTest.jni_strlen                       100  avgt   30  135.449 ? 1.454  ns/op
StrLenTest.panama_strlen                      5  avgt   30  115.883 ? 2.194  ns/op
StrLenTest.panama_strlen                     20  avgt   30  114.238 ? 1.896  ns/op
StrLenTest.panama_strlen                    100  avgt   30  133.208 ? 2.056  ns/op
StrLenTest.panama_strlen_scope                5  avgt   30   34.467 ? 0.596  ns/op
StrLenTest.panama_strlen_scope               20  avgt   30   50.872 ? 2.357  ns/op
StrLenTest.panama_strlen_scope              100  avgt   29   89.604 ? 3.675  ns/op
StrLenTest.panama_strlen_unsafe               5  avgt   30   52.222 ? 3.626  ns/op
StrLenTest.panama_strlen_unsafe              20  avgt   30   55.937 ? 2.125  ns/op
StrLenTest.panama_strlen_unsafe             100  avgt   30   67.084 ? 0.671  ns/op
StrLenTest.panama_strlen_unsafe_trivial       5  avgt   30   29.762 ? 0.443  ns/op
StrLenTest.panama_strlen_unsafe_trivial      20  avgt   30   36.156 ? 0.369  ns/op
StrLenTest.panama_strlen_unsafe_trivial     100  avgt   30   59.830 ? 4.222  ns/op

Loading

@openjdk openjdk bot added the rfr label Feb 17, 2021
@mlbridge
Copy link

@mlbridge mlbridge bot commented Feb 17, 2021

Webrevs

Loading

Copy link
Member

@JornVernee JornVernee left a comment

Looks good, besides a warning in the native code (see inline comment).

Numbers on my Windows machine look like this:

Benchmark                                (size)  Mode  Cnt    Score     Error  Units
StrLenTest.jni_strlen                         5  avgt   30  158.912 �  11.886  ns/op
StrLenTest.jni_strlen                        20  avgt   30  206.928 �   3.980  ns/op
StrLenTest.jni_strlen                       100  avgt   30  405.025 �  10.837  ns/op
StrLenTest.panama_strlen                      5  avgt   30  211.031 �   3.851  ns/op
StrLenTest.panama_strlen                     20  avgt   30  234.639 �  27.133  ns/op
StrLenTest.panama_strlen                    100  avgt   30  235.827 �   4.062  ns/op
StrLenTest.panama_strlen_scope                5  avgt   30   65.311 �   4.032  ns/op
StrLenTest.panama_strlen_scope               20  avgt   30   96.457 �   6.595  ns/op
StrLenTest.panama_strlen_scope              100  avgt   24  157.490 �  24.683  ns/op
StrLenTest.panama_strlen_unsafe               5  avgt   30  135.766 �   1.252  ns/op
StrLenTest.panama_strlen_unsafe              20  avgt   30  154.713 �   4.510  ns/op
StrLenTest.panama_strlen_unsafe             100  avgt   30  152.058 �   2.003  ns/op
StrLenTest.panama_strlen_unsafe_trivial       5  avgt   30  119.331 �   6.429  ns/op
StrLenTest.panama_strlen_unsafe_trivial      20  avgt   30  127.749 �   1.932  ns/op
StrLenTest.panama_strlen_unsafe_trivial     100  avgt   30  129.092 �   0.967  ns/op

(timings seem a bit jittery)

Loading


JNIEXPORT jint JNICALL Java_org_openjdk_bench_jdk_incubator_foreign_StrLenTest_strlen(JNIEnv *const env, const jclass cls, const jstring text) {
const char *str = (*env)->GetStringUTFChars(env, text, NULL);
int len = strlen(str);
Copy link
Member

@JornVernee JornVernee Feb 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm seeing a warning here: conversion from 'size_t' to 'int', possible loss of data which makes the compilation fail. An explicit cast to int fixes that.

Loading

@openjdk
Copy link

@openjdk openjdk bot commented Feb 17, 2021

@mcimadamore This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

Add a benchmark for strlen using Foreign Linker API

Reviewed-by: jvernee

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been no new commits pushed to the foreign-memaccess+abi branch. If another commit should be pushed before you perform the /integrate command, your PR will be automatically rebased. If you prefer to avoid any potential automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the foreign-memaccess+abi branch, type /integrate in a new comment.

Loading

@openjdk openjdk bot added the ready label Feb 17, 2021
@mcimadamore
Copy link
Collaborator Author

@mcimadamore mcimadamore commented Feb 17, 2021

/integrate

Loading

@openjdk openjdk bot closed this Feb 17, 2021
@openjdk openjdk bot added integrated and removed ready rfr labels Feb 17, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Feb 17, 2021

@mcimadamore Pushed as commit 2991213.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
2 participants