-
Notifications
You must be signed in to change notification settings - Fork 6.1k
8318364: Add an FFM-based implementation of harfbuzz OpenType layout #15476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Welcome back prr! A progress list of the required criteria for merging this PR into |
@prrace did you check how this change affects the performance, especially startup? I have experimented with Panama for littlecms: https://bugs.openjdk.org/browse/JDK-8313344 and found that the biggest issue is a cold start, 8 ms vs 100ms. An example of the report: https://jmh.morethan.io/?gists=4df0f27789cc4b0ca91fc5b2d677fe39,900b547e073cc1567971f46bfea151db |
VarIntLayout.withName("var2") | ||
).withName("hb_glyph_info_t"); | ||
|
||
private static VarHandle getVarHandle(MemoryLayout layout, String name) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method could take a SequenceLayout
instead of a MemoryLayout
as it only works for SeequenceLayouts.
private static MethodHandle jdk_hb_shape_handle; | ||
|
||
private static FunctionDescriptor get_nominal_glyph_fd; | ||
private static MethodHandle get_nominal_glyph_mh; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Declaring all these final
would improve performance. For example
private static final FunctionDescription GET_NOMINAL_GLYPH_MH;
private static MethodHandle dispose_face_handle; | ||
private static MethodHandle jdk_hb_shape_handle; | ||
|
||
private static FunctionDescriptor get_nominal_glyph_fd; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the *_fd
variables could be converted into local variables.
* int8_t i8[4]; | ||
* }; | ||
*/ | ||
private static final UnionLayout VarIntLayout = MemoryLayout.unionLayout( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was a bit confused by the naming. Suggest VarIntLayout
-> VAR_INT_LAYOUT
|
||
public class HBShaper { | ||
|
||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice with the original C struct as a comment.
return 0; | ||
} | ||
byte[] data = font2D.getTableBytes(tag); | ||
if (data == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No setting data_ptr_out to NULL here?
* so it will be freed by the caller using native free - when it is | ||
* done with it. | ||
*/ | ||
MemorySegment data_ptr = data_ptr_out.reinterpret(ADDRESS.byteSize()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest using .asSlice()
here as it is an unrestricted and safer method.
this.font2D = font; | ||
} | ||
|
||
private synchronized MemorySegment getFace() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are already synchronized via the faceMap
.
|
||
Font2D font2D = scopedFont2D.get(); | ||
int glyphID = font2D.charToGlyph(unicode); | ||
MemorySegment glyphIDPtr = glyph.reinterpret(4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a general comment, it is better to use slicing rather than reinterpret.
private static final ScopedValue<Font2D> scopedFont2D = ScopedValue.newInstance(); | ||
private static final ScopedValue<FontStrike> scopedFontStrike = ScopedValue.newInstance(); | ||
private static final ScopedValue<GVData> scopedGVData = ScopedValue.newInstance(); | ||
private static final ScopedValue<Point2D.Float> scopedStartPt = ScopedValue.newInstance(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using only one ScopedValue and storing a record(Font2D font2D, FontStrike fontStrike, GVData gvData, Point2D.float point2d) {}
of the various objects will provide much better performance.
* shaping can locate the correct instances of these to query or update. | ||
* The alternative of creating bound method handles is far too slow. | ||
*/ | ||
ScopedValue.where(scopedFont2D, font2D) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a static ConcurrentHashMap<Long, Record>
would provide better performance. We could clean up the key when the value is used. Use the Thread.threadId() as the key.
Hmm. I didn't notice this comment until today. No emails for drafts ? But yes, I had already measured startup + warmup and noticed that's an issue. |
No! That is really interesting proposal and discussion! BTW this PR is not in the draft state. |
@prrace This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration! |
float startY, | ||
int flags, | ||
int slot, | ||
hb_font_get_nominal_glyph_func_t nominal_fn, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't be necessary to pass all the functions here. Note that the upcalls are now effectively static (due to the use of scope values). It would be more efficient to create the array of functions once and for all in Java code (either directly, by creating a memory segment and storing all the function pointers in there, or indirectly, by calling the _hb_jdk_get_font_funcs
native function). But, we don't need to create a function array each time we call the shape
function (as all the functions in the array are going to be the same after all). If you do that, you can replace all the _fn
parameters in here with a single function pointer array parameter.
float startX = (float)startPt.getX(); | ||
float startY = (float)startPt.getY(); | ||
|
||
MemorySegment matrix = arena.allocateArray(JAVA_FLOAT, mat.length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be an overload of allocateArray
which takes a Java array directly and then copies it off-heap after allocation. In Java 22 this method is called allocateFrom
and is much more optimized (as it avoids zeroing of memory). But, even in 21, the call to copy seems redundant - you can just use the correct overload of SegmentAllocator::allocateArray
|
||
for (int i=0; i<glyphCount; i++) { | ||
int storei = i + initialCount; | ||
int cluster = (int)clusterHandle.get(glyphInfoArr, i) - offset; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the var handle calls in this loop are not exact - e.g. they use a int
offset instead of a long
one. Because of this, the memory access cannot be fully optimized. Adding a cast to long
on all offsets coordinates yields a significant performance boost.
To avoid issues like these, it is recommended to set up the var handle using the VarHandle::withInvokeExactBehavior
method, which will cause an exception to be thrown in case there's a type mismatch (similar to MethodHandle::invokeExact
).
).withName("hb_glyph_info_t"); | ||
|
||
private static VarHandle getVarHandle(StructLayout struct, String name) { | ||
VarHandle h = struct.varHandle(PathElement.groupElement(name)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that, strictly speaking, these combiners are not required. If you just call MemoryLayout::varHandle
, you get back a var handle that takes a MemorySegment
and a long
offset. So, you can, if you want, adapt the var handle so that the offset parameter becomes something else. But you could also just leave the var handle as is. Then, in the loop that is doing the access, you can do this:
for (int i = 0 ; i < limit ; i++) {
x_offsetHandle.get(segment, PositionLayout.scale(0, i));
y_offsetHandle.get(segment, PositionLayout.scale(0, i));
...
}
That is, use the offset hole to your advantage, to pass the struct base offset (obtained by scaling the enclosing struct layout by the value of the loop induction variable).
(That said, I understand that working with logical indices is a common operation, and that this is made a bit harder by the recent API changes. We should consider, as @JornVernee mentioned, adding back a more general MemoryLayout::arrayElementVarHandle
which will give you the var handle you need - with coordinates MemorySegment
and long
- a logical index, not an offset).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A PR which adds MemoryLayout::arrayElementVarHandle
can be found here:
#16272
With this, you can call the new method in order to create the var handle. The returned var handle will accept two long coordinate - the first is a base offset (as usual), the second in a logical index (what you need). The PR also adds plenty of narrative text describing how access to variable-length arrays should be performed using layouts (and also shows cases where the offset parameter is used in a non-trivial fashion).
@@ -148,6 +148,7 @@ | |||
// module declaration be annotated with jdk.internal.javac.ParticipatesInPreview | |||
exports jdk.internal.javac to | |||
java.compiler, | |||
java.desktop, // for ScopedValue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The indentation looks odd..?
Since we plan to import it into jdk22, do you have some performance data to share? any positive or negative effects of this migration? |
There's three phases - (1) startup, (2) warmup and (3) warmed up performance. JNI has minimal startup / warmup cost, getting to warmed up performance right away. There's a cost to the first time some code in JDK initialises the core FFM. So we have somewhere around a fixed 125ms startup cost for the FFM case - as measured on my Mac, And there is some potential for that code to get faster some day The FFM path then needs to be warmed. Once warmed up, FFM is always as fast or faster than JNI. 20% faster is typical as JNIlayoutCnt=1 total=3ms <<< JNI very fast to start up FFMlayoutCnt=1 total=186ms << // FFM slow to start up, includes 75ms core FFM, 35-40 varhandles + no JIT yet |
That looks unfortunate. I guess if we will start to use ffm in other places we can easily spend of 1 second budget on startup=(
It looks strange that 16000 calls are not enough to warmup, and the difference is so large. |
Yes, this case is sufficiently uncommon, that it is OK, and is a decent way to help us track improvements to FFM.
I am not a C2 expert, (not even an amateur), I just assume that it takes a lot of calls to be fully optimized. |
@JornVernee this looks suspicious and seems unrelated to the cold startup issues we discussed before. |
I suspect the benchmark might be measuring the java.lang.foreign code needing to be loaded as part of the benchmark. While for JNI, the initialization of all the JNI machinery is included in the startup of the application. Was the running time of the entire application/process measured? Or only from the start of the Secondly, we have not spent a lot of time optimizing the startup performance of FFM yet. There are things we could do such as pre-generating classes during jlink-time, similar to what we do for java.lang.invoke/lambda implementation classes. |
Yes, that's correct, it includes all the startup costs in that number. |
C2/fully optimized compilation kicks in after 10 000 calls, and is asynchronous by default (i.e. the rest of the application keeps running). So, 12,000 sounds relatively normal to me. |
@prrace This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 37 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
|
||
hb_buffer_destroy (buffer); | ||
hb_font_destroy(hbfont); | ||
if (features != NULL) free(features); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guess coding style warrants braces { and next statement in separate line...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
||
public class LayoutCompatTest { | ||
|
||
static String jni = "jni.txt"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems test is failing without fix with Exception in jtreg
java.io.FileNotFoundException: jni.txt (The system cannot find the file specified)
Also in standalone mode. I was expecting it will fail with RuntimeException "files differ byte offset"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why it matters what this test does in a JDK without the fix, although logically, since the new system property isn't known, both cases would end up using JNI, and I'd expect the test to pass. I am not sure why you say it should fail.
And I can't reproduce your first problem, I ran this test in jtreg on an unmodified JDK22 and it passed, as I expected, for the reason given above.
/integrate |
Going to push as commit f69e665.
Your commit was automatically rebased without conflicts. |
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/15476/head:pull/15476
$ git checkout pull/15476
Update a local copy of the PR:
$ git checkout pull/15476
$ git pull https://git.openjdk.org/jdk.git pull/15476/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 15476
View PR using the GUI difftool:
$ git pr show -t 15476
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/15476.diff
Webrev
Link to Webrev Comment