-
Notifications
You must be signed in to change notification settings - Fork 600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize harfbuzz big integer conversions #1398
Conversation
The idea is to start a discussion on how we may organize platform specific code. I'm still collecting data and doing experiments, I'm planning to upload all the results here: |
@behdad I'm planning next to validate the approach on android devices and will be reporting the results pretty soon. |
Reports are appreciated indeed. Thanks. Note that it's only 16bit read that matters. That's the only hot spot. |
src/hb-machinery.hh
Outdated
} | ||
inline operator Type (void) const | ||
{ | ||
#if defined(__x86_64) || defined(__ARM_NEON__) || defined(__ARM_NEON) | ||
return bswap_16(*(uint16_t *)this); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spaces around parantheses, like this: bswap_16 (*(uint16_t *) this)
src/hb-machinery.hh
Outdated
@@ -686,13 +689,22 @@ struct BEInt<Type, 2> | |||
typedef Type type; | |||
inline void set (Type V) | |||
{ | |||
#if defined(__x86_64) || defined(__ARM_NEON__) || defined(__ARM_NEON) | |||
uint16_t *ptr = (uint16_t *)v; | |||
*ptr = bswap_16(V); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Space before (
and after )
.
src/hb-machinery.hh
Outdated
} | ||
inline operator Type (void) const | ||
{ | ||
#if defined(__x86_64) || defined(__ARM_NEON__) || defined(__ARM_NEON) | ||
return bswap_16(*(uint16_t *)this); | ||
#else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We typically prefer to add return to the #ifdef
'd code and instead of using #else
, leave the defualt iplementation always compiled, just unreached.
src/hb-machinery.hh
Outdated
@@ -35,6 +35,9 @@ | |||
#include "hb-iter.hh" | |||
#include "hb-vector.hh" | |||
|
|||
#if defined(__x86_64) || defined(__ARM_NEON__) || defined(__ARM_NEON) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is unaligned access safe on NEON?
This needs a comment, why we only do it for some archs. Also, since the condition is duplicated multiple times, we should just define a symbol like HB_USE_BSWAP and condition on that. Like:
#ifndef HB_USE_BSWAP
# if ...
# define HB_USE_BSWAP 1
# else
# define HB_USE_BSWAP 0
#endif
later...
#if HB_USE_BSWAP
return ...
#endif
@behdad thanks for the feedback, I will upload an updated patch soon addressing the change requests. |
Codecov Report
@@ Coverage Diff @@
## master #1398 +/- ##
=========================================
- Coverage 76.3% 76.3% -0.01%
=========================================
Files 134 134
Lines 16408 16406 -2
=========================================
- Hits 12520 12518 -2
Misses 3888 3888
Continue to review full report at Codecov.
|
src/hb-machinery.hh
Outdated
@@ -34,7 +34,6 @@ | |||
|
|||
#include "hb-iter.hh" | |||
#include "hb-vector.hh" | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put back?
src/hb-machinery.hh
Outdated
@@ -691,6 +691,9 @@ struct BEInt<Type, 2> | |||
} | |||
inline operator Type (void) const | |||
{ | |||
#if defined(__GNUC__) || defined(__clang__) | |||
return __builtin_bswap16 (*(uint16_t *) this); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this version better. BUT, this still has the non-aligned access problem, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested on 3 Android devices with a 32bits build of Chrome (i.e. -march=armv7):
a) Nexus 4: ARMv7, released on 2012.
b) Nexus tablet 7, first generation: ARMv7, released on 2012.
c) Google Pixel 1 (cellphone): ARMv8, released on 2017.
And 3 ARM Linux boards (64bits builds):
a) Rock64: ARMv8 little cores (A53).
b) Marvel Machiatobin: ARMv8 big cores (A72).
c) Pi M4: big.LITTLE (2 A72, 4 A53).
The case of unaligned access on ARM is tricky. Unlike a vector load (e.g. 64bit or 128bits chunk), a REV16 instruction will operate in a scalar (16bits in this case) that will fit in a 32bits register.
For reference:
https://godbolt.org/z/Qk1UL6
In such scenario, it should work fine.
Finally, if the user is building HB for really older architectures (e.g. armv6), the compiler will know it and can adjust the _builtin_bswap16 for something that will be safe (i.e. similar to the portable C++ code).
Are you aware of people using HB in older than ARMv7 processors?
Finally, I made sure that all tests (i.e. make test) are running fine both on x86 and ARMv8@Linux.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Speaking of it, I recall you mentioned that the structs got be aligned by 1. May I ask the reason for it?
/me learning the way around the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Speaking of it, I recall you mentioned that the structs got be aligned by 1. May I ask the reason for it?
Because this structs are directly placed on mmap()ed font data. While many places in the font format it says 16-bit aligned or 32-bit aligned, there's no guarantee. Structs reference each other using byte offsets. It would not be acceptable to reject unaligned integers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the compiler will know it and can adjust the _builtin_bswap16 for something that will be safe (i.e. similar to the portable C++ code).
Portable C++ code does not allow unaligned access automatically...
However, this page: https://www.alfonsobeato.net/arm/how-to-access-safely-unaligned-data/ suggests that if we wrap uint16_t in a packed struct then compiler makes sure access is allowed. I'm happy to go with that. Then we essntially ifdef out our current BEInt<16> and swap this one. Or even better, just cast it to a packed struct and access the member? I'm happy with such solution.
Also, I was hoping that you have some benchmark numbers from ARM / Android.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification and the feedback, I really appreciate it.
\o/
I can try the suggested approach and report the results back.
Considering that we are approaching Thanksgiving, I will probably be able to resume this patch next week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Numbers for A72 (big core): around 6% perf gain.
a ) Vanilla:
root@lcds-mb12:~/harfbuzz/release# ./run.sh
Performance counter stats for './hb-shape Roboto-Regular.ttf --text-file=thelittleprince.txt --font-funcs=ot --output-format= --output-file=/dev/null --num-iterations=10':
885.767680 task-clock (msec) # 0.999 CPUs utilized
10 context-switches # 0.011 K/sec
0 cpu-migrations # 0.000 K/sec
202 page-faults # 0.228 K/sec
1,771,496,170 cycles # 2.000 GHz
3,506,331,383 instructions # 1.98 insn per cycle
branches
7,388,064 branch-misses
0.886525697 seconds time elapsed
0.886454000 seconds user
0.000000000 seconds sys
root@lcds-mb12:~/harfbuzz/release# ./run.sh
Performance counter stats for './hb-shape Roboto-Regular.ttf --text-file=thelittleprince.txt --font-funcs=ot --output-format= --output-file=/dev/null --num-iterations=10':
885.495400 task-clock (msec) # 0.999 CPUs utilized
2 context-switches # 0.002 K/sec
0 cpu-migrations # 0.000 K/sec
199 page-faults # 0.225 K/sec
1,770,967,624 cycles # 2.000 GHz
3,506,436,015 instructions # 1.98 insn per cycle
branches
7,492,861 branch-misses
0.886090838 seconds time elapsed
0.886095000 seconds user
0.000000000 seconds sys
b) Patched:
root@lcds-mb12:~/harfbuzz/insane# ./run.sh
Performance counter stats for './hb-shape Roboto-Regular.ttf --text-file=thelittleprince.txt --font-funcs=ot --output-format= --output-file=/dev/null --num-iterations=10':
838.583520 task-clock (msec) # 0.999 CPUs utilized
18 context-switches # 0.021 K/sec
0 cpu-migrations # 0.000 K/sec
197 page-faults # 0.235 K/sec
1,677,106,507 cycles # 2.000 GHz
3,321,580,240 instructions # 1.98 insn per cycle
branches
6,404,630 branch-misses
0.839294213 seconds time elapsed
0.839174000 seconds user
0.000000000 seconds sys
root@lcds-mb12:~/harfbuzz/insane# ./run.sh
Performance counter stats for './hb-shape Roboto-Regular.ttf --text-file=thelittleprince.txt --font-funcs=ot --output-format= --output-file=/dev/null --num-iterations=10':
843.651680 task-clock (msec) # 0.999 CPUs utilized
3 context-switches # 0.004 K/sec
0 cpu-migrations # 0.000 K/sec
197 page-faults # 0.234 K/sec
1,687,269,415 cycles # 2.000 GHz
3,322,224,355 instructions # 1.97 insn per cycle
branches
6,522,149 branch-misses
0.844274633 seconds time elapsed
0.840283000 seconds user
0.004001000 seconds sys
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Numbers for A53(little core): 11% perf gain.
a) Vanilla:
adenilson@rock64:~/harfbuzz/release$ ./run.sh
Performance counter stats for './hb-shape Roboto-Regular.ttf --text-file=thelittleprince.txt --font-funcs=ot --output-format= --output-file=/dev/null --num-iterations=100':
10587.091629 task-clock (msec) # 0.996 CPUs utilized
37 context-switches # 0.003 K/sec
0 cpu-migrations # 0.000 K/sec
199 page-faults # 0.019 K/sec
13,666,898,364 cycles # 1.291 GHz
11,898,985,328 instructions # 0.87 insn per cycle
1,043,804,690 branches # 98.592 M/sec
127,238,754 branch-misses # 12.19% of all branches
10.625467130 seconds time elapsed
adenilson@rock64:~/harfbuzz/release$ ./run.sh
Performance counter stats for './hb-shape Roboto-Regular.ttf --text-file=thelittleprince.txt --font-funcs=ot --output-format= --output-file=/dev/null --num-iterations=100':
10656.032876 task-clock (msec) # 1.000 CPUs utilized
22 context-switches # 0.002 K/sec
0 cpu-migrations # 0.000 K/sec
195 page-faults # 0.018 K/sec
13,670,689,293 cycles # 1.283 GHz
11,909,799,250 instructions # 0.87 insn per cycle
1,045,676,315 branches # 98.130 M/sec
127,320,823 branch-misses # 12.18% of all branches
10.658215420 seconds time elapsed
b) Patched:
adenilson@rock64:~/harfbuzz/insane$ ./run.sh
Performance counter stats for './hb-shape Roboto-Regular.ttf --text-file=thelittleprince.txt --font-funcs=ot --output-format= --output-file=/dev/null --num-iterations=100':
9623.224586 task-clock (msec) # 1.000 CPUs utilized
24 context-switches # 0.002 K/sec
0 cpu-migrations # 0.000 K/sec
195 page-faults # 0.020 K/sec
12,324,671,746 cycles # 1.281 GHz
10,502,054,557 instructions # 0.85 insn per cycle
1,021,269,428 branches # 106.125 M/sec
129,355,571 branch-misses # 12.67% of all branches
9.627012948 seconds time elapsed
adenilson@rock64:~/harfbuzz/insane$ ./run.sh
Performance counter stats for './hb-shape Roboto-Regular.ttf --text-file=thelittleprince.txt --font-funcs=ot --output-format= --output-file=/dev/null --num-iterations=100':
9483.454130 task-clock (msec) # 1.000 CPUs utilized
23 context-switches # 0.002 K/sec
0 cpu-migrations # 0.000 K/sec
196 page-faults # 0.021 K/sec
12,290,291,346 cycles # 1.296 GHz
10,501,462,804 instructions # 0.85 insn per cycle
1,021,195,129 branches # 107.682 M/sec
129,261,068 branch-misses # 12.66% of all branches
9.485277320 seconds time elapsed
It seems that 3 bots have failed. Do I have to make the patch pass a build with -Wcast-align? |
Let's try with the packed struct and see if that helps? |
So, using |
The x86 version works, but then again, it does with current code as well: Ie. both generate a I'm actually not sure why you see a speedup on x86. On arm, looks like the compiler is playing it safe as well. |
If I compile for armv7 however, both packed struct and our current code generate a rev16 CPU op. So, again, not sure why you see any speedup. |
@behdad thanks for the feedback, I'm back from the Thanksgiving holidays and planning to resume this. Just to clarify, when you mentioned that the current code will generate the REV16 instruction, which compiler did you use? I've tried g++ 6.3, 7.3, 8.2 and clang 8.0 and couldn't observe the generation of the correct instruction without the use of bultins. |
I was just using the
Took me a while to reproduce: The trick is, the target line MUST say "armv7", not just "arm7", and mid section be "linux", not "none" or other. -target armv7-linux-gnueabihf -O3 -fno-inline Looks like someone wrote optimizations for that particular target string :))? |
So how you wanna to proceed with this? Should I upload an updated patch with attribute(packed)? It seems to also make gcc happy for ARM 32bits: https://godbolt.org/z/svzYpd |
|
Are you factoring all the involved variables: compiler (gcc, clang), architecture (armv7, armv8), build target (32bits x 64bits)? Also it is important to factor that the clang available on Compiler Explorer is the latest (i.e. clang trunk). Unless I'm mistaken, this is an example where gcc will generate the wrong code (not using REV16, but instead using UBFIZ) with the vanilla code (just enable the line BUILTIN): I think it makes sense to add the extra code (just 2 lines) to couch the compiler to do the right thing, given the performance benefits (https://bugs.chromium.org/p/chromium/issues/detail?id=907244#c2), seems like an easy 6% to 11% performance gain on ARM as also on Intel. Waiting for the compiler to be fixed can take a long time from my experience (even escalating the issue to ARM's clang team). And I'm not even talking about gcc, that certainly can take a very lonnnng time to get fixed. |
I'm fine adding the two extra lines. Just observing that the performance benefits you saw do not reproduce with |
Profiling showed that type conversions were adding considerable cycles in time spent doing text shaping. The idea is to optimize it using native processor instructions to help Blink layout performance. Doing further investigation revelead that compilers may not use the proper instruction on ARM 32bits builds (i.e. REV16). One way to insure that the generated ASM was ideal for both gcc/clang was using __builtin_bswap16. Added bonus is that we no longer need to test for CPU architecture.
It seems now the failing bots (i.e. alignment warning) are green. |
By the way, it is pretty impressive the infrastructure (i.e. bots, coverage, etc) in Harfbuzz. |
That's all @ebraminio's with help from others. It can be quite a PITA but yeah catches real stuff... |
Profiling showed that type conversions were adding considerable cycles in time
spent doing text shaping.
The idea is to optimize it using native processor instructions to help Blink
layout performance.
NOT for commit: still needs further testing on ARM (i.e. unaligned loads) and handling other OS-es (only tested on Linux).