Optimize harfbuzz big integer conversions #1398

Adenilson · 2018-11-20T22:51:15Z

Profiling showed that type conversions were adding considerable cycles in time
spent doing text shaping.

The idea is to optimize it using native processor instructions to help Blink
layout performance.

NOT for commit: still needs further testing on ARM (i.e. unaligned loads) and handling other OS-es (only tested on Linux).

Adenilson · 2018-11-20T22:53:22Z

The idea is to start a discussion on how we may organize platform specific code.

I'm still collecting data and doing experiments, I'm planning to upload all the results here:
https://bugs.chromium.org/p/chromium/issues/detail?id=907244

Adenilson · 2018-11-20T22:57:28Z

@behdad I'm planning next to validate the approach on android devices and will be reporting the results pretty soon.

behdad · 2018-11-20T22:59:15Z

Reports are appreciated indeed. Thanks.

Note that it's only 16bit read that matters. That's the only hot spot.

behdad · 2018-11-21T17:25:15Z

src/hb-machinery.hh

  }
  inline operator Type (void) const
  {
+#if defined(__x86_64) || defined(__ARM_NEON__) || defined(__ARM_NEON)
+    return bswap_16(*(uint16_t *)this);


Spaces around parantheses, like this: bswap_16 (*(uint16_t *) this)

behdad · 2018-11-21T17:25:39Z

src/hb-machinery.hh

@@ -686,13 +689,22 @@ struct BEInt<Type, 2>
  typedef Type type;
  inline void set (Type V)
  {
+#if defined(__x86_64) || defined(__ARM_NEON__) || defined(__ARM_NEON)
+    uint16_t *ptr = (uint16_t *)v;
+    *ptr = bswap_16(V);


Space before ( and after ).

behdad · 2018-11-21T17:26:23Z

src/hb-machinery.hh

  }
  inline operator Type (void) const
  {
+#if defined(__x86_64) || defined(__ARM_NEON__) || defined(__ARM_NEON)
+    return bswap_16(*(uint16_t *)this);
+#else


We typically prefer to add return to the #ifdef'd code and instead of using #else, leave the defualt iplementation always compiled, just unreached.

behdad · 2018-11-21T17:29:13Z

src/hb-machinery.hh

@@ -35,6 +35,9 @@
 #include "hb-iter.hh"
 #include "hb-vector.hh"

+#if defined(__x86_64) || defined(__ARM_NEON__) || defined(__ARM_NEON)


Is unaligned access safe on NEON?

This needs a comment, why we only do it for some archs. Also, since the condition is duplicated multiple times, we should just define a symbol like HB_USE_BSWAP and condition on that. Like:

#ifndef HB_USE_BSWAP # if ... # define HB_USE_BSWAP 1 # else # define HB_USE_BSWAP 0 #endif later... #if HB_USE_BSWAP return ... #endif

Adenilson · 2018-11-21T19:38:45Z

@behdad thanks for the feedback, I will upload an updated patch soon addressing the change requests.

codecov · 2018-11-21T21:17:30Z

Codecov Report

Merging #1398 into master will decrease coverage by <.01%.
The diff coverage is n/a.

@@            Coverage Diff            @@
##           master   #1398      +/-   ##
=========================================
- Coverage    76.3%   76.3%   -0.01%     
=========================================
  Files         134     134              
  Lines       16408   16406       -2     
=========================================
- Hits        12520   12518       -2     
  Misses       3888    3888

Impacted Files	Coverage Δ
src/hb-machinery.hh	`94.49% <ø> (-0.02%)`	⬇️
src/hb-buffer.cc	`80.12% <0%> (-0.04%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1042d9f...4e2a03b. Read the comment docs.

behdad · 2018-11-21T21:23:23Z

src/hb-machinery.hh

@@ -34,7 +34,6 @@

 #include "hb-iter.hh"
 #include "hb-vector.hh"
-


behdad · 2018-11-21T21:24:33Z

src/hb-machinery.hh

@@ -691,6 +691,9 @@ struct BEInt<Type, 2>
  }
  inline operator Type (void) const
  {
+#if defined(__GNUC__) || defined(__clang__)
+    return __builtin_bswap16 (*(uint16_t *) this);


I like this version better. BUT, this still has the non-aligned access problem, no?

I tested on 3 Android devices with a 32bits build of Chrome (i.e. -march=armv7):
a) Nexus 4: ARMv7, released on 2012.
b) Nexus tablet 7, first generation: ARMv7, released on 2012.
c) Google Pixel 1 (cellphone): ARMv8, released on 2017.

And 3 ARM Linux boards (64bits builds):
a) Rock64: ARMv8 little cores (A53).
b) Marvel Machiatobin: ARMv8 big cores (A72).
c) Pi M4: big.LITTLE (2 A72, 4 A53).

The case of unaligned access on ARM is tricky. Unlike a vector load (e.g. 64bit or 128bits chunk), a REV16 instruction will operate in a scalar (16bits in this case) that will fit in a 32bits register.

For reference:
https://godbolt.org/z/Qk1UL6

In such scenario, it should work fine.

Finally, if the user is building HB for really older architectures (e.g. armv6), the compiler will know it and can adjust the _builtin_bswap16 for something that will be safe (i.e. similar to the portable C++ code).

Are you aware of people using HB in older than ARMv7 processors?

Finally, I made sure that all tests (i.e. make test) are running fine both on x86 and ARMv8@Linux.

Speaking of it, I recall you mentioned that the structs got be aligned by 1. May I ask the reason for it?
/me learning the way around the code.

Speaking of it, I recall you mentioned that the structs got be aligned by 1. May I ask the reason for it?

Because this structs are directly placed on mmap()ed font data. While many places in the font format it says 16-bit aligned or 32-bit aligned, there's no guarantee. Structs reference each other using byte offsets. It would not be acceptable to reject unaligned integers.

the compiler will know it and can adjust the _builtin_bswap16 for something that will be safe (i.e. similar to the portable C++ code).

Portable C++ code does not allow unaligned access automatically...

However, this page: https://www.alfonsobeato.net/arm/how-to-access-safely-unaligned-data/ suggests that if we wrap uint16_t in a packed struct then compiler makes sure access is allowed. I'm happy to go with that. Then we essntially ifdef out our current BEInt<16> and swap this one. Or even better, just cast it to a packed struct and access the member? I'm happy with such solution.

Also, I was hoping that you have some benchmark numbers from ARM / Android.

Thanks for the clarification and the feedback, I really appreciate it.
\o/

I can try the suggested approach and report the results back.

Considering that we are approaching Thanksgiving, I will probably be able to resume this patch next week.

Numbers for A72 (big core): around 6% perf gain.
a ) Vanilla:
root@lcds-mb12:~/harfbuzz/release# ./run.sh

Performance counter stats for './hb-shape Roboto-Regular.ttf --text-file=thelittleprince.txt --font-funcs=ot --output-format= --output-file=/dev/null --num-iterations=10':

885.767680 task-clock (msec) # 0.999 CPUs utilized 10 context-switches # 0.011 K/sec 0 cpu-migrations # 0.000 K/sec 202 page-faults # 0.228 K/sec 1,771,496,170 cycles # 2.000 GHz 3,506,331,383 instructions # 1.98 insn per cycle

branches
7,388,064 branch-misses

0.886525697 seconds time elapsed 0.886454000 seconds user 0.000000000 seconds sys

root@lcds-mb12:~/harfbuzz/release# ./run.sh

Performance counter stats for './hb-shape Roboto-Regular.ttf --text-file=thelittleprince.txt --font-funcs=ot --output-format= --output-file=/dev/null --num-iterations=10':

885.495400 task-clock (msec) # 0.999 CPUs utilized 2 context-switches # 0.002 K/sec 0 cpu-migrations # 0.000 K/sec 199 page-faults # 0.225 K/sec 1,770,967,624 cycles # 2.000 GHz 3,506,436,015 instructions # 1.98 insn per cycle

branches
7,492,861 branch-misses

0.886090838 seconds time elapsed 0.886095000 seconds user 0.000000000 seconds sys

b) Patched:
root@lcds-mb12:~/harfbuzz/insane# ./run.sh

Performance counter stats for './hb-shape Roboto-Regular.ttf --text-file=thelittleprince.txt --font-funcs=ot --output-format= --output-file=/dev/null --num-iterations=10':

838.583520 task-clock (msec) # 0.999 CPUs utilized 18 context-switches # 0.021 K/sec 0 cpu-migrations # 0.000 K/sec 197 page-faults # 0.235 K/sec 1,677,106,507 cycles # 2.000 GHz 3,321,580,240 instructions # 1.98 insn per cycle

branches
6,404,630 branch-misses

0.839294213 seconds time elapsed 0.839174000 seconds user 0.000000000 seconds sys

root@lcds-mb12:~/harfbuzz/insane# ./run.sh

Performance counter stats for './hb-shape Roboto-Regular.ttf --text-file=thelittleprince.txt --font-funcs=ot --output-format= --output-file=/dev/null --num-iterations=10':

843.651680 task-clock (msec) # 0.999 CPUs utilized 3 context-switches # 0.004 K/sec 0 cpu-migrations # 0.000 K/sec 197 page-faults # 0.234 K/sec 1,687,269,415 cycles # 2.000 GHz 3,322,224,355 instructions # 1.97 insn per cycle

branches
6,522,149 branch-misses

0.844274633 seconds time elapsed 0.840283000 seconds user 0.004001000 seconds sys

Numbers for A53(little core): 11% perf gain.

a) Vanilla:
adenilson@rock64:~/harfbuzz/release$ ./run.sh

Performance counter stats for './hb-shape Roboto-Regular.ttf --text-file=thelittleprince.txt --font-funcs=ot --output-format= --output-file=/dev/null --num-iterations=100':

10587.091629 task-clock (msec) # 0.996 CPUs utilized 37 context-switches # 0.003 K/sec 0 cpu-migrations # 0.000 K/sec 199 page-faults # 0.019 K/sec 13,666,898,364 cycles # 1.291 GHz 11,898,985,328 instructions # 0.87 insn per cycle 1,043,804,690 branches # 98.592 M/sec 127,238,754 branch-misses # 12.19% of all branches 10.625467130 seconds time elapsed

adenilson@rock64:~/harfbuzz/release$ ./run.sh

Performance counter stats for './hb-shape Roboto-Regular.ttf --text-file=thelittleprince.txt --font-funcs=ot --output-format= --output-file=/dev/null --num-iterations=100':

10656.032876 task-clock (msec) # 1.000 CPUs utilized 22 context-switches # 0.002 K/sec 0 cpu-migrations # 0.000 K/sec 195 page-faults # 0.018 K/sec 13,670,689,293 cycles # 1.283 GHz 11,909,799,250 instructions # 0.87 insn per cycle 1,045,676,315 branches # 98.130 M/sec 127,320,823 branch-misses # 12.18% of all branches 10.658215420 seconds time elapsed

b) Patched:
adenilson@rock64:~/harfbuzz/insane$ ./run.sh

Performance counter stats for './hb-shape Roboto-Regular.ttf --text-file=thelittleprince.txt --font-funcs=ot --output-format= --output-file=/dev/null --num-iterations=100':

9623.224586 task-clock (msec) # 1.000 CPUs utilized 24 context-switches # 0.002 K/sec 0 cpu-migrations # 0.000 K/sec 195 page-faults # 0.020 K/sec 12,324,671,746 cycles # 1.281 GHz 10,502,054,557 instructions # 0.85 insn per cycle 1,021,269,428 branches # 106.125 M/sec 129,355,571 branch-misses # 12.67% of all branches 9.627012948 seconds time elapsed

adenilson@rock64:~/harfbuzz/insane$ ./run.sh

Performance counter stats for './hb-shape Roboto-Regular.ttf --text-file=thelittleprince.txt --font-funcs=ot --output-format= --output-file=/dev/null --num-iterations=100':

9483.454130 task-clock (msec) # 1.000 CPUs utilized 23 context-switches # 0.002 K/sec 0 cpu-migrations # 0.000 K/sec 196 page-faults # 0.021 K/sec 12,290,291,346 cycles # 1.296 GHz 10,501,462,804 instructions # 0.85 insn per cycle 1,021,195,129 branches # 107.682 M/sec 129,261,068 branch-misses # 12.66% of all branches 9.485277320 seconds time elapsed

Adenilson · 2018-11-21T23:14:43Z

It seems that 3 bots have failed.

Do I have to make the patch pass a build with -Wcast-align?

behdad · 2018-11-21T23:32:25Z

Do I have to make the patch pass a build with -Wcast-align?

Let's try with the packed struct and see if that helps?

behdad · 2018-11-22T03:46:07Z

So, using struct __attribute__((packed)) { uint16_t v; }; is portable and I'm happy to take that in. However, this example shows that at least some compilers fallback to generating exact same code as our current code generates as soon as packed is involved: https://godbolt.org/z/tzYyoH

behdad · 2018-11-22T03:49:29Z

The x86 version works, but then again, it does with current code as well:
https://godbolt.org/z/QOgrMM
https://godbolt.org/z/Dy3KDn

Ie. both generate a rol operation to swap bytes.

I'm actually not sure why you see a speedup on x86. On arm, looks like the compiler is playing it safe as well.

behdad · 2018-11-22T04:12:53Z

So, using struct __attribute__((packed)) { uint16_t v; }; is portable and I'm happy to take that in. However, this example shows that at least some compilers fallback to generating exact same code as our current code generates as soon as packed is involved: https://godbolt.org/z/tzYyoH

If I compile for armv7 however, both packed struct and our current code generate a rev16 CPU op. So, again, not sure why you see any speedup.

Adenilson · 2018-11-27T19:44:18Z

@behdad thanks for the feedback, I'm back from the Thanksgiving holidays and planning to resume this.

Just to clarify, when you mentioned that the current code will generate the REV16 instruction, which compiler did you use?

I've tried g++ 6.3, 7.3, 8.2 and clang 8.0 and couldn't observe the generation of the correct instruction without the use of bultins.

behdad · 2018-11-27T19:56:49Z

Just to clarify, when you mentioned that the current code will generate the REV16 instruction, which compiler did you use?

I was just using the

@behdad thanks for the feedback, I'm back from the Thanksgiving holidays and planning to resume this.

Just to clarify, when you mentioned that the current code will generate the REV16 instruction, which compiler did you use?

I've tried g++ 6.3, 7.3, 8.2 and clang 8.0 and couldn't observe the generation of the correct instruction without the use of bultins.

Took me a while to reproduce:
https://godbolt.org/z/79iOir

The trick is, the target line MUST say "armv7", not just "arm7", and mid section be "linux", not "none" or other. -target armv7-linux-gnueabihf -O3 -fno-inline

Looks like someone wrote optimizations for that particular target string :))?

Adenilson · 2018-11-27T20:47:47Z

So how you wanna to proceed with this? Should I upload an updated patch with attribute(packed)?

It seems to also make gcc happy for ARM 32bits: https://godbolt.org/z/svzYpd

behdad · 2018-11-27T20:49:50Z

So how you wanna to proceed with this? Should I upload an updated patch with attribute(packed)?

__attribute__((packed)) seems to be the only legal way to spoon-feed the compiler. But evidence suggests that the code generated is identical to our current code. So, I don't see any reason to change that. Perhaps file bug with clang re overly-host-sensitive optimization?

Adenilson · 2018-11-27T21:13:42Z

Are you factoring all the involved variables: compiler (gcc, clang), architecture (armv7, armv8), build target (32bits x 64bits)?

Also it is important to factor that the clang available on Compiler Explorer is the latest (i.e. clang trunk).

Unless I'm mistaken, this is an example where gcc will generate the wrong code (not using REV16, but instead using UBFIZ) with the vanilla code (just enable the line BUILTIN):
https://godbolt.org/z/Y3rNFd

I think it makes sense to add the extra code (just 2 lines) to couch the compiler to do the right thing, given the performance benefits (https://bugs.chromium.org/p/chromium/issues/detail?id=907244#c2), seems like an easy 6% to 11% performance gain on ARM as also on Intel.

Waiting for the compiler to be fixed can take a long time from my experience (even escalating the issue to ARM's clang team). And I'm not even talking about gcc, that certainly can take a very lonnnng time to get fixed.

behdad · 2018-11-27T21:18:25Z

I'm fine adding the two extra lines. Just observing that the performance benefits you saw do not reproduce with __attribute__ necessarily.

Profiling showed that type conversions were adding considerable cycles in time spent doing text shaping. The idea is to optimize it using native processor instructions to help Blink layout performance. Doing further investigation revelead that compilers may not use the proper instruction on ARM 32bits builds (i.e. REV16). One way to insure that the generated ASM was ideal for both gcc/clang was using __builtin_bswap16. Added bonus is that we no longer need to test for CPU architecture.

Adenilson · 2018-11-27T22:26:03Z

It seems now the failing bots (i.e. alignment warning) are green.

Adenilson · 2018-11-27T22:32:58Z

By the way, it is pretty impressive the infrastructure (i.e. bots, coverage, etc) in Harfbuzz.
:-)

behdad · 2018-11-27T22:35:53Z

That's all @ebraminio's with help from others. It can be quite a PITA but yeah catches real stuff...

behdad reviewed Nov 21, 2018

View reviewed changes

src/hb-machinery.hh Outdated

@@ -34,7 +34,6 @@

#include "hb-iter.hh"

#include "hb-vector.hh"

Copy link

Member

behdad Nov 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put back?

Adenilson force-pushed the bigInt01 branch from 64ec6c3 to 9b2e38c Compare November 21, 2018 21:23

behdad reviewed Nov 21, 2018

View reviewed changes

Adenilson force-pushed the bigInt01 branch from 9b2e38c to 4a719a7 Compare November 27, 2018 21:43

Comment

4e2a03b

behdad merged commit 987f418 into harfbuzz:master Nov 27, 2018

ebraminio mentioned this pull request Aug 18, 2019

Add initial shaping performance test and profiler #1911

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize harfbuzz big integer conversions #1398

Optimize harfbuzz big integer conversions #1398

Adenilson commented Nov 20, 2018 •

edited

Loading

Adenilson commented Nov 20, 2018

Adenilson commented Nov 20, 2018

behdad commented Nov 20, 2018

behdad Nov 21, 2018

behdad Nov 21, 2018

behdad Nov 21, 2018

behdad Nov 21, 2018

Adenilson commented Nov 21, 2018 •

edited

Loading

codecov bot commented Nov 21, 2018 •

edited

Loading

behdad Nov 21, 2018

behdad Nov 21, 2018

Adenilson Nov 21, 2018 •

edited

Loading

Adenilson Nov 21, 2018

behdad Nov 21, 2018

behdad Nov 21, 2018

Adenilson Nov 21, 2018

Adenilson Nov 21, 2018

Adenilson Nov 21, 2018

Adenilson commented Nov 21, 2018

behdad commented Nov 21, 2018

behdad commented Nov 22, 2018

behdad commented Nov 22, 2018

behdad commented Nov 22, 2018

Adenilson commented Nov 27, 2018

behdad commented Nov 27, 2018

Adenilson commented Nov 27, 2018 •

edited

Loading

behdad commented Nov 27, 2018

Adenilson commented Nov 27, 2018

behdad commented Nov 27, 2018

Adenilson commented Nov 27, 2018

Adenilson commented Nov 27, 2018

behdad commented Nov 27, 2018

		@@ -34,7 +34,6 @@

		#include "hb-iter.hh"
		#include "hb-vector.hh"

Optimize harfbuzz big integer conversions #1398

Optimize harfbuzz big integer conversions #1398

Conversation

Adenilson commented Nov 20, 2018 • edited Loading

Adenilson commented Nov 20, 2018

Adenilson commented Nov 20, 2018

behdad commented Nov 20, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Adenilson commented Nov 21, 2018 • edited Loading

codecov bot commented Nov 21, 2018 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Adenilson Nov 21, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Adenilson commented Nov 21, 2018

behdad commented Nov 21, 2018

behdad commented Nov 22, 2018

behdad commented Nov 22, 2018

behdad commented Nov 22, 2018

Adenilson commented Nov 27, 2018

behdad commented Nov 27, 2018

Adenilson commented Nov 27, 2018 • edited Loading

behdad commented Nov 27, 2018

Adenilson commented Nov 27, 2018

behdad commented Nov 27, 2018

Adenilson commented Nov 27, 2018

Adenilson commented Nov 27, 2018

behdad commented Nov 27, 2018

Adenilson commented Nov 20, 2018 •

edited

Loading

Adenilson commented Nov 21, 2018 •

edited

Loading

codecov bot commented Nov 21, 2018 •

edited

Loading

Adenilson Nov 21, 2018 •

edited

Loading

Adenilson commented Nov 27, 2018 •

edited

Loading