Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize harfbuzz big integer conversions #1398

Merged
merged 2 commits into from
Nov 27, 2018
Merged

Conversation

Adenilson
Copy link
Contributor

@Adenilson Adenilson commented Nov 20, 2018

Profiling showed that type conversions were adding considerable cycles in time
spent doing text shaping.

The idea is to optimize it using native processor instructions to help Blink
layout performance.

NOT for commit: still needs further testing on ARM (i.e. unaligned loads) and handling other OS-es (only tested on Linux).

@Adenilson
Copy link
Contributor Author

The idea is to start a discussion on how we may organize platform specific code.

I'm still collecting data and doing experiments, I'm planning to upload all the results here:
https://bugs.chromium.org/p/chromium/issues/detail?id=907244

@Adenilson
Copy link
Contributor Author

@behdad I'm planning next to validate the approach on android devices and will be reporting the results pretty soon.

@behdad
Copy link
Member

behdad commented Nov 20, 2018

Reports are appreciated indeed. Thanks.

Note that it's only 16bit read that matters. That's the only hot spot.

}
inline operator Type (void) const
{
#if defined(__x86_64) || defined(__ARM_NEON__) || defined(__ARM_NEON)
return bswap_16(*(uint16_t *)this);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spaces around parantheses, like this: bswap_16 (*(uint16_t *) this)

@@ -686,13 +689,22 @@ struct BEInt<Type, 2>
typedef Type type;
inline void set (Type V)
{
#if defined(__x86_64) || defined(__ARM_NEON__) || defined(__ARM_NEON)
uint16_t *ptr = (uint16_t *)v;
*ptr = bswap_16(V);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space before ( and after ).

}
inline operator Type (void) const
{
#if defined(__x86_64) || defined(__ARM_NEON__) || defined(__ARM_NEON)
return bswap_16(*(uint16_t *)this);
#else
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We typically prefer to add return to the #ifdef'd code and instead of using #else, leave the defualt iplementation always compiled, just unreached.

@@ -35,6 +35,9 @@
#include "hb-iter.hh"
#include "hb-vector.hh"

#if defined(__x86_64) || defined(__ARM_NEON__) || defined(__ARM_NEON)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is unaligned access safe on NEON?

This needs a comment, why we only do it for some archs. Also, since the condition is duplicated multiple times, we should just define a symbol like HB_USE_BSWAP and condition on that. Like:

#ifndef HB_USE_BSWAP
#  if ...
#    define HB_USE_BSWAP 1
#  else
#    define HB_USE_BSWAP 0
#endif

later...

#if HB_USE_BSWAP
  return ...
#endif

@Adenilson
Copy link
Contributor Author

Adenilson commented Nov 21, 2018

@behdad thanks for the feedback, I will upload an updated patch soon addressing the change requests.

@codecov
Copy link

codecov bot commented Nov 21, 2018

Codecov Report

Merging #1398 into master will decrease coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #1398      +/-   ##
=========================================
- Coverage    76.3%   76.3%   -0.01%     
=========================================
  Files         134     134              
  Lines       16408   16406       -2     
=========================================
- Hits        12520   12518       -2     
  Misses       3888    3888
Impacted Files Coverage Δ
src/hb-machinery.hh 94.49% <ø> (-0.02%) ⬇️
src/hb-buffer.cc 80.12% <0%> (-0.04%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1042d9f...4e2a03b. Read the comment docs.

@@ -34,7 +34,6 @@

#include "hb-iter.hh"
#include "hb-vector.hh"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put back?

@@ -691,6 +691,9 @@ struct BEInt<Type, 2>
}
inline operator Type (void) const
{
#if defined(__GNUC__) || defined(__clang__)
return __builtin_bswap16 (*(uint16_t *) this);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this version better. BUT, this still has the non-aligned access problem, no?

Copy link
Contributor Author

@Adenilson Adenilson Nov 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested on 3 Android devices with a 32bits build of Chrome (i.e. -march=armv7):
a) Nexus 4: ARMv7, released on 2012.
b) Nexus tablet 7, first generation: ARMv7, released on 2012.
c) Google Pixel 1 (cellphone): ARMv8, released on 2017.

And 3 ARM Linux boards (64bits builds):
a) Rock64: ARMv8 little cores (A53).
b) Marvel Machiatobin: ARMv8 big cores (A72).
c) Pi M4: big.LITTLE (2 A72, 4 A53).

The case of unaligned access on ARM is tricky. Unlike a vector load (e.g. 64bit or 128bits chunk), a REV16 instruction will operate in a scalar (16bits in this case) that will fit in a 32bits register.

For reference:
https://godbolt.org/z/Qk1UL6

In such scenario, it should work fine.

Finally, if the user is building HB for really older architectures (e.g. armv6), the compiler will know it and can adjust the _builtin_bswap16 for something that will be safe (i.e. similar to the portable C++ code).

Are you aware of people using HB in older than ARMv7 processors?

Finally, I made sure that all tests (i.e. make test) are running fine both on x86 and ARMv8@Linux.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speaking of it, I recall you mentioned that the structs got be aligned by 1. May I ask the reason for it?
/me learning the way around the code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speaking of it, I recall you mentioned that the structs got be aligned by 1. May I ask the reason for it?

Because this structs are directly placed on mmap()ed font data. While many places in the font format it says 16-bit aligned or 32-bit aligned, there's no guarantee. Structs reference each other using byte offsets. It would not be acceptable to reject unaligned integers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the compiler will know it and can adjust the _builtin_bswap16 for something that will be safe (i.e. similar to the portable C++ code).

Portable C++ code does not allow unaligned access automatically...

However, this page: https://www.alfonsobeato.net/arm/how-to-access-safely-unaligned-data/ suggests that if we wrap uint16_t in a packed struct then compiler makes sure access is allowed. I'm happy to go with that. Then we essntially ifdef out our current BEInt<16> and swap this one. Or even better, just cast it to a packed struct and access the member? I'm happy with such solution.

Also, I was hoping that you have some benchmark numbers from ARM / Android.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification and the feedback, I really appreciate it.
\o/

I can try the suggested approach and report the results back.

Considering that we are approaching Thanksgiving, I will probably be able to resume this patch next week.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Numbers for A72 (big core): around 6% perf gain.
a ) Vanilla:
root@lcds-mb12:~/harfbuzz/release# ./run.sh

Performance counter stats for './hb-shape Roboto-Regular.ttf --text-file=thelittleprince.txt --font-funcs=ot --output-format= --output-file=/dev/null --num-iterations=10':

    885.767680      task-clock (msec)         #    0.999 CPUs utilized          
            10      context-switches          #    0.011 K/sec                  
             0      cpu-migrations            #    0.000 K/sec                  
           202      page-faults               #    0.228 K/sec                  
 1,771,496,170      cycles                    #    2.000 GHz                    
 3,506,331,383      instructions              #    1.98  insn per cycle         

branches
7,388,064 branch-misses

   0.886525697 seconds time elapsed

   0.886454000 seconds user
   0.000000000 seconds sys

root@lcds-mb12:~/harfbuzz/release# ./run.sh

Performance counter stats for './hb-shape Roboto-Regular.ttf --text-file=thelittleprince.txt --font-funcs=ot --output-format= --output-file=/dev/null --num-iterations=10':

    885.495400      task-clock (msec)         #    0.999 CPUs utilized          
             2      context-switches          #    0.002 K/sec                  
             0      cpu-migrations            #    0.000 K/sec                  
           199      page-faults               #    0.225 K/sec                  
 1,770,967,624      cycles                    #    2.000 GHz                    
 3,506,436,015      instructions              #    1.98  insn per cycle         

branches
7,492,861 branch-misses

   0.886090838 seconds time elapsed

   0.886095000 seconds user
   0.000000000 seconds sys

b) Patched:
root@lcds-mb12:~/harfbuzz/insane# ./run.sh

Performance counter stats for './hb-shape Roboto-Regular.ttf --text-file=thelittleprince.txt --font-funcs=ot --output-format= --output-file=/dev/null --num-iterations=10':

    838.583520      task-clock (msec)         #    0.999 CPUs utilized          
            18      context-switches          #    0.021 K/sec                  
             0      cpu-migrations            #    0.000 K/sec                  
           197      page-faults               #    0.235 K/sec                  
 1,677,106,507      cycles                    #    2.000 GHz                    
 3,321,580,240      instructions              #    1.98  insn per cycle         

branches
6,404,630 branch-misses

   0.839294213 seconds time elapsed

   0.839174000 seconds user
   0.000000000 seconds sys

root@lcds-mb12:~/harfbuzz/insane# ./run.sh

Performance counter stats for './hb-shape Roboto-Regular.ttf --text-file=thelittleprince.txt --font-funcs=ot --output-format= --output-file=/dev/null --num-iterations=10':

    843.651680      task-clock (msec)         #    0.999 CPUs utilized          
             3      context-switches          #    0.004 K/sec                  
             0      cpu-migrations            #    0.000 K/sec                  
           197      page-faults               #    0.234 K/sec                  
 1,687,269,415      cycles                    #    2.000 GHz                    
 3,322,224,355      instructions              #    1.97  insn per cycle         

branches
6,522,149 branch-misses

   0.844274633 seconds time elapsed

   0.840283000 seconds user
   0.004001000 seconds sys

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Numbers for A53(little core): 11% perf gain.

a) Vanilla:
adenilson@rock64:~/harfbuzz/release$ ./run.sh

Performance counter stats for './hb-shape Roboto-Regular.ttf --text-file=thelittleprince.txt --font-funcs=ot --output-format= --output-file=/dev/null --num-iterations=100':

  10587.091629      task-clock (msec)         #    0.996 CPUs utilized          
            37      context-switches          #    0.003 K/sec                  
             0      cpu-migrations            #    0.000 K/sec                  
           199      page-faults               #    0.019 K/sec                  
13,666,898,364      cycles                    #    1.291 GHz                    
11,898,985,328      instructions              #    0.87  insn per cycle         
 1,043,804,690      branches                  #   98.592 M/sec                  
   127,238,754      branch-misses             #   12.19% of all branches        

  10.625467130 seconds time elapsed

adenilson@rock64:~/harfbuzz/release$ ./run.sh

Performance counter stats for './hb-shape Roboto-Regular.ttf --text-file=thelittleprince.txt --font-funcs=ot --output-format= --output-file=/dev/null --num-iterations=100':

  10656.032876      task-clock (msec)         #    1.000 CPUs utilized          
            22      context-switches          #    0.002 K/sec                  
             0      cpu-migrations            #    0.000 K/sec                  
           195      page-faults               #    0.018 K/sec                  
13,670,689,293      cycles                    #    1.283 GHz                    
11,909,799,250      instructions              #    0.87  insn per cycle         
 1,045,676,315      branches                  #   98.130 M/sec                  
   127,320,823      branch-misses             #   12.18% of all branches        

  10.658215420 seconds time elapsed

b) Patched:
adenilson@rock64:~/harfbuzz/insane$ ./run.sh

Performance counter stats for './hb-shape Roboto-Regular.ttf --text-file=thelittleprince.txt --font-funcs=ot --output-format= --output-file=/dev/null --num-iterations=100':

   9623.224586      task-clock (msec)         #    1.000 CPUs utilized          
            24      context-switches          #    0.002 K/sec                  
             0      cpu-migrations            #    0.000 K/sec                  
           195      page-faults               #    0.020 K/sec                  
12,324,671,746      cycles                    #    1.281 GHz                    
10,502,054,557      instructions              #    0.85  insn per cycle         
 1,021,269,428      branches                  #  106.125 M/sec                  
   129,355,571      branch-misses             #   12.67% of all branches        

   9.627012948 seconds time elapsed

adenilson@rock64:~/harfbuzz/insane$ ./run.sh

Performance counter stats for './hb-shape Roboto-Regular.ttf --text-file=thelittleprince.txt --font-funcs=ot --output-format= --output-file=/dev/null --num-iterations=100':

   9483.454130      task-clock (msec)         #    1.000 CPUs utilized          
            23      context-switches          #    0.002 K/sec                  
             0      cpu-migrations            #    0.000 K/sec                  
           196      page-faults               #    0.021 K/sec                  
12,290,291,346      cycles                    #    1.296 GHz                    
10,501,462,804      instructions              #    0.85  insn per cycle         
 1,021,195,129      branches                  #  107.682 M/sec                  
   129,261,068      branch-misses             #   12.66% of all branches        

   9.485277320 seconds time elapsed

@Adenilson
Copy link
Contributor Author

It seems that 3 bots have failed.

Do I have to make the patch pass a build with -Wcast-align?

@behdad
Copy link
Member

behdad commented Nov 21, 2018

Do I have to make the patch pass a build with -Wcast-align?

Let's try with the packed struct and see if that helps?

@behdad
Copy link
Member

behdad commented Nov 22, 2018

So, using struct __attribute__((packed)) { uint16_t v; }; is portable and I'm happy to take that in. However, this example shows that at least some compilers fallback to generating exact same code as our current code generates as soon as packed is involved: https://godbolt.org/z/tzYyoH

@behdad
Copy link
Member

behdad commented Nov 22, 2018

The x86 version works, but then again, it does with current code as well:
https://godbolt.org/z/QOgrMM
https://godbolt.org/z/Dy3KDn

Ie. both generate a rol operation to swap bytes.

I'm actually not sure why you see a speedup on x86. On arm, looks like the compiler is playing it safe as well.

@behdad
Copy link
Member

behdad commented Nov 22, 2018

So, using struct __attribute__((packed)) { uint16_t v; }; is portable and I'm happy to take that in. However, this example shows that at least some compilers fallback to generating exact same code as our current code generates as soon as packed is involved: https://godbolt.org/z/tzYyoH

If I compile for armv7 however, both packed struct and our current code generate a rev16 CPU op. So, again, not sure why you see any speedup.

@Adenilson
Copy link
Contributor Author

@behdad thanks for the feedback, I'm back from the Thanksgiving holidays and planning to resume this.

Just to clarify, when you mentioned that the current code will generate the REV16 instruction, which compiler did you use?

I've tried g++ 6.3, 7.3, 8.2 and clang 8.0 and couldn't observe the generation of the correct instruction without the use of bultins.

@behdad
Copy link
Member

behdad commented Nov 27, 2018

Just to clarify, when you mentioned that the current code will generate the REV16 instruction, which compiler did you use?

I was just using the

@behdad thanks for the feedback, I'm back from the Thanksgiving holidays and planning to resume this.

Just to clarify, when you mentioned that the current code will generate the REV16 instruction, which compiler did you use?

I've tried g++ 6.3, 7.3, 8.2 and clang 8.0 and couldn't observe the generation of the correct instruction without the use of bultins.

Took me a while to reproduce:
https://godbolt.org/z/79iOir

The trick is, the target line MUST say "armv7", not just "arm7", and mid section be "linux", not "none" or other. -target armv7-linux-gnueabihf -O3 -fno-inline

Looks like someone wrote optimizations for that particular target string :))?

@Adenilson
Copy link
Contributor Author

Adenilson commented Nov 27, 2018

So how you wanna to proceed with this? Should I upload an updated patch with attribute(packed)?

It seems to also make gcc happy for ARM 32bits: https://godbolt.org/z/svzYpd

@behdad
Copy link
Member

behdad commented Nov 27, 2018

So how you wanna to proceed with this? Should I upload an updated patch with attribute(packed)?

__attribute__((packed)) seems to be the only legal way to spoon-feed the compiler. But evidence suggests that the code generated is identical to our current code. So, I don't see any reason to change that. Perhaps file bug with clang re overly-host-sensitive optimization?

@Adenilson
Copy link
Contributor Author

Are you factoring all the involved variables: compiler (gcc, clang), architecture (armv7, armv8), build target (32bits x 64bits)?

Also it is important to factor that the clang available on Compiler Explorer is the latest (i.e. clang trunk).

Unless I'm mistaken, this is an example where gcc will generate the wrong code (not using REV16, but instead using UBFIZ) with the vanilla code (just enable the line BUILTIN):
https://godbolt.org/z/Y3rNFd

I think it makes sense to add the extra code (just 2 lines) to couch the compiler to do the right thing, given the performance benefits (https://bugs.chromium.org/p/chromium/issues/detail?id=907244#c2), seems like an easy 6% to 11% performance gain on ARM as also on Intel.

Waiting for the compiler to be fixed can take a long time from my experience (even escalating the issue to ARM's clang team). And I'm not even talking about gcc, that certainly can take a very lonnnng time to get fixed.

@behdad
Copy link
Member

behdad commented Nov 27, 2018

I'm fine adding the two extra lines. Just observing that the performance benefits you saw do not reproduce with __attribute__ necessarily.

Profiling showed that type conversions were adding considerable cycles in time
spent doing text shaping.

The idea is to optimize it using native processor instructions to help Blink
layout performance.

Doing further investigation revelead that compilers may not use the
proper instruction on ARM 32bits builds (i.e. REV16).

One way to insure that the generated ASM was ideal for both gcc/clang
was using __builtin_bswap16.

Added bonus is that we no longer need to test for CPU architecture.
@Adenilson
Copy link
Contributor Author

It seems now the failing bots (i.e. alignment warning) are green.

@Adenilson
Copy link
Contributor Author

By the way, it is pretty impressive the infrastructure (i.e. bots, coverage, etc) in Harfbuzz.
:-)

@behdad
Copy link
Member

behdad commented Nov 27, 2018

That's all @ebraminio's with help from others. It can be quite a PITA but yeah catches real stuff...

@behdad behdad merged commit 987f418 into harfbuzz:master Nov 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants