Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[perf] hb_ot_tags_from_language() too slow #3591

Closed
behdad opened this issue May 17, 2022 · 19 comments
Closed

[perf] hb_ot_tags_from_language() too slow #3591

behdad opened this issue May 17, 2022 · 19 comments
Assignees

Comments

@behdad
Copy link
Member

behdad commented May 17, 2022

[This was flagged by @matthiasclasen in https://gitlab.gnome.org/GNOME/gtk/-/issues/3334 https://gitlab.gnome.org/GNOME/gtk/uploads/a6da390ef2fd85d37dd594d88c4229ff/image.png]

The new perf/benchmark-ot shows this:

----------------------------------------------------------------------------------------------
Benchmark                                                    Time             CPU   Iterations
----------------------------------------------------------------------------------------------
BM_hb_ot_tags_from_script_and_language/COMMON zh_CN        172 ns          171 ns      4083155
BM_hb_ot_tags_from_script_and_language/COMMON en_US        120 ns          119 ns      5849947
BM_hb_ot_tags_from_script_and_language/LATIN en_US         113 ns          112 ns      5840326
BM_hb_ot_tags_from_script_and_language/COMMON none        4.66 ns         4.64 ns    151396224
BM_hb_ot_tags_from_script_and_language/LATIN none         4.66 ns         4.64 ns    149019593

Half of the time is spent in hb_ot_tags_from_complex_language(); in there, most of the time is spent in subtag_matches cases outside the switch. Ideas to speed that up:

  1. Make subtag_matches use strchr for - instead of strstr as main loop. Might or might not have any effect.
  2. All the subtag_matches outside the switch match long strings (>= 6 or so). As such, check the tag for such length before going into any of them.

I implementing the second.

After that, bulk of the time I suppose is spent in binary-searching the language table. I suggest we split the language table in 2-letter and 3-letter tags, to speed-up the vast majority of cases that are 2-letter.

@matthiasclasen
Copy link
Collaborator

behdad added a commit that referenced this issue May 17, 2022
Part of #3591

2. All the subtag_matches outside the switch match long strings (>= 6 or so).
   As such, check the tag for such length before going into any of them.

benchmark-ot, before:

----------------------------------------------------------------------------------------------
Benchmark                                                    Time             CPU   Iterations
----------------------------------------------------------------------------------------------
BM_hb_ot_tags_from_script_and_language/COMMON zh_CN        172 ns          171 ns      4083155
BM_hb_ot_tags_from_script_and_language/COMMON en_US        120 ns          119 ns      5849947
BM_hb_ot_tags_from_script_and_language/LATIN en_US         113 ns          112 ns      5840326
BM_hb_ot_tags_from_script_and_language/COMMON none        4.66 ns         4.64 ns    151396224
BM_hb_ot_tags_from_script_and_language/LATIN none         4.66 ns         4.64 ns    149019593

After:

----------------------------------------------------------------------------------------------
Benchmark                                                    Time             CPU   Iterations
----------------------------------------------------------------------------------------------
BM_hb_ot_tags_from_script_and_language/COMMON zh_CN        112 ns          112 ns      6357763
BM_hb_ot_tags_from_script_and_language/COMMON en_US       60.5 ns         60.3 ns     11475091
BM_hb_ot_tags_from_script_and_language/LATIN en_US        54.9 ns         54.8 ns     12575690
BM_hb_ot_tags_from_script_and_language/COMMON none        4.61 ns         4.59 ns    152388450
BM_hb_ot_tags_from_script_and_language/LATIN none         4.66 ns         4.64 ns    151497600
@behdad
Copy link
Member Author

behdad commented May 17, 2022

2. All the subtag_matches outside the switch match long strings (>= 6 or so). As such, check the tag for such length before going into any of them.

I did this now. New numbers:

----------------------------------------------------------------------------------------------
Benchmark                                                    Time             CPU   Iterations
----------------------------------------------------------------------------------------------
BM_hb_ot_tags_from_script_and_language/COMMON zh_CN        112 ns          112 ns      6357763
BM_hb_ot_tags_from_script_and_language/COMMON en_US       60.5 ns         60.3 ns     11475091
BM_hb_ot_tags_from_script_and_language/LATIN en_US        54.9 ns         54.8 ns     12575690
BM_hb_ot_tags_from_script_and_language/COMMON none        4.61 ns         4.59 ns    152388450
BM_hb_ot_tags_from_script_and_language/LATIN none         4.66 ns         4.64 ns    151497600

@dscorbett
Copy link
Collaborator

I’ll start working on splitting the ot_languages table.

behdad added a commit that referenced this issue May 17, 2022
Part of #3591

"After that, bulk of the time I suppose is spent in binary-searching the
language table. I suggest we split the language table in 2-letter and
3-letter tags, to speed-up the vast majority of cases that are
2-letter."

benchmark-ot, before:

----------------------------------------------------------------------------------------------
Benchmark                                                    Time             CPU   Iterations
----------------------------------------------------------------------------------------------
BM_hb_ot_tags_from_script_and_language/COMMON zh_CN        112 ns          111 ns      6286271
BM_hb_ot_tags_from_script_and_language/COMMON en_US       60.6 ns         60.4 ns     11671176
BM_hb_ot_tags_from_script_and_language/LATIN en_US        61.3 ns         61.1 ns     11442645
BM_hb_ot_tags_from_script_and_language/COMMON none        4.75 ns         4.74 ns    146997235
BM_hb_ot_tags_from_script_and_language/LATIN none         4.65 ns         4.64 ns    150938747

After:

----------------------------------------------------------------------------------------------
Benchmark                                                    Time             CPU   Iterations
----------------------------------------------------------------------------------------------
BM_hb_ot_tags_from_script_and_language/COMMON zh_CN       89.5 ns         89.2 ns      7747649
BM_hb_ot_tags_from_script_and_language/COMMON en_US       38.5 ns         38.4 ns     18199432
BM_hb_ot_tags_from_script_and_language/LATIN en_US        39.0 ns         38.9 ns     18049238
BM_hb_ot_tags_from_script_and_language/COMMON none        4.53 ns         4.52 ns    154895110
BM_hb_ot_tags_from_script_and_language/LATIN none         4.54 ns         4.52 ns    154762105
@behdad
Copy link
Member Author

behdad commented May 17, 2022

I’ll start working on splitting the ot_languages table.

Done.

@behdad
Copy link
Member Author

behdad commented May 17, 2022

I’ll start working on splitting the ot_languages table.

Done.

After:

----------------------------------------------------------------------------------------------
Benchmark                                                    Time             CPU   Iterations
----------------------------------------------------------------------------------------------
BM_hb_ot_tags_from_script_and_language/COMMON zh_CN       89.5 ns         89.2 ns      7747649
BM_hb_ot_tags_from_script_and_language/COMMON en_US       38.5 ns         38.4 ns     18199432
BM_hb_ot_tags_from_script_and_language/LATIN en_US        39.0 ns         38.9 ns     18049238
BM_hb_ot_tags_from_script_and_language/COMMON none        4.53 ns         4.52 ns    154895110
BM_hb_ot_tags_from_script_and_language/LATIN none         4.54 ns         4.52 ns    154762105

@behdad
Copy link
Member Author

behdad commented May 17, 2022

Now perhaps looking into zh case would be wise.

behdad added a commit that referenced this issue May 17, 2022
Part of #3591

Before:
------------------------------------------------------------------------------------------------
Benchmark                                                      Time             CPU   Iterations
------------------------------------------------------------------------------------------------
BM_hb_ot_tags_from_script_and_language/COMMON abcd_XY       8.67 ns         8.64 ns     80324382
BM_hb_ot_tags_from_script_and_language/COMMON zh_CN         91.2 ns         90.9 ns      7674131
BM_hb_ot_tags_from_script_and_language/COMMON en_US         41.1 ns         41.0 ns     17174093
BM_hb_ot_tags_from_script_and_language/LATIN en_US          41.3 ns         41.2 ns     17000876
BM_hb_ot_tags_from_script_and_language/COMMON none          4.56 ns         4.55 ns    153914130
BM_hb_ot_tags_from_script_and_language/LATIN none           4.53 ns         4.52 ns    153830303

After:
------------------------------------------------------------------------------------------------
Benchmark                                                      Time             CPU   Iterations
------------------------------------------------------------------------------------------------
BM_hb_ot_tags_from_script_and_language/COMMON abcd_XY       8.24 ns         8.21 ns     84078465
BM_hb_ot_tags_from_script_and_language/COMMON zh_CN         77.5 ns         77.2 ns      9059230
BM_hb_ot_tags_from_script_and_language/COMMON en_US         38.8 ns         38.7 ns     17790692
BM_hb_ot_tags_from_script_and_language/LATIN en_US          37.6 ns         37.5 ns     18648293
BM_hb_ot_tags_from_script_and_language/COMMON none          4.50 ns         4.49 ns    155573267
BM_hb_ot_tags_from_script_and_language/LATIN none           4.49 ns         4.47 ns    156456653
behdad added a commit that referenced this issue May 17, 2022
Now that we know both strings are of equal len of 2 or 3, optimize.

Part of #3591

Before:
------------------------------------------------------------------------------------------------
Benchmark                                                      Time             CPU   Iterations
------------------------------------------------------------------------------------------------
BM_hb_ot_tags_from_script_and_language/COMMON abcd_XY       8.50 ns         8.47 ns     81221549
BM_hb_ot_tags_from_script_and_language/COMMON zh_CN         79.6 ns         79.3 ns      8785804
BM_hb_ot_tags_from_script_and_language/COMMON en_US         40.0 ns         39.9 ns     17462768
BM_hb_ot_tags_from_script_and_language/LATIN en_US          39.2 ns         39.1 ns     17886793
BM_hb_ot_tags_from_script_and_language/COMMON none          4.31 ns         4.30 ns    162805417
BM_hb_ot_tags_from_script_and_language/LATIN none           4.32 ns         4.31 ns    162656688

After:
------------------------------------------------------------------------------------------------
Benchmark                                                      Time             CPU   Iterations
------------------------------------------------------------------------------------------------
BM_hb_ot_tags_from_script_and_language/COMMON abcd_XY       8.27 ns         8.24 ns     81868701
BM_hb_ot_tags_from_script_and_language/COMMON zh_CN         56.1 ns         56.0 ns     12353284
BM_hb_ot_tags_from_script_and_language/COMMON en_US         24.3 ns         24.2 ns     28955030
BM_hb_ot_tags_from_script_and_language/LATIN en_US          24.5 ns         24.4 ns     28664868
BM_hb_ot_tags_from_script_and_language/COMMON none          4.35 ns         4.34 ns    161190014
BM_hb_ot_tags_from_script_and_language/LATIN none           4.36 ns         4.34 ns    161319000
@behdad
Copy link
Member Author

behdad commented May 17, 2022

Now down to:

------------------------------------------------------------------------------------------------
Benchmark                                                      Time             CPU   Iterations
------------------------------------------------------------------------------------------------
BM_hb_ot_tags_from_script_and_language/COMMON abcd_XY       8.27 ns         8.24 ns     81868701
BM_hb_ot_tags_from_script_and_language/COMMON zh_CN         56.1 ns         56.0 ns     12353284
BM_hb_ot_tags_from_script_and_language/COMMON en_US         24.3 ns         24.2 ns     28955030
BM_hb_ot_tags_from_script_and_language/LATIN en_US          24.5 ns         24.4 ns     28664868
BM_hb_ot_tags_from_script_and_language/COMMON none          4.35 ns         4.34 ns    161190014
BM_hb_ot_tags_from_script_and_language/LATIN none           4.36 ns         4.34 ns    161319000

@behdad
Copy link
Member Author

behdad commented May 17, 2022

Calling this fixed for now.

@behdad behdad closed this as completed May 17, 2022
@behdad
Copy link
Member Author

behdad commented May 17, 2022

Seems like the zh case is just the worst-case of the bsearch where en gets rather lucky. If I disable the bsearch, both zh and en become the same as abcd at 8ns.

@behdad
Copy link
Member Author

behdad commented May 17, 2022

So we're bsearch-bound now, and the cmp of it is fully optimized now.

A way to vastly speed up the case for 2-letter languages is to use a sparse array of 26x26 letters. Currently there's maximum three tags per language, so 9 bytes x 26 x 26 is required for that table, ~6kb, vs around 1.6kb currently.

behdad added a commit that referenced this issue May 17, 2022
Using an integer tag to bsearch, instead of string.

Part of: #3591

Before:
------------------------------------------------------------------------------------------------
Benchmark                                                      Time             CPU   Iterations
------------------------------------------------------------------------------------------------
BM_hb_ot_tags_from_script_and_language/COMMON abcd_XY       8.11 ns         8.08 ns     87067795
BM_hb_ot_tags_from_script_and_language/COMMON zh_CN         53.6 ns         53.5 ns     13042418
BM_hb_ot_tags_from_script_and_language/COMMON en_US         24.2 ns         24.1 ns     29052731
BM_hb_ot_tags_from_script_and_language/LATIN en_US          24.4 ns         24.3 ns     28736769
BM_hb_ot_tags_from_script_and_language/COMMON none          4.43 ns         4.41 ns    160370413
BM_hb_ot_tags_from_script_and_language/LATIN none           4.35 ns         4.34 ns    160578191

After:
------------------------------------------------------------------------------------------------
Benchmark                                                      Time             CPU   Iterations
------------------------------------------------------------------------------------------------
BM_hb_ot_tags_from_script_and_language/COMMON abcd_XY       7.97 ns         7.95 ns     85208363
BM_hb_ot_tags_from_script_and_language/COMMON zh_CN         41.7 ns         41.6 ns     16945817
BM_hb_ot_tags_from_script_and_language/COMMON en_US         16.1 ns         16.0 ns     43613523
BM_hb_ot_tags_from_script_and_language/LATIN en_US          16.5 ns         16.4 ns     42568107
BM_hb_ot_tags_from_script_and_language/COMMON none          4.30 ns         4.29 ns    164055469
BM_hb_ot_tags_from_script_and_language/LATIN none           4.29 ns         4.27 ns    163793591
@behdad
Copy link
Member Author

behdad commented May 17, 2022

I made the bsearch use integer instead of string. Now:

------------------------------------------------------------------------------------------------
Benchmark                                                      Time             CPU   Iterations
------------------------------------------------------------------------------------------------
BM_hb_ot_tags_from_script_and_language/COMMON abcd_XY       7.97 ns         7.95 ns     85208363
BM_hb_ot_tags_from_script_and_language/COMMON zh_CN         41.7 ns         41.6 ns     16945817
BM_hb_ot_tags_from_script_and_language/COMMON en_US         16.1 ns         16.0 ns     43613523
BM_hb_ot_tags_from_script_and_language/LATIN en_US          16.5 ns         16.4 ns     42568107
BM_hb_ot_tags_from_script_and_language/COMMON none          4.30 ns         4.29 ns    164055469
BM_hb_ot_tags_from_script_and_language/LATIN none           4.29 ns         4.27 ns    163793591

behdad added a commit that referenced this issue May 17, 2022
Part of #3591

Humm. Looks like not all of the fat is bsearch overhead now. I cached
the last bsearch result, but most of the time is still there. I'm
baffled.

Before:
------------------------------------------------------------------------------------------------
Benchmark                                                      Time             CPU   Iterations
------------------------------------------------------------------------------------------------
BM_hb_ot_tags_from_script_and_language/COMMON abcd_XY       8.08 ns         8.05 ns     84500482
BM_hb_ot_tags_from_script_and_language/COMMON zh_CN         42.2 ns         42.1 ns     16722006
BM_hb_ot_tags_from_script_and_language/COMMON en_US         16.1 ns         16.0 ns     43461527
BM_hb_ot_tags_from_script_and_language/LATIN en_US          16.5 ns         16.5 ns     42448505
BM_hb_ot_tags_from_script_and_language/COMMON none          4.34 ns         4.33 ns    161290530
BM_hb_ot_tags_from_script_and_language/LATIN none           4.34 ns         4.33 ns    162339799

After:
------------------------------------------------------------------------------------------------
Benchmark                                                      Time             CPU   Iterations
------------------------------------------------------------------------------------------------
BM_hb_ot_tags_from_script_and_language/COMMON abcd_XY       8.13 ns         8.11 ns     80438134
BM_hb_ot_tags_from_script_and_language/COMMON zh_CN         40.0 ns         39.9 ns     17487939
BM_hb_ot_tags_from_script_and_language/COMMON en_US         12.7 ns         12.7 ns     55124394
BM_hb_ot_tags_from_script_and_language/LATIN en_US          13.1 ns         13.0 ns     53660125
BM_hb_ot_tags_from_script_and_language/COMMON none          4.61 ns         4.60 ns    151394104
BM_hb_ot_tags_from_script_and_language/LATIN none           4.70 ns         4.68 ns    150402847
@behdad
Copy link
Member Author

behdad commented May 17, 2022

Humm. Looks like not all of the fat is bsearch overhead now. I cached the last bsearch result, but most of the time is still there. I'm baffled.

After:

------------------------------------------------------------------------------------------------
Benchmark                                                      Time             CPU   Iterations
------------------------------------------------------------------------------------------------
BM_hb_ot_tags_from_script_and_language/COMMON abcd_XY       8.13 ns         8.11 ns     80438134
BM_hb_ot_tags_from_script_and_language/COMMON zh_CN         40.0 ns         39.9 ns     17487939
BM_hb_ot_tags_from_script_and_language/COMMON en_US         12.7 ns         12.7 ns     55124394
BM_hb_ot_tags_from_script_and_language/LATIN en_US          13.1 ns         13.0 ns     53660125
BM_hb_ot_tags_from_script_and_language/COMMON none          4.61 ns         4.60 ns    151394104
BM_hb_ot_tags_from_script_and_language/LATIN none           4.70 ns         4.68 ns    150402847

@behdad
Copy link
Member Author

behdad commented May 17, 2022

My benchmark had a bug. Now it looks like not much time is spent in bsearch anymore...

@behdad
Copy link
Member Author

behdad commented May 17, 2022

These are the current numbers:

------------------------------------------------------------------------------------------------
Benchmark                                                      Time             CPU   Iterations
------------------------------------------------------------------------------------------------
BM_hb_ot_tags_from_script_and_language/COMMON abcd_XY       78.0 ns         77.7 ns      8917912
BM_hb_ot_tags_from_script_and_language/COMMON zh_CN         44.9 ns         44.8 ns     15475318
BM_hb_ot_tags_from_script_and_language/COMMON en_US         17.6 ns         17.5 ns     39812340
BM_hb_ot_tags_from_script_and_language/LATIN en_US          18.2 ns         18.1 ns     38356204
BM_hb_ot_tags_from_script_and_language/COMMON none          4.76 ns         4.74 ns    148746131
BM_hb_ot_tags_from_script_and_language/LATIN none           4.73 ns         4.71 ns    148421349

behdad added a commit that referenced this issue May 17, 2022
Part of #3591

Ouch!

These are the current numbers:

------------------------------------------------------------------------------------------------
Benchmark                                                      Time             CPU   Iterations
------------------------------------------------------------------------------------------------
BM_hb_ot_tags_from_script_and_language/COMMON abcd_XY       78.0 ns         77.7 ns      8917912
BM_hb_ot_tags_from_script_and_language/COMMON zh_CN         44.9 ns         44.8 ns     15475318
BM_hb_ot_tags_from_script_and_language/COMMON en_US         17.6 ns         17.5 ns     39812340
BM_hb_ot_tags_from_script_and_language/LATIN en_US          18.2 ns         18.1 ns     38356204
BM_hb_ot_tags_from_script_and_language/COMMON none          4.76 ns         4.74 ns    148746131
BM_hb_ot_tags_from_script_and_language/LATIN none           4.73 ns         4.71 ns    148421349
@behdad
Copy link
Member Author

behdad commented May 17, 2022

Bulk of the abcd_XY and zh_CN time are back in hb_ot_tags_from_complex_language() again. If I disable that, all non-none times come down to 15ns.

@behdad
Copy link
Member Author

behdad commented May 17, 2022

Next step would be to form a if trie in the generated code for the complex cases instead of the switch-if block that is there right now...

@behdad
Copy link
Member Author

behdad commented May 17, 2022

Again, I'm happy to call this fixed at this point. :)

behdad added a commit that referenced this issue May 17, 2022
behdad added a commit that referenced this issue May 17, 2022
Part of #3591

Comparing before to after
Benchmark                                                               Time             CPU      Time Old      Time New       CPU Old       CPU New
----------------------------------------------------------------------------------------------------------------------------------------------------
BM_hb_ot_tags_from_script_and_language/COMMON abcd_XY                -0.3371         -0.3371            71            47            71            47
behdad added a commit that referenced this issue May 18, 2022
Part of #3591

Still 'zh-trad' is the slowest case.

--------------------------------------------------------------------------------------------------
Benchmark                                                        Time             CPU   Iterations
--------------------------------------------------------------------------------------------------
BM_hb_ot_tags_from_script_and_language/COMMON zh_trad          136 ns          136 ns      5107838
BM_hb_ot_tags_from_script_and_language/COMMON ab_abcd          115 ns          115 ns      6103104
BM_hb_ot_tags_from_script_and_language/COMMON ab_abc          25.4 ns         25.3 ns     27674482
BM_hb_ot_tags_from_script_and_language/COMMON abcdef_XY       20.2 ns         20.1 ns     34795719
BM_hb_ot_tags_from_script_and_language/COMMON abcd_XY         19.4 ns         19.3 ns     36390401
BM_hb_ot_tags_from_script_and_language/COMMON cxy_CN          33.5 ns         33.4 ns     20998939
BM_hb_ot_tags_from_script_and_language/COMMON exy_CN          25.1 ns         25.0 ns     27705832
BM_hb_ot_tags_from_script_and_language/COMMON zh_CN           34.2 ns         34.1 ns     20564356
BM_hb_ot_tags_from_script_and_language/COMMON en_US           15.5 ns         15.5 ns     45032204
BM_hb_ot_tags_from_script_and_language/LATIN en_US            15.9 ns         15.8 ns     44412379
BM_hb_ot_tags_from_script_and_language/COMMON none            4.72 ns         4.71 ns    149101665
BM_hb_ot_tags_from_script_and_language/LATIN none             4.72 ns         4.70 ns    149254498
behdad added a commit that referenced this issue May 18, 2022
@behdad behdad reopened this May 18, 2022
@behdad
Copy link
Member Author

behdad commented May 18, 2022

Current numbers:

--------------------------------------------------------------------------------------------------
Benchmark                                                        Time             CPU   Iterations
--------------------------------------------------------------------------------------------------
BM_hb_ot_tags_from_script_and_language/COMMON zh_abcd          133 ns          133 ns      4845544
BM_hb_ot_tags_from_script_and_language/COMMON zh_hans          108 ns          107 ns      6490844
BM_hb_ot_tags_from_script_and_language/COMMON ab_abcd          109 ns          109 ns      6447121
BM_hb_ot_tags_from_script_and_language/COMMON ab_abc          23.5 ns         23.5 ns     29466504
BM_hb_ot_tags_from_script_and_language/COMMON abcdef_XY       20.2 ns         20.1 ns     35021717
BM_hb_ot_tags_from_script_and_language/COMMON abcd_XY         19.6 ns         19.6 ns     35789281
BM_hb_ot_tags_from_script_and_language/COMMON cxy_CN          32.8 ns         32.7 ns     21369410
BM_hb_ot_tags_from_script_and_language/COMMON exy_CN          25.2 ns         25.1 ns     27450162
BM_hb_ot_tags_from_script_and_language/COMMON zh_CN           33.1 ns         33.0 ns     20987357
BM_hb_ot_tags_from_script_and_language/COMMON en_US           15.1 ns         15.0 ns     46329117
BM_hb_ot_tags_from_script_and_language/LATIN en_US            15.5 ns         15.5 ns     45204539
BM_hb_ot_tags_from_script_and_language/COMMON none            4.82 ns         4.81 ns    145076116
BM_hb_ot_tags_from_script_and_language/LATIN none             4.82 ns         4.81 ns    146123477

Adding a second-level switch will be the next step.

@behdad
Copy link
Member Author

behdad commented May 18, 2022

Although, the ones above 100ns are those going into the non-switch if case.

@behdad behdad closed this as completed May 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants