Enable the use of [SU]Int32Size and EnumSize templates for AArch64 #11102

avieira-arm · 2022-11-30T22:09:03Z

Hi,

When benchmarking proto_benchmark from fleetbench on an AArch64 target we found that clang is able to vectorize these functions and they offer better performance than the scalar alternative.

I ran //src/google/protobuf:arena_unittest on aarch64-none-linux-gnu. Should I run any other tests? Also protobuf used to have its own set of benchmarks, but I can't find these when I query all targets with bazel. Let me know if you'd like me to run anything else, I couldn't find instructions on what the full test run is.

ckennelly · 2022-12-01T16:15:02Z

The benchmarks are now located in https://github.com/google/fleetbench

avieira-arm · 2023-01-06T17:02:29Z

Rebased this as I thought that might be the cause for the 'Mergeable 1/1 Fail(s): LABEL', but that still seems to be there. Any idea what that's trying to tell me?

And @ckennelly, thanks. I used proto_benchmark off that repo to benchmark protobuf and it's how I found out we weren't using this path for Arm. The reason I asked about benchmarks was because I remembered the protobuf-repo ones were slightly different (I think they had smaller messages).

fowles · 2023-01-06T17:05:24Z

mergable fail is a safeguard to prevent google engineers from accidentally merging it without doing internal testing

danlark1 · 2023-01-08T16:45:42Z

For fleetbench we see 1-1.5% improvement on G2A instances

name                 old cpu/op  new cpu/op  delta
BM_Protogen_Arena    19.1ms ± 2%  18.9ms ± 3%  -1.48%  (p=0.000 n=36+45)
BM_Protogen_NoArena  20.6ms ± 1%  20.4ms ± 3%  -1.18%  (p=0.000 n=30+44)

For int32 size microbenchmarks we have something like 2x speed up

BM_RepeatedFieldSize/1        2.67ns ± 0%  5.39ns ± 0%  +101.67%      (p=0.000 n=495+552)
BM_RepeatedFieldSize/8        10.6ns ± 0%   6.5ns ± 1%   -38.57%      (p=0.000 n=515+581)
BM_RepeatedFieldSize/64       63.8ns ± 0%  30.0ns ± 0%   -52.98%      (p=0.000 n=550+438)
BM_RepeatedFieldSize/512       490ns ± 0%   216ns ± 0%   -55.84%      (p=0.000 n=563+547)
BM_RepeatedFieldSize/1k        981ns ± 0%   430ns ± 0%   -56.15%      (p=0.000 n=545+506)

Godbolt looks great. We managed to get a combination of cmhi and sub with usra when needed for >>31. This is compared to clz + usra in the previous version. https://gcc.godbolt.org/z/W9bGd81KM

LGTM from the arm code generation side. Thanks a lot!

src/google/protobuf/wire_format_lite.cc

avieira-arm · 2023-01-09T13:53:29Z

For int32 size microbenchmarks we have something like 2x speed up
BM_RepeatedFieldSize/1        2.67ns ± 0%  5.39ns ± 0%  +101.67%      (p=0.000 n=495+552)

How do you run the BM_RepeatedFieldSize Benchmarks?

danlark1 · 2023-01-09T15:19:56Z

For int32 size microbenchmarks we have something like 2x speed up
BM_RepeatedFieldSize/1        2.67ns ± 0%  5.39ns ± 0%  +101.67%      (p=0.000 n=495+552)
How do you run the BM_RepeatedFieldSize Benchmarks?

They are internal to us as we did not want to expose gbench at the time, I guess. Not sure about the current state, we probably should expose them.

The benchmark is simple, it just runs Int32Size on a random repeated field. We located it in protobuf/wire_format_unittest.cc

haberman · 2023-02-16T21:43:06Z

@avieira-arm could you please rebase on main? Sorry for the trouble, you've caught us in the middle of a migration to GitHub Actions.

avieira-arm · 2023-02-20T14:39:27Z

I've rebased it but I've not been able to rebuild proto_benchmark due to some bazel stuff, so I'm hoping the CI can check whether it builds fine here.

While I have your attention, maybe you can help me resolve the issue I have with building the proto benchmark in fleetbench.

I build it overriding the com_google_protobuf, com_google_absl and com_google_tcmalloc repository's using local ones (that are unchanged, other than this protobuf patch). I get an error complaining the C++ version is too old, and when I check the build commands with --verbose_failures I see '-std=c++0x' is beeing passed. I can't find this option being added in any of the local repositories so I have to assume it is being added by some bazel rule that is being downloaded. I'm not super experienced with bazel and I've only done a minimal level of looking into this for now, in the past I hacked up the config files in abseil to get me past the errors, but the projects now actually use C++14 stuff so making these changes is no longer viable.

avieira-arm · 2023-02-23T10:37:33Z

Looks like none of the tests are running because of some missing 'secrets'. I suspect this is an internal thing too?

avieira-arm · 2023-03-09T11:08:28Z

I keep getting emails about workflows failing to run. I don't think theres much I can do about that though, can someone please confirm.

It would be nice to get this merged or dropped if you don't want it, but the benchmarks seem to indicate that it would be a desirable change :)

avieira-arm · 2023-04-03T15:33:01Z

Rebased again.

fowles · 2023-04-25T18:07:06Z

sorry to be a pain but you are synced to a bad point and we need another rebase

When benchmarking proto_benchmark from fleetbench on an AArch64 target we found that clang is able to vectorize these functions and they offer better performance than the scalar alternative.

avieira-arm · 2023-05-15T18:09:12Z

Rebased again. Let me know if you still want this.

danlark1 · 2023-05-15T18:25:56Z

We definitely want this, this should be a very safe change. LGTM from me

@fowles

fowles · 2023-05-17T14:23:44Z

Sorry for all the delays on this one!

anandolee assigned deannagarcia Dec 14, 2022

deannagarcia assigned sbenzaquen and unassigned deannagarcia Dec 14, 2022

avieira-arm force-pushed the main branch from b6bd05c to cf77f0e Compare January 6, 2023 16:55

avieira-arm requested a review from a team as a code owner January 6, 2023 16:55

avieira-arm requested review from mcy and removed request for a team January 6, 2023 16:55

fowles added kokoro:run c++ labels Jan 6, 2023

protobuf-kokoro removed the kokoro:run label Jan 6, 2023

danlark1 reviewed Jan 8, 2023

View reviewed changes

src/google/protobuf/wire_format_lite.cc Outdated Show resolved Hide resolved

src/google/protobuf/wire_format_lite.cc Show resolved Hide resolved

avieira-arm force-pushed the main branch from cf77f0e to 97c674a Compare January 9, 2023 13:55

sbenzaquen approved these changes Jan 9, 2023

View reviewed changes

sbenzaquen approved these changes Feb 7, 2023

View reviewed changes

avieira-arm force-pushed the main branch from 97c674a to f3a3c39 Compare February 20, 2023 14:30

zhangskz removed the release notes: no label Feb 23, 2023

avieira-arm force-pushed the main branch from f3a3c39 to 538368c Compare April 3, 2023 15:32

fowles added the 🅰️ safe for tests Mark a commit as safe to run presubmits over label Apr 25, 2023

github-actions bot removed the 🅰️ safe for tests Mark a commit as safe to run presubmits over label Apr 25, 2023

avieira-arm force-pushed the main branch from 538368c to 6c6df5b Compare April 26, 2023 08:29

Enable the use of [SU]Int32Size and EnumSize templates for AArch64

5552410

When benchmarking proto_benchmark from fleetbench on an AArch64 target we found that clang is able to vectorize these functions and they offer better performance than the scalar alternative.

avieira-arm force-pushed the main branch from 6c6df5b to 5552410 Compare May 15, 2023 18:08

sbenzaquen removed the request for review from mcy May 15, 2023 21:52

fowles added the 🅰️ safe for tests Mark a commit as safe to run presubmits over label May 15, 2023

github-actions bot removed the 🅰️ safe for tests Mark a commit as safe to run presubmits over label May 15, 2023

fowles approved these changes May 16, 2023

View reviewed changes

fowles added the platform related Any issue releated to specific platform or OS label May 16, 2023

copybara-service bot closed this in e285d3e May 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable the use of [SU]Int32Size and EnumSize templates for AArch64 #11102

Enable the use of [SU]Int32Size and EnumSize templates for AArch64 #11102

avieira-arm commented Nov 30, 2022

ckennelly commented Dec 1, 2022

avieira-arm commented Jan 6, 2023

fowles commented Jan 6, 2023

danlark1 commented Jan 8, 2023 •

edited

avieira-arm commented Jan 9, 2023

danlark1 commented Jan 9, 2023

haberman commented Feb 16, 2023

avieira-arm commented Feb 20, 2023

avieira-arm commented Feb 23, 2023

avieira-arm commented Mar 9, 2023

avieira-arm commented Apr 3, 2023

fowles commented Apr 25, 2023

avieira-arm commented May 15, 2023

danlark1 commented May 15, 2023

fowles commented May 17, 2023

Enable the use of [SU]Int32Size and EnumSize templates for AArch64 #11102

Enable the use of [SU]Int32Size and EnumSize templates for AArch64 #11102

Conversation

avieira-arm commented Nov 30, 2022

ckennelly commented Dec 1, 2022

avieira-arm commented Jan 6, 2023

fowles commented Jan 6, 2023

danlark1 commented Jan 8, 2023 • edited

avieira-arm commented Jan 9, 2023

danlark1 commented Jan 9, 2023

haberman commented Feb 16, 2023

avieira-arm commented Feb 20, 2023

avieira-arm commented Feb 23, 2023

avieira-arm commented Mar 9, 2023

avieira-arm commented Apr 3, 2023

fowles commented Apr 25, 2023

avieira-arm commented May 15, 2023

danlark1 commented May 15, 2023

fowles commented May 17, 2023

danlark1 commented Jan 8, 2023 •

edited