Change LittleEndian loads/stores to use memcpy #150

davemgreen · 2022-01-19T11:31:38Z

The existing code uses a series of 8bit loads with shifts and ors to
emulate an (unaligned) load of a larger type. These are then expected to
become single loads in the compiler, producing optimal assembly. Whilst
this is true it happens very late in the compiler, meaning that
throughout most of the pipeline it is treated (and cost-modelled) as
multiple loads, shifts and ors. This can make the compiler make poor
decisions (such as not unrolling loops that should be), or to break up
the pattern before it is turned into a single load.

For example the loops in CompressFragment do not get unrolled as
expected due to a higher cost than the unroll threshold in clang.

Instead this patch uses a more conventional methods of loading unaligned
data, using a memcpy directly which the compiler will be able to deal
with much more straight forwardly, modelling it as a single unaligned
load. The old code is left as-is for big-endian systems.

This helps improve the performance of the BM_ZFlat benchmarks by up to
10-15% on an Arm Neoverse N1.

The existing code uses a series of 8bit loads with shifts and ors to emulate an (unaligned) load of a larger type. These are then expected to become single loads in the compiler, producing optimal assembly. Whilst this is true it happens very late in the compiler, meaning that throughout most of the pipeline it is treated (and cost-modelled) as multiple loads, shifts and ors. This can make the compiler make poor decisions (such as not unrolling loops that should be), or to break up the pattern before it is turned into a single load. For example the loops in CompressFragment do not get unrolled as expected due to a higher cost than the unroll threshold in clang. Instead this patch uses a more conventional methods of loading unaligned data, using a memcpy directly which the compiler will be able to deal with much more straight forwardly, modelling it as a single unaligned load. The old code is left as-is for big-endian systems. This helps improve the performance of the BM_ZFlat benchmarks by up to 10-15% on an Arm Neoverse N1. Change-Id: I986f845ebd0a0806d052d2be3e4dbcbee91713d7

danlark1 · 2023-01-11T09:38:12Z

LGTM from me

pwnall

Thank you for the optimizations and the clear explanation!

I'll get this through the internal repository. This PR will be automatically merged when the process completes.

JunHe77 · 2023-01-12T01:29:11Z

Thank you for reviewing this, @danlark1 , @pwnall ! 😄

davemgreen · 2023-01-12T17:43:27Z

Thanks!

JunHe77 mentioned this pull request Jan 11, 2023

Optimize Store32 to generate reliable single str #162

Closed

pwnall self-requested a review January 11, 2023 21:09

pwnall approved these changes Jan 11, 2023

View reviewed changes

pwnall merged commit 30326e5 into google:main Jan 12, 2023

davemgreen deleted the betterunalignedloads branch January 12, 2023 17:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change LittleEndian loads/stores to use memcpy #150

Change LittleEndian loads/stores to use memcpy #150

davemgreen commented Jan 19, 2022

danlark1 commented Jan 11, 2023

pwnall left a comment

JunHe77 commented Jan 12, 2023

davemgreen commented Jan 12, 2023

Change LittleEndian loads/stores to use memcpy #150

Change LittleEndian loads/stores to use memcpy #150

Conversation

davemgreen commented Jan 19, 2022

danlark1 commented Jan 11, 2023

pwnall left a comment

Choose a reason for hiding this comment

JunHe77 commented Jan 12, 2023

davemgreen commented Jan 12, 2023