Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change LittleEndian loads/stores to use memcpy #150

Merged
merged 1 commit into from
Jan 12, 2023

Conversation

davemgreen
Copy link
Contributor

The existing code uses a series of 8bit loads with shifts and ors to
emulate an (unaligned) load of a larger type. These are then expected to
become single loads in the compiler, producing optimal assembly. Whilst
this is true it happens very late in the compiler, meaning that
throughout most of the pipeline it is treated (and cost-modelled) as
multiple loads, shifts and ors. This can make the compiler make poor
decisions (such as not unrolling loops that should be), or to break up
the pattern before it is turned into a single load.

For example the loops in CompressFragment do not get unrolled as
expected due to a higher cost than the unroll threshold in clang.

Instead this patch uses a more conventional methods of loading unaligned
data, using a memcpy directly which the compiler will be able to deal
with much more straight forwardly, modelling it as a single unaligned
load. The old code is left as-is for big-endian systems.

This helps improve the performance of the BM_ZFlat benchmarks by up to
10-15% on an Arm Neoverse N1.

The existing code uses a series of 8bit loads with shifts and ors to
emulate an (unaligned) load of a larger type. These are then expected to
become single loads in the compiler, producing optimal assembly. Whilst
this is true it happens very late in the compiler, meaning that
throughout most of the pipeline it is treated (and cost-modelled) as
multiple loads, shifts and ors. This can make the compiler make poor
decisions (such as not unrolling loops that should be), or to break up
the pattern before it is turned into a single load.

For example the loops in CompressFragment do not get unrolled as
expected due to a higher cost than the unroll threshold in clang.

Instead this patch uses a more conventional methods of loading unaligned
data, using a memcpy directly which the compiler will be able to deal
with much more straight forwardly, modelling it as a single unaligned
load. The old code is left as-is for big-endian systems.

This helps improve the performance of the BM_ZFlat benchmarks by up to
10-15% on an Arm Neoverse N1.

Change-Id: I986f845ebd0a0806d052d2be3e4dbcbee91713d7
@danlark1
Copy link

LGTM from me

@pwnall pwnall self-requested a review January 11, 2023 21:09
Copy link
Member

@pwnall pwnall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the optimizations and the clear explanation!

I'll get this through the internal repository. This PR will be automatically merged when the process completes.

@JunHe77
Copy link
Contributor

JunHe77 commented Jan 12, 2023

Thank you for reviewing this, @danlark1 , @pwnall ! 😄

@pwnall pwnall merged commit 30326e5 into google:main Jan 12, 2023
@davemgreen
Copy link
Contributor Author

Thanks!

@davemgreen davemgreen deleted the betterunalignedloads branch January 12, 2023 17:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants