Use movi NEON instruction to zero out registers #203
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Currently
dup
is used to zero our NEON registers in the packing and AArch64 kernel code. According to the Cortex A72 optimization guide which is used in the Raspberry PI 4,dup
has an execution latency of 8 cycles and a throughput of 1 when copying from a general purpose register to a NEON register.This PR changes the code to use
movi
which has a latency of 3 cycles and a throughput of 2. This is also used in LLVM for zeroing out registers, but please let me know if I am missing something here.I briefly benchmarked this code on a Pixel phone but didn't see any measurable difference which I think is expected since on the used A76 architecture
dup
only has a latency of 3 cycles so this PR won't have a large effect anyway.