Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use movi NEON instruction to zero out registers #203

Closed
wants to merge 1 commit into from

Conversation

lgeiger
Copy link
Contributor

@lgeiger lgeiger commented Sep 14, 2020

Currently dup is used to zero our NEON registers in the packing and AArch64 kernel code. According to the Cortex A72 optimization guide which is used in the Raspberry PI 4, dup has an execution latency of 8 cycles and a throughput of 1 when copying from a general purpose register to a NEON register.

This PR changes the code to use movi which has a latency of 3 cycles and a throughput of 2. This is also used in LLVM for zeroing out registers, but please let me know if I am missing something here.

I briefly benchmarked this code on a Pixel phone but didn't see any measurable difference which I think is expected since on the used A76 architecture dup only has a latency of 3 cycles so this PR won't have a large effect anyway.

@google-cla google-cla bot added the cla: yes label Sep 14, 2020
@bjacob
Copy link
Contributor

bjacob commented Oct 16, 2020

Thanks!

@lgeiger lgeiger deleted the movi-to-zero-neon-register branch October 16, 2020 00:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants