Use movi NEON instruction to zero out registers #203

lgeiger · 2020-09-14T21:58:05Z

Currently dup is used to zero our NEON registers in the packing and AArch64 kernel code. According to the Cortex A72 optimization guide which is used in the Raspberry PI 4, dup has an execution latency of 8 cycles and a throughput of 1 when copying from a general purpose register to a NEON register.

This PR changes the code to use movi which has a latency of 3 cycles and a throughput of 2. This is also used in LLVM for zeroing out registers, but please let me know if I am missing something here.

I briefly benchmarked this code on a Pixel phone but didn't see any measurable difference which I think is expected since on the used A76 architecture dup only has a latency of 3 cycles so this PR won't have a large effect anyway.

bjacob · 2020-10-16T00:19:09Z

Thanks!

Use movi NEON instruction to zero out registers

106c13e

google-cla bot added the cla: yes label Sep 14, 2020

bjacob added the ready to pull label Oct 16, 2020

copybara-service bot closed this in 034c0e2 Oct 16, 2020

lgeiger deleted the movi-to-zero-neon-register branch October 16, 2020 00:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use movi NEON instruction to zero out registers #203

Use movi NEON instruction to zero out registers #203

lgeiger commented Sep 14, 2020

bjacob commented Oct 16, 2020

Use movi NEON instruction to zero out registers #203

Use movi NEON instruction to zero out registers #203

Conversation

lgeiger commented Sep 14, 2020

bjacob commented Oct 16, 2020