Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized Armv8 methods #2

Open
jakobnissen opened this issue Oct 8, 2021 · 0 comments
Open

Optimized Armv8 methods #2

jakobnissen opened this issue Oct 8, 2021 · 0 comments

Comments

@jakobnissen
Copy link
Owner

jakobnissen commented Oct 8, 2021

The new-ish Apple computers don't have x86 instructions. Make sure to write a fast implementation for those.
Edit: A few notes:

  • Instead of the vpshufb instruction, use tbl. For vec_generic, tbl can take two LUT directly in one instruction, making the code significantly simpler than the x86 one, where the entire function could be:
(1 << (x >>> 5)) & tbl(x & 0b00011111, lut1, lut2)
  • There is no equivalent to pmovmskb. Perhaps the fastest is just to cast to uint128 and do a direct count of leading zeros (for trailing zeros, do rbit + clz - probably LLVM already does this for you, check)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant