New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Optimized Armv8 methods #2

Open

jakobnissen opened this issue Oct 8, 2021 · 0 comments

Owner

jakobnissen commented Oct 8, 2021 •

edited

The new-ish Apple computers don't have x86 instructions. Make sure to write a fast implementation for those.
Edit: A few notes:

Instead of the vpshufb instruction, use tbl. For vec_generic, tbl can take two LUT directly in one instruction, making the code significantly simpler than the x86 one, where the entire function could be:

(1 << (x >>> 5)) & tbl(x & 0b00011111, lut1, lut2)

There is no equivalent to pmovmskb. Perhaps the fastest is just to cast to uint128 and do a direct count of leading zeros (for trailing zeros, do rbit + clz - probably LLVM already does this for you, check)

The text was updated successfully, but these errors were encountered:

jakobnissen mentioned this issue

Use SIMD code generator BioJulia/FASTX.jl#42

Closed

jakobnissen mentioned this issue

SIMD Capacity not detected on M1 #5

Closed

jakobnissen mentioned this issue

SIMD capacity not detected by ScanByte, using scalar fallback #7

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment