Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Neon SIMD #6

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Implement Neon SIMD #6

wants to merge 1 commit into from

Conversation

CryZe
Copy link
Contributor

@CryZe CryZe commented Oct 13, 2021

This ports the WASM algorithm over to Aarch64 / ARM Neon. Rust itself isn't quite ready yet, but this should start compiling soon.

Blocked on: rust-lang/stdarch#1230

@CryZe CryZe force-pushed the neon-simd branch 2 times, most recently from fccf53d to 8733f3c Compare October 13, 2021 16:54
@pthariensflame
Copy link

I think those transmutes should probably be vld1qs?

@CryZe
Copy link
Contributor Author

CryZe commented Oct 14, 2021

They totally could be, but ptr::read_unaligned also should work just fine across the board (and is platform independent, so that makes the code easier to port to a different platform). Also this issue where vld* instructions are supposedly broken concerns me: rust-lang/stdarch#1227

@hkratz
Copy link

hkratz commented Oct 14, 2021

Actually only the interleaved loads are affected (vld[234]*), using vld1* should be fine as those use ptr::read_unaligned under the hood.

@mcountryman mcountryman marked this pull request as ready for review February 14, 2022 11:11
@mcountryman mcountryman reopened this Feb 14, 2022
@kaffeemonster
Copy link

Maybe you want to take a look at my adler32 SIMD implementation for NEON from way back then:
https://github.com/kaffeemonster/zlib/blob/adler32_vec/arm/adler32.c

It also contains some tricks which can be used in other SIMD (x86), mainly vector_chop, a modulo-approx. you can do on most SIMD instruction set. This way you can leave the data in the vector register and don't have to do a full reduction, copy to normal register set, do a modulo, and back to vector register. Also since you then have often 4 or more single sums, your inner loop count is not limited to NMAX (5552).

@peterdk
Copy link

peterdk commented Jan 6, 2023

Is there any progress or status on this? I need adler32 to be fast on arm64. Is this PR not deemed ok? Any plans to finish the support for NEON? I would love to contribute, but no SIMD experience.

@mcountryman
Copy link
Owner

Is there any progress or status on this? I need adler32 to be fast on arm64. Is this PR not deemed ok? Any plans to finish the support for NEON? I would love to contribute, but no SIMD experience.

I'll try running the build again when I have time. Last a checked rust support for some of the operations we need was lacking although this was a while ago

@mcountryman
Copy link
Owner

bump

@mcountryman mcountryman reopened this Jan 21, 2023
@CryZe
Copy link
Contributor Author

CryZe commented Jan 21, 2023

I'll quickly remove all the nightly conditions that shouldn't apply anymore to aarch64.

This ports the WASM algorithm over to Aarch64 / ARM Neon.
@CryZe
Copy link
Contributor Author

CryZe commented Jan 21, 2023

Seems like vdotq_u32 is not stable yet somehow.

@CryZe
Copy link
Contributor Author

CryZe commented Jan 21, 2023

Actually the error seems incorrect. The function is stable, but requires dotprod in addition to neon.

@mcountryman
Copy link
Owner

Actually the error seems incorrect. The function is stable, but requires dotprod in addition to neon.

It's been a bit since I looked at this code, but if I remember correct this would require adding an additional #cfg clause to allow compilation on non-dotprod,neon targets.

@CryZe
Copy link
Contributor Author

CryZe commented Jan 21, 2023

Mmh, turns out that the dotprod feature is indeed still unstable:
https://i.imgur.com/B2bAj6k.png
https://github.com/rust-lang/stdarch/blob/master/crates/stdarch-gen/src/main.rs#L1590

@peterdk
Copy link

peterdk commented Jan 21, 2023

Would this dotprod be supported on Android ARM64 devices? Or is it a very obscure instruction? Would it impact performance much if a device doesn't have it? (cq can we still use NEON for the rest?)

@mcountryman
Copy link
Owner

Would this dotprod be supported on Android ARM64 devices? Or is it a very obscure instruction? Would it impact performance much if a device doesn't have it? (cq can we still use NEON for the rest?)

At a glance @CryZe comment indicates we could get it working when compiling with nightly. The alternative is to port the c implementations mentioned above and hope that the intrinsics are stable.

@peterdk
Copy link

peterdk commented Jan 25, 2023

I am willing to offer a bounty of 50 Euro to get this working on ARM64 Android. One for the implementer (@CryZe ?) and one for the owner of this crate.

@peterdk
Copy link

peterdk commented Mar 30, 2023

I now resorted to using libdeflater-sys that wraps the libdeflate C++ library, and provides SIMD accelerated adler32 besides deflate and crc32. It has a dotproduct check and implementations for both with and without dot prod, while using NEON intrinsics. It did speed up my app on Android with 33% compared to using nightly simd-adler32. For now I leave it at that, maybe I find time and motivation in the future to extract that adler32 part from the library into a rust one, to make it somewhat more safe. But it's SIMD, so I doubt much improvements can be done in that regard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants