-
Notifications
You must be signed in to change notification settings - Fork 558
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ARM Neon and scalar implementations of SIMD functions #359
Conversation
The previous code implicitly caused a load; change it so the load intrinsic is explicitly invoked, as the others are. (This in fact makes no difference to the generated code.)
Many Intel intrinsics have a corresponding Neon equivalent. Other cases are more interesting: * Neon's vmaxvq directly selects the maximum entry in a vector, so can be used to implement both the __max_16/__max_8 macros and the _mm_movemask_epi8 early loop exit. Introduce additional helper macros alongside __max_16/__max_8 so that the early loop exit can similarly be implemented differently on the two platforms. * Full-width shifts can be done via vextq. This is defined close to the ksw_u8()/ksw_i16() functions (rather than in neon_sse.h) as it implicitly uses one of their local variables. * ksw_i16() uses saturating *signed* 16-bit operations apart from _mm_subs_epu16; presumably the data is effectively still signed but we wish to keep it non-negative. The ARM intrinsics are more careful about type checking, so this requires an extra U16() helper macro.
This has been tested on x86-64 macOS and on aarch64 Debian GNU/Linux 11 (bullseye) (on a Raspberry Pi). The scalar version has been tested on both of those platforms. Thanks to @nh13 for assistance with test data and testing. |
Make the native SSE2 code conditional on __SSE2__, which is defined by GCC/Clang/etc on x86-64 by default and on i386 with -msse2 etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @jmarshall !
I've successfully verified your changes on Linux openEuler 20.03 SP3. CPU: Kunpeng-920 aarch64
Awesome work, @jmarshall ! |
This has now been tested on a big-endian host (namely s390x), where (for this PR's test set of 10,000 reads) it produces the same output as the PR on other hosts and current bwa master on x86-64. Note that bwa's index files (in particular, the .sa and .bwt files) are not portable between endiannesses, so |
@lh3 Is there a chance this PR to be reviewed and merged soon ? |
Thank you so much for this, @jmarshall. I got a M1 macbook yesterday. I also need the arm support now. |
Do you know if anyone is considering enabling SVE/SVE2 for these functions? |
Apply https://patch-diff.githubusercontent.com/raw/lh3/bwa/pull/359.patch from lh3/bwa#359 Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>
Building BWA has long been limited to x86-64 due to the unconditional usage of SSE2 SIMD intrinsics in ksw.c's
ksw_u8()
andksw_i16()
functions. This PR adds Neon implementations of these functions, allowing compilation on Aarch64 machines, and also implementations using ordinary scalar C code that at least provide a working (albeit inefficient) build on other (little endian, at least) platforms.This is done without duplicating the code in ksw.c: in most cases, by defining SSE2 intrinsics as
static inline
functions calling Neon intrinsics where there is a direct equivalent; in other cases, by analysing what the SSE2 code is doing semantically and using a non-directly-equivalent Neon intrinsic that has the same effect in the context of the code. (Similarly with the scalar implementation using a small inline C function for each SSE2 intrinsic.)There are already BWA PRs #283 and #344, which port the SIMD code to other platforms using SIMDe and sse2neon respectively. The approach used in this PR is superior for several reasons:
It adds ~200 lines of code, rather than a ~800,000-line submodule (SIMDe) or an external ~9000-line header file (sse2neon).
The sse2neon approach adds support for ARM platforms only; this approach (and the SIMDe approach) add support for other platforms too — little endian platforms anyway; the code may need further porting to other endiannesses.
(This is mostly of interest to distribution packagers who wish to have at least minimal support for other platforms; I am unaware of substantial communities of even potential BWA users on platforms other than x86 and ARM.)
This approach produces a more efficient ARM implementation than naive use of a direct SSE2 intrinsic emulation layer:
The
__max_16
/__max_8
macros return the maximum entry in their vector argument.SSE2 does not implement this operation directly, so they are coded in terms of other SSE2 intrinsics. The
_mm_srli_si128
operation used does not have a direct equivalent in Neon; sse2neon implements it using a sequence of three Neon intrinsics involving memory I/O. Implemented thus,__max_8
requires 3*(1+3)+1 = 13 Neon operations.However by understanding the semantics of the code (instead of simply translating individual operations), the entire
__max_8
computation can be done using a single Neonvmaxvq_s16
intrinsic (involving only registers).Similarly sse2neon implements
__mm_movemask_epi8
using a sequence of six Neon intrinsics. However analysing the code semantics (seeallzero_16
andallzero_0f_8
below) shows that an identical effect can be achieved using a singlevmaxvq_u8
/vmaxvq_u16
Neon intrinsic.