Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port to aarch64 #1562

Merged
merged 11 commits into from
Jun 14, 2020
Merged

Port to aarch64 #1562

merged 11 commits into from
Jun 14, 2020

Conversation

rmcgibbo
Copy link
Member

It would be nice to get mdtraj compiling on aarch64 (armv8).
It's really not that hard, but there are a lot of SSE intrinsics that are quite esoteric and need to be ported.

@kyleabeauchamp
Copy link
Contributor

FYI got a merge conflict already

@rmcgibbo
Copy link
Member Author

rmcgibbo/mdtraj must have been quite a bit behind mdtraj/mdtraj

@rmcgibbo
Copy link
Member Author

Well, that's enough for today. And it cost all of $0.75 to rent an AWS instance to develop on.

@rmcgibbo rmcgibbo mentioned this pull request Jun 14, 2020
@rmcgibbo
Copy link
Member Author

Attached is the log of the tests results on aarch64. Everything passes -- it looks like there is only a failure running one of the IPython notebooks because I didn't install sklearn.

log.txt

@rmcgibbo rmcgibbo merged commit b28df2c into mdtraj:master Jun 14, 2020
@rmcgibbo rmcgibbo deleted the aarch64 branch June 14, 2020 22:14
@tonigi
Copy link
Contributor

tonigi commented Jun 15, 2020

Ouch, looks like we worked on the same problem within hours of each other. I made an attempt using a cross-architecture emulation library, SIMDE, which should compile to whatever optimized directives are available, but needs debugging.

@rmcgibbo
Copy link
Member Author

@tonigi: Cool!

For arrch64, I did notice a few things while doing the port.

  1. There's an opportunity to use FMAs. The Intel intrinsics in theobald_rmsd_sse.h don't use FMA -- it's possible that those instructions only are available for CPUs that support AVX, which probably wasn't common when that code was written. I used FMAs in the aarch64 code though.
  2. NEON has very convenient de-interleaved loads that eliminate the need for a bunch of shuffles. I don't think x64 has these. If you emulate the x64 intrinsics on aarch64 I don't think it'd be smart enough to use the direct de-interleaved loads.
  3. The geometry code is no problem, because it already uses a fvec4 wrapper class around the intrinsics, so you can just pull in https://github.com/openmm/openmm/blob/master/openmmapi/include/openmm/internal/vectorize_ppc.h if you want to handle ppc. It's only the RMSD code that either needs to be ported directly or refactored.

@nemequ
Copy link

nemequ commented Jun 30, 2020

Hi all, I'm the lead developer for SIMDe.

SIMDE, which should compile to whatever optimized directives are available, but needs debugging.

For what it's worth, the most common problem is aliasing and/or alignment issues issues where people use something like reinterpret_cast<__m128*>(ptr). x86 compilers are designed to allow this (in the case of GCC and clang, there is a __may_alias__ attribute on the x86 types to tell the compiler it can't perform certain optimizations), but if you're targeting other architectures there is a good chance the compiler use some optimizations which break your code.

The solution here is to use functions like _mm_loadu_ps (or, if your data is aligned to 16-byte boundaries, _mm_load_ps, and do not cast to __m128*. The x86 functions actually take __m128* arguments, but if you cast to __m128* clang tends to have problems. The functions in SIMDe actually take a void*, so if you want to avoid this problem without triggering any warnings about implicit conversions you can use simde_mm_loadu_ps instead of _mm_loadu_ps.

There's an opportunity to use FMAs. The Intel intrinsics in theobald_rmsd_sse.h don't use FMA -- it's possible that those instructions only are available for CPUs that support AVX, which probably wasn't common when that code was written. I used FMAs in the aarch64 code though.

FMA on x86 is actually a separate extension from AVX or AVX2. IDK what types you're using, but there is an _mm_madd_epi16 in SSE2. For floating point it's not available until FMA. That said, if you're using SIMDe you can just use the FMA functions… if you're targeting FMA the compiler will use the _mm_fma* functions, otherwise it will fall back on _mm_mul_* + _mm_add_*. And, of course, on platforms like ARM and POWER you should get real FMA functions (I actually wrote a patch for ARM over the weekend, it's not in SIMDe quite yet but it should be within a day or two).

NEON has very convenient de-interleaved loads that eliminate the need for a bunch of shuffles. I don't think x64 has these. If you emulate the x64 intrinsics on aarch64 I don't think it'd be smart enough to use the direct de-interleaved loads.

It doesn't :(.

We're actually working on implementing NEON functions using SSE as well. We haven't implemented these yet (we have vuzp1/vuzp2, but not vld2), but if it would help you we can prioritize those. That way you could actually just call the NEON functions for your loads even on x86; they would just be a vld2 on NEON, and on x86 it would be a load and shuffle.

With SIMDe there isn't really anything preventing you from mixing functions from different extensions, and it can actually help since sometimes a more specific function means sometimes we can do better on another architecture (POWER, WASM, etc.).

@tonigi
Copy link
Contributor

tonigi commented Jun 30, 2020

Thanks @nemequ. In fact I made another attempt at using SIMDE with @rmcgibbo 's clean ARM implementation. https://github.com/giorginolab/mdtraj/tree/simde-attempt-2 . It does not work (missing types such as float32x4x3_t - but I wanted to leave a pointer).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants