Port to aarch64 #1562

rmcgibbo · 2020-06-13T19:51:58Z

It would be nice to get mdtraj compiling on aarch64 (armv8).
It's really not that hard, but there are a lot of SSE intrinsics that are quite esoteric and need to be ported.

kyleabeauchamp · 2020-06-13T19:53:29Z

FYI got a merge conflict already

rmcgibbo · 2020-06-13T20:05:11Z

rmcgibbo/mdtraj must have been quite a bit behind mdtraj/mdtraj

rmcgibbo · 2020-06-13T21:53:05Z

Well, that's enough for today. And it cost all of $0.75 to rent an AWS instance to develop on.

rmcgibbo · 2020-06-14T22:13:29Z

Attached is the log of the tests results on aarch64. Everything passes -- it looks like there is only a failure running one of the IPython notebooks because I didn't install sklearn.

log.txt

tonigi · 2020-06-15T10:54:04Z

Ouch, looks like we worked on the same problem within hours of each other. I made an attempt using a cross-architecture emulation library, SIMDE, which should compile to whatever optimized directives are available, but needs debugging.

rmcgibbo · 2020-06-15T15:25:31Z

@tonigi: Cool!

For arrch64, I did notice a few things while doing the port.

There's an opportunity to use FMAs. The Intel intrinsics in theobald_rmsd_sse.h don't use FMA -- it's possible that those instructions only are available for CPUs that support AVX, which probably wasn't common when that code was written. I used FMAs in the aarch64 code though.
NEON has very convenient de-interleaved loads that eliminate the need for a bunch of shuffles. I don't think x64 has these. If you emulate the x64 intrinsics on aarch64 I don't think it'd be smart enough to use the direct de-interleaved loads.
The geometry code is no problem, because it already uses a fvec4 wrapper class around the intrinsics, so you can just pull in https://github.com/openmm/openmm/blob/master/openmmapi/include/openmm/internal/vectorize_ppc.h if you want to handle ppc. It's only the RMSD code that either needs to be ported directly or refactored.

nemequ · 2020-06-30T02:33:23Z

Hi all, I'm the lead developer for SIMDe.

SIMDE, which should compile to whatever optimized directives are available, but needs debugging.

For what it's worth, the most common problem is aliasing and/or alignment issues issues where people use something like reinterpret_cast<__m128*>(ptr). x86 compilers are designed to allow this (in the case of GCC and clang, there is a __may_alias__ attribute on the x86 types to tell the compiler it can't perform certain optimizations), but if you're targeting other architectures there is a good chance the compiler use some optimizations which break your code.

The solution here is to use functions like _mm_loadu_ps (or, if your data is aligned to 16-byte boundaries, _mm_load_ps, and do not cast to __m128*. The x86 functions actually take __m128* arguments, but if you cast to __m128* clang tends to have problems. The functions in SIMDe actually take a void*, so if you want to avoid this problem without triggering any warnings about implicit conversions you can use simde_mm_loadu_ps instead of _mm_loadu_ps.

There's an opportunity to use FMAs. The Intel intrinsics in theobald_rmsd_sse.h don't use FMA -- it's possible that those instructions only are available for CPUs that support AVX, which probably wasn't common when that code was written. I used FMAs in the aarch64 code though.

FMA on x86 is actually a separate extension from AVX or AVX2. IDK what types you're using, but there is an _mm_madd_epi16 in SSE2. For floating point it's not available until FMA. That said, if you're using SIMDe you can just use the FMA functions… if you're targeting FMA the compiler will use the _mm_fma* functions, otherwise it will fall back on _mm_mul_* + _mm_add_*. And, of course, on platforms like ARM and POWER you should get real FMA functions (I actually wrote a patch for ARM over the weekend, it's not in SIMDe quite yet but it should be within a day or two).

NEON has very convenient de-interleaved loads that eliminate the need for a bunch of shuffles. I don't think x64 has these. If you emulate the x64 intrinsics on aarch64 I don't think it'd be smart enough to use the direct de-interleaved loads.

It doesn't :(.

We're actually working on implementing NEON functions using SSE as well. We haven't implemented these yet (we have vuzp1/vuzp2, but not vld2), but if it would help you we can prioritize those. That way you could actually just call the NEON functions for your loads even on x86; they would just be a vld2 on NEON, and on x86 it would be a load and shuffle.

With SIMDe there isn't really anything preventing you from mixing functions from different extensions, and it can actually help since sometimes a more specific function means sometimes we can do better on another architecture (POWER, WASM, etc.).

tonigi · 2020-06-30T10:03:37Z

Thanks @nemequ. In fact I made another attempt at using SIMDE with @rmcgibbo 's clean ARM implementation. https://github.com/giorginolab/mdtraj/tree/simde-attempt-2 . It does not work (missing types such as float32x4x3_t - but I wanted to leave a pointer).

rmcgibbo added 2 commits June 13, 2020 19:47

Begin porting rmsd to aarch64

3ec29bb

Add files

eedb6d7

rmcgibbo added 2 commits June 13, 2020 20:02

Stash

9c42267

merge master

f40cee7

rmcgibbo added 4 commits June 13, 2020 20:23

Another function

9032b79

Get the rest of the RMSD extension working

6051992

Remove header in wrong place

efca4ef

Remove header in wrong place

ccee324

rmcgibbo added 3 commits June 13, 2020 17:57

Update center_sse.h

1091744

Update the rest of the intrisic-using native code for neon

199aa86

Update the rest of the intrisic-using native code for neon

1127c3f

rmcgibbo mentioned this pull request Jun 14, 2020

Building for ppc64le #1432

Closed

rmcgibbo merged commit b28df2c into mdtraj:master Jun 14, 2020

rmcgibbo deleted the aarch64 branch June 14, 2020 22:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port to aarch64 #1562

Port to aarch64 #1562

rmcgibbo commented Jun 13, 2020

kyleabeauchamp commented Jun 13, 2020

rmcgibbo commented Jun 13, 2020

rmcgibbo commented Jun 13, 2020

rmcgibbo commented Jun 14, 2020

tonigi commented Jun 15, 2020

rmcgibbo commented Jun 15, 2020

nemequ commented Jun 30, 2020

tonigi commented Jun 30, 2020 •

edited

Port to aarch64 #1562

Port to aarch64 #1562

Conversation

rmcgibbo commented Jun 13, 2020

kyleabeauchamp commented Jun 13, 2020

rmcgibbo commented Jun 13, 2020

rmcgibbo commented Jun 13, 2020

rmcgibbo commented Jun 14, 2020

tonigi commented Jun 15, 2020

rmcgibbo commented Jun 15, 2020

nemequ commented Jun 30, 2020

tonigi commented Jun 30, 2020 • edited

tonigi commented Jun 30, 2020 •

edited