Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for ARM64? #59

Closed
philres opened this issue Jun 5, 2018 · 19 comments
Closed

Support for ARM64? #59

philres opened this issue Jun 5, 2018 · 19 comments

Comments

@philres
Copy link

philres commented Jun 5, 2018

Hi Jeff,

I was wondering if ARM64 support in parasail would be possible?

I have been using parasail for several different things (mostly through the python API) and have been very happy with it. Unfortunately, I now have to make my scripts run on ARM64 as well. I tried compiling it but it (probably not surprisingly) failed (see below).

Do you think it would be possible to make it compile and run on ARM64? As a first step having a non SSE/AVX version of the algorithms running would be great already. On the long run I was wondering if using something like: https://github.com/jratcliff63367/sse2neon or https://github.com/nemequ/simde/ would allow running the SSE versions as well?

Thanks,
Philipp

prescheneder@tegra-ubuntu:~/parasail/parasail-2.1.4/build$ cmake ..
-- The C compiler identification is GNU 5.4.0
-- The CXX compiler identification is GNU 5.4.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Try OpenMP C flag = [-fopenmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Success
-- Try OpenMP CXX flag = [-fopenmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Success
-- Found OpenMP: -fopenmp
-- Try SSE2 C flag = [ ]
-- Performing Test HAVE_SSE2
-- Performing Test HAVE_SSE2 - Failed
-- Try SSE2 C flag = [-march=core2]
-- Performing Test HAVE_SSE2
-- Performing Test HAVE_SSE2 - Failed
-- Try SSE2 C flag = [-msse2]
-- Performing Test HAVE_SSE2
-- Performing Test HAVE_SSE2 - Failed
-- Could NOT find SSE2 (missing:  SSE2_C_FLAGS)
-- Try SSE41 C flag = [ ]
-- Performing Test HAVE_SSE41
-- Performing Test HAVE_SSE41 - Failed
-- Try SSE41 C flag = [-march=corei7]
-- Performing Test HAVE_SSE41
-- Performing Test HAVE_SSE41 - Failed
-- Try SSE41 C flag = [-msse4]
-- Performing Test HAVE_SSE41
-- Performing Test HAVE_SSE41 - Failed
-- Try SSE41 C flag = [-msse4.1]
-- Performing Test HAVE_SSE41
-- Performing Test HAVE_SSE41 - Failed
-- Could NOT find SSE41 (missing:  SSE41_C_FLAGS)
-- Try AVX2 C flag = [ ]
-- Performing Test HAVE_AVX2
-- Performing Test HAVE_AVX2 - Failed
-- Try AVX2 C flag = [-march=core-avx2]
-- Performing Test HAVE_AVX2
-- Performing Test HAVE_AVX2 - Failed
-- Try AVX2 C flag = [-mavx2]
-- Performing Test HAVE_AVX2
-- Performing Test HAVE_AVX2 - Failed
-- Could NOT find AVX2 (missing:  AVX2_C_FLAGS)
-- Performing Test HAVE_XGETBV
-- Performing Test HAVE_XGETBV - Failed
-- Try AltiVec C flag = [ ]
-- Performing Test HAVE_ALTIVEC
-- Performing Test HAVE_ALTIVEC - Failed
-- Try AltiVec C flag = [-maltivec]
-- Performing Test HAVE_ALTIVEC
-- Performing Test HAVE_ALTIVEC - Failed
-- Try AltiVec C flag = [-faltivec]
-- Performing Test HAVE_ALTIVEC
-- Performing Test HAVE_ALTIVEC - Failed
-- Could NOT find ALTIVEC (missing:  ALTIVEC_C_FLAGS)
-- Found ZLIB: /usr/lib/aarch64-linux-gnu/libz.so (found version "1.2.8")
-- Check if the CPU is POWER
-- Check if the CPU is POWER - FALSE
-- Looking for sys/types.h
-- Looking for sys/types.h - found
-- Looking for stdint.h
-- Looking for stdint.h - found
-- Looking for stddef.h
-- Looking for stddef.h - found
-- Check size of int
-- Check size of int - done
-- Check size of int8_t
-- Check size of int8_t - done
-- Check size of int16_t
-- Check size of int16_t - done
-- Check size of int32_t
-- Check size of int32_t - done
-- Check size of int64_t
-- Check size of int64_t - done
-- Check size of uint8_t
-- Check size of uint8_t - done
-- Check size of uint16_t
-- Check size of uint16_t - done
-- Check size of uint32_t
-- Check size of uint32_t - done
-- Check size of uint64_t
-- Check size of uint64_t - done
-- Check if the system is big endian
-- Searching 16 bit integer
-- Check size of unsigned short
-- Check size of unsigned short - done
-- Using unsigned short
-- Check if the system is big endian - little endian
-- Performing Test HAVE_RESTRICT
-- Performing Test HAVE_RESTRICT - Success
-- Performing Test HAVE_INLINE_NATIVE
-- Performing Test HAVE_INLINE_NATIVE - Success
-- Looking for _aligned_malloc
-- Looking for _aligned_malloc - not found
-- Looking for posix_memalign
-- Looking for posix_memalign - found
-- Looking for aligned_alloc
-- Looking for aligned_alloc - found
-- Looking for memalign
-- Looking for memalign - found
-- Looking for clock_gettime
-- Looking for clock_gettime - found
-- Looking for clock_gettime in rt
-- Looking for clock_gettime in rt - found
-- Looking for clock_gettime in pthread
-- Looking for clock_gettime in pthread - found
-- Looking for sqrt
-- Looking for sqrt - not found
-- Looking for sqrt in m
-- Looking for sqrt in m - found
-- Configuring done
-- Generating done
-- Build files have been written to: /home/prescheneder/parasail/parasail-2.1.4/build
prescheneder@tegra-ubuntu:~/parasail/parasail-2.1.4/build$ make
Scanning dependencies of target parasail_novec_trace
[  1%] Building C object CMakeFiles/parasail_novec_trace.dir/src/nw_trace.c.o
[  2%] Building C object CMakeFiles/parasail_novec_trace.dir/src/sg_trace.c.o
[  3%] Building C object CMakeFiles/parasail_novec_trace.dir/src/sw_trace.c.o
[  4%] Building C object CMakeFiles/parasail_novec_trace.dir/src/nw_trace_scan.c.o
[  5%] Building C object CMakeFiles/parasail_novec_trace.dir/src/sg_trace_scan.c.o
[  6%] Building C object CMakeFiles/parasail_novec_trace.dir/src/sw_trace_scan.c.o
[  6%] Built target parasail_novec_trace
Scanning dependencies of target parasail_sse41
[  7%] Building C object CMakeFiles/parasail_sse41.dir/cmake/sse41_dummy.c.o
[  7%] Built target parasail_sse41
Scanning dependencies of target parasail_sse2_trace
[  8%] Building C object CMakeFiles/parasail_sse2_trace.dir/cmake/sse2_dummy.c.o
[  8%] Built target parasail_sse2_trace
Scanning dependencies of target parasail_avx2_table
[  9%] Building C object CMakeFiles/parasail_avx2_table.dir/cmake/avx2_dummy.c.o
[  9%] Built target parasail_avx2_table
Scanning dependencies of target parasail_novec_rowcol
[ 10%] Building C object CMakeFiles/parasail_novec_rowcol.dir/src/nw.c.o
[ 11%] Building C object CMakeFiles/parasail_novec_rowcol.dir/src/sg.c.o
[ 12%] Building C object CMakeFiles/parasail_novec_rowcol.dir/src/sw.c.o
[ 13%] Building C object CMakeFiles/parasail_novec_rowcol.dir/src/nw_banded.c.o
[ 14%] Building C object CMakeFiles/parasail_novec_rowcol.dir/src/nw_scan.c.o
[ 15%] Building C object CMakeFiles/parasail_novec_rowcol.dir/src/sg_scan.c.o
[ 16%] Building C object CMakeFiles/parasail_novec_rowcol.dir/src/sw_scan.c.o
[ 17%] Building C object CMakeFiles/parasail_novec_rowcol.dir/src/nw_stats.c.o
[ 18%] Building C object CMakeFiles/parasail_novec_rowcol.dir/src/sg_stats.c.o
[ 19%] Building C object CMakeFiles/parasail_novec_rowcol.dir/src/sw_stats.c.o
[ 20%] Building C object CMakeFiles/parasail_novec_rowcol.dir/src/nw_stats_scan.c.o
[ 21%] Building C object CMakeFiles/parasail_novec_rowcol.dir/src/sg_stats_scan.c.o
[ 21%] Building C object CMakeFiles/parasail_novec_rowcol.dir/src/sw_stats_scan.c.o
[ 21%] Built target parasail_novec_rowcol
Scanning dependencies of target parasail_sse2_table
[ 22%] Building C object CMakeFiles/parasail_sse2_table.dir/cmake/sse2_dummy.c.o
[ 22%] Built target parasail_sse2_table
Scanning dependencies of target parasail_sse41_table
[ 23%] Building C object CMakeFiles/parasail_sse41_table.dir/cmake/sse41_dummy.c.o
[ 23%] Built target parasail_sse41_table
Scanning dependencies of target parasail_sse2_rowcol
[ 24%] Building C object CMakeFiles/parasail_sse2_rowcol.dir/cmake/sse2_dummy.c.o
[ 24%] Built target parasail_sse2_rowcol
Scanning dependencies of target parasail_sse2
[ 25%] Building C object CMakeFiles/parasail_sse2.dir/cmake/sse2_dummy.c.o
[ 25%] Built target parasail_sse2
Scanning dependencies of target parasail_sse41_trace
[ 25%] Building C object CMakeFiles/parasail_sse41_trace.dir/cmake/sse41_dummy.c.o
[ 25%] Built target parasail_sse41_trace
Scanning dependencies of target parasail_altivec
[ 26%] Building C object CMakeFiles/parasail_altivec.dir/cmake/altivec_dummy.c.o
[ 26%] Built target parasail_altivec
Scanning dependencies of target parasail_avx2_trace
[ 27%] Building C object CMakeFiles/parasail_avx2_trace.dir/cmake/avx2_dummy.c.o
[ 27%] Built target parasail_avx2_trace
Scanning dependencies of target parasail_sse41_rowcol
[ 28%] Building C object CMakeFiles/parasail_sse41_rowcol.dir/cmake/sse41_dummy.c.o
[ 28%] Built target parasail_sse41_rowcol
Scanning dependencies of target parasail_avx2
[ 29%] Building C object CMakeFiles/parasail_avx2.dir/cmake/avx2_dummy.c.o
[ 29%] Built target parasail_avx2
Scanning dependencies of target parasail_avx2_rowcol
[ 30%] Building C object CMakeFiles/parasail_avx2_rowcol.dir/cmake/avx2_dummy.c.o
[ 30%] Built target parasail_avx2_rowcol
Scanning dependencies of target parasail_altivec_rowcol
[ 31%] Building C object CMakeFiles/parasail_altivec_rowcol.dir/cmake/altivec_dummy.c.o
[ 31%] Built target parasail_altivec_rowcol
Scanning dependencies of target parasail_altivec_trace
[ 32%] Building C object CMakeFiles/parasail_altivec_trace.dir/cmake/altivec_dummy.c.o
[ 32%] Built target parasail_altivec_trace
Scanning dependencies of target parasail_core
[ 33%] Building C object CMakeFiles/parasail_core.dir/src/cigar.c.o
[ 33%] Building C object CMakeFiles/parasail_core.dir/src/function_lookup.c.o
[ 34%] Building C object CMakeFiles/parasail_core.dir/src/io.c.o
[ 35%] Building C object CMakeFiles/parasail_core.dir/src/isastubs.c.o
[ 36%] Building C object CMakeFiles/parasail_core.dir/src/matrix_lookup.c.o
[ 37%] Building C object CMakeFiles/parasail_core.dir/src/memory.c.o
[ 38%] Building C object CMakeFiles/parasail_core.dir/src/parser.c.o
[ 39%] Building C object CMakeFiles/parasail_core.dir/src/pssw.c.o
[ 40%] Building C object CMakeFiles/parasail_core.dir/src/time.c.o
/home/prescheneder/parasail/parasail-2.1.4/src/time.c:24:0: warning: "_POSIX_C_SOURCE" redefined
 #define _POSIX_C_SOURCE 199309L
 ^
In file included from /usr/include/stdio.h:27:0,
                 from /home/prescheneder/parasail/parasail-2.1.4/parasail.h:11,
                 from /home/prescheneder/parasail/parasail-2.1.4/src/time.c:12:
/usr/include/features.h:228:0: note: this is the location of the previous definition
 # define _POSIX_C_SOURCE 200809L
 ^
[ 41%] Building C object CMakeFiles/parasail_core.dir/src/nw_dispatch.c.o
[ 42%] Building C object CMakeFiles/parasail_core.dir/src/sg_dispatch.c.o
[ 43%] Building C object CMakeFiles/parasail_core.dir/src/sw_dispatch.c.o
[ 44%] Building C object CMakeFiles/parasail_core.dir/src/dispatch_profile.c.o
[ 45%] Building C object CMakeFiles/parasail_core.dir/src/satcheck.c.o
[ 46%] Building C object CMakeFiles/parasail_core.dir/src/striped_unwind.c.o
[ 46%] Building C object CMakeFiles/parasail_core.dir/src/traceback.c.o
[ 47%] Building C object CMakeFiles/parasail_core.dir/src/cpuid.c.o
/home/prescheneder/parasail/parasail-2.1.4/src/cpuid.c: In function ‘run_cpuid’:
/home/prescheneder/parasail/parasail-2.1.4/src/cpuid.c:31:5: error: impossible constraint in ‘asm’
     __asm__ ( "cpuid" : "+b" (ebx),
     ^
CMakeFiles/parasail_core.dir/build.make:446: recipe for target 'CMakeFiles/parasail_core.dir/src/cpuid.c.o' failed
make[2]: *** [CMakeFiles/parasail_core.dir/src/cpuid.c.o] Error 1
CMakeFiles/Makefile2:718: recipe for target 'CMakeFiles/parasail_core.dir/all' failed
make[1]: *** [CMakeFiles/parasail_core.dir/all] Error 2
Makefile:149: recipe for target 'all' failed
make: *** [all] Error 2
prescheneder@tegra-ubuntu:~/parasail/parasail-2.1.4/build$
@jeffdaily
Copy link
Owner

jeffdaily commented Jun 5, 2018

I just pushed hotfix/2.1.5 that at least fixes the compilation issue you reported. As for arm simd neon support, I will explore the options you suggested. Thanks for that. I'm going to keep this issue open to track simd progress.

edit
I only updated the autotools build. I'll get CMake next.

@jeffdaily
Copy link
Owner

I updated the CMakeLists.txt to test for arm. Please test it.

@jeffdaily
Copy link
Owner

@philres Would it be possible for you to send me an email? I wanted to take part of this discussion offline but I can't find an email address for you. Thanks.

@philres
Copy link
Author

philres commented Jun 6, 2018

Hi Jeff!

Sorry for the delay and thank you very much for the quick fix. It compiles perfectly fine now. Haven't had time to run alignments yet, but I'll do that next.

Oh yeah, didn't realise that my mail address is not visible to others on github.
Will contact you via mail.

@philres
Copy link
Author

philres commented Jun 8, 2018

Ok I can confirm that hotfix/2.1.5 works on ARM64. It compiles, parasail_aligner works when using a non-SSE alignment function and parasail-python works as well with the non-SSE functions.

Thx a lot!

@philres
Copy link
Author

philres commented Jun 8, 2018

Sorry, one more question which might be a bit outside of the scope of this issues.

Is there a way with parasail-python to check whether a particular function is available (e.g. sg_stripped)?

@jeffdaily
Copy link
Owner

jeffdaily commented Jun 8, 2018 via email

@philres
Copy link
Author

philres commented Jun 11, 2018

Thanks Jeff, can_use_sse2 worked perfectly!

@jeffdaily
Copy link
Owner

Hi @philres I was wondering what you were using for your ARM system and CPU? I'm looking into use QEMU to emulate ARM64 so I can try and port the vectorized code.

@jeffdaily
Copy link
Owner

Hi @philres . I used https://github.com/nemequ/simde instead of https://github.com/jratcliff63367/sse2neon. Please try the branch feature/aarch64. I used Ubuntu packages to cross compile for arm and then the qemu package to verify the implementations (see the new section in the README.md in the above branch). In my limited testing, only the 64-bit neon solutions for sg_diag and sw_diag are failing.

I am intensely curious whether you experience any speed up using the arm neon vectorized functions. Also, please verify on your own if you're getting correct results since my testing has been limited.

@philres
Copy link
Author

philres commented Jun 28, 2018

Hi Jeff,

once more sorry for the delay, I only had time to do a couple of quick test runs and it seems that the vectorised algorithms are slower than the once without vectorisation on our ARM machines. Is that possible? In any way I'll have to check carefully if I did something wrong.

Concerning the results, I haven't done a proper evaluation, but the results seem to be the same as on AMD64.

@philres
Copy link
Author

philres commented Jun 29, 2018

Ok so I tried it again and it seems to be the case that the vectorised function are slower than the non-vectorised on ARM at the moment.

When running my application that is using parasail-python (uses sg and sg_stats) on my Mac I get the following speed up (it is a rough comparison but I ran all of them a couple of times to make sure that the times are consistent between runs):
parasail_sg_stat + parasail_sg: 32 sec
sg_stats_striped_32 + sg_striped_32: 10 sec
sg_stats_diag_32 + sg_diag_32: 23 sec
sg_stats_scan_32 + sg_scan_32: 11 sec

On AMR64 I get:
parasail_sg_stat + parasail_sg: 17 sec
sg_stats_striped_32 + sg_striped_32: 42 sec
sg_stats_diag_32 + sg_diag_32: 212 sec
sg_stats_scan_32 + sg_scan_32: 78 sec

I maybe should add that in this application I mostly align short sequences (~200 bp) and that most of the runtime is spent on sg and not on sg_stats.

One thing that is quite surprising is how fast the non-vectorised versions are on ARM64 compared to my MacBook (Pro 2017 with 2.5 GHz Intel Core i7).

@jeffdaily
Copy link
Owner

Thanks for the update. Quick background. I used the simde project (https://github.com/nemequ/simde) to implement the NEON routines. simde provides an SSE function signature but uses NEON to implement it internally. If there isn't a direct equivalent function, it tries to use some OpenMP vectorization directives to a for loop over each vector element.

In short, it could be that we're missing a step when compiling for NEON that is using the slower, non-vector path within the 'vector' operations. Now that I have access to hardware (thanks!) I will experiment.

@philres
Copy link
Author

philres commented Jul 2, 2018

Oh yes, good call. That's very possible. I vaguely remember having to set a compiler option to enable neon the last time I did it.

@jeffdaily
Copy link
Owner

As my first time with a raspberry pi, I hit a few hurdles. For some reason in my Big Box of Extra Cables, I didn't have an HDMI. Arrived yesterday. Then, I didn't understand the power requirements so I needed to order a 3A power supply for the card. I managed to boot it with a 1.5A supply, but got lots of incorrect voltage warnings. I'll get back to this next week when the power supply arrives.

I've been looking at the simde project more closely now. At first I had just set up the parasail scaffolding to use this simde project's intrinsic functions and made sure the appropriate compiler flags were set for neon. Now I'm looking at the code and it seems some important SSE2 and SSE4.1 functionality isn't using neon intrinsic functions natively. Every time an intrinsic doesn't exist in the simde project, it reverts to a for loop over the vector. That could be our source of slowdown. But I'll know more once I can adequately test.

@jeffdaily
Copy link
Owner

@philres Please try the latest tip of the aarch64 branch. There were a number of functions that the simde project did not have optimized for neon. I added all of the missing functions and I'm now getting faster performance with the neon functions than without. I'd be curious to see what you get if you rerun your performance tests from earlier. Please let me know. Thanks.

@jeffdaily
Copy link
Owner

I went ahead and merged the feature/aarch64 branch into develop and master and tagged a new release v2.2. Let me know if you have any problems with the new NEON functions.

@philres
Copy link
Author

philres commented Jul 11, 2018

Hi Jeff, sorry again for the late reply. I tried your latest changes and I looks very good, but there are a two of minor things:

  1. configure && make fail for me with the following error:
prescheneder@tegra-ubuntu:~/parasail/parasail$ make -j6
make  all-am
make[1]: Entering directory '/home/prescheneder/parasail/parasail'
  CC       src/libparasail_neon_la-nw_diag_neon_128_8.lo
  CC       src/libparasail_neon_la-sw_striped_neon_128_64.lo
  CC       src/libparasail_neon_la-sg_striped_neon_128_32.lo
  CC       src/libparasail_neon_la-sw_striped_neon_128_32.lo
  CC       src/libparasail_neon_la-nw_striped_neon_128_32.lo
  CC       src/libparasail_neon_la-nw_striped_neon_128_16.lo
/mnt/HD/tmp/ccJCPNb0.s: Assembler messages:
/mnt/HD/tmp/ccJCPNb0.s: Error: unaligned opcodes detected in executable segment
Makefile:11484: recipe for target 'src/libparasail_neon_la-nw_diag_neon_128_8.lo' failed
make[1]: *** [src/libparasail_neon_la-nw_diag_neon_128_8.lo] Error 1
make[1]: *** Waiting for unfinished jobs....
make[1]: Leaving directory '/home/prescheneder/parasail/parasail'
Makefile:3177: recipe for target 'all' failed
make: *** [all] Error 2
  1. cmake && make works but when I first tried running parasail it was surprisingly slow. Then I figured out that -O2 is not passed when building with cmake and changed it to:
cmake -DNEON_C_FLAGS="-O2" ..

with this the results look very nice! I get a speed up of >2 from 26 sec to ~12 sec when using sg_striped_32 compared to sg.

Thank you very much for all your work!

Philipp

@jeffdaily
Copy link
Owner

I ran into the unaligned opcode error only when I was cross-compiling from Ubuntu and not when using the gcc in raspbian on the pi itself. Which version of gcc are you using? The work-around seems to be to not use -g debug symbols when compiling. ./configure CFLAGS=-O2 should work (it defaults to -g -O2 otherwise -- that's an autoconf default).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants