-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for ARM64? #59
Comments
I just pushed hotfix/2.1.5 that at least fixes the compilation issue you reported. As for arm simd neon support, I will explore the options you suggested. Thanks for that. I'm going to keep this issue open to track simd progress. edit |
I updated the CMakeLists.txt to test for arm. Please test it. |
@philres Would it be possible for you to send me an email? I wanted to take part of this discussion offline but I can't find an email address for you. Thanks. |
Hi Jeff! Sorry for the delay and thank you very much for the quick fix. It compiles perfectly fine now. Haven't had time to run alignments yet, but I'll do that next. Oh yeah, didn't realise that my mail address is not visible to others on github. |
Ok I can confirm that hotfix/2.1.5 works on ARM64. It compiles, parasail_aligner works when using a non-SSE alignment function and parasail-python works as well with the non-SSE functions. Thx a lot! |
Sorry, one more question which might be a bit outside of the scope of this issues. Is there a way with parasail-python to check whether a particular function is available (e.g. sg_stripped)? |
Good question. I know I only added python bindings for the cpu dispatching functions because I made the assumption that my python users wouldn’t want to call the ISA-specific functions directly.
Do you want those functions exposed?
If so, it’s an advanced (read: can fail badly) if you call an ISA-specific function and it is only a stubbed out function with an assert.
Interestingly, I did add the functions like
can_use_sse2
To the bindings, just so you can check which ISA would get dispatched to if you were curious. So if that function returned true, that means you could use stripped safely.
On Jun 8, 2018, at 8:18 AM, Philipp Rescheneder <notifications@github.com<mailto:notifications@github.com>> wrote:
Sorry, one more question which might be a bit outside of the scope of this issues.
Is there a way with parasail-python to check whether a particular function is available (e.g. sg_stripped)?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<#59 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AA3MOGU3fddKSCARsjoEHeQ4MoPj5SQ8ks5t6pXGgaJpZM4Uaeyq>.
|
Thanks Jeff, |
Hi @philres I was wondering what you were using for your ARM system and CPU? I'm looking into use QEMU to emulate ARM64 so I can try and port the vectorized code. |
Hi @philres . I used https://github.com/nemequ/simde instead of https://github.com/jratcliff63367/sse2neon. Please try the branch I am intensely curious whether you experience any speed up using the arm neon vectorized functions. Also, please verify on your own if you're getting correct results since my testing has been limited. |
Hi Jeff, once more sorry for the delay, I only had time to do a couple of quick test runs and it seems that the vectorised algorithms are slower than the once without vectorisation on our ARM machines. Is that possible? In any way I'll have to check carefully if I did something wrong. Concerning the results, I haven't done a proper evaluation, but the results seem to be the same as on AMD64. |
Ok so I tried it again and it seems to be the case that the vectorised function are slower than the non-vectorised on ARM at the moment. When running my application that is using parasail-python (uses sg and sg_stats) on my Mac I get the following speed up (it is a rough comparison but I ran all of them a couple of times to make sure that the times are consistent between runs): On AMR64 I get: I maybe should add that in this application I mostly align short sequences (~200 bp) and that most of the runtime is spent on sg and not on sg_stats. One thing that is quite surprising is how fast the non-vectorised versions are on ARM64 compared to my MacBook (Pro 2017 with 2.5 GHz Intel Core i7). |
Thanks for the update. Quick background. I used the simde project (https://github.com/nemequ/simde) to implement the NEON routines. simde provides an SSE function signature but uses NEON to implement it internally. If there isn't a direct equivalent function, it tries to use some OpenMP vectorization directives to a for loop over each vector element. In short, it could be that we're missing a step when compiling for NEON that is using the slower, non-vector path within the 'vector' operations. Now that I have access to hardware (thanks!) I will experiment. |
Oh yes, good call. That's very possible. I vaguely remember having to set a compiler option to enable neon the last time I did it. |
As my first time with a raspberry pi, I hit a few hurdles. For some reason in my Big Box of Extra Cables, I didn't have an HDMI. Arrived yesterday. Then, I didn't understand the power requirements so I needed to order a 3A power supply for the card. I managed to boot it with a 1.5A supply, but got lots of incorrect voltage warnings. I'll get back to this next week when the power supply arrives. I've been looking at the simde project more closely now. At first I had just set up the parasail scaffolding to use this simde project's intrinsic functions and made sure the appropriate compiler flags were set for neon. Now I'm looking at the code and it seems some important SSE2 and SSE4.1 functionality isn't using neon intrinsic functions natively. Every time an intrinsic doesn't exist in the simde project, it reverts to a for loop over the vector. That could be our source of slowdown. But I'll know more once I can adequately test. |
@philres Please try the latest tip of the aarch64 branch. There were a number of functions that the simde project did not have optimized for neon. I added all of the missing functions and I'm now getting faster performance with the neon functions than without. I'd be curious to see what you get if you rerun your performance tests from earlier. Please let me know. Thanks. |
I went ahead and merged the feature/aarch64 branch into develop and master and tagged a new release v2.2. Let me know if you have any problems with the new NEON functions. |
Hi Jeff, sorry again for the late reply. I tried your latest changes and I looks very good, but there are a two of minor things:
prescheneder@tegra-ubuntu:~/parasail/parasail$ make -j6
make all-am
make[1]: Entering directory '/home/prescheneder/parasail/parasail'
CC src/libparasail_neon_la-nw_diag_neon_128_8.lo
CC src/libparasail_neon_la-sw_striped_neon_128_64.lo
CC src/libparasail_neon_la-sg_striped_neon_128_32.lo
CC src/libparasail_neon_la-sw_striped_neon_128_32.lo
CC src/libparasail_neon_la-nw_striped_neon_128_32.lo
CC src/libparasail_neon_la-nw_striped_neon_128_16.lo
/mnt/HD/tmp/ccJCPNb0.s: Assembler messages:
/mnt/HD/tmp/ccJCPNb0.s: Error: unaligned opcodes detected in executable segment
Makefile:11484: recipe for target 'src/libparasail_neon_la-nw_diag_neon_128_8.lo' failed
make[1]: *** [src/libparasail_neon_la-nw_diag_neon_128_8.lo] Error 1
make[1]: *** Waiting for unfinished jobs....
make[1]: Leaving directory '/home/prescheneder/parasail/parasail'
Makefile:3177: recipe for target 'all' failed
make: *** [all] Error 2
with this the results look very nice! I get a speed up of >2 from 26 sec to ~12 sec when using Thank you very much for all your work! Philipp |
I ran into the unaligned opcode error only when I was cross-compiling from Ubuntu and not when using the gcc in raspbian on the pi itself. Which version of gcc are you using? The work-around seems to be to not use |
Hi Jeff,
I was wondering if ARM64 support in parasail would be possible?
I have been using parasail for several different things (mostly through the python API) and have been very happy with it. Unfortunately, I now have to make my scripts run on ARM64 as well. I tried compiling it but it (probably not surprisingly) failed (see below).
Do you think it would be possible to make it compile and run on ARM64? As a first step having a non SSE/AVX version of the algorithms running would be great already. On the long run I was wondering if using something like: https://github.com/jratcliff63367/sse2neon or https://github.com/nemequ/simde/ would allow running the SSE versions as well?
Thanks,
Philipp
The text was updated successfully, but these errors were encountered: