New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[arm64] use a better translation for move_mask #140
base: master
Are you sure you want to change the base?
Conversation
No changes | With the patch | Speedup $ python3 ./tests/test_ext.py | | .bitshuffle 64 : 4.94 s/GB, 0.20 GB/s | 1.53 s/GB, 0.65 GB/s | 3.25x .bitunshuffle 64 : 5.09 s/GB, 0.20 GB/s | 1.53 s/GB, 0.65 GB/s | 3.25x .compress 64 : 5.26 s/GB, 0.19 GB/s | 1.80 s/GB, 0.55 GB/s | 2.89x .compress zstd 64 : 8.02 s/GB, 0.12 GB/s | 4.80 s/GB, 0.21 GB/s | 1.75x .decompress 64 : 5.72 s/GB, 0.17 GB/s | 2.21 s/GB, 0.45 GB/s | 2.64x .decompress zstd 64 : 5.71 s/GB, 0.18 GB/s | 2.18 s/GB, 0.46 GB/s | 2.55x
It seems it have some problem when apply this patch. |
It looks like if you compile with -O0 you will get the correct result. |
Hello @stdpain, thanks a lot for the testcase.
Could you please report which gcc version you are using? |
I was able to reproduce the bug with gcc10 on AL2.
I would suggest using gcc7 from AL2 until I fix the issue in gcc10. |
gcc-9 is the last version of gcc that works. |
I opened a bug against gcc-10: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109519 |
Patch from Andrew Pinski <pinskia@gcc.gnu.org>.
The compiler does not have a bug. |
It works when compile with |
patch d29228f also works. |
@sebpop It seems we could have a better implements for neonmovemask_bulk https://gist.github.com/geofflangdale/99393863c8cae3e83195a5e592e7dc82
|
ignore |
Ping patch. |
No changes | With the patch | Speedup
$ python3 ./tests/test_ext.py | |
.bitshuffle 64 : 4.94 s/GB, 0.20 GB/s | 1.53 s/GB, 0.65 GB/s | 3.25x
.bitunshuffle 64 : 5.09 s/GB, 0.20 GB/s | 1.53 s/GB, 0.65 GB/s | 3.25x
.compress 64 : 5.26 s/GB, 0.19 GB/s | 1.80 s/GB, 0.55 GB/s | 2.89x
.compress zstd 64 : 8.02 s/GB, 0.12 GB/s | 4.80 s/GB, 0.21 GB/s | 1.75x
.decompress 64 : 5.72 s/GB, 0.17 GB/s | 2.21 s/GB, 0.45 GB/s | 2.64x
.decompress zstd 64 : 5.71 s/GB, 0.18 GB/s | 2.18 s/GB, 0.46 GB/s | 2.55x