Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[arm64] use a better translation for move_mask #140

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

sebpop
Copy link

@sebpop sebpop commented Apr 7, 2023

No changes | With the patch | Speedup
$ python3 ./tests/test_ext.py | |
.bitshuffle 64 : 4.94 s/GB, 0.20 GB/s | 1.53 s/GB, 0.65 GB/s | 3.25x
.bitunshuffle 64 : 5.09 s/GB, 0.20 GB/s | 1.53 s/GB, 0.65 GB/s | 3.25x
.compress 64 : 5.26 s/GB, 0.19 GB/s | 1.80 s/GB, 0.55 GB/s | 2.89x
.compress zstd 64 : 8.02 s/GB, 0.12 GB/s | 4.80 s/GB, 0.21 GB/s | 1.75x
.decompress 64 : 5.72 s/GB, 0.17 GB/s | 2.21 s/GB, 0.45 GB/s | 2.64x
.decompress zstd 64 : 5.71 s/GB, 0.18 GB/s | 2.18 s/GB, 0.46 GB/s | 2.55x

No changes                                   | With the patch       | Speedup
$ python3 ./tests/test_ext.py                |                      |
.bitshuffle 64       :  4.94 s/GB, 0.20 GB/s | 1.53 s/GB, 0.65 GB/s | 3.25x
.bitunshuffle 64     :  5.09 s/GB, 0.20 GB/s | 1.53 s/GB, 0.65 GB/s | 3.25x
.compress 64         :  5.26 s/GB, 0.19 GB/s | 1.80 s/GB, 0.55 GB/s | 2.89x
.compress zstd  64   :  8.02 s/GB, 0.12 GB/s | 4.80 s/GB, 0.21 GB/s | 1.75x
.decompress 64       :  5.72 s/GB, 0.17 GB/s | 2.21 s/GB, 0.45 GB/s | 2.64x
.decompress zstd 64  :  5.71 s/GB, 0.18 GB/s | 2.18 s/GB, 0.46 GB/s | 2.55x
@stdpain
Copy link

stdpain commented Apr 15, 2023

It seems it have some problem when apply this patch.
here was a reproduce case: https://github.com/stdpain/bitshuffle/tree/reproduce/reproduce

@stdpain
Copy link

stdpain commented Apr 15, 2023

It looks like if you compile with -O0 you will get the correct result.
Hope it can help you

@sebpop
Copy link
Author

sebpop commented Apr 15, 2023

Hello @stdpain, thanks a lot for the testcase.
I have tried to reproduce the error, however I am seeing the exact same output when I compile with and without my patch, on a Graviton3 c7g machine that is running Debian.
I also have tried both gcc-9.3.0 and clang-11.0.1 and both compilers produce the same output.

$ clang --version
Debian clang version 11.0.1-2
Target: aarch64-unknown-linux-gnu

$ gcc --version
gcc (Debian 9.3.0-22) 9.3.0

Could you please report which gcc version you are using?
I will try to see if I can see the error with your compiler version.

@sebpop
Copy link
Author

sebpop commented Apr 15, 2023

I was able to reproduce the bug with gcc10 on AL2.
The output is incorrect:

$ make
cd .. && gcc10-gcc src/bitshuffle_core.c src/bitshuffle.c src/iochain.c lz4/lz4.c -O3 -march=native -c -std=c99 -I lz4/ && ar rs libbitshuffle.a ./*.o && cd src
gcc10-g++ test.cpp -I ../src ../libbitshuffle.a
spop-arm64 - 03:35:04:~/bitshuffle/reproduce$ ./a.out 
16384
0
0
0
0
39
45
51
57
[...]

I would suggest using gcc7 from AL2 until I fix the issue in gcc10.

@sebpop
Copy link
Author

sebpop commented Apr 15, 2023

gcc-9 is the last version of gcc that works.
gcc-10 and all versions of gcc after 10 are broken.
gcc-trunk as of today is still broken.
I will fix the bug in gcc-trunk and do the back-ports to all active branches.

@sebpop
Copy link
Author

sebpop commented Apr 15, 2023

I opened a bug against gcc-10: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109519

Patch from Andrew Pinski <pinskia@gcc.gnu.org>.
@sebpop
Copy link
Author

sebpop commented Apr 15, 2023

The compiler does not have a bug.
I added d29228f to this pull request to fix the issue in the code of bitshuffle, following the recommendations from Andrew Pinski.

@stdpain
Copy link

stdpain commented Apr 16, 2023

It works when compile with -fno-strict-aliasing

@stdpain
Copy link

stdpain commented Apr 16, 2023

patch d29228f also works.

@stdpain
Copy link

stdpain commented May 6, 2023

@sebpop It seems we could have a better implements for neonmovemask_bulk https://gist.github.com/geofflangdale/99393863c8cae3e83195a5e592e7dc82

uint8x16_t t0 = vbslq_u8(vdupq_n_u8(0x55), p0, p1); // 01010101...
uint8x16_t t1 = vbslq_u8(vdupq_n_u8(0x55), p2, p3); // 23232323...
uint8x16_t combined = vbslq_u8(vdupq_n_u8(0x33), t0, t1); // 01230123...
int8x8_t sum = vshrn_n_s16(vreinterpretq_s16_u8(combined), 4);
return vget_lane_u64(vreinterpret_u64_s8(sum), 0);

@stdpain
Copy link

stdpain commented May 9, 2023

@sebpop It seems we could have a better implements for neonmovemask_bulk https://gist.github.com/geofflangdale/99393863c8cae3e83195a5e592e7dc82
uint8x16_t t0 = vbslq_u8(vdupq_n_u8(0x55), p0, p1); // 01010101...
uint8x16_t t1 = vbslq_u8(vdupq_n_u8(0x55), p2, p3); // 23232323...
uint8x16_t combined = vbslq_u8(vdupq_n_u8(0x33), t0, t1); // 01230123...
int8x8_t sum = vshrn_n_s16(vreinterpretq_s16_u8(combined), 4);
return vget_lane_u64(vreinterpret_u64_s8(sum), 0);

ignore

@sebpop
Copy link
Author

sebpop commented Sep 18, 2023

Ping patch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants