Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVX #3

Open
markNZed opened this issue Mar 21, 2017 · 34 comments
Open

AVX #3

markNZed opened this issue Mar 21, 2017 · 34 comments

Comments

@markNZed
Copy link

Hi,

I found your work searching for a bit-matrix transpose using SIMD. Seems very close to what we need. AVX is becoming more popular and I was wondering if that function needs to be modified to leverage AVX instructions ?

@mischasan
Copy link
Owner

mischasan commented Mar 21, 2017 via email

@markNZed
Copy link
Author

If I understood you, you have moved some of the code base to AVX2 but are not planning on publishing that source code. But you may make an AVX version of the bmx procedure available to the public. Is that right ?

"We" is me and a dev who I've asked to help me because he has some SIMD experience. We have been using boost.simd to do some benchmarking.

The GPL could be a problem as I want to develop a commercial application (for engineering). There is no problem sharing changes that we might make to the bmx proc but the GPL would require releasing all the code it is linked with and that is problematic.

The app would run on industry server farms so managing different SIMD implementations/generations is an issue. We were thinking of using gcc intrinsics for this. One idea would be to map bmx to intrinsics. You did not want to use intrinsics ?

@mischasan
Copy link
Owner

mischasan commented Mar 22, 2017 via email

@markNZed
Copy link
Author

Only targetting x86 at this stage.

I tried compiling on Ubuntu 16.04:

cc -g -MMD -fPIC -pthread -fdiagnostics-show-option -fno-strict-aliasing -fstack-protector --param ssp-buffer-size=4 -Wall -Werror -Wextra -Wcast-align -Wcast-qual -Wformat=2 -Wformat-security -Wmissing-prototypes -Wnested-externs -Wpointer-arith -Wshadow -Wstrict-prototypes -Wunused -Wwrite-strings -Wno-attributes -Wno-cast-qual -Wno-error -Wno-unknown-pragmas -Wno-unused-parameter -O3 -I/usr/local/include -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -I. -c -o sseutil.o sseutil.c
sseutil.c:1:18: fatal error: plat.h: No such file or directory

Is that file missing from the repo ?

@mischasan
Copy link
Owner

mischasan commented Mar 22, 2017 via email

@mischasan
Copy link
Owner

mischasan commented Mar 22, 2017 via email

@markNZed
Copy link
Author

Also missing msutil.h and sock.h

I ran make then make test which gives:

make test
cc   -pthread  -L/usr/local/lib        ssebmx_t.o libsse.a tap.o bitmat.o     -lstdc++  -lm    -o ssebmx_t
bitmat.o: In function `bitmat_trans':
/home/propacov/shared/proto/i8051-07/src/tests/primitives/sse2/bitmat.c:80: undefined reference to `ssebmx'
/home/propacov/shared/proto/i8051-07/src/tests/primitives/sse2/bitmat.c:80: undefined reference to `ssebmx_m'
collect2: error: ld returned 1 exit status
<builtin>: recipe for target 'ssebmx_t' failed
make: *** [ssebmx_t] Error 1

@mischasan
Copy link
Owner

mischasan commented Mar 22, 2017 via email

@markNZed
Copy link
Author

No problem, it is worth the effort if we can use the code. I resolved the missing files (downloaded the 3 headers from your utils package). But ran into the compile error reported in my previous message. Can you get the bmx test running ? The GNUMakefile and rules are new to me so not so easy to quickly understand where the issue is. Thanks.

@mischasan
Copy link
Owner

mischasan commented Mar 23, 2017 via email

@mischasan
Copy link
Owner

mischasan commented Mar 23, 2017 via email

@mischasan
Copy link
Owner

mischasan commented Mar 26, 2017 via email

@markNZed
Copy link
Author

Hi,

I don't see updates to the repo, are you using attachments with these messages ? I don't think I can access those.

I'm in France. I imagine the user of our software will have AVX512 boxes. But I don't have a server farm. I plan to do testing on cloud infrastructure e.g. AWS.

@markNZed
Copy link
Author

With 256 or 512bit registers does the optimal size of the bit matrix for transposition change ?

@mischasan
Copy link
Owner

mischasan commented Mar 26, 2017 via email

@mischasan
Copy link
Owner

mischasan commented Mar 26, 2017 via email

@markNZed
Copy link
Author

If you like you could upload to https://expirebox.com/ it is very simple, no login, provides a link to the file (which gets deleted after 48hrs).

@markNZed
Copy link
Author

For bmx, does AVX provide improved instructions or is the only benefit larger registers ?

@mischasan
Copy link
Owner

mischasan commented Mar 26, 2017 via email

@mischasan
Copy link
Owner

mischasan commented Mar 26, 2017 via email

@markNZed
Copy link
Author

I have a hard time understanding why the CPU don't provide native support for a bitwise transpose, it seems such a fundamental building block. Do you see why that hasn't happened ?

The zip ran fine on my machine, I only tried ssebmx_t (I'm using an Intel Core i5 on my laptop). Thanks!

Have you tried benchmarking between clang and gcc ? I was surprised to see how much better clang-3.8 was than gcc-6.2 on some auto-vectorization test cases, seemed to make better use of the ymm/xmm registers.

Yeah lucky to be in France, so much come down to luck...

@mischasan
Copy link
Owner

mischasan commented Mar 26, 2017 via email

@markNZed
Copy link
Author

Non, néo-zélandais, beaucoup de chance la aussi!

@markNZed
Copy link
Author

This is a bit of a diverging thread but I hesitate to create new issues for questions. The bmx is 16x8 and I am wondering, if we are targeting a size of 256 x W (where W is typically less than 512). Are there changes to the algorithm that could match up with the initial row count of 256 and improve performance ? Or is it best to just break that up into 16x8 chunks. Thanks.

@mischasan
Copy link
Owner

mischasan commented Mar 31, 2017 via email

@markNZed
Copy link
Author

Nice idea with INP and OUT. I would hope that the hardware could prefetch but in any case memory will be the bottleneck. It is premature to optimise now. I will be late next week before I can do profiling and the current bmx may be plenty enough.

The application is analysing decompressed trace files from digital circuit simulation. One dimension of the matrix is time/cycles and the another dimension inputs. The matrix can be quite big (e.g. GBs).

@mischasan
Copy link
Owner

mischasan commented Mar 31, 2017 via email

@markNZed
Copy link
Author

markNZed commented Apr 1, 2017

Could __builtin_prefetch be a big help with that ? If the gather/scatter work on a block that fits in L1...

I should probably mention that we are looking to transpose blocks (kBs) not the entire matrix (potentially GBs). So the scatter can be limited.

@mischasan
Copy link
Owner

mischasan commented Apr 1, 2017 via email

@mischasan
Copy link
Owner

mischasan commented Apr 4, 2017 via email

@markNZed
Copy link
Author

markNZed commented Apr 5, 2017

Hi, thanks! Can you please upload it to github or https://expirebox.com

@mischasan
Copy link
Owner

mischasan commented Apr 5, 2017 via email

@markNZed
Copy link
Author

Hi, we ran some benchmarking and got slightly better results with code based on http://stackoverflow.com/questions/41778362/how-to-efficiently-transpose-a-2d-bit-matrix targetting a 64x64 matrix. It was surprising. 940.423 MB/s vs 747.659 MB/s and AVX2 was actually slower at 400.961 MB/s Thanks for your support!

@mischasan
Copy link
Owner

mischasan commented Apr 13, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants