-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AVX #3
Comments
Sure. I've been moving my code to AVX2 (not the bmx proc) proprietarily. I
won't make an upgraded bmx proprietary.
But who is "we"?
…On 21 March 2017 at 09:25, markNZed ***@***.***> wrote:
Hi,
I found your work searching for a bit-matrix transpose using SIMD. Seems
very close to what we need. AVX is becoming more popular and I was
wondering if that function needs to be modified to leverage AVX
instructions ?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#3>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABjJMQw4Nrg2UWKj_Su2Q-6iictRPnfMks5rn_n0gaJpZM4MkD9a>
.
|
If I understood you, you have moved some of the code base to AVX2 but are not planning on publishing that source code. But you may make an AVX version of the bmx procedure available to the public. Is that right ? "We" is me and a dev who I've asked to help me because he has some SIMD experience. We have been using boost.simd to do some benchmarking. The GPL could be a problem as I want to develop a commercial application (for engineering). There is no problem sharing changes that we might make to the bmx proc but the GPL would require releasing all the code it is linked with and that is problematic. The app would run on industry server farms so managing different SIMD implementations/generations is an issue. We were thinking of using gcc intrinsics for this. One idea would be to map bmx to intrinsics. You did not want to use intrinsics ? |
That's correct.
The reason I first post everything with GPL is my curiosity about who is
using it; what kind of applications.
If LGPL works for you, that's fine for me.
I've switched my own praxis to testing cpuid on the fly, and using
alternate code paths for SSE2 and AVX2.
If you compile with gcc, you may note that some versions do not support
SSE2 at all when you compile for 32-bit processors.
The code uses the gcc intrinsics, either way. I haven't seen any other
vector op sets (AMD 3dnow, ARM Neon) worth supporting.
You (or the dev) have any perspective on that?
…On 22 March 2017 at 01:17, markNZed ***@***.***> wrote:
If I understood you, you have moved some of the code base to AVX2 but are
not planning on publishing that source code. But you may make an AVX
version of the bmx procedure available to the public. Is that right ?
"We" is me and a dev who I've asked to help me because he has some SIMD
experience. We have been using boost.simd to do some benchmarking.
The GPL could be a problem as I want to develop a commercial application
(for engineering). There is no problem sharing changes that we might make
to the bmx proc but the GPL would require releasing all the code it is
linked with and that is problematic.
The app would run on industry server farms so managing different SIMD
implementations/generations is an issue. We were thinking of using gcc
intrinsics for this. One idea would be to map bmx to intrinsics. You did
not want to use intrinsics ?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABjJMVwoJoCxa2VTXy2fzujUMcNk3D7cks5roNkTgaJpZM4MkD9a>
.
|
Only targetting x86 at this stage. I tried compiling on Ubuntu 16.04: cc -g -MMD -fPIC -pthread -fdiagnostics-show-option -fno-strict-aliasing -fstack-protector --param ssp-buffer-size=4 -Wall -Werror -Wextra -Wcast-align -Wcast-qual -Wformat=2 -Wformat-security -Wmissing-prototypes -Wnested-externs -Wpointer-arith -Wshadow -Wstrict-prototypes -Wunused -Wwrite-strings -Wno-attributes -Wno-cast-qual -Wno-error -Wno-unknown-pragmas -Wno-unused-parameter -O3 -I/usr/local/include -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -I. -c -o sseutil.o sseutil.c Is that file missing from the repo ? |
Oh carp. Yes. Sigh. Here: this is faster than my updating github (srsly)
…On 22 March 2017 at 10:26, markNZed ***@***.***> wrote:
Only targetting x86 at this stage.
I tried compiling on Ubuntu 16.04:
cc -g -MMD -fPIC -pthread -fdiagnostics-show-option -fno-strict-aliasing
-fstack-protector --param ssp-buffer-size=4 -Wall -Werror -Wextra
-Wcast-align -Wcast-qual -Wformat=2 -Wformat-security -Wmissing-prototypes
-Wnested-externs -Wpointer-arith -Wshadow -Wstrict-prototypes -Wunused
-Wwrite-strings -Wno-attributes -Wno-cast-qual -Wno-error
-Wno-unknown-pragmas -Wno-unused-parameter -O3 -I/usr/local/include
-D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -I. -c -o sseutil.o sseutil.c
sseutil.c:1:18: fatal error: plat.h: No such file or directory
Is that file missing from the repo ?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABjJMfdMDlrMzbBvFqDoOW0sN7k-HXkgks5roVm7gaJpZM4MkD9a>
.
|
<lame>The file is in my util/ repo as well </lame>
On 22 March 2017 at 13:19, Mischa Sandberg <mischa_sandberg@telus.net>
wrote:
… Oh carp. Yes. Sigh. Here: this is faster than my updating github (srsly)
On 22 March 2017 at 10:26, markNZed ***@***.***> wrote:
> Only targetting x86 at this stage.
>
> I tried compiling on Ubuntu 16.04:
>
> cc -g -MMD -fPIC -pthread -fdiagnostics-show-option -fno-strict-aliasing
> -fstack-protector --param ssp-buffer-size=4 -Wall -Werror -Wextra
> -Wcast-align -Wcast-qual -Wformat=2 -Wformat-security -Wmissing-prototypes
> -Wnested-externs -Wpointer-arith -Wshadow -Wstrict-prototypes -Wunused
> -Wwrite-strings -Wno-attributes -Wno-cast-qual -Wno-error
> -Wno-unknown-pragmas -Wno-unused-parameter -O3 -I/usr/local/include
> -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -I. -c -o sseutil.o sseutil.c
> sseutil.c:1:18: fatal error: plat.h: No such file or directory
>
> Is that file missing from the repo ?
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#3 (comment)>, or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/ABjJMfdMDlrMzbBvFqDoOW0sN7k-HXkgks5roVm7gaJpZM4MkD9a>
> .
>
|
Also missing msutil.h and sock.h I ran
|
My apologies for leading it in that state. If you pull my util repo, it has
all the files required.
I'm currently in an odd position having to recover my git remote
state/switch interfaces.
Had not really expected anyone to use that project in a while.
…On 22 March 2017 at 14:09, markNZed ***@***.***> wrote:
Also missing msutil.h and sock.h
I ran make then make test which gives:
make test
cc -pthread -L/usr/local/lib ssebmx_t.o libsse.a tap.o bitmat.o -lstdc++ -lm -o ssebmx_t
bitmat.o: In function `bitmat_trans':
/home/propacov/shared/proto/i8051-07/src/tests/primitives/sse2/bitmat.c:80: undefined reference to `ssebmx'
/home/propacov/shared/proto/i8051-07/src/tests/primitives/sse2/bitmat.c:80: undefined reference to `ssebmx_m'
collect2: error: ld returned 1 exit status
<builtin>: recipe for target 'ssebmx_t' failed
make: *** [ssebmx_t] Error 1
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABjJMa7NTnajseIOdT3LI92PRzxoAM3Tks5roY4igaJpZM4MkD9a>
.
|
No problem, it is worth the effort if we can use the code. I resolved the missing files (downloaded the 3 headers from your utils package). But ran into the compile error reported in my previous message. Can you get the bmx test running ? The GNUMakefile and rules are new to me so not so easy to quickly understand where the issue is. Thanks. |
Sure. I'm going to be in the air for most of today. Pardon, but what tz are
you in? And does your server farm include AVX512 boxes?
…On 23 March 2017 at 01:07, markNZed ***@***.***> wrote:
No problem, it is worth the effort if we can use the code. I resolved the
missing files (downloaded the 3 headers from your utils package). But ran
into the compile error reported in my previous message. Can you get the bmx
test running ? The GNUMakefile and rules are new to me so not so easy to
quickly understand where the issue is. Thanks.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABjJMcIgnVSI-xlZrMS9P6l05zJgxdqeks5roihXgaJpZM4MkD9a>
.
|
This is what I can do off my notebook. Passes ssebmx unit tests on my side.
On 23 March 2017 at 05:29, Mischa Sandberg <mischa_sandberg@telus.net>
wrote:
… Sure. I'm going to be in the air for most of today. Pardon, but what tz
are you in? And does your server farm include AVX512 boxes?
On 23 March 2017 at 01:07, markNZed ***@***.***> wrote:
> No problem, it is worth the effort if we can use the code. I resolved the
> missing files (downloaded the 3 headers from your utils package). But ran
> into the compile error reported in my previous message. Can you get the bmx
> test running ? The GNUMakefile and rules are new to me so not so easy to
> quickly understand where the issue is. Thanks.
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#3 (comment)>, or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/ABjJMcIgnVSI-xlZrMS9P6l05zJgxdqeks5roihXgaJpZM4MkD9a>
> .
>
|
And here's an update with AVX2 (mm256) support.
And I'm happy to convert to Apache license if you'll satisfy my curiosity
--- if that can be worded in a way that doesn't impinge on any competitive
secret.
On 23 March 2017 at 06:08, Mischa Sandberg <mischa_sandberg@telus.net>
wrote:
… This is what I can do off my notebook. Passes ssebmx unit tests on my side.
On 23 March 2017 at 05:29, Mischa Sandberg ***@***.***>
wrote:
> Sure. I'm going to be in the air for most of today. Pardon, but what tz
> are you in? And does your server farm include AVX512 boxes?
>
> On 23 March 2017 at 01:07, markNZed ***@***.***> wrote:
>
>> No problem, it is worth the effort if we can use the code. I resolved
>> the missing files (downloaded the 3 headers from your utils package). But
>> ran into the compile error reported in my previous message. Can you get the
>> bmx test running ? The GNUMakefile and rules are new to me so not so easy
>> to quickly understand where the issue is. Thanks.
>>
>> —
>> You are receiving this because you commented.
>> Reply to this email directly, view it on GitHub
>> <#3 (comment)>, or mute
>> the thread
>> <https://github.com/notifications/unsubscribe-auth/ABjJMcIgnVSI-xlZrMS9P6l05zJgxdqeks5roihXgaJpZM4MkD9a>
>> .
>>
>
>
|
Hi, I don't see updates to the repo, are you using attachments with these messages ? I don't think I can access those. I'm in France. I imagine the user of our software will have AVX512 boxes. But I don't have a server farm. I plan to do testing on cloud infrastructure e.g. AWS. |
With 256 or 512bit registers does the optimal size of the bit matrix for transposition change ? |
Yes they were zip attachments.
When I get back I'll update github (need ssh key/ cert)
…On Mar 26, 2017 6:52 AM, "markNZed" ***@***.***> wrote:
With 256 or 512bit registers does the optimal size of the bit matrix for
transposition change ?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABjJMbMj8yEspQ817aZ9FTm5SOLCno6Xks5rpm2pgaJpZM4MkD9a>
.
|
no chg. uses 256 for as much as fits; falls through to 128 for what doesn't.
…On Mar 26, 2017 6:52 AM, "markNZed" ***@***.***> wrote:
With 256 or 512bit registers does the optimal size of the bit matrix for
transposition change ?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABjJMbMj8yEspQ817aZ9FTm5SOLCno6Xks5rpm2pgaJpZM4MkD9a>
.
|
If you like you could upload to https://expirebox.com/ it is very simple, no login, provides a link to the file (which gets deleted after 48hrs). |
For bmx, does AVX provide improved instructions or is the only benefit larger registers ? |
Sure https://expirebox.com/download/791aa29d46fa7dda158d8b6f52893ea3.html
The cpuid check broke on one other older pc I had access to last night.
Other than that, ssebmx_t.pass speaks for itself.
Lucky you, in France. Paris, Menton and St Remy de Provence are some of my
favourite places to be.
…On 26 March 2017 at 08:22, markNZed ***@***.***> wrote:
If you like you could upload to https://expirebox.com/ it is very simple,
no login, provides a link to the file (which gets deleted after 48hrs).
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABjJMZYOnMor2fpijbANGv2oAExp63JQks5rpoKqgaJpZM4MkD9a>
.
|
No improved instructions for this particular app ... and the core op
(movemask) is not implemented for AVX512.
…On 26 March 2017 at 08:25, markNZed ***@***.***> wrote:
For bmx, does AVX provide improved instructions or is the only benefit
larger registers ?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABjJMeSakIRqi7GnGLo_ODVEEwQKeTD1ks5rpoNVgaJpZM4MkD9a>
.
|
I have a hard time understanding why the CPU don't provide native support for a bitwise transpose, it seems such a fundamental building block. Do you see why that hasn't happened ? The zip ran fine on my machine, I only tried ssebmx_t (I'm using an Intel Core i5 on my laptop). Thanks! Have you tried benchmarking between clang and gcc ? I was surprised to see how much better clang-3.8 was than gcc-6.2 on some auto-vectorization test cases, seemed to make better use of the ymm/xmm registers. Yeah lucky to be in France, so much come down to luck... |
Haha I am stealing time to type, let alone perftest.
This ssebmx doesn't use multiple registers. I expect no better than what
gcc 4.4 does unrolling trivial loops.
It could be modified to use multiple registers to make better use of cache
lines. That's not through auto-vectorization, though.
AVX (opinion) is part of Intel's war with AMD --- that's why SSE3+ and AVX+
are such messy unorthogonal arch.
AMD lost, so now Intel has gone back and improved REP MOVSB et al which is
what most people needed.
If I were re-implementing APL :-) I'd think about AVX2 more. It *might*
also help on table-driven charset conversion.
I stuck to SSE2 because it was pretty much guaranteed everywhere.
Well have fun. My home is Vancouver (Canada), it's good even if not France
(or Germany).
Est-ce que vous soyez français?
…On 26 March 2017 at 09:59, markNZed ***@***.***> wrote:
I have a hard time understanding why the CPU don't provide native support
for a bitwise transpose, it seems such a fundamental building block. Do you
see why that hasn't happened ?
The zip ran fine on my machine, I only tried ssebmx_t (I'm using an Intel
Core i5 on my laptop). Thanks!
Have you tried benchmarking between clang and gcc ? I was surprised to see
how much better clang-3.8 was than gcc-6.2 on some auto-vectorization test
cases, seemed to make better use of the ymm/xmm registers.
Yeah lucky to be in France, so much come down to luck...
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABjJMXDV_e2SHV8U0rNulVMzcI7E7JxEks5rppl8gaJpZM4MkD9a>
.
|
Non, néo-zélandais, beaucoup de chance la aussi! |
This is a bit of a diverging thread but I hesitate to create new issues for questions. The bmx is 16x8 and I am wondering, if we are targeting a size of 256 x W (where W is typically less than 512). Are there changes to the algorithm that could match up with the initial row count of 256 and improve performance ? Or is it best to just break that up into 16x8 chunks. Thanks. |
Short answer: doesn't help SSE2, probably won't help AVX2.
I did some SSE2-only timing a couple years ago, aiming at using the same
input cache line (64 bytes) immediately in the "gather" (INP) loops,
There was a factor of 1.5...2 improvement for the [8x16] becoming [8 x 64],
but it only applied for up to [8 x 512] arrays (special case; someone was
interested in that).
At that point fetch from RAM (not cache) became the limiting factor.
That second loop [8 x ...] is slower than the first one [16 x ...].
I have not tried perftesting what else discussed. A quick small test of
changing INP() and OUT() to use induction variables, and so avoid IMUL,
suggests it's a quick win.
I'm occupied by a large customer; will be happy to rethink this in two
weeks.
You haven't mentioned what the application is for this (in even general
terms), I assume then you won't.
…On 30 March 2017 at 23:47, markNZed ***@***.***> wrote:
This is a bit of a diverging thread but I hesitate to create new issues
for questions. The bmx is 16x8 and I am wondering, if we are targeting a
size of 256 x W (where W is typically less than 512). Are there changes to
the algorithm that could match up with the initial row count of 256 and
improve performance ? Or is it best to just break that up into 16x8 chunks.
Thanks.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABjJMVsFc9Wmp6GmiZBY6GOAHSGBjDzuks5rrKGagaJpZM4MkD9a>
.
|
Nice idea with INP and OUT. I would hope that the hardware could prefetch but in any case memory will be the bottleneck. It is premature to optimise now. I will be late next week before I can do profiling and the current bmx may be plenty enough. The application is analysing decompressed trace files from digital circuit simulation. One dimension of the matrix is time/cycles and the another dimension inputs. The matrix can be quite big (e.g. GBs). |
Thanks; and that's all I wanted to know. Best of luck to you (folks) on
that.
Cache-line caching does a lot. For transpose, the access pattern is too
hard for prefetch to spot; and if you widen the contiguous access on the
gather (INP) side, you create sparser action on the scatter side.
I'll switch to induction indexes for INP and OUT as soon as I get a chance
to exhale.
…On 31 March 2017 at 09:03, markNZed ***@***.***> wrote:
Nice idea with INP and OUT. I would hope that the hardware could prefetch
but in any case memory will be the bottleneck. It is premature to optimise
now. I will be late next week before I can do profiling and the current bmx
may be plenty enough.
The application is analysing decompressed trace files from digital circuit
simulation. One dimension of the matrix is time/cycles and the another
dimension inputs. The matrix can be quite big (e.g. GBs).
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABjJMfqfRNr70ycKoCire2DhJKyIRdrZks5rrSO3gaJpZM4MkD9a>
.
|
Could __builtin_prefetch be a big help with that ? If the gather/scatter work on a block that fits in L1... I should probably mention that we are looking to transpose blocks (kBs) not the entire matrix (potentially GBs). So the scatter can be limited. |
unfortunately not. i tested prefetch heavily for a version of memcpy using
sse2. it is a minor improvement when there is a single output target cache
line. bmx does scatter output. always happy to be proven wrong.
…On Apr 1, 2017 1:13 AM, "markNZed" ***@***.***> wrote:
Could __builtin_prefetch be a big help with that ? If the gather/scatter
work on a block that fits in L1...
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABjJMQNpdkvjSoHHss64QUws_o6uEu30ks5rrgcXgaJpZM4MkD9a>
.
|
Okay, here's the final cut (from my side). It has no IMULs. It uses AVX2 if
that is defined at compile-time. A run-time test for CPUID is cheap; I'm
afraid I have to move on and won't be doing that.
To complete that previous comment about prefetch: it has a limited use for
prefetching _target_ memory, prior to updating bytes in a new cache line.
Some CPU's appear to have a limited queue for prefetches; if you do it too
often, performance starts to degrade below having no prefetch at all.
…On 1 April 2017 at 07:06, Mischa Sandberg ***@***.***> wrote:
unfortunately not. i tested prefetch heavily for a version of memcpy using
sse2. it is a minor improvement when there is a single output target cache
line. bmx does scatter output. always happy to be proven wrong.
On Apr 1, 2017 1:13 AM, "markNZed" ***@***.***> wrote:
> Could __builtin_prefetch be a big help with that ? If the gather/scatter
> work on a block that fits in L1...
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#3 (comment)>, or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/ABjJMQNpdkvjSoHHss64QUws_o6uEu30ks5rrgcXgaJpZM4MkD9a>
> .
>
|
Hi, thanks! Can you please upload it to github or https://expirebox.com |
… On 5 April 2017 at 04:42, markNZed ***@***.***> wrote:
Hi, thanks! Can you please upload it to github or https://expirebox.com
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABjJMRQwN7reuqfn4iJIpLWFWDy53e5rks5rs34PgaJpZM4MkD9a>
.
|
Hi, we ran some benchmarking and got slightly better results with code based on http://stackoverflow.com/questions/41778362/how-to-efficiently-transpose-a-2d-bit-matrix targetting a 64x64 matrix. It was surprising. 940.423 MB/s vs 747.659 MB/s and AVX2 was actually slower at 400.961 MB/s Thanks for your support! |
Terrific! Non-hardware-specific is always preferrable. Good luck with your
application of it.
…On 13 April 2017 at 03:05, markNZed ***@***.***> wrote:
Hi, we ran some benchmarking and got slightly better results with code
based on http://stackoverflow.com/questions/41778362/how-to-
efficiently-transpose-a-2d-bit-matrix targetting a 64x64 matrix. It was
surprising. 940.423 MB/s vs 747.659 MB/s and AVX2 was actually slower at
400.961 MB/s Thanks for your support!
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABjJMf8lAzpMxMfuTIcFyb7YOmSqbs9Dks5rvfN4gaJpZM4MkD9a>
.
|
Hi,
I found your work searching for a bit-matrix transpose using SIMD. Seems very close to what we need. AVX is becoming more popular and I was wondering if that function needs to be modified to leverage AVX instructions ?
The text was updated successfully, but these errors were encountered: