AVX #3

markNZed · 2017-03-21T16:25:24Z

Hi,

I found your work searching for a bit-matrix transpose using SIMD. Seems very close to what we need. AVX is becoming more popular and I was wondering if that function needs to be modified to leverage AVX instructions ?

mischasan · 2017-03-21T23:09:16Z

Sure. I've been moving my code to AVX2 (not the bmx proc) proprietarily. I won't make an upgraded bmx proprietary. But who is "we"?

…

On 21 March 2017 at 09:25, markNZed ***@***.***> wrote: Hi, I found your work searching for a bit-matrix transpose using SIMD. Seems very close to what we need. AVX is becoming more popular and I was wondering if that function needs to be modified to leverage AVX instructions ? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#3>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABjJMQw4Nrg2UWKj_Su2Q-6iictRPnfMks5rn_n0gaJpZM4MkD9a> .

markNZed · 2017-03-22T08:17:22Z

If I understood you, you have moved some of the code base to AVX2 but are not planning on publishing that source code. But you may make an AVX version of the bmx procedure available to the public. Is that right ?

"We" is me and a dev who I've asked to help me because he has some SIMD experience. We have been using boost.simd to do some benchmarking.

The GPL could be a problem as I want to develop a commercial application (for engineering). There is no problem sharing changes that we might make to the bmx proc but the GPL would require releasing all the code it is linked with and that is problematic.

The app would run on industry server farms so managing different SIMD implementations/generations is an issue. We were thinking of using gcc intrinsics for this. One idea would be to map bmx to intrinsics. You did not want to use intrinsics ?

mischasan · 2017-03-22T17:16:56Z

That's correct. The reason I first post everything with GPL is my curiosity about who is using it; what kind of applications. If LGPL works for you, that's fine for me. I've switched my own praxis to testing cpuid on the fly, and using alternate code paths for SSE2 and AVX2. If you compile with gcc, you may note that some versions do not support SSE2 at all when you compile for 32-bit processors. The code uses the gcc intrinsics, either way. I haven't seen any other vector op sets (AMD 3dnow, ARM Neon) worth supporting. You (or the dev) have any perspective on that?

…

On 22 March 2017 at 01:17, markNZed ***@***.***> wrote: If I understood you, you have moved some of the code base to AVX2 but are not planning on publishing that source code. But you may make an AVX version of the bmx procedure available to the public. Is that right ? "We" is me and a dev who I've asked to help me because he has some SIMD experience. We have been using boost.simd to do some benchmarking. The GPL could be a problem as I want to develop a commercial application (for engineering). There is no problem sharing changes that we might make to the bmx proc but the GPL would require releasing all the code it is linked with and that is problematic. The app would run on industry server farms so managing different SIMD implementations/generations is an issue. We were thinking of using gcc intrinsics for this. One idea would be to map bmx to intrinsics. You did not want to use intrinsics ? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABjJMVwoJoCxa2VTXy2fzujUMcNk3D7cks5roNkTgaJpZM4MkD9a> .

markNZed · 2017-03-22T17:26:19Z

Only targetting x86 at this stage.

I tried compiling on Ubuntu 16.04:

cc -g -MMD -fPIC -pthread -fdiagnostics-show-option -fno-strict-aliasing -fstack-protector --param ssp-buffer-size=4 -Wall -Werror -Wextra -Wcast-align -Wcast-qual -Wformat=2 -Wformat-security -Wmissing-prototypes -Wnested-externs -Wpointer-arith -Wshadow -Wstrict-prototypes -Wunused -Wwrite-strings -Wno-attributes -Wno-cast-qual -Wno-error -Wno-unknown-pragmas -Wno-unused-parameter -O3 -I/usr/local/include -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -I. -c -o sseutil.o sseutil.c
sseutil.c:1:18: fatal error: plat.h: No such file or directory

Is that file missing from the repo ?

mischasan · 2017-03-22T20:19:54Z

Oh carp. Yes. Sigh. Here: this is faster than my updating github (srsly)

…

On 22 March 2017 at 10:26, markNZed ***@***.***> wrote: Only targetting x86 at this stage. I tried compiling on Ubuntu 16.04: cc -g -MMD -fPIC -pthread -fdiagnostics-show-option -fno-strict-aliasing -fstack-protector --param ssp-buffer-size=4 -Wall -Werror -Wextra -Wcast-align -Wcast-qual -Wformat=2 -Wformat-security -Wmissing-prototypes -Wnested-externs -Wpointer-arith -Wshadow -Wstrict-prototypes -Wunused -Wwrite-strings -Wno-attributes -Wno-cast-qual -Wno-error -Wno-unknown-pragmas -Wno-unused-parameter -O3 -I/usr/local/include -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -I. -c -o sseutil.o sseutil.c sseutil.c:1:18: fatal error: plat.h: No such file or directory Is that file missing from the repo ? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABjJMfdMDlrMzbBvFqDoOW0sN7k-HXkgks5roVm7gaJpZM4MkD9a> .

mischasan · 2017-03-22T20:23:09Z

<lame>The file is in my util/ repo as well </lame> On 22 March 2017 at 13:19, Mischa Sandberg <mischa_sandberg@telus.net> wrote:

…

Oh carp. Yes. Sigh. Here: this is faster than my updating github (srsly) On 22 March 2017 at 10:26, markNZed ***@***.***> wrote: > Only targetting x86 at this stage. > > I tried compiling on Ubuntu 16.04: > > cc -g -MMD -fPIC -pthread -fdiagnostics-show-option -fno-strict-aliasing > -fstack-protector --param ssp-buffer-size=4 -Wall -Werror -Wextra > -Wcast-align -Wcast-qual -Wformat=2 -Wformat-security -Wmissing-prototypes > -Wnested-externs -Wpointer-arith -Wshadow -Wstrict-prototypes -Wunused > -Wwrite-strings -Wno-attributes -Wno-cast-qual -Wno-error > -Wno-unknown-pragmas -Wno-unused-parameter -O3 -I/usr/local/include > -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -I. -c -o sseutil.o sseutil.c > sseutil.c:1:18: fatal error: plat.h: No such file or directory > > Is that file missing from the repo ? > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#3 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/ABjJMfdMDlrMzbBvFqDoOW0sN7k-HXkgks5roVm7gaJpZM4MkD9a> > . >

markNZed · 2017-03-22T21:09:53Z

Also missing msutil.h and sock.h

I ran make then make test which gives:

make test
cc   -pthread  -L/usr/local/lib        ssebmx_t.o libsse.a tap.o bitmat.o     -lstdc++  -lm    -o ssebmx_t
bitmat.o: In function `bitmat_trans':
/home/propacov/shared/proto/i8051-07/src/tests/primitives/sse2/bitmat.c:80: undefined reference to `ssebmx'
/home/propacov/shared/proto/i8051-07/src/tests/primitives/sse2/bitmat.c:80: undefined reference to `ssebmx_m'
collect2: error: ld returned 1 exit status
<builtin>: recipe for target 'ssebmx_t' failed
make: *** [ssebmx_t] Error 1

mischasan · 2017-03-22T22:24:19Z

My apologies for leading it in that state. If you pull my util repo, it has all the files required. I'm currently in an odd position having to recover my git remote state/switch interfaces. Had not really expected anyone to use that project in a while.

…

On 22 March 2017 at 14:09, markNZed ***@***.***> wrote: Also missing msutil.h and sock.h I ran make then make test which gives: make test cc -pthread -L/usr/local/lib ssebmx_t.o libsse.a tap.o bitmat.o -lstdc++ -lm -o ssebmx_t bitmat.o: In function `bitmat_trans': /home/propacov/shared/proto/i8051-07/src/tests/primitives/sse2/bitmat.c:80: undefined reference to `ssebmx' /home/propacov/shared/proto/i8051-07/src/tests/primitives/sse2/bitmat.c:80: undefined reference to `ssebmx_m' collect2: error: ld returned 1 exit status <builtin>: recipe for target 'ssebmx_t' failed make: *** [ssebmx_t] Error 1 — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABjJMa7NTnajseIOdT3LI92PRzxoAM3Tks5roY4igaJpZM4MkD9a> .

markNZed · 2017-03-23T08:07:50Z

No problem, it is worth the effort if we can use the code. I resolved the missing files (downloaded the 3 headers from your utils package). But ran into the compile error reported in my previous message. Can you get the bmx test running ? The GNUMakefile and rules are new to me so not so easy to quickly understand where the issue is. Thanks.

mischasan · 2017-03-23T12:29:49Z

Sure. I'm going to be in the air for most of today. Pardon, but what tz are you in? And does your server farm include AVX512 boxes?

…

On 23 March 2017 at 01:07, markNZed ***@***.***> wrote: No problem, it is worth the effort if we can use the code. I resolved the missing files (downloaded the 3 headers from your utils package). But ran into the compile error reported in my previous message. Can you get the bmx test running ? The GNUMakefile and rules are new to me so not so easy to quickly understand where the issue is. Thanks. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABjJMcIgnVSI-xlZrMS9P6l05zJgxdqeks5roihXgaJpZM4MkD9a> .

mischasan · 2017-03-23T13:08:38Z

This is what I can do off my notebook. Passes ssebmx unit tests on my side. On 23 March 2017 at 05:29, Mischa Sandberg <mischa_sandberg@telus.net> wrote:

…

Sure. I'm going to be in the air for most of today. Pardon, but what tz are you in? And does your server farm include AVX512 boxes? On 23 March 2017 at 01:07, markNZed ***@***.***> wrote: > No problem, it is worth the effort if we can use the code. I resolved the > missing files (downloaded the 3 headers from your utils package). But ran > into the compile error reported in my previous message. Can you get the bmx > test running ? The GNUMakefile and rules are new to me so not so easy to > quickly understand where the issue is. Thanks. > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#3 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/ABjJMcIgnVSI-xlZrMS9P6l05zJgxdqeks5roihXgaJpZM4MkD9a> > . >

mischasan · 2017-03-26T02:58:47Z

And here's an update with AVX2 (mm256) support. And I'm happy to convert to Apache license if you'll satisfy my curiosity --- if that can be worded in a way that doesn't impinge on any competitive secret. On 23 March 2017 at 06:08, Mischa Sandberg <mischa_sandberg@telus.net> wrote:

…

This is what I can do off my notebook. Passes ssebmx unit tests on my side. On 23 March 2017 at 05:29, Mischa Sandberg ***@***.***> wrote: > Sure. I'm going to be in the air for most of today. Pardon, but what tz > are you in? And does your server farm include AVX512 boxes? > > On 23 March 2017 at 01:07, markNZed ***@***.***> wrote: > >> No problem, it is worth the effort if we can use the code. I resolved >> the missing files (downloaded the 3 headers from your utils package). But >> ran into the compile error reported in my previous message. Can you get the >> bmx test running ? The GNUMakefile and rules are new to me so not so easy >> to quickly understand where the issue is. Thanks. >> >> — >> You are receiving this because you commented. >> Reply to this email directly, view it on GitHub >> <#3 (comment)>, or mute >> the thread >> <https://github.com/notifications/unsubscribe-auth/ABjJMcIgnVSI-xlZrMS9P6l05zJgxdqeks5roihXgaJpZM4MkD9a> >> . >> > >

markNZed · 2017-03-26T13:51:29Z

Hi,

I don't see updates to the repo, are you using attachments with these messages ? I don't think I can access those.

I'm in France. I imagine the user of our software will have AVX512 boxes. But I don't have a server farm. I plan to do testing on cloud infrastructure e.g. AWS.

markNZed · 2017-03-26T13:52:41Z

With 256 or 512bit registers does the optimal size of the bit matrix for transposition change ?

mischasan · 2017-03-26T15:17:14Z

Yes they were zip attachments. When I get back I'll update github (need ssh key/ cert)

…

On Mar 26, 2017 6:52 AM, "markNZed" ***@***.***> wrote: With 256 or 512bit registers does the optimal size of the bit matrix for transposition change ? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABjJMbMj8yEspQ817aZ9FTm5SOLCno6Xks5rpm2pgaJpZM4MkD9a> .

mischasan · 2017-03-26T15:19:13Z

no chg. uses 256 for as much as fits; falls through to 128 for what doesn't.

…

On Mar 26, 2017 6:52 AM, "markNZed" ***@***.***> wrote: With 256 or 512bit registers does the optimal size of the bit matrix for transposition change ? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABjJMbMj8yEspQ817aZ9FTm5SOLCno6Xks5rpm2pgaJpZM4MkD9a> .

markNZed · 2017-03-26T15:22:18Z

If you like you could upload to https://expirebox.com/ it is very simple, no login, provides a link to the file (which gets deleted after 48hrs).

markNZed · 2017-03-26T15:25:09Z

For bmx, does AVX provide improved instructions or is the only benefit larger registers ?

mischasan · 2017-03-26T16:40:17Z

Sure https://expirebox.com/download/791aa29d46fa7dda158d8b6f52893ea3.html The cpuid check broke on one other older pc I had access to last night. Other than that, ssebmx_t.pass speaks for itself. Lucky you, in France. Paris, Menton and St Remy de Provence are some of my favourite places to be.

…

On 26 March 2017 at 08:22, markNZed ***@***.***> wrote: If you like you could upload to https://expirebox.com/ it is very simple, no login, provides a link to the file (which gets deleted after 48hrs). — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABjJMZYOnMor2fpijbANGv2oAExp63JQks5rpoKqgaJpZM4MkD9a> .

mischasan · 2017-03-26T16:41:57Z

No improved instructions for this particular app ... and the core op (movemask) is not implemented for AVX512.

…

On 26 March 2017 at 08:25, markNZed ***@***.***> wrote: For bmx, does AVX provide improved instructions or is the only benefit larger registers ? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABjJMeSakIRqi7GnGLo_ODVEEwQKeTD1ks5rpoNVgaJpZM4MkD9a> .

markNZed · 2017-03-26T16:59:40Z

I have a hard time understanding why the CPU don't provide native support for a bitwise transpose, it seems such a fundamental building block. Do you see why that hasn't happened ?

The zip ran fine on my machine, I only tried ssebmx_t (I'm using an Intel Core i5 on my laptop). Thanks!

Have you tried benchmarking between clang and gcc ? I was surprised to see how much better clang-3.8 was than gcc-6.2 on some auto-vectorization test cases, seemed to make better use of the ymm/xmm registers.

Yeah lucky to be in France, so much come down to luck...

mischasan · 2017-03-26T17:51:31Z

Haha I am stealing time to type, let alone perftest. This ssebmx doesn't use multiple registers. I expect no better than what gcc 4.4 does unrolling trivial loops. It could be modified to use multiple registers to make better use of cache lines. That's not through auto-vectorization, though. AVX (opinion) is part of Intel's war with AMD --- that's why SSE3+ and AVX+ are such messy unorthogonal arch. AMD lost, so now Intel has gone back and improved REP MOVSB et al which is what most people needed. If I were re-implementing APL :-) I'd think about AVX2 more. It *might* also help on table-driven charset conversion. I stuck to SSE2 because it was pretty much guaranteed everywhere. Well have fun. My home is Vancouver (Canada), it's good even if not France (or Germany). Est-ce que vous soyez français?

…

On 26 March 2017 at 09:59, markNZed ***@***.***> wrote: I have a hard time understanding why the CPU don't provide native support for a bitwise transpose, it seems such a fundamental building block. Do you see why that hasn't happened ? The zip ran fine on my machine, I only tried ssebmx_t (I'm using an Intel Core i5 on my laptop). Thanks! Have you tried benchmarking between clang and gcc ? I was surprised to see how much better clang-3.8 was than gcc-6.2 on some auto-vectorization test cases, seemed to make better use of the ymm/xmm registers. Yeah lucky to be in France, so much come down to luck... — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABjJMXDV_e2SHV8U0rNulVMzcI7E7JxEks5rppl8gaJpZM4MkD9a> .

markNZed · 2017-03-27T08:47:12Z

Non, néo-zélandais, beaucoup de chance la aussi!

markNZed · 2017-03-31T06:47:54Z

This is a bit of a diverging thread but I hesitate to create new issues for questions. The bmx is 16x8 and I am wondering, if we are targeting a size of 256 x W (where W is typically less than 512). Are there changes to the algorithm that could match up with the initial row count of 256 and improve performance ? Or is it best to just break that up into 16x8 chunks. Thanks.

mischasan · 2017-03-31T12:54:17Z

Short answer: doesn't help SSE2, probably won't help AVX2. I did some SSE2-only timing a couple years ago, aiming at using the same input cache line (64 bytes) immediately in the "gather" (INP) loops, There was a factor of 1.5...2 improvement for the [8x16] becoming [8 x 64], but it only applied for up to [8 x 512] arrays (special case; someone was interested in that). At that point fetch from RAM (not cache) became the limiting factor. That second loop [8 x ...] is slower than the first one [16 x ...]. I have not tried perftesting what else discussed. A quick small test of changing INP() and OUT() to use induction variables, and so avoid IMUL, suggests it's a quick win. I'm occupied by a large customer; will be happy to rethink this in two weeks. You haven't mentioned what the application is for this (in even general terms), I assume then you won't.

…

On 30 March 2017 at 23:47, markNZed ***@***.***> wrote: This is a bit of a diverging thread but I hesitate to create new issues for questions. The bmx is 16x8 and I am wondering, if we are targeting a size of 256 x W (where W is typically less than 512). Are there changes to the algorithm that could match up with the initial row count of 256 and improve performance ? Or is it best to just break that up into 16x8 chunks. Thanks. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABjJMVsFc9Wmp6GmiZBY6GOAHSGBjDzuks5rrKGagaJpZM4MkD9a> .

markNZed · 2017-03-31T16:03:03Z

Nice idea with INP and OUT. I would hope that the hardware could prefetch but in any case memory will be the bottleneck. It is premature to optimise now. I will be late next week before I can do profiling and the current bmx may be plenty enough.

The application is analysing decompressed trace files from digital circuit simulation. One dimension of the matrix is time/cycles and the another dimension inputs. The matrix can be quite big (e.g. GBs).

mischasan · 2017-03-31T21:03:02Z

Thanks; and that's all I wanted to know. Best of luck to you (folks) on that. Cache-line caching does a lot. For transpose, the access pattern is too hard for prefetch to spot; and if you widen the contiguous access on the gather (INP) side, you create sparser action on the scatter side. I'll switch to induction indexes for INP and OUT as soon as I get a chance to exhale.

…

On 31 March 2017 at 09:03, markNZed ***@***.***> wrote: Nice idea with INP and OUT. I would hope that the hardware could prefetch but in any case memory will be the bottleneck. It is premature to optimise now. I will be late next week before I can do profiling and the current bmx may be plenty enough. The application is analysing decompressed trace files from digital circuit simulation. One dimension of the matrix is time/cycles and the another dimension inputs. The matrix can be quite big (e.g. GBs). — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABjJMfqfRNr70ycKoCire2DhJKyIRdrZks5rrSO3gaJpZM4MkD9a> .

markNZed · 2017-04-01T08:13:11Z

Could __builtin_prefetch be a big help with that ? If the gather/scatter work on a block that fits in L1...

I should probably mention that we are looking to transpose blocks (kBs) not the entire matrix (potentially GBs). So the scatter can be limited.

mischasan · 2017-04-01T14:06:25Z

unfortunately not. i tested prefetch heavily for a version of memcpy using sse2. it is a minor improvement when there is a single output target cache line. bmx does scatter output. always happy to be proven wrong.

…

On Apr 1, 2017 1:13 AM, "markNZed" ***@***.***> wrote: Could __builtin_prefetch be a big help with that ? If the gather/scatter work on a block that fits in L1... — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABjJMQNpdkvjSoHHss64QUws_o6uEu30ks5rrgcXgaJpZM4MkD9a> .

mischasan · 2017-04-04T19:55:07Z

Okay, here's the final cut (from my side). It has no IMULs. It uses AVX2 if that is defined at compile-time. A run-time test for CPUID is cheap; I'm afraid I have to move on and won't be doing that. To complete that previous comment about prefetch: it has a limited use for prefetching _target_ memory, prior to updating bytes in a new cache line. Some CPU's appear to have a limited queue for prefetches; if you do it too often, performance starts to degrade below having no prefetch at all.

…

On 1 April 2017 at 07:06, Mischa Sandberg ***@***.***> wrote: unfortunately not. i tested prefetch heavily for a version of memcpy using sse2. it is a minor improvement when there is a single output target cache line. bmx does scatter output. always happy to be proven wrong. On Apr 1, 2017 1:13 AM, "markNZed" ***@***.***> wrote: > Could __builtin_prefetch be a big help with that ? If the gather/scatter > work on a block that fits in L1... > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#3 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/ABjJMQNpdkvjSoHHss64QUws_o6uEu30ks5rrgcXgaJpZM4MkD9a> > . >

markNZed · 2017-04-05T11:42:06Z

Hi, thanks! Can you please upload it to github or https://expirebox.com

mischasan · 2017-04-05T18:52:35Z

Right: https://expirebox.com/download/a943062e34c58f520bef1902227f161a.html

…

On 5 April 2017 at 04:42, markNZed ***@***.***> wrote: Hi, thanks! Can you please upload it to github or https://expirebox.com — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABjJMRQwN7reuqfn4iJIpLWFWDy53e5rks5rs34PgaJpZM4MkD9a> .

markNZed · 2017-04-13T10:05:44Z

Hi, we ran some benchmarking and got slightly better results with code based on http://stackoverflow.com/questions/41778362/how-to-efficiently-transpose-a-2d-bit-matrix targetting a 64x64 matrix. It was surprising. 940.423 MB/s vs 747.659 MB/s and AVX2 was actually slower at 400.961 MB/s Thanks for your support!

mischasan · 2017-04-13T17:11:27Z

Terrific! Non-hardware-specific is always preferrable. Good luck with your application of it.

…

On 13 April 2017 at 03:05, markNZed ***@***.***> wrote: Hi, we ran some benchmarking and got slightly better results with code based on http://stackoverflow.com/questions/41778362/how-to- efficiently-transpose-a-2d-bit-matrix targetting a 64x64 matrix. It was surprising. 940.423 MB/s vs 747.659 MB/s and AVX2 was actually slower at 400.961 MB/s Thanks for your support! — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABjJMf8lAzpMxMfuTIcFyb7YOmSqbs9Dks5rvfN4gaJpZM4MkD9a> .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AVX #3

AVX #3

markNZed commented Mar 21, 2017

mischasan commented Mar 21, 2017 via email

markNZed commented Mar 22, 2017

mischasan commented Mar 22, 2017 via email

markNZed commented Mar 22, 2017

mischasan commented Mar 22, 2017 via email

mischasan commented Mar 22, 2017 via email

markNZed commented Mar 22, 2017

mischasan commented Mar 22, 2017 via email

markNZed commented Mar 23, 2017

mischasan commented Mar 23, 2017 via email

mischasan commented Mar 23, 2017 via email

mischasan commented Mar 26, 2017 via email

markNZed commented Mar 26, 2017

markNZed commented Mar 26, 2017

mischasan commented Mar 26, 2017 via email

mischasan commented Mar 26, 2017 via email

markNZed commented Mar 26, 2017

markNZed commented Mar 26, 2017

mischasan commented Mar 26, 2017 via email

mischasan commented Mar 26, 2017 via email

markNZed commented Mar 26, 2017

mischasan commented Mar 26, 2017 via email

markNZed commented Mar 27, 2017

markNZed commented Mar 31, 2017

mischasan commented Mar 31, 2017 via email

markNZed commented Mar 31, 2017

mischasan commented Mar 31, 2017 via email

markNZed commented Apr 1, 2017 •

edited

Loading

mischasan commented Apr 1, 2017 via email

mischasan commented Apr 4, 2017 via email

markNZed commented Apr 5, 2017

mischasan commented Apr 5, 2017 via email

markNZed commented Apr 13, 2017

mischasan commented Apr 13, 2017 via email

AVX #3

AVX #3

Comments

markNZed commented Mar 21, 2017

mischasan commented Mar 21, 2017 via email

markNZed commented Mar 22, 2017

mischasan commented Mar 22, 2017 via email

markNZed commented Mar 22, 2017

mischasan commented Mar 22, 2017 via email

mischasan commented Mar 22, 2017 via email

markNZed commented Mar 22, 2017

mischasan commented Mar 22, 2017 via email

markNZed commented Mar 23, 2017

mischasan commented Mar 23, 2017 via email

mischasan commented Mar 23, 2017 via email

mischasan commented Mar 26, 2017 via email

markNZed commented Mar 26, 2017

markNZed commented Mar 26, 2017

mischasan commented Mar 26, 2017 via email

mischasan commented Mar 26, 2017 via email

markNZed commented Mar 26, 2017

markNZed commented Mar 26, 2017

mischasan commented Mar 26, 2017 via email

mischasan commented Mar 26, 2017 via email

markNZed commented Mar 26, 2017

mischasan commented Mar 26, 2017 via email

markNZed commented Mar 27, 2017

markNZed commented Mar 31, 2017

mischasan commented Mar 31, 2017 via email

markNZed commented Mar 31, 2017

mischasan commented Mar 31, 2017 via email

markNZed commented Apr 1, 2017 • edited Loading

mischasan commented Apr 1, 2017 via email

mischasan commented Apr 4, 2017 via email

markNZed commented Apr 5, 2017

mischasan commented Apr 5, 2017 via email

markNZed commented Apr 13, 2017

mischasan commented Apr 13, 2017 via email

markNZed commented Apr 1, 2017 •

edited

Loading