add AVX2 support for bitslice (TODO BS_LOAD_DEINTERLEAVE_8) #4

mkaluza · 2018-01-16T16:18:35Z

benchmarks on core i5-6400 @ 2.7GHz
sse2 1200 1220
ssse3 1340 1360
avx2 1640 1680

Warning: Using AVX makes Intel cpus lower their clock which can have
negative impact on performance of other, non-avx code, so be sure to
benchmark your real-world setup instead of blindly using it

benchmarks on core i5-6400 @ 2.7GHz sse2 1200 1220 ssse3 1340 1360 avx2 1640 1680 Warning: Using AVX makes Intel cpus lower their clock which can have negative impact on performance of other, non-avx code, so be sure to benchmark your real-world setup instead of blindly using it

mkaluza · 2018-01-16T16:30:09Z

It passess test/testbitslice, but unfortunately I don't understand what BS_LOAD_DEINTERLEAVE_8 does so it was hard for me to extend it to avx2.

It's interesting though that despite using 2x wider registers there's only 25-30% speedup. Missing BS_LOAD_DEINTERLEAVE_8 probably has something to do with it, but looking at the difference between sse2 and ssse3 there shouldn't be that much hit. Any ideas?

And curiously cpu doesn't lower it's clock when in avx2, so maybe it's only about avx-512...

glenvt18 · 2018-01-16T17:41:45Z

src/dvbcsa_bs_avx2.h

+#ifndef DVBCSA_AVX_H_
+#define DVBCSA_AVX_H_
+
+#include <xmmintrin.h>


Only <immintrin.h> is needed

glenvt18 · 2018-01-16T17:43:44Z

src/dvbcsa_bs_avx2.h

+#define BS_EMPTY()
+
+/* block cipher 2-word load with byte-deinterleaving */
+/* FIXME no idea about what it does, so its hard to modify...


Looks like shuffle and unpack work on 128 bit only.

hmmm, there are 256 bit versions of unpack and shuffle:
_mm256_shuffle_epi8: https://software.intel.com/en-us/node/524017
_mm256_unpacklo_epi64: https://software.intel.com/en-us/node/524002
_mm256_unpackhi_epi64: https://software.intel.com/en-us/node/524001

The thing is I don't know what those numbers mean:
a = _mm_shuffle_epi8(a, _mm_set_epi8(15,13,11,9,7,5,3,1,14,12,10,8,6,4,2,0));
and how to extend them to 32 bytes.
Tried this, but that was just a guess, because I don't know, what this means/does:
a = _mm256_shuffle_epi8(a, _mm256_set_epi8(31,27,23,19,15,11,7,3, 30,26,22,18,14,10,6,2, 29,25,21,17,13,9,5,1, 28,24,20,16,12,8,4,0)); \

However testbitslice fails for tests 2 and 3 - see commit message 3e5396c

At least it was worth a try...

glenvt18 · 2018-01-16T18:04:07Z

@mkaluza Thanks:) You also have to add avx include file to libdvbcsa_la_SOURCES in src/Makefile.am.

BS_LOAD_DEINTERLEAVE_8 macro loads two bs_word vectors and deinterleaves even and odd bytes. It helps to optimize sbox lookup a bit using one 16 bit lookup instead of two 8 bit ones. There is a nice ARM NEON instruction for that. It looks like there is no efficient way of doing it with AVX. So, just remove this macro.

As for performance, the heaviest part is block cipher sbox (a simple table lookup) which can't benefit from SIMD. So, performance growth is not linear.

How about adding AVX-512? I suggest using a single include file dvbcsa_bs_avx.h and #ifdef ... The only difference is 256->512 (you can copy and replace in the editor) and BS_BATCH_BYTES = 64.

EDIT. On deinterleaving on ARM:
https://community.arm.com/processors/b/blog/posts/coding-for-neon---part-1-load-and-stores
libdvbcsa uses VLD2.8 instruction.

EDIT I usually run the benchmark in series like seq 20 | xargs -n1 test/benchbitslice to 'warm-up' the CPU.

* DVBCSA test * - Generating batch with 256 packets, size range: 184-184, pattern: 0xa5 - Encrypting each packet using dvbcsa_encrypt() - Decrypting batch using _bitslice_ dvbcsa_bs_decrypt() - Checking results... - Decrypting each packet using dvbcsa_decrypt() - Encrypting batch using _bitslice_ dvbcsa_bs_encrypt() - Checking results... - Ok ! - Generating batch with 256 packets, size range: 100-184 (random), pattern: 0xff (random) - Encrypting each packet using dvbcsa_encrypt() - Decrypting batch using _bitslice_ dvbcsa_bs_decrypt() - Checking results... - glenvt18#1 Failed ! - failed - 183 bytes 0x0e, 0xe7, 0x32, 0xaf, 0xb4, 0xb9, 0xde, 0x50, 0x67, 0x48, 0x57, 0xa3, 0x3c, 0x1e, 0x6f, 0x70, 0xf5, 0x93, 0xa0, 0xd4, 0xb2, 0x46, 0x46, 0x13, 0xf2, 0x61, 0x08, 0x81, 0x39, 0xaa, 0xde, 0x80, 0x0d, 0x40, 0x90, 0xa0, 0x5d, 0x6e, 0x40, 0xf6, 0x4f, 0x9f, 0x9a, 0x5e, 0xdf, 0xd1, 0x74, 0xaf, 0xa7, 0x64, 0xc9, 0xab, 0x19, 0x24, 0x3a, 0x17, 0x9a, 0x20, 0x10, 0x06, 0x0b, 0x38, 0x93, 0x0c, 0x27, 0xec, 0x7f, 0x8f, 0x04, 0xe9, 0xf4, 0xe4, 0x14, 0xfb, 0x22, 0xfd, 0xbf, 0xc7, 0xf1, 0x7e, 0x97, 0x65, 0x43, 0xb9, 0xb4, 0x36, 0x7e, 0xff, 0xac, 0xe0, 0x75, 0xd9, 0x0b, 0xc4, 0x8b, 0xc0, 0x6f, 0xb7, 0x15, 0x0e, 0xa0, 0x54, 0x48, 0x14, 0xcb, 0x56, 0x4e, 0x68, 0x48, 0x24, 0x4b, 0xaf, 0x1a, 0x57, 0xcb, 0x68, 0xf6, 0xce, 0x38, 0x83, 0x91, 0x2a, 0x3f, 0xa7, 0x80, 0x34, 0x5f, 0x6d, 0xb7, 0x62, 0x7e, 0x63, 0x07, 0x5c, 0xc6, 0xb3, 0x52, 0x18, 0xd2, 0xeb, 0x93, 0xd6, 0x0e, 0x9a, 0x12, 0x4b, 0x32, 0xdc, 0xe2, 0xd8, 0xa9, 0x28, 0xb0, 0x5c, 0xac, 0x0c, 0xdd, 0x66, 0x5f, 0x3d, 0x1c, 0x01, 0xd0, 0x6d, 0xda, 0x98, 0x0d, 0x23, 0xcb, 0xa9, 0xf0, 0x39, 0xf1, 0xf0, 0xcc, 0x63, 0x73, 0x73, 0x73, 0x73, 0x73, 0x73, 0x73, - Generating batch with 256 packets, size range: 0-184, pattern: 0xff (random) - Encrypting each packet using dvbcsa_encrypt() - Decrypting batch using _bitslice_ dvbcsa_bs_decrypt() - Checking results... - #9 Failed ! - failed - 9 bytes 0x47, 0x10, 0xe8, 0x78, 0x82, 0x2e, 0x46, 0xe9, 0xd2, ======================= 2 out of 3 tests FAILED.

glenvt18 · 2018-01-16T19:29:44Z

@mkaluza Could you try other compiler optimization settings (or run ./configure with --enable-debug).

glenvt18 · 2018-01-16T19:33:19Z

@mkaluza Apart from test failures, do you see the performance improvement?

mkaluza · 2018-01-16T23:36:09Z

There seems to be little to no speedup with the macro... didn't do any stats, but by looking at the numbers it seems to be ~100mbps, so its similar to sse2 vs ssse3 difference... So probably not worth it if it potentially causes errors.

outputs from seq 20 | xargs -n1 test/benchbitslice | grep "packets proceded"
bench_with_macro.txt
bench_without_macro.txt

What other optimization options would you like me to test?
With --enable-debug perf dropped to around 200mbps. No extra info was printed anywere and testbitslice also failed tests 2 and 3

Thanks for perf explanation. I knew there are two ciphers but didn't know that only one gets the SIMD speedup.

But going along this way if we see diminishing returns here as doubling word width gave only 25% speed increase (meaning the stream part is already well optimized and block part takes up most of the time) then probably going from avx2 to avx512 will give even smaller speedup. And when you add this (https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling/) to the mix, it might not be worth the effort, as it might as well produce lower performance then AVX2...

Besides, I have no AVX512 capable cpu to test it.
Besides #2: full cable bandwidth in dvb-c is 97 * 50mbps = 4700mbps. Considering that one $200 i5 cpu can do more even without avx (5200 vs 6400mbps with avx) we're probably already beyond what anyone will ever need :)

glenvt18 · 2018-01-17T00:31:43Z

@mkaluza Could you run /test/testbsops with BS_LOAD_DEINTERLEAVE_8 enabled. It has a dedicated test (I should have mentioned it fist place, sorry). Just run make check. We might have a bug here. You can also try to add -fno-strict-aliasing.

What other optimization options would you like me to test?

Try --enable-alt-sbox. It also depends on the compiler. Sometimes I see 5-10% regressions. Block cipher's kernel is simple but is very sensitive to the quality of the generated code.

I knew there are two ciphers but didn't know that only one gets the SIMD speedup.

There are DVBCSA_DISABLE_STREAM and DVBCSA_DISABLE_BLOCK in dvbcsa_bs_algo.c. You can run benchmarks of each cipher separately. The block cipher uses SIMD everywhere except sbox lookup and permutation - https://en.wikipedia.org/wiki/Common_Scrambling_Algorithm#Block_cipher.

I have no AVX512 capable cpu to test it

We can ask @nto.

BTW. What about power consumption AVX vs SSSE3?

nto · 2018-01-17T01:27:15Z

Here are the results I got, including AVX-512:

Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz

Mode	Decrypt	Encrypt
uint32	694.4	693.7
uint64	1112.4	1134.7
MMX	791.6	786.5
SSE2	1673.6	1706.7
SSSE3	1914.4	1924.4
AVX2	2253.1	2279.7
AVX-512	2111.7	2281.1

AVX-512 is a bit slower than AVX2 once the CPU is warm. It's only slightly faster on the first few runs.

glenvt18 · 2018-01-17T02:22:35Z

@nto Nice work:) Thank you very much.
Did make check passed with AVX-512? Did you enable BS_LOAD_DEINTERLEAVE_8 with AVX2? Just for curiosity, could you set DVBCSA_DISABLE_BLOCK and benchmark the stream cipher alone to see how good AVX2/512 are.

nto · 2018-01-17T02:39:48Z

Did make check passed with AVX-512?

Yes.

Did you enable BS_LOAD_DEINTERLEAVE_8 with AVX2?

I tried a few variations of the macro, but I couldn't get any of them to pass make check.

Just for curiosity, could you set DVBCSA_DISABLE_BLOCK and benchmark the stream cipher alone to see how good AVX2/512 are.

Mode	Decrypt	Encrypt
uint32	1794.2	1794.8
uint64	3508.1	3493.2
MMX	2745.1	2746.2
SSE2	6650.6	6654.1
SSSE3	6632.6	6634.9
AVX2	12785.3	12818.4
AVX-512	14737.9	14918.5

glenvt18 · 2018-01-17T03:10:04Z

@nto Thanks. Looks like the block cipher sbox/permute is the bottleneck (as expected). I know nothing about AVX-512 and can't comment on AVX2 vs AVX-512 figures. Other figures are nice.

kierank · 2018-01-17T04:33:04Z

I haven't looked in detail at the s-boxes, but have you considered using AVX2 gather for them?

mkaluza · 2018-01-17T11:02:01Z

I did some batch testing with -march=native -mtune=native, -fno-strict-aliasing both with and without PGO. Selected best result out of 10 runs of benchbitslice. CPU core i5-6400 @ 2.7GHz, no turbo, gcc version 6.3.0 20161221 (release) (PLD-Linux)

It appears that march=native improves sse2 by 11%, ssse3 by 4%. Apart from that there are no measurable gains (but no slowdowns either).

Later I'll repeat those for each cipher separately.

Summary for uint32

==> 2018_01_17_10_57_alt_sbox_uint32 <==

520192 packets proceded, 496.7 Mbits/s

==> 2018_01_17_11_15_no_flags_uint32 <==

520192 packets proceded, 496.7 Mbits/s

==> 2018_01_17_11_18_march_native_uint32 <==

520192 packets proceded, 500.5 Mbits/s

==> 2018_01_17_11_20_march_native_no_strict_aliasing_uint32 <==

520192 packets proceded, 504.8 Mbits/s

==> 2018_01_17_11_27_no_flags_prof_uint32 <==

520192 packets proceded, 497.0 Mbits/s

==> 2018_01_17_11_30_march_native_prof_uint32 <==

520192 packets proceded, 501.5 Mbits/s

==> 2018_01_17_11_32_march_native_no_strict_aliasing_prof_uint32 <==

520192 packets proceded, 500.8 Mbits/s

Summary for uint64

==> 2018_01_17_10_57_alt_sbox_uint64 <==

520192 packets proceded, 776.8 Mbits/s

==> 2018_01_17_11_15_no_flags_uint64 <==

520192 packets proceded, 776.4 Mbits/s

==> 2018_01_17_11_18_march_native_uint64 <==

520192 packets proceded, 780.8 Mbits/s

==> 2018_01_17_11_20_march_native_no_strict_aliasing_uint64 <==

520192 packets proceded, 777.8 Mbits/s

==> 2018_01_17_11_27_no_flags_prof_uint64 <==

520192 packets proceded, 768.0 Mbits/s

==> 2018_01_17_11_30_march_native_prof_uint64 <==

520192 packets proceded, 783.1 Mbits/s

==> 2018_01_17_11_32_march_native_no_strict_aliasing_prof_uint64 <==

520192 packets proceded, 780.8 Mbits/s

Summary for mmx

==> 2018_01_17_10_57_alt_sbox_mmx <==

520192 packets proceded, 646.2 Mbits/s

==> 2018_01_17_11_15_no_flags_mmx <==

520192 packets proceded, 643.6 Mbits/s

==> 2018_01_17_11_18_march_native_mmx <==

520192 packets proceded, 640.9 Mbits/s

==> 2018_01_17_11_20_march_native_no_strict_aliasing_mmx <==

520192 packets proceded, 639.0 Mbits/s

==> 2018_01_17_11_27_no_flags_prof_mmx <==

520192 packets proceded, 642.4 Mbits/s

==> 2018_01_17_11_30_march_native_prof_mmx <==

520192 packets proceded, 638.6 Mbits/s

==> 2018_01_17_11_32_march_native_no_strict_aliasing_prof_mmx <==

520192 packets proceded, 640.1 Mbits/s

Summary for sse2

==> 2018_01_17_10_57_alt_sbox_sse2 <==

520192 packets proceded, 1225.3 Mbits/s

==> 2018_01_17_11_15_no_flags_sse2 <==

520192 packets proceded, 1217.5 Mbits/s

==> 2018_01_17_11_18_march_native_sse2 <==

520192 packets proceded, 1368.2 Mbits/s

==> 2018_01_17_11_20_march_native_no_strict_aliasing_sse2 <==

520192 packets proceded, 1344.9 Mbits/s

==> 2018_01_17_11_27_no_flags_prof_sse2 <==

520192 packets proceded, 1223.6 Mbits/s

==> 2018_01_17_11_30_march_native_prof_sse2 <==

520192 packets proceded, 1358.3 Mbits/s

==> 2018_01_17_11_32_march_native_no_strict_aliasing_prof_sse2 <==

520192 packets proceded, 1364.1 Mbits/s

Summary for ssse3

==> 2018_01_17_10_57_alt_sbox_ssse3 <==

520192 packets proceded, 1396.5 Mbits/s

==> 2018_01_17_11_15_no_flags_ssse3 <==

520192 packets proceded, 1397.0 Mbits/s

==> 2018_01_17_11_18_march_native_ssse3 <==

520192 packets proceded, 1463.0 Mbits/s

==> 2018_01_17_11_20_march_native_no_strict_aliasing_ssse3 <==

520192 packets proceded, 1453.5 Mbits/s

==> 2018_01_17_11_27_no_flags_prof_ssse3 <==

520192 packets proceded, 1398.1 Mbits/s

==> 2018_01_17_11_30_march_native_prof_ssse3 <==

520192 packets proceded, 1460.9 Mbits/s

==> 2018_01_17_11_32_march_native_no_strict_aliasing_prof_ssse3 <==

520192 packets proceded, 1460.1 Mbits/s

Summary for avx2

==> 2018_01_17_10_57_alt_sbox_avx2 <==

520192 packets proceded, 1691.0 Mbits/s

==> 2018_01_17_11_15_no_flags_avx2 <==

520192 packets proceded, 1676.8 Mbits/s

==> 2018_01_17_11_18_march_native_avx2 <==

520192 packets proceded, 1690.5 Mbits/s

==> 2018_01_17_11_20_march_native_no_strict_aliasing_avx2 <==

520192 packets proceded, 1684.3 Mbits/s

==> 2018_01_17_11_27_no_flags_prof_avx2 <==

520192 packets proceded, 1687.6 Mbits/s

==> 2018_01_17_11_30_march_native_prof_avx2 <==

520192 packets proceded, 1695.5 Mbits/s

==> 2018_01_17_11_32_march_native_no_strict_aliasing_prof_avx2 <==

520192 packets proceded, 1701.0 Mbits/s

It still doesn't work, but its one step closer Expected 000102030405060708090a0b0c0d0e0f000102030405060708090a0b0c0d0e0f Got 0001020304050607000102030405060708090a0b0c0d0e0f08090a0b0c0d0e0f It turns out _mm256_shuffle_epi8 works only within 128bit slices, so after shuffle we get a = AE1 AO1 AE2 AO2 b = BE1 BO1 BE2 BO2 and after unpack lo = AE1 BE1 AE2 BE2 hi = AO1 BO1 AO2 BO2 and we need lo = AE1 AE2 BE1 BE2 hi = AO1 AO2 BO1 BO2

It works, but it's slow, but probably AVX2 scatter/gather will work here

…% slower, so leaving it disabled

glenvt18 · 2018-01-17T15:12:26Z

@mkaluza Thanks for the benchmarks. A couple of notes. I suggested -fno-strict-aliasing to detect possible aliasing optimization bugs, just in case. Target specific optimization produces code that can be faster or slower depending on the target and compiler version. The code generated for non-native targets may a bit be faster.

About BS_LOAD_DEINTERLEAVE_8: "one 16 bit lookup instead of two 8 bit" is a complete bullshit. I was a bit sleepy, sorry for that;). How it works: we need both sbox and permutation outputs stored in 2 bs_word registers. We fetch both sbox and permutation (precomputed) outputs at one lookup and write a 16 bit word to a memory buffer. Then we have to load two bs_words from the buffer and deinterleave - store odds bytes in one bs_word and even bytes in another. It only makes sense if deinterleaving can be done efficiently (as one or two instructions). Otherwise permutation is done by a series of shifts and logic operations (see BLOCK_PERMUTE_LOGIC). I wonder if the permutation can be done using the new AVX2 instructions (thanks to @kierank ). I'm looking into it...

mkaluza · 2018-01-17T15:33:05Z

@glenvt18 I fixed it :) but it's still 3% slower, so I'm just leaving it disabled - maybe it'll be useful for avx512 (@nto, care to try? :) )

I looked at what @kierank suggested - avx2 gather is for 32 and 64 bit values only, but it can be used to load byte values when scale=1. I didn't test it yet, but looking at the BLOCK_SBOX macro I'd something along those lines:

gather load src
AND with 0xff to generate indexes
gather load dvbcsa_block_sbox using those indexes
tricky part: get the lsb of each dword out of there:
- either shuffle to get them in two pieces and then unpack or
- use unpack_lo/hi_epi8, but I don't know how it works yet

that would give 8 bytes in 4 ops, but I don't know about perf.

I tried enabling auto-vectorization, but it gave no speedup, so it remains to write it and test it :)

mkaluza · 2018-01-17T20:40:36Z

@glenvt18 I've written avx version of BLOCK_SBOX :) perf went up to 2000mbps ;) @kierank - your intuition was good :)

@nto there are 3 versions of this function which perform very similarly on my cpu. Can you test them on yours? Just change the number here: https://github.com/mkaluza/libdvbcsa/blob/14f6e30b6a1c44faa431b1e77119ad884d4a674e/src/dvbcsa_bs_block.c#L176 and rebuild

And while I can understand that v1 and v2 perform similarly, I don't understand why v3 isn't any faster, as there are less operations in it. Any ideas guys?

@glenvt18 - as for power consumption, I have no reliable way to measure it right now (although at some point we could, as we already did power meter tests - but since those are production machines, we can only do it very late at night). But by looking at the temperature from sensors the cpu didn't break a sweat, so I wouldn't worry about it too much right now.

now there's 4% speed gain

mkaluza · 2018-01-17T21:45:44Z

@glenvt18 ok, so BS_LOAD_DEINTERLEAVE_8 is fixed, it gives 4% speedup. Now all that's left is to do the same with BLOCK_SBOX_PERMUTE what I did with BLOCK_SBOX.

And actually as I look at dvbcsa_bs_block_decrypt_register, I think that BLOCK_SBOX_PERMUTE and BS_LOAD_DEINTERLEAVE_8 could be combined into one, but that's another step I think.
And - that funny code at the end of dvbcsa_bs_block_decrypt/encrypt_register also seems like a good candidate for SIMD. But lets leave something for tomorrow ;)

glenvt18 · 2018-01-17T21:49:59Z

@mkaluza Cool:) Congratulations!
You don't need deinterleaving with avx2 sbox. As for the performance, did you look at the generated code? Gnu compiler is usually very good at reordering instructions to reduce dependencies, but sometimes it fails.

mkaluza · 2018-01-17T21:58:37Z

I know I don't, but I will when I do avx version of sbox permute - with than and working deinterleave we'll be able to ditch BLOCK_PERMUTE_LOGIC, which is 16 ops. And there's that code at the and as well... But so far so good :) Especially considering that I have absolutely zero experience with vector programming :P

I didn't look at the code. In fact I dislike at&t assembly syntax and I somehow can't get used to it :/ I grew up on intel syntax and that huge amount of % signs reminds me only of the percent nightmare of dos/windows batch files :P you know... that bad touch stays with you for life :P But maybe time will come when I see there something besides %%%%% %%% % % % :)

Good night :)

kierank · 2018-01-17T22:06:22Z

I dislike at&t syntax as well. Tomorrow my colleague @JDarnley will look at your code.

glenvt18 · 2018-01-18T13:04:00Z

@JDarnley

Also there was some talk of AT&T syntax. Is there some inline assembly somewhere in the library?

No. We talked about the generated assembly (dvbcsa_bs_block.s). The quality of the code generated for x86_64 is usually very high (at least for this project). Though I wouldn't say that about ARM. Nevertheless, I don't think it worth bothering with hard-coded assembly.

Your review of the current SIMD implementation of the block cipher (dvbcsa_bs_block.c) will be very much appreciated. Any suggestions are welcome. Perhaps you can find some stalls which can be eliminated.

Some details of the implementation:
Each RN byte on the diagram is represented by a huge (256 bytes for avx2) byteslice register. The register consists of 8 bs_word words (8 * 256 bit for avx2):

r[8 * 0 + g]  for g = [0, 8)  - R0 byte on the diagram  - byte 0 of packets [0, 256)
r[8 * 1 + g]  for g = [0, 8)  - R1 byte on the diagram  - byte 1 of packets [0, 256)
...
r[8 * 7 + g]  for g = [0, 8)  - R7 byte on the diagram  - byte 7 of packets [0, 256)

The key is represented the same way but has all bytes equal to each other (for now).

The shift is implemented by moving the processing window across a larger buffer (r += 8, r -= 8: 256 bytes - one byteslice 'byte').

dvbcsa_block_sbox is a 256-byte lookup table for sbox.
dvbcsa_block_sbox_perm is a 512-byte lookup table for sbox and permutation.

mkaluza · 2018-01-18T13:06:34Z

@JDarnley Hi! Thanks for being here ;) Well - AVX has it's quirks, probably like everything else, but what fun would that be if everything was just sweet and rose'y :P

Yes I did - BS_LOAD_DEINTERLEAVE_8 works since 4943282, and mkaluza@600e7b4 makes it finally run faster than "generic" sbox code.

Yes I do - I have some vision of sbox_permute_deinterleave_avx (mkaluza@dc1b0e0), but I probably got the order of permutations wrong, so it currently produces incorrect data. But was only a first aproach very late at night, so I expect things will improve.

ATT assembly was only mentioned in context of looking at compiler output and that its ugly with all those "%" signs everywhere :P No handcoded stuff was done yet.

I have a question then. Here 14f6e30 I have three versions of the function, which in plain C does this:

dvbcsa_u8_aliasing_t *src = (dvbcsa_u8_aliasing_t *)in_buf;
dvbcsa_u8_aliasing_t *dst = (dvbcsa_u8_aliasing_t *)out_buf;
for (j = 0; j < BS_BATCH_SIZE; j++) dst[j] = dvbcsa_block_sbox[src[j]];

A simple lookup table. And while v1 and v2 are very similar - both using SHL/R and AND, while the third one uses shuffle - just one op instead of two. I'd expect the last one to be faster, but it's not, even by a bit. Can this be optimized somehow?

Maybe it's as @glenvt18 says - gather latency is high and there's nothing we can do about it since we have nothing better to do there. But... @glenvt18 - as I'm looking at it... since sbox_avx is also a loop 0..7, I think that those two loops here https://github.com/mkaluza/libdvbcsa/blob/avx2/src/dvbcsa_bs_block.c#L234 and here https://github.com/mkaluza/libdvbcsa/blob/avx2/src/dvbcsa_bs_block.c#L245 and BLOCK_SBOX could be merged into one, as they are all processing scratch variables. That could eliminate some loads/stores to scratch and give cpu something to do between each gather call. I'll try it later. I wonder if gcc could do it by itself...

@glenvt18 Yes, make check works (now, I fixed v1 nad v2 loops this morning) for all three sbox_avx functions. And sbox_permute_deinterleave_avx is in progress :)

glenvt18 · 2018-01-18T13:49:02Z

@mkaluza

A simple lookup table. And while v1 and v2 are very similar - both using SHL/R and AND, while the third one uses shuffle - just one op instead of two. I'd expect the last one to be faster, but it's not, even by a bit. Can this be optimized somehow?

shuffle is slower (at least with ssse3). That's why deinterleaving performs much better on ARM. And there might be a small stall due to the dependencies.
EDIT. According to the Intel docs, if the code is perfectly organized, a Skylake CPU can process 3 logic instructions per cycle, but only one shuffle/unpack.

Do the combined version perform much better than BLOCK_SBOX_PERMUTE(with gather) + BS_LOAD_DEINTERLEAVE_8 (you don't have to pass make check: just run the benchmark to estimate the benefit)?

That could eliminate some loads/stores to scratch and give cpu something to do between each gather call.

You'll very likely get a huge performance drop if you get rid of the scratch buffers. This buffers are essential for reducing dependencies. Also, this will very likely ruin performance on other targets. I'll never sacrifice ARM and SSE performance for the sake of a small AVX2 increase.

@JDarnley

I see you have already encountered some of the "joys" of working with AVX2 instructions.

I'm a bit disappointed too. The csa algorithm makes a heavy use of a simple table lookup. Both AVX2 and NEON have instructions for doing that. And, in the context of this algorithm, they both suck. NEON's one sucks even more. The good news is that the currently achieved performance is still very good and acceptable for most applications.

glenvt18 · 2018-01-18T14:23:47Z

@mkaluza

That could eliminate some loads/stores to scratch and give cpu something to do between each gather call

Unfortunately, some dependencies are build into the algorithm flow. Otherwise it would be easy to process each step separately. Definitely, making hard to implement efficiently in SW was one of the algorithm design goals.

mkaluza · 2018-01-18T15:58:55Z

@glenvt18 don't worry, I don't intend to break anything or to force anything on anyone - I'm just curious. Combined version is about 5% faster, so it's not much, but measureable.
Thanks for the info about latencies - I'll keep that in mind.

As for authors' intentions, I'm not sure - the algorithm was invented quite a long time ago and I suspect it wasn't made sw unfriendly on purpose, but rather it was made hw friendly with no attention to software implementation (which probably wasn't usable anyway at that time). But that's just a guess.

Another +4% perf

JDarnley · 2018-01-18T16:55:48Z

Okay, I see I misunderstood the discussion around AT&T syntax. FYI: objdump, gdb, and perf can be made to show disassembled code in Intel syntax, usually with -M intel on the command line.

@glenvt18 I will look (I am looking) over the block file and thanks for the link even though I'm not sure it helps much but perhaps I can marry it with your explanation.

You are quite right about the the shuffle instruction. It and some others, like unpack, can cause a bottleneck due the specific nature of the processor. This can cause a shuffle to be slower than some other methods but usually it is "good enough".

@mkaluza That brings me onto those three alternative functions you highlighted; block_sbox_avx{1,2,3}. They are all extremely linear as in one line must wait for the previous to finish. The compiler might reorder things to use all available registers and allow some instructions to be executed simultaneously but in the shuffle case: the por must wait for the second pshufb which must wait for the gather which must wait for the first pshufb. This function would basically be waiting on the p5 port while the shuffle gets executed. Even the gather has a micro-op on p5 (I did have to look that up). This means I am not surprised that this variant is slower and I doubt there is any way to improve it.

I will post some very useful references that you may not be aware of. To start with there is Agner Fog's instruction tables document [1] which has very detailed timing information for instructions on just about every microarch from Intel and AMD. Next are Intel's own Software Development Manuals [2] of which I usually use the Instruction Set Reference [3][4]. Finally with the coming of AVX-512 Intel finally published an extremely verbose timing reference of their own for the new processors (Skylake-X, the new Xeons, etc) which can be found at [5].

[1] http://agner.org/optimize/
[2] https://software.intel.com/en-us/articles/intel-sdm
[3] https://software.intel.com/sites/default/files/managed/a4/60/325383-sdm-vol-2abcd.pdf
[4] https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-instruction-set-reference-manual-325383.pdf
[5] https://software.intel.com/sites/default/files/managed/ad/dc/Intel-Xeon-Scalable-Processor-throughput-latency.pdf

P.S. I will have to pull the latest commits I've seen popping up while writing this.

mkaluza · 2018-01-18T20:45:39Z

@JDarnley thank you for this analisis - that's the level of knowledge I haven't reached yet. The references will probably make a good, or at least long read :) I'll try to make use of this knowledge.

As for those commits, I fixed sbox_permute_deinterleave_avx. It now passes the testbitslice and after another change is 9% faster than separate sbox + permute. Of course it breaks testbsops, but it's because with it BS_LOAD_DEINTERLEAVE_8 macro is a noop.

mkaluza · 2018-01-18T21:18:49Z

@glenvt18 I did merge that inner loop effectively getting rid of scratch and not only there was no performance hit, but I got +2,5% speedup (measured with both ciphers enabled). Of course it's not worth messing the code up, but it's good to know. I can post that code somewhere if you want.

Updating benchmark table from above:

cipher	speed [mbps]
block sse2	1780
block ssse3	1950
block plain avx2	2000
block deinterleave avx2	2100
block sbox avx2	2300
block sbox_permute_deinterleave_avx	2650
block sbox_permute_deinterleave_avx + merged loop	2740
stream sse2/ssse3	5560
stream avx2	9670

glenvt18 · 2018-01-18T21:55:13Z

@mkaluza Could you give figures for the whole cipher (block+stream).

mkaluza · 2018-01-18T22:24:49Z

@glenvt18 here you are

version	encrypt	decrypt
sse2	1338	1361
ssse3	1470	1460
plain avx2 (`603a5c9`)	1700	1650
deinterleave avx2 (`20ba124`)	1770	1655
sbox avx2 (func2) (`14f6e30`)	1975	1985
sbox_permute_deinterleave_avx (`9ffa1e0`)	2147	2160
sbox_permute_deinterleave_avx + merged loop	2217	2201

glenvt18 · 2018-01-18T22:48:59Z

@mkaluza Thanks. Well, finally, it's 1.5x faster than ssse3. Quite impressing. I would stay with sbox_permute_deinterleave_avx (current HEAD). As for the merged loop, I guess you escaped the penalty because you're not filling byte buffers anymore and are using SIMD registers instead (less dependency).

The code requires some cleanup before merging. As an option, I could do it myself and commit on your behalf.

What about AVX-512. Should it be supported?

nto · 2018-01-18T23:37:40Z

Here is a new benchmark (best of 10 runs):

Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz

Mode	Decrypt	Encrypt
uint32	695.2	695.8
uint64	1113.5	1137.4
mmx	813.2	810.0
sse2	1677.1	1710.0
ssse3	1918.2	1928.0
avx2-1	2715.8	2860.5
avx2-2	2719.8	2868.6
avx2-3	2722.0	2882.8

Looks like block_sbox_avx3 is the fastest on this cpu.

mkaluza · 2018-01-19T10:30:08Z

I know - this PR was meant for intrinsics only and after I started playing with SIMD it suddenly became my devel branch :P I'll clean it up.

Besides, I'm not finished :) Apart from few small ideas about sbox_permute_deinterleave_avx I'd also like to take a look at transpose functions, because there are places with loops still loading uint32 or uint64, so maybe AVX could help there a bit.

And I have one a bit crazy idea to test - or maybe not so crazy. Why not do lookup by two bytes? This would make the array have 65k 2-byte elements, so its still manageable, and should reduce at least load operations by half, which might make up for increased memory latency (the array wouldn't fit the cache anymore). Anyway, I'll try and report back :)

As for AVX-512, I'm not sure, but I'd say 'no' for now. Simply enabling avx-512 intrinsincs would probably do more harm than good, because only stream part would benefit, and it already takes less than 20% of time, so you'd get maybe 10%, and because using avx-512 makes the cpu reduce clock (by more than 10%), it would affect the performance of block part. So unless someone ports sbox_permute_deinterleave_avx to avx-512 and tests it, I wouldn't add it now. Besides, why should we have all the fun ;) Let someone else get some too :)

what wonders me however is that the block part is so immute to word width extension. I can't see anywhere any dependency on the BS_BATCH_SIZE or BS_BATCH_BYTES that would suggest it would run longer for wider words, and yet it seems to do so...

@nto thanks for those benchmarks. One thing wonders me - are you sure you're really running at 3GHz? That is - without turbo? Because your cpu is Skylake, as well as mine, and the clock is only 11% higher, while the results are 40%+ higher, which seems unlikely. What kind of memory setup do you have there? (speed/CAS latency/number of channes populated).

glenvt18 · 2018-01-19T11:56:33Z

@mkaluza

Besides, I'm not finished :) Apart from few small ideas about sbox_permute_deinterleave_avx I'd also like to take a look at transpose functions, because there are places with loops still loading uint32 or uint64, so maybe AVX could help there a bit.

Algorithm works with 64-bit blocks. Transpose matrices are Nx64 or 2x32. So, don't bother. SIMD can be used while implementing matrix transpose operations (BS_SWAPXX_LE). NEON has a dedicated instruction for that (see dvbcsa_bs_neon.h). If AVX2 doesn't have such an efficient solution, don't bother too because the default macros are fast enough. This part (loading and transpose) has been heavily optimized (see git log).

Why not do lookup by two bytes?

Well. You can try. But the lookup table is accessed randomly and you have about 40k of working buffers. Cache misses are very expensive. BTW. In a real application you may encounter performance drops due to the cache misses.
EDIT. Just try and measure the performance. Don't make it pass the tests.

what wonders me however is that the block part is so immute to word width extension

See #4 (comment)

Besides, I'd very much like to merge the changes and close this PR. Any other improvements can be done later. The same is about AVX-512.

mkaluza · 2018-01-19T13:42:45Z

You were right about cache misses. Although number of instruction decreased by 25%, L1 misses increased from 9,7% to 99%+, and performance dropped to ~1600mbps and everything landed in LLC.

Byte-addressed:

 Performance counter stats for './test/benchbitslice':

        589.204895      task-clock:u (msec)       #    0.962 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
                75      page-faults:u             #    0.127 K/sec
        1562092036      cycles:u                  #    2.651 GHz                      (61.95%)
        2241142468      instructions:u            #    1.43  insn per cycle           (69.99%)
          77740487      branches:u                #  131.941 M/sec                    (70.02%)
            314219      branch-misses:u           #    0.40% of all branches          (70.05%)
         683645086      L1-dcache-loads:u         # 1160.284 M/sec                    (69.12%)
          66344234      L1-dcache-load-misses:u   #    9.70% of all L1-dcache hits    (69.16%)
            623164      LLC-loads:u               #    1.058 M/sec                    (68.63%)
              1590      LLC-load-misses:u         #    0.51% of all LL-cache hits     (68.69%)

Word addressed:

 Performance counter stats for './test/benchbitslice':

        759.579766      task-clock:u (msec)       #    0.995 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
                80      page-faults:u             #    0.105 K/sec
        2029772935      cycles:u                  #    2.672 GHz                      (60.69%)
        1799763725      instructions:u            #    0.89  insn per cycle           (68.55%)
         104716633      branches:u                #  137.861 M/sec                    (68.55%)
            291284      branch-misses:u           #    0.28% of all branches          (68.90%)
         587536729      L1-dcache-loads:u         #  773.502 M/sec                    (69.85%)
         584942106      L1-dcache-load-misses:u   #   99.56% of all L1-dcache hits    (69.97%)
         164812983      LLC-loads:u               #  216.979 M/sec                    (70.09%)
              9376      LLC-load-misses:u         #    0.01% of all LL-cache hits     (70.13%)

But it was worth to try :)

nto · 2018-01-19T13:54:58Z

One thing wonders me - are you sure you're really running at 3GHz? That is - without turbo? Because your cpu is Skylake, as well as mine, and the clock is only 11% higher, while the results are 40%+ higher, which seems unlikely.

You are probably right, but it's hard to tell because the intel pstate driver is not running on this server, and /proc/cpuinfo reports 3000.000 MHz for all CPUs.

I ran another benchmark on an older CPU, with frequency scaling and turbo boost disabled:

Intel(R) Xeon(R) CPU E3-1230 v5 @ 3.40GHz

Mode	Decrypt	Encrypt
uint32	637.7	638.0
uint64	1023.7	1042.7
mmx	746.0	744.0
sse2	1526.0	1567.3
ssse3	1757.6	1764.1
avx2-1	2479.4	2630.4
avx2-2	2481.8	2627.6
avx2-3	2488.9	2633.4

What kind of memory setup do you have there? (speed/CAS latency/number of channes populated).

$ lshw -short -C memory
H/W path            Device           Class          Description
===============================================================
/0/1                                 memory         64KiB BIOS
/0/400/700                           memory         1152KiB L1 cache
/0/400/701                           memory         18MiB L2 cache
/0/400/702                           memory         24MiB L3 cache
/0/401/703                           memory         1152KiB L1 cache
/0/401/704                           memory         18MiB L2 cache
/0/401/705                           memory         24MiB L3 cache
/0/1000                              memory         32GiB System Memory
/0/1000/0                            memory         16GiB DIMM DDR4 Synchronous Registered (Buffered) 2666 MHz (0.4 ns)
...
/0/1000/c                            memory         16GiB DIMM DDR4 Synchronous Registered (Buffered) 2666 MHz (0.4 ns)
...
/0/100/1f.2                          memory         Memory controller

JDarnley · 2018-01-19T14:15:00Z

@mkaluza I feel I should point out that the CPU @nto has been testing with is not Intel's old 6th generation Skylake like your i5-6400 but rather a Skylake-SP, the newest iteration that is in the Skylake-X desktop processors and the Xeon Platinum/Gold/Silver/Bronze server processors. I am not familiar with the changes between them particularly on the turbo and power/heat throttling they apply. The E3-1230 v5 is an original Skylake though.

@mkaluza I also want to commend you on commit 9ffa1e0. Nice work.

mkaluza · 2018-01-20T12:54:05Z

Thank you @JDarnley :) Also following your explanation I switched from shuffle to AND + SHIFT and got 2% more ;)
I know the CPU is a bit newer, but it's still more-less the same uarch and it could be hard to explain a 30-40% single-thread speedup :)

@glenvt18 by the way, I found this http://software.intel.com/en-us/articles/intel-software-development-emulator/, which allowed me to port (and test) avx512 version (with sbox_permute): mkaluza@24d4781. @nto can you feed it to your Gold behemoth and give us some numbers? :) That could more-less close the issue of AVX512 for now.

nto · 2018-01-23T16:58:11Z

Skylake X cpus won't have intel_pstate driver until Linux 4.16 (see this pull request), so I had to disable turbo boost in the BIOS, and the numbers are lower indeed.

Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz

Mode	Decrypt	Encrypt
uint32	564.0	564.0
uint64	904.9	922.2
mmx	659.8	658.0
sse2	1361.4	1388.0
ssse3	1555.2	1561.0
avx2-1	2407.9	2467.6
avx2-2	2413.4	2473.0
avx2-3	2413.5	2473.2
avx512	2903.1	2896.5

@mkaluza : nice work! the AVX-512 version is actually faster than AVX2 with this version (and make check is OK).

mkaluza · 2018-01-23T18:08:48Z

@nto thank you ;) awesome ;) benchmarks made without turbo are more repeatable and comparable (my 2.7GHz Skylake does 2200 on AVX2, so it's very comparable) and I can include them in a commit message.

I expected around 20% and it's exactly 20% :) Always something, but considering the fact that word width is doubled, it looks a bit disappointing in comparison to ssse3->avx2. But that block cipher part doesn't scale well and if you account for for lower clocks with AVX-512, then 20% is not bad.

Now we can debate whether AVX-512 should be supported or not, but I have mixed feelings about it - because of the clock penalty using AVX-512 just a bit doens't make sense, but on the other hand I'm for freedom of choice... there should probably be some warning in README about this and that's it.
Another thing is that probably it'll never be used considering the price of AVX-512 capable cpus and the insane speed you could achieve with it, but it's a geeky thing to have it :)

(EDIT: for comparison, classic implementation scaled to 3GHz has 68/75 dec/enc :)

Think about it - your cpu even without HT should do around 50gbit/s!! and (at least for sse3) ht does improve things considerably - that's an insane amount of traffic :) Cable has around 4,7gbit/s, so less than 10% of that :) 50gbps is 841 satellite transponders 62mbps each!!! :) HotBird has only around 100 ;) so your beast could probably encrypt most (if not all) the satellite traffic on Earth (I think all considering that some channels are FTA and not all transponders are 62mbps). That's insane :)

Although it's a bit surprising that make check worked on avx-512, because it shouldn't - testbsops should fail, testbitslice should work - but whatever :)

glenvt18 · 2018-01-23T18:32:16Z

moved to #5 and merged 967ad8e

nto · 2018-01-23T18:36:20Z

Although it's a bit surprising that make check worked on avx-512, because it shouldn't - testbsops should fail, testbitslice should work - but whatever :)

I patched testbsops to disable the BS_LOAD_DEINTERLEAVE_8 test of course ;)

JDarnley · 2018-01-23T19:16:55Z

I should point out that in this internal test AVX-512 might be quicker just because it spends enough of its time in ZMM registers (I mean code and instructions that operate on them). The gain from using the more, wider registers can outweigh the clock speed reduction. The benchmark doesn't need to switch to other code. Unlike some small(ish) DSP functions like in FFmpeg and Upipe where the clock speed hit will impact the rest of the program's performance.

If someone does want to add AVX-512 then I do suggest noting the possible performance issues in a document somewhere, like the README.

"In the Current Year of 2018 using ZMM registers of the AVX-512 feature set may not be faster for your application. It is advised that you test the performance of all your software before committing to using AVX-512."

Anyway congratulations to all in this thread and in PR no. 5, especially you @mkaluza.

glenvt18 · 2018-01-23T22:52:15Z

Those benchmark figures might be deceptive. Unfortunately, you are unlikely to get such results in a real world application due to many factors. This is true for all --enable-xxxx options on all supported targets. Usually, if the benchmark runs faster with a particular option, applications run faster too. But there might be exceptions.

If AVX-512 works and performs better (at least with benchbitslice), I don't mind merging it. I think users of this library are smart enough to test the performance of their applications.

mkaluza · 2018-01-24T13:53:28Z

@glenvt18 I know, but it was just cool to see :) Besides I doubt anyone needs so much bandwidth here.

@JDarnley I like such things, although I lack in-depth knowledge of how cpu works and after looking at the amount of volumes you posted links to I'm afraid it'll stay this way unless I go full time into HPC so I have nothing else to do except for reading and coding :P

One thing bothers me though: in stream cipher results posted by @nto here with double register width the speed almost doubles as well up until AVX-512, where there is only about 20% gain. Any ideas about it? That's too much of a drop to blame it all on lowered clock speed. The only thing that comes to my mind is that we don't fit some cache anymore, as 256*188 = 47k of data and it is 94k for AVX-512.

@nto, could you run perf stat -d ./test/benchbitslice for avx2 (whichever func) and avx512 and post it here? Because if we're loosing L1 cache anyway, we might as well try with that bigger lookup table I tried before.

nto · 2018-01-26T10:33:50Z

$ perf stat -d test/benchbitslice-avx2-2 
* DVBCSA bench *
 - Generating batch with 256 randomly sized packets

 - decrypting 4096 TS packets
 - decrypting 8192 TS packets
 - decrypting 16384 TS packets
 - decrypting 32768 TS packets
 - decrypting 65536 TS packets
 - decrypting 131072 TS packets
 - decrypting 262144 TS packets
 - 520192 packets proceded, 2407.5 Mbits/s

 - encrypting 4096 TS packets
 - encrypting 8192 TS packets
 - encrypting 16384 TS packets
 - encrypting 32768 TS packets
 - encrypting 65536 TS packets
 - encrypting 131072 TS packets
 - encrypting 262144 TS packets
 - 520192 packets proceded, 2463.8 Mbits/s
* Done *

 Performance counter stats for 'test/benchbitslice-avx2-2':

        628.671765      task-clock (msec)         #    0.998 CPUs utilized          
                 1      context-switches          #    0.002 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                80      page-faults               #    0.127 K/sec                  
     1,884,306,796      cycles                    #    2.997 GHz                      (49.14%)
     3,637,874,864      instructions              #    1.93  insn per cycle           (61.86%)
        65,491,937      branches                  #  104.175 M/sec                    (62.26%)
           276,598      branch-misses             #    0.42% of all branches          (63.31%)
     1,028,144,912      L1-dcache-loads           # 1635.424 M/sec                    (63.54%)
        69,768,782      L1-dcache-load-misses     #    6.79% of all L1-dcache hits    (63.55%)
             1,098      LLC-loads                 #    0.002 M/sec                    (50.20%)
               115      LLC-load-misses           #   10.47% of all LL-cache hits     (49.56%)

       0.629975843 seconds time elapsed

$ perf stat -d test/benchbitslice-avx512
* DVBCSA bench *
 - Generating batch with 512 randomly sized packets

 - decrypting 4096 TS packets
 - decrypting 8192 TS packets
 - decrypting 16384 TS packets
 - decrypting 32768 TS packets
 - decrypting 65536 TS packets
 - decrypting 131072 TS packets
 - decrypting 262144 TS packets
 - 520192 packets proceded, 2897.3 Mbits/s

 - encrypting 4096 TS packets
 - encrypting 8192 TS packets
 - encrypting 16384 TS packets
 - encrypting 32768 TS packets
 - encrypting 65536 TS packets
 - encrypting 131072 TS packets
 - encrypting 262144 TS packets
 - 520192 packets proceded, 2894.9 Mbits/s
* Done *

 Performance counter stats for 'test/benchbitslice-avx512':

        528.706801      task-clock (msec)         #    0.998 CPUs utilized          
                 4      context-switches          #    0.008 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               108      page-faults               #    0.204 K/sec                  
     1,573,578,473      cycles                    #    2.976 GHz                      (49.00%)
     1,936,741,776      instructions              #    1.23  insn per cycle           (62.89%)
        57,831,226      branches                  #  109.382 M/sec                    (63.18%)
           424,316      branch-misses             #    0.73% of all branches          (63.46%)
       556,561,314      L1-dcache-loads           # 1052.684 M/sec                    (63.71%)
       134,228,967      L1-dcache-load-misses     #   24.12% of all L1-dcache hits    (63.44%)
             2,320      LLC-loads                 #    0.004 M/sec                    (49.83%)
               473      LLC-load-misses           #   20.39% of all LL-cache hits     (49.07%)

       0.529718574 seconds time elapsed

mkaluza · 2018-01-26T11:36:22Z

Thank you :) Indeed there's an increase in in L1 misses that could explain it, but that just a guess, as I know too little about this stuff. I had a look at Intel's optimization manuals and well... right now it could as well be in Chinese (and I'm sure it wasn't :) and it wouldn't make much difference :P

But it's fun, so I'll probably bother you sometime in the future again to run some benchmarks if I come up with something and now I'll prepare a PR for AVX512 with what I have.

Besides I can see that Intel Cannon Lake might support AVX512 in consumer cpus so maybe I'll have a cpu of my own at some point.

Thank you once again for help :)

glenvt18 reviewed Jan 16, 2018

View reviewed changes

Marcin Kaluza added 2 commits January 16, 2018 19:47

avx2: remove extra headers

e755fc0

Makefile.am: add avx2 header

4a26835

Marcin Kaluza added 3 commits January 17, 2018 13:41

avx: WIP: ugly fix BS_LOAD_DEINTERLEAVE_8

4943282

It works, but it's slow, but probably AVX2 scatter/gather will work here

avx2: fixed BS_LOAD_DEINTERLEAVE_8 using avx2 gather load, but it's 3…

4e19d8a

…% slower, so leaving it disabled

avx: experimental: 3 avx versions of BLOCK_SBOX

14f6e30

mkaluza force-pushed the avx2 branch from e3df7fd to 14f6e30 Compare January 17, 2018 20:32

avx: improve BS_LOAD_DEINTERLEAVE_8 with _mm256_permute4x64_epi64

600e7b4

now there's 4% speed gain

Marcin Kaluza added 2 commits January 18, 2018 17:39

avx: fix and enable block_sbox_permute_interleave_avx

5198674

avx: block_sbox_permute_interleave_avx: optimize out unnecesary permutes

9ffa1e0

Another +4% perf

mkaluza closed this Jan 19, 2018

add AVX2 support for bitslice (TODO BS_LOAD_DEINTERLEAVE_8) #4

add AVX2 support for bitslice (TODO BS_LOAD_DEINTERLEAVE_8) #4

Conversation

mkaluza commented Jan 16, 2018

mkaluza commented Jan 16, 2018

glenvt18 Jan 16, 2018

Choose a reason for hiding this comment

glenvt18 Jan 16, 2018

Choose a reason for hiding this comment

mkaluza Jan 16, 2018 • edited

Choose a reason for hiding this comment

glenvt18 commented Jan 16, 2018 • edited

glenvt18 commented Jan 16, 2018

glenvt18 commented Jan 16, 2018

mkaluza commented Jan 16, 2018

glenvt18 commented Jan 17, 2018 • edited

nto commented Jan 17, 2018

glenvt18 commented Jan 17, 2018

nto commented Jan 17, 2018

glenvt18 commented Jan 17, 2018

kierank commented Jan 17, 2018

mkaluza commented Jan 17, 2018

Summary for uint32

Summary for uint64

Summary for mmx

Summary for sse2

Summary for ssse3

Summary for avx2

glenvt18 commented Jan 17, 2018

mkaluza commented Jan 17, 2018

mkaluza commented Jan 17, 2018

mkaluza commented Jan 17, 2018

glenvt18 commented Jan 17, 2018

mkaluza commented Jan 17, 2018

kierank commented Jan 17, 2018

glenvt18 commented Jan 18, 2018 • edited

mkaluza commented Jan 18, 2018 • edited

glenvt18 commented Jan 18, 2018 • edited

glenvt18 commented Jan 18, 2018

mkaluza commented Jan 18, 2018

JDarnley commented Jan 18, 2018

mkaluza commented Jan 18, 2018

mkaluza commented Jan 18, 2018 • edited

glenvt18 commented Jan 18, 2018

mkaluza commented Jan 18, 2018

glenvt18 commented Jan 18, 2018

nto commented Jan 18, 2018

mkaluza commented Jan 19, 2018

glenvt18 commented Jan 19, 2018 • edited

mkaluza commented Jan 19, 2018

nto commented Jan 19, 2018

JDarnley commented Jan 19, 2018

mkaluza commented Jan 20, 2018

nto commented Jan 23, 2018

mkaluza commented Jan 23, 2018 • edited

glenvt18 commented Jan 23, 2018

nto commented Jan 23, 2018

JDarnley commented Jan 23, 2018

glenvt18 commented Jan 23, 2018

mkaluza commented Jan 24, 2018

nto commented Jan 26, 2018

mkaluza commented Jan 26, 2018

mkaluza Jan 16, 2018 •

edited

glenvt18 commented Jan 16, 2018 •

edited

glenvt18 commented Jan 17, 2018 •

edited

glenvt18 commented Jan 18, 2018 •

edited

mkaluza commented Jan 18, 2018 •

edited

glenvt18 commented Jan 18, 2018 •

edited

mkaluza commented Jan 18, 2018 •

edited

glenvt18 commented Jan 19, 2018 •

edited

mkaluza commented Jan 23, 2018 •

edited