- Fastest transpose/shuffle
- π (2019.11) ALL TurboTranspose functions now available under 64 bits ARMv8 including NEON SIMD.
- Byte/Nibble transpose/shuffle for improving compression of binary data (ex. floating point data)
- β¨ Scalar/SIMD Transpose/Shuffle 8,16,32,64,... bits
- π Dynamic CPU detection and JIT scalar/sse/avx2 switching
- 100% C (C++ headers), usage as simple as memcpy
- Byte Transpose
- Fastest byte transpose
- π (2019.11) 2D,3D,4D transpose
- Nibble Transpose
- nearly as fast as byte transpose
- more efficient, up to 10 times! faster than Bitshuffle
- π better compression (w/ lz77) and
10 times! faster than one of the best floating-point compressors SPDP - can compress/decompress (w/ lz77) better and faster than other domain specific floating point compressors
- Scalar and SIMD Transform
- Delta encoding for sorted lists
- Zigzag encoding for unsorted lists
- Xor encoding
- π lossy floating point compression with user-defined error
- Benchmark Intel CPU: Skylake i7-6700 3.4GHz gcc 9.2 single thread
- Benchmark ARM: ARMv8 A73-ODROID-N2 1.8GHz
BOLD = pareto frontier.
E:Encode, D:Decode
./tpbench -s# file -B16K (# = 8,4,2)
E cycles/byte | D cycles/byte | Transpose 64 bits AVX2 |
---|---|---|
.199 | .134 | TurboTranspose Byte |
.326 | .201 | Blosc byteshuffle |
.394 | .260 | TurboTranspose Nibble |
.848 | .478 | Bitshuffle 8 |
E cycles/byte | D cycles/byte | Transpose 32 bits AVX2 |
---|---|---|
.121 | .102 | TurboTranspose Byte |
.451 | .139 | Blosc byteshuffle |
.345 | .229 | TurboTranspose Nibble |
.773 | .476 | Bitshuffle |
E cycles/byte | D cycles/byte | Transpose 16 bits AVX2 |
---|---|---|
.095 | .071 | TurboTranspose Byte |
.640 | .108 | Blosc byteshuffle |
.329 | .198 | TurboTranspose Nibble |
.758 | 1.177 | Bitshuffle 2 |
.067 | .067 | memcpy |
E MB/s | D MB/s | 16 bits ARM 2019.11 |
---|---|---|
8192 | 16384 | TurboTranspose Byte |
8192 | 8192 | blosc byteshuffle |
1638 | 2341 | TurboTranspose Nibble |
356 | 287 | blosc bitshuffle |
16384 | 16384 | memcpy |
E MB/s | D MB/s | 32 bits ARM 2019.11 |
---|---|---|
8192 | 8192 | TurboTranspose Byte |
8192 | 8192 | blosc byteshuffle |
1820 | 2341 | TurboTranspose Nibble |
372 | 252 | blosc bitshuffle |
E MB/s | D MB/s | 64 bits ARM 2019.11 |
---|---|---|
4096 | 8192 | TurboTranspose Byte |
5461 | 5461 | blosc byteshuffle |
1490 | 1490 | TurboTranspose Nibble |
372 | 260 | blosc bitshuffle |
MB/s: 1,000,000 bytes/second
./tpbench -s# file (# = 8,4,2)
E MB/s | D MB/s | Transpose 16 bits AVX2 2019.11 |
---|---|---|
9208 | 9795 | TurboTranspose Byte |
8382 | 7689 | Blosc byteshuffle |
9377 | 9584 | TurboTranspose Nibble |
2750 | 2530 | Blosc bitshuffle |
13725 | 13900 | memcpy |
E MB/s | D MB/s | Transpose 32 bits AVX2 2019.11 |
---|---|---|
9718 | 9713 | TurboTranspose Byte |
9181 | 9030 | Blosc byteshuffle |
8750 | 9472 | TurboTranspose Nibble |
2767 | 2942 | Blosc bitshuffle 4 |
E MB/s | D MB/s | Transpose 64 bits AVX2 2019.11 |
---|---|---|
8998 | 9573 | TurboTranspose Byte |
8721 | 8586 | Blosc byteshuffle 2 |
8252 | 9222 | TurboTranspose Nibble |
2711 | 2053 | Blosc bitshuffle 2 |
E MB/s | D MB/s | 16 bits ARM 2019.11 |
---|---|---|
872 | 3998 | TurboTranspose Byte |
678 | 3852 | blosc byteshuffle |
1365 | 2195 | TurboTranspose Nibble |
357 | 280 | blosc bitshuffle |
3921 | 3913 | memcpy |
E MB/s | D MB/s | 32 bits ARM 2019.11 |
---|---|---|
1828 | 3768 | TurboTranspose Byte |
1769 | 3713 | blosc byteshuffle |
1456 | 2299 | TurboTranspose Nibble |
374 | 243 | blosc bitshuffle |
E MB/s | D MB/s | 64 bits ARM 2019.11 |
---|---|---|
1793 | 3572 | TurboTranspose Byte |
1784 | 3544 | blosc byteshuffle |
1176 | 1267 | TurboTranspose Nibble |
331 | 203 | blosc bitshuffle |
π Download IcApp a new benchmark for TurboPFor+TurboTranspose
for testing allmost all integer and floating point file types.
Note: Lossy compression benchmark with icapp only.
C size | ratio % | C MB/s | D MB/s | Name AVX2 |
---|---|---|---|---|
11,348,554 | 18.1 | 2276 | 4425 | TurboTranspose Nibble+lz |
22,489,691 | 35.8 | 1670 | 3881 | TurboTranspose Byte+lz |
43,471,376 | 69.2 | 348 | 402 | SPDP |
44,626,407 | 71.0 | 1065 | 2101 | bitshuffle+lz |
62,865,612 | 100.0 | 13300 | 13300 | memcpy |
./tpbench -s4 -z *.sp
File | File size | lz % | Tp8lz | Tp4lz | BSlz | spdp1 | spdp9 | Tp4lzt | eTp4lzt | |
---|---|---|---|---|---|---|---|---|---|---|
msg_bt | 133194716 | 94.3 | 70.4 | 66.4 | 73.9 | 70.0 | |
67.4 | 54.7 | 32.4 |
msg_lu | 97059484 | 100.4 | 77.1 | 70.4 | 75.4 | 76.8 | |
74.0 | 61.0 | 42.2 |
msg_sppm | 139497932 | 11.7 | 11.6 | 12.6 | 15.4 | 14.4 | |
13.7 | 9.0 | 5.6 |
msg_sp | 145052928 | 100.3 | 68.8 | 63.7 | 68.1 | 67.9 | |
65.3 | 52.6 | 24.9 |
msg_sweep3d | 62865612 | 98.7 | 35.8 | 18.1 | 71.0 | 69.6 | |
13.7 | 9.8 | 3.8 |
num_brain | 70920000 | 100.4 | 76.5 | 71.1 | 77.4 | 79.1 | |
73.9 | 63.4 | 32.6 |
num_comet | 53673984 | 92.4 | 79.0 | 77.6 | 82.1 | 84.5 | |
84.6 | 70.1 | 41.7 |
num_control | 79752372 | 99.4 | 89.5 | 90.7 | 88.1 | 98.3 | |
98.5 | 81.4 | 51.2 |
num_plasma | 17544800 | 100.4 | 0.7 | 0.7 | 75.5 | 30.7 | |
2.9 | 0.3 | 0.2 |
obs_error | 31080408 | 89.2 | 73.1 | 70.0 | 76.9 | 78.3 | |
49.4 | 20.5 | 12.2 |
obs_info | 9465264 | 93.6 | 70.2 | 61.9 | 72.9 | 62.4 | |
43.8 | 27.3 | 15.1 |
obs_spitzer | 99090432 | 98.3 | 90.4 | 95.6 | 93.6 | 100.1 | |
100.7 | 80.2 | 52.3 |
obs_temp | 19967136 | 100.4 | 89.5 | 92.4 | 91.0 | 99.4 | |
100.1 | 84.0 | 55.8 |
Tp8=Byte transpose, Tp4=Nibble transpose, lz = lz4
eTp4Lzt = lossy compression with lzturbo and allowed error = 0.0001 (1e-4)
Slow but best compression: SPDP9 and lzt = lzturbo,39
-
Scientific IEEE 754 64-Bit Double-Precision Floating-Point Datasets
./tpbench -s8 -z *.trace
File | File size | lz % | Tp8lz | Tp4lz | BSlz | spdp1 | spdp9 | Tp4lzt | eTp4lzt | |
---|---|---|---|---|---|---|---|---|---|---|
msg_bt | 266389432 | 94.5 | 77.2 | 76.5 | 81.6 | 77.9 | |
75.4 | 69.9 | 16.0 |
msg_lu | 194118968 | 100.4 | 82.7 | 81.0 | 83.7 | 83.3 | |
79.6 | 75.5 | 21.0 |
msg_sppm | 278995864 | 18.9 | 14.5 | 14.9 | 19.5 | 21.5 | |
19.8 | 11.2 | 2.8 |
msg_sp | 290105856 | 100.4 | 79.2 | 77.5 | 80.2 | 78.8 | |
77.1 | 71.3 | 12.4 |
msg_sweep3d | 125731224 | 98.7 | 50.7 | 36.7 | 80.4 | 76.2 | |
33.2 | 27.3 | 1.9 |
num_brain | 141840000 | 100.4 | 82.6 | 81.1 | 84.5 | 87.8 | |
83.3 | 77.0 | 16.3 |
num_comet | 107347968 | 92.8 | 83.3 | 78.8 | 76.3 | 86.5 | |
86.0 | 69.8 | 21.2 |
num_control | 159504744 | 99.6 | 92.2 | 90.9 | 89.4 | 97.6 | |
98.9 | 85.5 | 25.8 |
num_plasma | 35089600 | 75.2 | 0.7 | 0.7 | 84.5 | 77.3 | |
3.0 | 0.3 | 0.1 |
obs_error | 62160816 | 78.7 | 81.0 | 77.5 | 84.4 | 87.9 | |
62.3 | 23.4 | 6.3 |
obs_info | 18930528 | 92.3 | 75.4 | 70.6 | 82.4 | 81.7 | |
51.2 | 33.1 | 7.7 |
obs_spitzer | 198180864 | 95.4 | 93.2 | 93.7 | 86.4 | 100.1 | |
102.4 | 78.0 | 26.9 |
obs_temp | 39934272 | 100.4 | 93.1 | 93.8 | 91.7 | 98.0 | |
97.4 | 88.2 | 28.8 |
eTp4Lzt = lossy compression with allowed error = 0.0001
git clone git://github.com/powturbo/TurboTranspose.git
cd TurboTranspose
make
or
make AVX2=1
nmake /f makefile.vs
or
nmake AVX2=1 /f makefile.vs
-
benchmark with other libraries
download or clone bitshuffle or blosc and typemake AVX2=1 BLOSC=1 or make AVX2=1 BITSHUFFLE=1
-
benchmark "transpose" functions
./tpbench [-s#] [-z] file s# = element size #=2,4,8,16,... (default 4) -z = only lz77 compression benchmark (bitshuffle package mandatory)
Byte transpose:
void tpenc( unsigned char *in, unsigned n, unsigned char *out, unsigned esize);
void tpdec( unsigned char *in, unsigned n, unsigned char *out, unsigned esize)
in : input buffer
n : number of bytes
out : output buffer
esize : element size in bytes (2,4,8,...)
Nibble transpose:
void tp4enc( unsigned char *in, unsigned n, unsigned char *out, unsigned esize);
void tp4dec( unsigned char *in, unsigned n, unsigned char *out, unsigned esize)
in : input buffer
n : number of bytes
out : output buffer
esize : element size in bytes (2,4,8,...)
- Linux: GNU GCC (>=4.6)
- Linux: Clang (>=3.2)
- Windows: MinGW-w64 makefile
- Windows: Visual c++ (>=VS2008) - makefile.vs (for nmake)
- Windows: Visual Studio project file - vs/vs2017 - Thanks to PavelP
- Linux ARM: 64 bits aarch64 ARMv8: gcc (>=6.3)
- Linux ARM: 64 bits aarch64 ARMv8: clang
- All TurboTranspose functions are thread safe
- BS - Bitshuffle: Filter for improving compression of typed binary data.
π A compression scheme for radio data in high performance computing - Blosc: A blocking, shuffling and loss-less compression
- SPDP is a compression/decompression algorithm for binary IEEE 754 32/64 bits floating-point data
π SPDP - An Automatically Synthesized Lossless Compression Algorithm for Floating-Point Data + DCC 2018 - π FPC: A High-Speed Compressor for Double-Precision Floating-Point Data
Last update: 25 Oct 2019