SSE matrix 4x4 multiplication & SciMark 2 benchmark
Based on:
Usage
make TEST=(matrix,scimark2_c) O=(2,3) TYPE=(0,1,2) CC=(gcc-4.6.2,clang-3.0,cc) ITER=100000000
SSE matrix results 100M iterations
Tested on:
- Mac OS X 10.7.3
- Intel Core i5 2400 2.5 GHz
Results:
version opt/-Ox arch SciMark comp. Matrix plain naive SSE opt SSE
ICC [1] 12.1.3 3 corei7-avx 1734.40 0.96 2.87 1.00
ICC 12.1.3 3 core2 1741.45 1.04 3.20 1.05
GCC [2] 4.7 3 corei7-avx 1666.37 0.73 2.08 0.54
GCC 4.7 3 core2 1355.56 0.78 2.32 0.58
GCC 4.6.2 3 corei7-avx 1641.45 0.88 1.97 0.54
GCC 4.6.2 3 core2 1357.67 0.74 2.08 0.58
clang [3] 3.1/318.0.58 3 corei7-avx 1406.52 3.03 3.19 1.03
clang 3.1/318.0.58 3 core2 1406.03 3.03 3.39 1.03
GCC/LLVM [4] 4.2.1/2336.1 3 core2 2.56 4.93 **CRASH**
GCC [5] 4.2.1 3 core2 2.70 4.41 1.63
[1] ICC 12.1.3 binaries provided by Intel® C++ Composer XE 2011 for Mac OS X. [2] GCC 4.7 & GCC 4.6.2 was built from GNU sources. [3] clang 3.1 binaries provided by Xcode 4.3 package. [4] GCC/LLVM 4.2.1 binaries provided by Xcode 4.2 package. [5] GCC 4.2.1 binaries provided by Xcode 3.2.6 package.
SSE matrix conclusions
SSE optimized matrix multiplication using intrinsics is hand optimized so does not really heavily relay on compiler's optimizer. But still performance is prone to number of available registers, especially SSE registers.
GCC seems to provide best performance
Both GCC 4.7 and 4.6.2 provide best performance, also pure C implementation is just 50% slower under GCC than optimized SSE matrix multiplication, where it is 100% slower under ICC and 300% slower under Clang.
Why ICC is worse than GCC for SSE optimized example
GCC can perform whole loop of operations using XMM (SSE) registers only if all calculation can be done in place. ICC moves values back and forth between stack and XMM registers using SSE4/AVX instructions.
Why Clang is worst in comparison to GCC & ICC
Clang does not even use SSE4/AVX instructions, but older movss than moves single SSE register value. So it is even slower than ICC. Clang also poorly optimize pure C version, where GCC and ICC can optimize using AVX.
Clang 3.1 optimized SSE example loop:
LBB0_5: ## Parent Loop BB0_4 Depth=1
## => This Inner Loop Header: Depth=2
movss -96(%rbp,%rcx,4), %xmm4
pshufd $0, %xmm4, %xmm4 ## xmm4 = xmm4[0,0,0,0]
mulps %xmm3, %xmm4
movss -92(%rbp,%rcx,4), %xmm5
pshufd $0, %xmm5, %xmm5 ## xmm5 = xmm5[0,0,0,0]
mulps %xmm0, %xmm5
addps %xmm4, %xmm5
movss -88(%rbp,%rcx,4), %xmm4
pshufd $0, %xmm4, %xmm4 ## xmm4 = xmm4[0,0,0,0]
mulps %xmm1, %xmm4
addps %xmm5, %xmm4
movss -84(%rbp,%rcx,4), %xmm5
pshufd $0, %xmm5, %xmm5 ## xmm5 = xmm5[0,0,0,0]
mulps %xmm2, %xmm5
addps %xmm4, %xmm5
movaps %xmm5, -192(%rbp,%rcx,4)
leaq 4(%rcx), %rcx
cmpl $16, %ecx
jl LBB0_5
GCC 4.6.2 loop:
L6:
movaps %xmm2, %xmm9
L3:
movaps %xmm1, %xmm8
movaps %xmm9, %xmm4
movaps %xmm0, %xmm2
mulps %xmm3, %xmm8
movaps %xmm6, %xmm5
addq $1, %rdx
mulps %xmm3, %xmm4
cmpq %rax, %rdx
mulps %xmm10, %xmm2
mulps %xmm3, %xmm5
movaps %xmm8, %xmm7
addps %xmm4, %xmm7
addps %xmm4, %xmm1
movaps %xmm0, %xmm4
movaps %xmm8, %xmm0
mulps %xmm3, %xmm4
addps %xmm9, %xmm0
addps %xmm7, %xmm2
addps %xmm4, %xmm1
addps %xmm4, %xmm0
addps %xmm7, %xmm4
addps %xmm5, %xmm2
addps %xmm5, %xmm1
addps %xmm5, %xmm0
addps %xmm4, %xmm6
jl L6
movaps %xmm2, (%rsp)
movaps %xmm1, 16(%rsp)
movaps %xmm0, 32(%rsp)
movaps %xmm6, 48(%rsp)
SciMark 2 results
ICC outperforms GCC with 4%, GCC outperforms Clang with 18%. SciMark does use pure C core without any intrinsics and proves that Intel's compiler has the best generic optimizer.
./scimark2_c-o3-t2-icc-corei7-avx
Using 2.00 seconds min time per kenel.
Composite Score: 1725.89
FFT Mflops: 1263.81 (N=1024)
SOR Mflops: 1193.42 (100 x 100)
MonteCarlo: Mflops: 970.75
Sparse matmult Mflops: 1461.61 (N=1000, nz=5000)
LU Mflops: 3739.88 (M=100, N=100)
./scimark2_c-o3-t2-icc-core2
Using 2.00 seconds min time per kenel.
Composite Score: 1741.45
FFT Mflops: 1213.43 (N=1024)
SOR Mflops: 1195.41 (100 x 100)
MonteCarlo: Mflops: 958.60
Sparse matmult Mflops: 1515.78 (N=1000, nz=5000)
LU Mflops: 3824.03 (M=100, N=100)
./scimark2_c-o3-t2-gcc-4.7-corei7-avx
Using 2.00 seconds min time per kenel.
Composite Score: 1666.37
FFT Mflops: 1446.79 (N=1024)
SOR Mflops: 1018.09 (100 x 100)
MonteCarlo: Mflops: 604.54
Sparse matmult Mflops: 1774.87 (N=1000, nz=5000)
LU Mflops: 3487.58 (M=100, N=100)
./scimark2_c-o3-t2-gcc-4.7-core2
Using 2.00 seconds min time per kenel.
Composite Score: 1355.56
FFT Mflops: 1395.37 (N=1024)
SOR Mflops: 1022.19 (100 x 100)
MonteCarlo: Mflops: 591.04
Sparse matmult Mflops: 1768.99 (N=1000, nz=5000)
LU Mflops: 2000.24 (M=100, N=100)
./scimark2_c-o3-t2-gcc-4.6.2-corei7-avx
Using 2.00 seconds min time per kenel.
Composite Score: 1641.45
FFT Mflops: 1437.98 (N=1024)
SOR Mflops: 1010.89 (100 x 100)
MonteCarlo: Mflops: 803.15
Sparse matmult Mflops: 1601.78 (N=1000, nz=5000)
LU Mflops: 3353.45 (M=100, N=100)
./scimark2_c-o3-t2-gcc-4.6.2-core2
Using 2.00 seconds min time per kenel.
Composite Score: 1357.67
FFT Mflops: 1385.32 (N=1024)
SOR Mflops: 1025.30 (100 x 100)
MonteCarlo: Mflops: 802.10
Sparse matmult Mflops: 1600.90 (N=1000, nz=5000)
LU Mflops: 1974.74 (M=100, N=100)
./scimark2_c-o3-t2-clang-core2
Using 2.00 seconds min time per kenel.
Composite Score: 1406.03
FFT Mflops: 1050.31 (N=1024)
SOR Mflops: 1417.75 (100 x 100)
MonteCarlo: Mflops: 562.68
Sparse matmult Mflops: 1644.04 (N=1000, nz=5000)
LU Mflops: 2355.38 (M=100, N=100)
./scimark2_c-o3-t2-clang-corei7-avx
Using 2.00 seconds min time per kenel.
Composite Score: 1406.52
FFT Mflops: 1051.59 (N=1024)
SOR Mflops: 1421.25 (100 x 100)
MonteCarlo: Mflops: 559.95
Sparse matmult Mflops: 1638.98 (N=1000, nz=5000)
LU Mflops: 2360.82 (M=100, N=100)