Skip to content

nanoant/ssebench

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 

SSE matrix 4x4 multiplication & SciMark 2 benchmark

Based on:

Usage

make TEST=(matrix,scimark2_c) O=(2,3) TYPE=(0,1,2) CC=(gcc-4.6.2,clang-3.0,cc) ITER=100000000

SSE matrix results 100M iterations

Tested on:

  • Mac OS X 10.7.3
  • Intel Core i5 2400 2.5 GHz

Results:

             version  opt/-Ox   arch        SciMark comp. Matrix plain  naive SSE    opt SSE
     ICC [1] 12.1.3         3   corei7-avx        1734.40         0.96       2.87       1.00
     ICC     12.1.3         3   core2             1741.45         1.04       3.20       1.05
     GCC [2] 4.7            3   corei7-avx        1666.37         0.73       2.08       0.54
     GCC     4.7            3   core2             1355.56         0.78       2.32       0.58
     GCC     4.6.2          3   corei7-avx        1641.45         0.88       1.97       0.54
     GCC     4.6.2          3   core2             1357.67         0.74       2.08       0.58
   clang [3] 3.1/318.0.58   3   corei7-avx        1406.52         3.03       3.19       1.03
   clang     3.1/318.0.58   3   core2             1406.03         3.03       3.39       1.03
GCC/LLVM [4] 4.2.1/2336.1   3   core2                             2.56       4.93     **CRASH**
     GCC [5] 4.2.1          3   core2                             2.70       4.41       1.63

[1] ICC 12.1.3 binaries provided by Intel® C++ Composer XE 2011 for Mac OS X. [2] GCC 4.7 & GCC 4.6.2 was built from GNU sources. [3] clang 3.1 binaries provided by Xcode 4.3 package. [4] GCC/LLVM 4.2.1 binaries provided by Xcode 4.2 package. [5] GCC 4.2.1 binaries provided by Xcode 3.2.6 package.

SSE matrix conclusions

SSE optimized matrix multiplication using intrinsics is hand optimized so does not really heavily relay on compiler's optimizer. But still performance is prone to number of available registers, especially SSE registers.

GCC seems to provide best performance

Both GCC 4.7 and 4.6.2 provide best performance, also pure C implementation is just 50% slower under GCC than optimized SSE matrix multiplication, where it is 100% slower under ICC and 300% slower under Clang.

Why ICC is worse than GCC for SSE optimized example

GCC can perform whole loop of operations using XMM (SSE) registers only if all calculation can be done in place. ICC moves values back and forth between stack and XMM registers using SSE4/AVX instructions.

Why Clang is worst in comparison to GCC & ICC

Clang does not even use SSE4/AVX instructions, but older movss than moves single SSE register value. So it is even slower than ICC. Clang also poorly optimize pure C version, where GCC and ICC can optimize using AVX.

Clang 3.1 optimized SSE example loop:

LBB0_5:                                 ##   Parent Loop BB0_4 Depth=1
                                        ## =>  This Inner Loop Header: Depth=2
	movss	-96(%rbp,%rcx,4), %xmm4
	pshufd	$0, %xmm4, %xmm4        ## xmm4 = xmm4[0,0,0,0]
	mulps	%xmm3, %xmm4
	movss	-92(%rbp,%rcx,4), %xmm5
	pshufd	$0, %xmm5, %xmm5        ## xmm5 = xmm5[0,0,0,0]
	mulps	%xmm0, %xmm5
	addps	%xmm4, %xmm5
	movss	-88(%rbp,%rcx,4), %xmm4
	pshufd	$0, %xmm4, %xmm4        ## xmm4 = xmm4[0,0,0,0]
	mulps	%xmm1, %xmm4
	addps	%xmm5, %xmm4
	movss	-84(%rbp,%rcx,4), %xmm5
	pshufd	$0, %xmm5, %xmm5        ## xmm5 = xmm5[0,0,0,0]
	mulps	%xmm2, %xmm5
	addps	%xmm4, %xmm5
	movaps	%xmm5, -192(%rbp,%rcx,4)
	leaq	4(%rcx), %rcx
	cmpl	$16, %ecx
	jl	LBB0_5

GCC 4.6.2 loop:

L6:
	movaps	%xmm2, %xmm9
L3:
	movaps	%xmm1, %xmm8
	movaps	%xmm9, %xmm4
	movaps	%xmm0, %xmm2
	mulps	%xmm3, %xmm8
	movaps	%xmm6, %xmm5
	addq	$1, %rdx
	mulps	%xmm3, %xmm4
	cmpq	%rax, %rdx
	mulps	%xmm10, %xmm2
	mulps	%xmm3, %xmm5
	movaps	%xmm8, %xmm7
	addps	%xmm4, %xmm7
	addps	%xmm4, %xmm1
	movaps	%xmm0, %xmm4
	movaps	%xmm8, %xmm0
	mulps	%xmm3, %xmm4
	addps	%xmm9, %xmm0
	addps	%xmm7, %xmm2
	addps	%xmm4, %xmm1
	addps	%xmm4, %xmm0
	addps	%xmm7, %xmm4
	addps	%xmm5, %xmm2
	addps	%xmm5, %xmm1
	addps	%xmm5, %xmm0
	addps	%xmm4, %xmm6
	jl	L6
	movaps	%xmm2, (%rsp)
	movaps	%xmm1, 16(%rsp)
	movaps	%xmm0, 32(%rsp)
	movaps	%xmm6, 48(%rsp)

SciMark 2 results

ICC outperforms GCC with 4%, GCC outperforms Clang with 18%. SciMark does use pure C core without any intrinsics and proves that Intel's compiler has the best generic optimizer.

./scimark2_c-o3-t2-icc-corei7-avx

Using       2.00 seconds min time per kenel.
Composite Score:         1725.89
FFT             Mflops:  1263.81    (N=1024)
SOR             Mflops:  1193.42    (100 x 100)
MonteCarlo:     Mflops:   970.75
Sparse matmult  Mflops:  1461.61    (N=1000, nz=5000)
LU              Mflops:  3739.88    (M=100, N=100)

./scimark2_c-o3-t2-icc-core2

Using       2.00 seconds min time per kenel.
Composite Score:         1741.45
FFT             Mflops:  1213.43    (N=1024)
SOR             Mflops:  1195.41    (100 x 100)
MonteCarlo:     Mflops:   958.60
Sparse matmult  Mflops:  1515.78    (N=1000, nz=5000)
LU              Mflops:  3824.03    (M=100, N=100)

./scimark2_c-o3-t2-gcc-4.7-corei7-avx

Using       2.00 seconds min time per kenel.
Composite Score:         1666.37
FFT             Mflops:  1446.79    (N=1024)
SOR             Mflops:  1018.09    (100 x 100)
MonteCarlo:     Mflops:   604.54
Sparse matmult  Mflops:  1774.87    (N=1000, nz=5000)
LU              Mflops:  3487.58    (M=100, N=100)

./scimark2_c-o3-t2-gcc-4.7-core2

Using       2.00 seconds min time per kenel.
Composite Score:         1355.56
FFT             Mflops:  1395.37    (N=1024)
SOR             Mflops:  1022.19    (100 x 100)
MonteCarlo:     Mflops:   591.04
Sparse matmult  Mflops:  1768.99    (N=1000, nz=5000)
LU              Mflops:  2000.24    (M=100, N=100)

./scimark2_c-o3-t2-gcc-4.6.2-corei7-avx

Using       2.00 seconds min time per kenel.
Composite Score:         1641.45
FFT             Mflops:  1437.98    (N=1024)
SOR             Mflops:  1010.89    (100 x 100)
MonteCarlo:     Mflops:   803.15
Sparse matmult  Mflops:  1601.78    (N=1000, nz=5000)
LU              Mflops:  3353.45    (M=100, N=100)

./scimark2_c-o3-t2-gcc-4.6.2-core2

Using       2.00 seconds min time per kenel.
Composite Score:         1357.67
FFT             Mflops:  1385.32    (N=1024)
SOR             Mflops:  1025.30    (100 x 100)
MonteCarlo:     Mflops:   802.10
Sparse matmult  Mflops:  1600.90    (N=1000, nz=5000)
LU              Mflops:  1974.74    (M=100, N=100)

./scimark2_c-o3-t2-clang-core2

Using       2.00 seconds min time per kenel.
Composite Score:         1406.03
FFT             Mflops:  1050.31    (N=1024)
SOR             Mflops:  1417.75    (100 x 100)
MonteCarlo:     Mflops:   562.68
Sparse matmult  Mflops:  1644.04    (N=1000, nz=5000)
LU              Mflops:  2355.38    (M=100, N=100)

./scimark2_c-o3-t2-clang-corei7-avx

Using       2.00 seconds min time per kenel.
Composite Score:         1406.52
FFT             Mflops:  1051.59    (N=1024)
SOR             Mflops:  1421.25    (100 x 100)
MonteCarlo:     Mflops:   559.95
Sparse matmult  Mflops:  1638.98    (N=1000, nz=5000)
LU              Mflops:  2360.82    (M=100, N=100)

About

Little (SSE) 4x4 matrix multiplication test for various compilers

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published