Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pure nim gemm #66

Merged
merged 7 commits into from
Dec 26, 2019
Merged

Pure nim gemm #66

merged 7 commits into from
Dec 26, 2019

Conversation

mratsim
Copy link
Owner

@mratsim mratsim commented Dec 26, 2019

This implements a Weave-based state-of-the-art GEMM.

Unfortunately it seems plagued with the same woes as reductions:

$  ./weave_gemm 
Warmup: 1506 ms, result 660 (displayed to avoid compiler optimizing warmup away)

Backend:                        Weave (Pure Nim)
Type:                           float32
A matrix shape:                 (M: 1920, N: 1920)
B matrix shape:                 (M: 1920, N: 1920)
Output shape:                   (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    224.000 GFLOP/s
Theoretical peak multi:         4032.000 GFLOP/s

Weave implementation
Collected 10 samples in 262 ms
Average time: 25.200 ms
Stddev  time: 3.225 ms
Min     time: 18.000 ms
Max     time: 29.000 ms
Perf:         561.737 GFLOP/s

$  ./openblas_gemm 
Warmup: 1502 ms, result 660 (displayed to avoid compiler optimizing warmup away)

Backend:                        OpenBLAS
Type:                           float32
A matrix shape:                 (M: 1920, N: 1920)
B matrix shape:                 (M: 1920, N: 1920)
Output shape:                   (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    224.000 GFLOP/s
Theoretical peak multi:         4032.000 GFLOP/s

OpenBLAS benchmark
Collected 10 samples in 122 ms
Average time: 11.100 ms
Stddev  time: 3.143 ms
Min     time: 10.000 ms
Max     time: 20.000 ms
Perf:         1275.295 GFLOP/s

$  ./mkl_gemm 
Warmup: 1502 ms, result 660 (displayed to avoid compiler optimizing warmup away)

Backend:                        Intel MKL
Type:                           float32
A matrix shape:                 (M: 1920, N: 1920)
B matrix shape:                 (M: 1920, N: 1920)
Output shape:                   (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    224.000 GFLOP/s
Theoretical peak multi:         4032.000 GFLOP/s

Intel MKL benchmark
Collected 10 samples in 127 ms
Average time: 11.300 ms
Stddev  time: 13.598 ms
Min     time: 7.000 ms
Max     time: 50.000 ms
Perf:         1252.724 GFLOP/s

$  ./laser_omp_gemm 
Warmup: 1503 ms, result 660 (displayed to avoid compiler optimizing warmup away)

Backend:                        Laser (Pure Nim) + OpenMP
Type:                           float32
A matrix shape:                 (M: 1920, N: 1920)
B matrix shape:                 (M: 1920, N: 1920)
Output shape:                   (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    224.000 GFLOP/s
Theoretical peak multi:         4032.000 GFLOP/s

Laser production implementation
Collected 10 samples in 87 ms
Average time: 7.400 ms
Stddev  time: 4.477 ms
Min     time: 5.000 ms
Max     time: 20.000 ms
Perf:         1912.943 GFLOP/s

Note that the last 3 are using OpenMP which is a bit unstable: mratsim/laser#40 (comment)

In Laser own benchmarks, both OpenBLAS and MKL can hit 1.5~1.8 TFLOP/s on my machine, it may require putting all CPUs out of powersaving profile though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant