Pure nim gemm #66

mratsim · 2019-12-26T14:40:53Z

This implements a Weave-based state-of-the-art GEMM.

Unfortunately it seems plagued with the same woes as reductions:

$  ./weave_gemm 
Warmup: 1506 ms, result 660 (displayed to avoid compiler optimizing warmup away)

Backend:                        Weave (Pure Nim)
Type:                           float32
A matrix shape:                 (M: 1920, N: 1920)
B matrix shape:                 (M: 1920, N: 1920)
Output shape:                   (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    224.000 GFLOP/s
Theoretical peak multi:         4032.000 GFLOP/s

Weave implementation
Collected 10 samples in 262 ms
Average time: 25.200 ms
Stddev  time: 3.225 ms
Min     time: 18.000 ms
Max     time: 29.000 ms
Perf:         561.737 GFLOP/s

$  ./openblas_gemm 
Warmup: 1502 ms, result 660 (displayed to avoid compiler optimizing warmup away)

Backend:                        OpenBLAS
Type:                           float32
A matrix shape:                 (M: 1920, N: 1920)
B matrix shape:                 (M: 1920, N: 1920)
Output shape:                   (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    224.000 GFLOP/s
Theoretical peak multi:         4032.000 GFLOP/s

OpenBLAS benchmark
Collected 10 samples in 122 ms
Average time: 11.100 ms
Stddev  time: 3.143 ms
Min     time: 10.000 ms
Max     time: 20.000 ms
Perf:         1275.295 GFLOP/s

$  ./mkl_gemm 
Warmup: 1502 ms, result 660 (displayed to avoid compiler optimizing warmup away)

Backend:                        Intel MKL
Type:                           float32
A matrix shape:                 (M: 1920, N: 1920)
B matrix shape:                 (M: 1920, N: 1920)
Output shape:                   (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    224.000 GFLOP/s
Theoretical peak multi:         4032.000 GFLOP/s

Intel MKL benchmark
Collected 10 samples in 127 ms
Average time: 11.300 ms
Stddev  time: 13.598 ms
Min     time: 7.000 ms
Max     time: 50.000 ms
Perf:         1252.724 GFLOP/s

$  ./laser_omp_gemm 
Warmup: 1503 ms, result 660 (displayed to avoid compiler optimizing warmup away)

Backend:                        Laser (Pure Nim) + OpenMP
Type:                           float32
A matrix shape:                 (M: 1920, N: 1920)
B matrix shape:                 (M: 1920, N: 1920)
Output shape:                   (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    224.000 GFLOP/s
Theoretical peak multi:         4032.000 GFLOP/s

Laser production implementation
Collected 10 samples in 87 ms
Average time: 7.400 ms
Stddev  time: 4.477 ms
Min     time: 5.000 ms
Max     time: 20.000 ms
Perf:         1912.943 GFLOP/s

Note that the last 3 are using OpenMP which is a bit unstable: mratsim/laser#40 (comment)

In Laser own benchmarks, both OpenBLAS and MKL can hit 1.5~1.8 TFLOP/s on my machine, it may require putting all CPUs out of powersaving profile though.

mratsim added 7 commits December 26, 2019 14:01

Properly use global barriers in pure Nim parallel GEMM

74940cb

Add MKL benchmark

8b6c76e

Add OpenBLAS

8be3704

Use the new localPassC pragma, benchmarks Laser

442d4ba

Add Weave-based BLAS

31dd15f

Add GEMM to the anti-regression suite

241c82e

GEMM is only implemented for x86 targets

789fa95

mratsim mentioned this pull request Dec 26, 2019

Nested for-loops: Expressing reads-after-writes and writes-after-writes dependencies #31

Closed

mratsim merged commit 4418657 into master Dec 26, 2019

mratsim deleted the pure-nim-gemm branch December 27, 2019 21:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pure nim gemm #66

Pure nim gemm #66

mratsim commented Dec 26, 2019

Pure nim gemm #66

Pure nim gemm #66

Conversation

mratsim commented Dec 26, 2019