Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GEMM] Significant performance regression (divided by 5) #32

Closed
mratsim opened this issue Jul 14, 2019 · 2 comments
Closed

[GEMM] Significant performance regression (divided by 5) #32

mratsim opened this issue Jul 14, 2019 · 2 comments
Labels
bug Something isn't working

Comments

@mratsim
Copy link
Owner

mratsim commented Jul 14, 2019

Since #28 that fixed #27, another strange regression appeared, dividing per by 5:

from a March 23 build

$  ./build/gemm_f32_omp

A matrix shape: (M: 1920, N: 1920)
B matrix shape: (M: 1920, N: 1920)
Output shape: (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    224.000 GFLOP/s
Theoretical peak multi:         4032.000 GFLOP/s
Make sure to not bench Apple Accelerate or the default Linux BLAS.

Reference loop
Collected 10 samples in 10.421 seconds
Average time: 1041.539 ms
Stddev  time: 3.983 ms
Min     time: 1035.329 ms
Max     time: 1047.674 ms
Perf:         13.591 GFLOP/s

OpenBLAS benchmark
Collected 10 samples in 0.091 seconds
Average time: 8.438 ms
Stddev  time: 6.319 ms
Min     time: 6.240 ms
Max     time: 26.393 ms
Perf:         1677.596 GFLOP/s

Laser production implementation
Collected 10 samples in 0.087 seconds
Average time: 8.035 ms
Stddev  time: 4.186 ms
Min     time: 6.517 ms
Max     time: 19.913 ms
Perf:         1761.855 GFLOP/s

PyTorch Glow: libjit matmul implementation (with AVX+FMA)
Collected 10 samples in 1.900 seconds
Average time: 189.987 ms
Stddev  time: 2.893 ms
Min     time: 188.794 ms
Max     time: 198.044 ms
Perf:         74.509 GFLOP/s

MKL-DNN reference GEMM benchmark
Collected 10 samples in 0.368 seconds
Average time: 36.043 ms
Stddev  time: 5.048 ms
Min     time: 34.275 ms
Max     time: 50.364 ms
Perf:         392.748 GFLOP/s

MKL-DNN JIT AVX benchmark
Collected 10 samples in 0.105 seconds
Average time: 9.758 ms
Stddev  time: 5.933 ms
Min     time: 7.715 ms
Max     time: 26.624 ms
Perf:         1450.731 GFLOP/s

MKL-DNN JIT AVX512 benchmark
Collected 10 samples in 0.088 seconds
Average time: 8.154 ms
Stddev  time: 10.128 ms
Min     time: 4.733 ms
Max     time: 36.938 ms
Perf:         1736.020 GFLOP/s
Mean Relative Error compared to vendor BLAS: 3.045843413929106e-06

From a recent rebuild

$  ./build/gemm_omp_f32

A matrix shape: (M: 1920, N: 1920)
B matrix shape: (M: 1920, N: 1920)
Output shape: (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    224.000 GFLOP/s
Theoretical peak multi:         4032.000 GFLOP/s
Make sure to not bench Apple Accelerate or the default Linux BLAS.

Laser production implementation
Collected 10 samples in 0.555 seconds
Average time: 54.917 ms
Stddev  time: 5.027 ms
Min     time: 53.250 ms
Max     time: 69.218 ms
Perf:         257.765 GFLOP/s
@mratsim
Copy link
Owner Author

mratsim commented Jul 14, 2019

Issues is in Nim upstream, rebuilding Laser current master with the Nim OpenMP commit (nim-lang/Nim@2564961) brings back full performance

@mratsim
Copy link
Owner Author

mratsim commented Jul 14, 2019

After bisecting, the cause is the split of -d:release into -d:release -d:danger nim-lang/Nim#11385

@mratsim mratsim closed this as completed Jul 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant