Provide a BLAS-free option #422

mratsim · 2020-03-05T01:10:58Z

BLAS installation has been a recurrent problem for users and CI:

IRC: https://irclogs.nim-lang.org/03-03-2020.html#17:24:53
In Nim CI due to Linux distros using either blas or cblas: Tests are not passing with blas+cblas SciNim/nimblas#3
OpenBLAS is plain broken in Archlinux: Archlinux specific - float64 matmul tests failing #375
I never managed to test LAPACK in Windows CI due to DLL/path issues: Lapack tests on Windows #212

As explained on IRC, replacing BLAS while providing the same performance is possible via the Laser backend already integrated in Arraymancer.

The steps are:

Integers are already using Laser, we can change the constraint to not Complex:

Arraymancer/src/tensor/private/p_operator_blas_l2l3.nim

Lines 112 to 146 in 8acfd68

    
           proc fallbackMM_C_eq_aAB_p_bC*[T: SomeInteger]( 
        
             alpha: T, a, b: Tensor[T], 
        
             beta: T, c: var Tensor[T]) {.inline.}= 
        
             # Matrix: C = alpha A matmul B + beta C 
        
             let 
        
               M = a.shape[0] 
        
               K = a.shape[1] # = b.shape[0] 
        
               N = b.shape[1] 
        
             # Legacy fallback 
        
             # gemm_nn_fallback( M, N, K, 
        
             #                   alpha, 
        
             #                   a.data, a.offset, 
        
             #                   a.strides[0], a.strides[1], 
        
             #                   b.data, b.offset, 
        
             #                   b.strides[0], b.strides[1], 
        
             #                   beta, 
        
             #                   c.data, c.offset, 
        
             #                   c.strides[0], c.strides[1]) 
        
             # Laser backend 
        
             gemm_strided(M, N, K, 
        
                          alpha, 
        
                          a.get_offset_ptr(), 
        
                          a.strides[0], a.strides[1], 
        
                          b.get_offset_ptr(), 
        
                          b.strides[0], b.strides[1], 
        
                          beta, 
        
                          c.get_offset_ptr(), 
        
                          c.strides[0], c.strides[1]) 
        
           proc blasMM_C_eq_aAB_p_bC*[T: SomeFloat|Complex[float32]|Complex[float64]]( 
        
             alpha: T, a, b: Tensor[T], 
        
             beta: T, c: var Tensor[T]) = 
        
             # Matrix: C = alpha A matmul B + beta C

The "blasfree" flag should imply no_lapack:

Arraymancer/src/arraymancer.nim

Lines 41 to 44 in 8acfd68

    
           when not defined(no_lapack): 
        
             # THe ml module also does not export everything is LAPACK is not available 
        
             import ./linear_algebra/linear_algebra 
        
             export linear_algebra

Arraymancer/src/ml/ml.nim

Lines 21 to 25 in 8acfd68

    
           when not defined(no_lapack): 
        
             import ./dimensionality_reduction/pca, 
        
                    ./clustering/kmeans 
        
             export pca, 
        
                    kmeans

The features that would not be available without BLAS would be:

Linear algebra:
- Symmetric matrices eigenvalues decomposition
- QR decomposition
- LU decomposition
- Singular Value Decomposition (SVD)
- Randomized SVD
- Least Squares Solver
Machine Learning
- Principal Component Analysis (PCA)

The text was updated successfully, but these errors were encountered:

elcritch · 2020-04-07T07:00:46Z

Would it be possible to include a lightweight pure C BLAS like https://github.com/michael-lehn/ulmBLAS as a fallback? I'm just learning about Nim, but it I under it should be possible to create a Nimble package for something like ulmBLAS and have it configurable as an optional dependency?

Edit: Also not sure how good Nim is at link-time-optimization but for embedded devices it would be nice to be able to have unused functions automatically stripped.

mratsim · 2020-04-07T09:53:14Z

Nim strips all unused functions, it's actually quite tricky to keep them.

ulmBLAS is missing several LAPACK routines, here is what I need for the features I mentioned in the post:

laswp: Pivot a Matrix A with P: A = P * A
orgqr: Extract the orthonormal matrix Q from the Householder reflectors coming from geqrf
ormqr: Multiply Householder reflectors from geqrf with with any matrix without any extraction (optional)
syevr: Extract eigenvalues from symmetric matrices
geqrf: QR decomposition produces Q as raw Householder reflectors that require orgqr or ormqr for further use.
gesdd: Singular Value Decomposition by Divide-and-Conquer
getrf: Pivoted LU Decomposition
gelsd: Least Squares via Recursive Divide-and-COnquer
gesv: Solve AX = B for general matrix

Regarding ulmBLAS in general, in terms of speed it will be severely lacking compared to OpenBLAS/MKL/BLIS/Laser as the microkernel is SSE only and does not use AVX or AVX512.

Furthermore if your concern is embedded, ulmBLAS (or OpenBLAS) will need to compile in all assembly to produce a DLL while a pure Nim solution will be able to do dead-code elimination if you never use QR or LU decomposition for example.

elcritch · 2020-04-08T20:12:55Z

Ok, ulmBLAS doesn't seem like it'd be ideal. I tried compiling it on Linux and it breaks.

What is the primary goal for this issue? Is it to write naive-Nim implementations, with or without optimized ASM kernels? It seems that maybe the CLAPACK C libraries could be auto-converted to Nim. Then they could be improved over time.

Also, if there is an attempt for pure Nim impl's (that aren't ASM optimized) I wouldn't mind porting a few algo's as I'm having to one-off many in C currently (or a high level language using C matrices) and it's not enjoyable or complete. Hence why Nim/Arraymancer seems so interesting!

P.S. referring to #430 would this future work always support a compile-time only CPU target? Using LLVM / JIT'ing isn't generally feasible for embedded devices.

mratsim · 2020-04-08T20:44:31Z

What is the primary goal for this issue? Is it to write naive-Nim implementations, with or without optimized ASM kernels? It seems that maybe the CLAPACK C libraries could be auto-converted to Nim. Then they could be improved over time.

Users on Windows in particular have reported issues installing BLAS. Even for me I couldn't get Lapack working on Windows CI: #212.

Also once integrating Arraymancer to the Nim test suite as an important package, BLAS caused issue due to differing distro packaging: SciNim/nimblas#3

And lastly, the packaging can just be plain wrong and gives wrong results for the past 6 months: https://bugs.archlinux.org/task/63054

Regarding performance, you don't need Assembly to reach BLAS performance, but you do need to know your CPU and memory. I have already implemented the main reason to use BLAS, matrix multiplication, in pure Nim and can reach OpenBLAS speed on large matrices (small matrices are still significantly slower than OpenBLAS/MKL but it's much easier to optimize those: you need to deactivate some preprocessing of optimized large matrix multiplication).

See implementation in Laser using OpenMP:

https://github.com/numforge/laser/tree/d1e6ae61/laser/primitives/matrix_multiplication
benches: https://github.com/numforge/laser/tree/d1e6ae61/benchmarks/gemm
And implementation using the Weave threading runtime
https://github.com/mratsim/weave/tree/ce8cac84/benchmarks/matmul_gemm_blas

For performance, you can capture 80% of the speed by doing 1D/2D/3D tiling + vectorization.
There is a detailed example on how do to 1D and 2D tiling in Laser on transposition:

In short:

Blas Level1 / Lapack level 1, they only require linear iteration and a for loop will reach max speed.
- Example: Addition / Copy
- Exception: Transposition because it iterates in 2 different ways for the source and destination matrices
- Exception: Sum reduction because either you use fast-math or you deal will instruction latency when accumulating in a single memory location
  - bench: https://github.com/numforge/laser/tree/d1e6ae61/benchmarks/fp_reduction_latency
  - technique: https://github.com/numforge/laser/blob/d1e6ae61/laser/primitives/reductions.nim
Blas Level 2: they require 2D Tiling to handle "thin and skinny" matrices and "tall and fat" matrices
and better use the CPU caches
- Example Matrix-Vector Multiplication
- This includes transposition

My plan detailed here: #430 (comment) is to allow 2 different kinds of people to work on compute kernels:

The researcher/implementer only has to get the algorithm right
The optimizer/performance engineer only has to tune the performance to the architecture
I've given example in the Nim forum on convolution: https://forum.nim-lang.org/t/6164#38122
Furthermore we will likely have different optimization for CPU and GPU target for the same algorithm
And I want compiler-generated backpropagation kernels (though I'm not sure yet how to allow a performance engineer to optimize them)

proc convolve(A, Filter: Function): Function =
  # Generator for the convolution function
  var i, j: Domain
  var convH, convW: Function
  
  # Definition of the result function
  # Filter F = [1, 2, 1]
  # So 2D application is:
  # [[1, 2, 1],
  #  [2, 4, 2],
  #  [1, 2, 1]]
  #
  convH[i, j] = A[i-1, j] * F[0] + A[i, j] * F[1] + A[i+1, j] * F[2]
  convW[i, j] = convH[i, j-1] * F[0] + convH[i, j] * F[1] + convH[i, j] * F[2]
  
  # Optimization of the definition - completely optional
  convH.compute_at(convW, j) # Compute the required region of convH before convW start (i.e. merged)
  # convH.compute_root()     # Alternatively do the height convolve before the width convolve (i.e. 2 steps)
  
  convW.tile(i, ii, 64)      # Split i in tile of size 64, with ii as the tile iterator
  convW.parallel(i)          # Iterating on tiles is done in parallel
  convW.tile(j, jj, 64)      # Split j in tile of size 64 as well with jj as the tile iterator
  convW.vectorize(jj, 8)     # Iterating on jj is done with size 8 vectors if possible (i.e. AVX on x86)
  
  return convW
  
  generate convolve:
    proc foobar(a: Tensor[float32], filter: array[3, float32]): Tensor[float32]

P.S. referring to #430 would this future work always support a compile-time only CPU target? Using LLVM / JIT'ing isn't generally feasible for embedded devices.

Yes. I don't want LLVM to be a requirement. It's huge.

elcritch · 2020-04-08T21:16:04Z

Just a quick response, that all sounds great! I'm impressed by how much detail you've already got. While I enjoy writing in Julia, the whole LLVM dependency is just a pain for embedded. Rust is a pain to cross-compile, and seems to have problems with arrays on the stack(?). Given all that, Nim's new ARC/move semantics really makes Nim/Arraymancer very promising for embedded usage (in addition to server side).

My plan detailed here: #430 (comment) is to allow 2 different kinds of people to work on compute kernels:
The researcher/implementer only has to get the algorithm right
The optimizer/performance engineer only has to tune the performance to the architecture

That seems reasonable. Glad you summarized it as I didn't quite understand that from the linked comment. In my company's use case, simple machine learning techniques turn out pretty handy. So if there's a simple path to implement pure Nim using just the correct algorithm first, one could cover most of the basic machine learning techniques pretty quickly. Then others could refine/tune the performance. That's how I'm understanding your previous comment. I'll look into your example some more! It is fun playing with vectorization/tiling...

mratsim added the ergonomics label Mar 5, 2020

mratsim mentioned this issue Apr 7, 2020

MLIR or TVM target? #430

Closed

mratsim mentioned this issue Dec 31, 2023

2023-12-31 - Longstanding missing features #616

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide a BLAS-free option #422

Provide a BLAS-free option #422

mratsim commented Mar 5, 2020

elcritch commented Apr 7, 2020 •

edited

Loading

mratsim commented Apr 7, 2020

elcritch commented Apr 8, 2020 •

edited

Loading

mratsim commented Apr 8, 2020 •

edited

Loading

elcritch commented Apr 8, 2020

Provide a BLAS-free option #422

Provide a BLAS-free option #422

Comments

mratsim commented Mar 5, 2020

elcritch commented Apr 7, 2020 • edited Loading

mratsim commented Apr 7, 2020

elcritch commented Apr 8, 2020 • edited Loading

mratsim commented Apr 8, 2020 • edited Loading

elcritch commented Apr 8, 2020

elcritch commented Apr 7, 2020 •

edited

Loading

elcritch commented Apr 8, 2020 •

edited

Loading

mratsim commented Apr 8, 2020 •

edited

Loading