Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide a BLAS-free option #422

Open
mratsim opened this issue Mar 5, 2020 · 5 comments
Open

Provide a BLAS-free option #422

mratsim opened this issue Mar 5, 2020 · 5 comments

Comments

@mratsim
Copy link
Owner

mratsim commented Mar 5, 2020

BLAS installation has been a recurrent problem for users and CI:

As explained on IRC, replacing BLAS while providing the same performance is possible via the Laser backend already integrated in Arraymancer.

The steps are:

  • Integers are already using Laser, we can change the constraint to not Complex:
    proc fallbackMM_C_eq_aAB_p_bC*[T: SomeInteger](
    alpha: T, a, b: Tensor[T],
    beta: T, c: var Tensor[T]) {.inline.}=
    # Matrix: C = alpha A matmul B + beta C
    let
    M = a.shape[0]
    K = a.shape[1] # = b.shape[0]
    N = b.shape[1]
    # Legacy fallback
    # gemm_nn_fallback( M, N, K,
    # alpha,
    # a.data, a.offset,
    # a.strides[0], a.strides[1],
    # b.data, b.offset,
    # b.strides[0], b.strides[1],
    # beta,
    # c.data, c.offset,
    # c.strides[0], c.strides[1])
    # Laser backend
    gemm_strided(M, N, K,
    alpha,
    a.get_offset_ptr(),
    a.strides[0], a.strides[1],
    b.get_offset_ptr(),
    b.strides[0], b.strides[1],
    beta,
    c.get_offset_ptr(),
    c.strides[0], c.strides[1])
    proc blasMM_C_eq_aAB_p_bC*[T: SomeFloat|Complex[float32]|Complex[float64]](
    alpha: T, a, b: Tensor[T],
    beta: T, c: var Tensor[T]) =
    # Matrix: C = alpha A matmul B + beta C
  • The "blasfree" flag should imply no_lapack:
    • when not defined(no_lapack):
      # THe ml module also does not export everything is LAPACK is not available
      import ./linear_algebra/linear_algebra
      export linear_algebra
    • when not defined(no_lapack):
      import ./dimensionality_reduction/pca,
      ./clustering/kmeans
      export pca,
      kmeans

The features that would not be available without BLAS would be:

  • Linear algebra:
    • Symmetric matrices eigenvalues decomposition
    • QR decomposition
    • LU decomposition
    • Singular Value Decomposition (SVD)
    • Randomized SVD
    • Least Squares Solver
  • Machine Learning
    • Principal Component Analysis (PCA)
@elcritch
Copy link

elcritch commented Apr 7, 2020

Would it be possible to include a lightweight pure C BLAS like https://github.com/michael-lehn/ulmBLAS as a fallback? I'm just learning about Nim, but it I under it should be possible to create a Nimble package for something like ulmBLAS and have it configurable as an optional dependency?

Edit: Also not sure how good Nim is at link-time-optimization but for embedded devices it would be nice to be able to have unused functions automatically stripped.

@mratsim
Copy link
Owner Author

mratsim commented Apr 7, 2020

Nim strips all unused functions, it's actually quite tricky to keep them.

ulmBLAS is missing several LAPACK routines, here is what I need for the features I mentioned in the post:

  • laswp: Pivot a Matrix A with P: A = P * A
  • orgqr: Extract the orthonormal matrix Q from the Householder reflectors coming from geqrf
  • ormqr: Multiply Householder reflectors from geqrf with with any matrix without any extraction (optional)
  • syevr: Extract eigenvalues from symmetric matrices
  • geqrf: QR decomposition produces Q as raw Householder reflectors that require orgqr or ormqr for further use.
  • gesdd: Singular Value Decomposition by Divide-and-Conquer
  • getrf: Pivoted LU Decomposition
  • gelsd: Least Squares via Recursive Divide-and-COnquer
  • gesv: Solve AX = B for general matrix

Regarding ulmBLAS in general, in terms of speed it will be severely lacking compared to OpenBLAS/MKL/BLIS/Laser as the microkernel is SSE only and does not use AVX or AVX512.

Furthermore if your concern is embedded, ulmBLAS (or OpenBLAS) will need to compile in all assembly to produce a DLL while a pure Nim solution will be able to do dead-code elimination if you never use QR or LU decomposition for example.

@elcritch
Copy link

elcritch commented Apr 8, 2020

Ok, ulmBLAS doesn't seem like it'd be ideal. I tried compiling it on Linux and it breaks.

What is the primary goal for this issue? Is it to write naive-Nim implementations, with or without optimized ASM kernels? It seems that maybe the CLAPACK C libraries could be auto-converted to Nim. Then they could be improved over time.

Also, if there is an attempt for pure Nim impl's (that aren't ASM optimized) I wouldn't mind porting a few algo's as I'm having to one-off many in C currently (or a high level language using C matrices) and it's not enjoyable or complete. Hence why Nim/Arraymancer seems so interesting!

P.S. referring to #430 would this future work always support a compile-time only CPU target? Using LLVM / JIT'ing isn't generally feasible for embedded devices.

@mratsim
Copy link
Owner Author

mratsim commented Apr 8, 2020

What is the primary goal for this issue? Is it to write naive-Nim implementations, with or without optimized ASM kernels? It seems that maybe the CLAPACK C libraries could be auto-converted to Nim. Then they could be improved over time.

Users on Windows in particular have reported issues installing BLAS. Even for me I couldn't get Lapack working on Windows CI: #212.

Also once integrating Arraymancer to the Nim test suite as an important package, BLAS caused issue due to differing distro packaging: SciNim/nimblas#3

And lastly, the packaging can just be plain wrong and gives wrong results for the past 6 months: https://bugs.archlinux.org/task/63054

Regarding performance, you don't need Assembly to reach BLAS performance, but you do need to know your CPU and memory. I have already implemented the main reason to use BLAS, matrix multiplication, in pure Nim and can reach OpenBLAS speed on large matrices (small matrices are still significantly slower than OpenBLAS/MKL but it's much easier to optimize those: you need to deactivate some preprocessing of optimized large matrix multiplication).

See implementation in Laser using OpenMP:

For performance, you can capture 80% of the speed by doing 1D/2D/3D tiling + vectorization.
There is a detailed example on how do to 1D and 2D tiling in Laser on transposition:

In short:

My plan detailed here: #430 (comment) is to allow 2 different kinds of people to work on compute kernels:

  • The researcher/implementer only has to get the algorithm right
  • The optimizer/performance engineer only has to tune the performance to the architecture
    I've given example in the Nim forum on convolution: https://forum.nim-lang.org/t/6164#38122
  • Furthermore we will likely have different optimization for CPU and GPU target for the same algorithm
  • And I want compiler-generated backpropagation kernels (though I'm not sure yet how to allow a performance engineer to optimize them)
proc convolve(A, Filter: Function): Function =
  # Generator for the convolution function
  var i, j: Domain
  var convH, convW: Function
  
  # Definition of the result function
  # Filter F = [1, 2, 1]
  # So 2D application is:
  # [[1, 2, 1],
  #  [2, 4, 2],
  #  [1, 2, 1]]
  #
  convH[i, j] = A[i-1, j] * F[0] + A[i, j] * F[1] + A[i+1, j] * F[2]
  convW[i, j] = convH[i, j-1] * F[0] + convH[i, j] * F[1] + convH[i, j] * F[2]
  
  # Optimization of the definition - completely optional
  convH.compute_at(convW, j) # Compute the required region of convH before convW start (i.e. merged)
  # convH.compute_root()     # Alternatively do the height convolve before the width convolve (i.e. 2 steps)
  
  convW.tile(i, ii, 64)      # Split i in tile of size 64, with ii as the tile iterator
  convW.parallel(i)          # Iterating on tiles is done in parallel
  convW.tile(j, jj, 64)      # Split j in tile of size 64 as well with jj as the tile iterator
  convW.vectorize(jj, 8)     # Iterating on jj is done with size 8 vectors if possible (i.e. AVX on x86)
  
  return convW
  
  generate convolve:
    proc foobar(a: Tensor[float32], filter: array[3, float32]): Tensor[float32]

P.S. referring to #430 would this future work always support a compile-time only CPU target? Using LLVM / JIT'ing isn't generally feasible for embedded devices.

Yes. I don't want LLVM to be a requirement. It's huge.

@elcritch
Copy link

elcritch commented Apr 8, 2020

Just a quick response, that all sounds great! I'm impressed by how much detail you've already got. While I enjoy writing in Julia, the whole LLVM dependency is just a pain for embedded. Rust is a pain to cross-compile, and seems to have problems with arrays on the stack(?). Given all that, Nim's new ARC/move semantics really makes Nim/Arraymancer very promising for embedded usage (in addition to server side).

My plan detailed here: #430 (comment) is to allow 2 different kinds of people to work on compute kernels:

The researcher/implementer only has to get the algorithm right
The optimizer/performance engineer only has to tune the performance to the architecture

That seems reasonable. Glad you summarized it as I didn't quite understand that from the linked comment. In my company's use case, simple machine learning techniques turn out pretty handy. So if there's a simple path to implement pure Nim using just the correct algorithm first, one could cover most of the basic machine learning techniques pretty quickly. Then others could refine/tune the performance. That's how I'm understanding your previous comment. I'll look into your example some more! It is fun playing with vectorization/tiling...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants