-
-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide a BLAS-free option #422
Comments
Would it be possible to include a lightweight pure C BLAS like https://github.com/michael-lehn/ulmBLAS as a fallback? I'm just learning about Nim, but it I under it should be possible to create a Nimble package for something like ulmBLAS and have it configurable as an optional dependency? Edit: Also not sure how good Nim is at link-time-optimization but for embedded devices it would be nice to be able to have unused functions automatically stripped. |
Nim strips all unused functions, it's actually quite tricky to keep them. ulmBLAS is missing several LAPACK routines, here is what I need for the features I mentioned in the post:
Regarding ulmBLAS in general, in terms of speed it will be severely lacking compared to OpenBLAS/MKL/BLIS/Laser as the microkernel is SSE only and does not use AVX or AVX512. Furthermore if your concern is embedded, ulmBLAS (or OpenBLAS) will need to compile in all assembly to produce a DLL while a pure Nim solution will be able to do dead-code elimination if you never use QR or LU decomposition for example. |
Ok, What is the primary goal for this issue? Is it to write naive-Nim implementations, with or without optimized ASM kernels? It seems that maybe the CLAPACK C libraries could be auto-converted to Nim. Then they could be improved over time. Also, if there is an attempt for pure Nim impl's (that aren't ASM optimized) I wouldn't mind porting a few algo's as I'm having to one-off many in C currently (or a high level language using C matrices) and it's not enjoyable or complete. Hence why Nim/Arraymancer seems so interesting! P.S. referring to #430 would this future work always support a compile-time only CPU target? Using LLVM / JIT'ing isn't generally feasible for embedded devices. |
Users on Windows in particular have reported issues installing BLAS. Even for me I couldn't get Lapack working on Windows CI: #212. Also once integrating Arraymancer to the Nim test suite as an important package, BLAS caused issue due to differing distro packaging: SciNim/nimblas#3 And lastly, the packaging can just be plain wrong and gives wrong results for the past 6 months: https://bugs.archlinux.org/task/63054 Regarding performance, you don't need Assembly to reach BLAS performance, but you do need to know your CPU and memory. I have already implemented the main reason to use BLAS, matrix multiplication, in pure Nim and can reach OpenBLAS speed on large matrices (small matrices are still significantly slower than OpenBLAS/MKL but it's much easier to optimize those: you need to deactivate some preprocessing of optimized large matrix multiplication). See implementation in Laser using OpenMP:
For performance, you can capture 80% of the speed by doing 1D/2D/3D tiling + vectorization.
In short:
My plan detailed here: #430 (comment) is to allow 2 different kinds of people to work on compute kernels:
proc convolve(A, Filter: Function): Function =
# Generator for the convolution function
var i, j: Domain
var convH, convW: Function
# Definition of the result function
# Filter F = [1, 2, 1]
# So 2D application is:
# [[1, 2, 1],
# [2, 4, 2],
# [1, 2, 1]]
#
convH[i, j] = A[i-1, j] * F[0] + A[i, j] * F[1] + A[i+1, j] * F[2]
convW[i, j] = convH[i, j-1] * F[0] + convH[i, j] * F[1] + convH[i, j] * F[2]
# Optimization of the definition - completely optional
convH.compute_at(convW, j) # Compute the required region of convH before convW start (i.e. merged)
# convH.compute_root() # Alternatively do the height convolve before the width convolve (i.e. 2 steps)
convW.tile(i, ii, 64) # Split i in tile of size 64, with ii as the tile iterator
convW.parallel(i) # Iterating on tiles is done in parallel
convW.tile(j, jj, 64) # Split j in tile of size 64 as well with jj as the tile iterator
convW.vectorize(jj, 8) # Iterating on jj is done with size 8 vectors if possible (i.e. AVX on x86)
return convW
generate convolve:
proc foobar(a: Tensor[float32], filter: array[3, float32]): Tensor[float32]
Yes. I don't want LLVM to be a requirement. It's huge. |
Just a quick response, that all sounds great! I'm impressed by how much detail you've already got. While I enjoy writing in Julia, the whole LLVM dependency is just a pain for embedded. Rust is a pain to cross-compile, and seems to have problems with arrays on the stack(?). Given all that, Nim's new ARC/move semantics really makes Nim/Arraymancer very promising for embedded usage (in addition to server side).
That seems reasonable. Glad you summarized it as I didn't quite understand that from the linked comment. In my company's use case, simple machine learning techniques turn out pretty handy. So if there's a simple path to implement pure Nim using just the correct algorithm first, one could cover most of the basic machine learning techniques pretty quickly. Then others could refine/tune the performance. That's how I'm understanding your previous comment. I'll look into your example some more! It is fun playing with vectorization/tiling... |
BLAS installation has been a recurrent problem for users and CI:
As explained on IRC, replacing BLAS while providing the same performance is possible via the Laser backend already integrated in Arraymancer.
The steps are:
not Complex
:Arraymancer/src/tensor/private/p_operator_blas_l2l3.nim
Lines 112 to 146 in 8acfd68
Arraymancer/src/arraymancer.nim
Lines 41 to 44 in 8acfd68
Arraymancer/src/ml/ml.nim
Lines 21 to 25 in 8acfd68
The features that would not be available without BLAS would be:
The text was updated successfully, but these errors were encountered: