Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2023-12-31 - Longstanding missing features #616

Open
mratsim opened this issue Dec 31, 2023 · 25 comments
Open

2023-12-31 - Longstanding missing features #616

mratsim opened this issue Dec 31, 2023 · 25 comments

Comments

@mratsim
Copy link
Owner

mratsim commented Dec 31, 2023

Arraymancer has become a key piece of Nim ecosystem. Unfortunately I do not have the time to develop it further for several reasons:

  • family, birth of family member, death of hobby time.
  • competing hobby, I've been focusing on cryptography the last couple years, and feel like Nim has also an unique niche there, and I'm even accelerating Rust libraries with a Nim backend.
  • pace of dev, the deep learning community was moving rapidly in 2012 / 2018, today it's very very very fast and hard to compete. Not to say it's impossible, but you need better infrastructure to catch up.

Furthermore, since then Nim v2 introduced new interesting features like builtin memory management that works with multithreading or views that are quire relevant to Arraymancer.

Let's go over the longstanding missing features to improve Arraymancer, we'll go over the tensor library and over the neural network library.

Tensor backend (~NumPy, ~SciPy)

  1. Mutable operations on slices
  2. Nested parallelism
  3. Doc generation
  4. Versioning / releases
  5. Slow transcendental functions (exp, log, sin, cos, tan, tanh, ...)
  6. Windows: BLAS and Lapack deployment woes
  7. MacOS: OpenMP woes
  8. MacOS: Tensor cores usage

Also: the need for untyped Tensors.

Neural network backend (~PyTorch)

  1. Nvidia Cuda
  2. Implementation woes: CPU forward, CPU backward, GPU forward, GPU backward, all optimized
  3. Ergonomic serialization and deserialization of models
  4. Slowness of reduction operations like sigmnoid or softmax the more cores you have
@mratsim
Copy link
Owner Author

mratsim commented Dec 31, 2023

Tensor backend - 1. Mutable operations on slices

References:

Currently slicing returns immutable tensors and you need an extra var assignment to mutate the original tensor which is quite burdensome.

i don't remember what were the errors, iirc tests gave wrong result and it was painful to debug.

@mratsim
Copy link
Owner Author

mratsim commented Dec 31, 2023

Tensor backend - 2. Nested parallelism

We currently cannot benefit from nested parallelism in Arraymancer, a parallel function calling a parallel function will cause oversubscription due to OpenMP limitations, for example the Frobenius inner product

proc frobenius_inner_prod*[T](a,b: Tensor[T]): T =
sum:
map2_inline(a,b):
x * y
which also might be source of woes from Slowness of reduction operations like sigmnoid or softmax the more cores you have

Solution is switching to Weave (or Constantine's threadpool) as a backend.

Related:

@mratsim
Copy link
Owner Author

mratsim commented Dec 31, 2023

Tensor backend - 3. Doc generation

Doc generation is a pain point, see wishlist #488 (comment) and haxscramper/haxdoc#1

though having those auto-generated in CI is awesome: #556

@mratsim
Copy link
Owner Author

mratsim commented Dec 31, 2023

Tensor backend - 4. Versioning / releases

Reserved

@mratsim
Copy link
Owner Author

mratsim commented Dec 31, 2023

Tensor backend - 5. Slow transcendental functions (exp, log, sin, cos, tan, tanh, ...)

Background: #265 (comment)

While working on Laser I've looked into many areas and found huge slowness in the exp and log function from the C standard library <math.h>, that notably bottlenecks sigmoid and softmax.

Benchmark.

On my machine with all optimisations, <math.h> can achieve about 160 millions exponentials per second per thread while the fastest can achieve 1.3 billions (so about 10x faster).

PyTorch is not using the fastest implementation but theirs can still achieve 900 millions exponentials per second. https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/avx_mathfun.h

I.e. I can't trust anyone :/.

There is a tradeoff there between accuracy which is very needed for some applications, e.g. weather forecast where errors compound exponentially and others typically deep learning where quantizing on 4-bit is actually fine.

See 8-way accuracy/speed comparison for AVX2 impl of exponentiation: https://github.com/mratsim/laser/blob/master/benchmarks/vector_math/bench_exp_avx2.nim#L282-L470

Either this and nested parallelism or both are a significant bottleneck for the Shakespeare char-rnn demo.

@mratsim
Copy link
Owner Author

mratsim commented Dec 31, 2023

Tensor backend - 6, 7, 8 OS-specific woes

On Windows, BLAS and Lapack are a pain to deploy. Even in CI I did not figure out how to install Lapack

# TODO: install openblas or lapack
# - os: windows
# cpu: amd64
# - os: windows
# cpu: i386
#512 #465 #422

See similar woes here: https://forum.nim-lang.org/t/10812

On MacOS, Apple Clang doesn't support OpenMP and requires installing LLVM via Homebrew.

The solution to both would be to implement a pure Nim BLAS backed by a pure Nim multithreading runtime like Weave.
The hardest primitive, matrix multiplication has already been implemented as a benchmark for Weave, with decent performance: even after 3 years of progress on OpenBLAS or Intel MKL it still achieves 80%~90% of perf, https://github.com/mratsim/weave/tree/b6255af/benchmarks/matmul_gemm_blas

Lastly, on MacOS, with has unified memory it's costless to use Tensor cores for matrix multiplication.
Currently as we use Apple Accelerate we're covered, but if we use our own BLAS like Weave's, we'll have to go through Apple AMX instructions: https://github.com/corsix/amx

@mratsim
Copy link
Owner Author

mratsim commented Dec 31, 2023

The need for untyped tensors

When I started Arraymancer, I first was really excited about the possibility of encoding all size information in the type system to have a robust library with everything a compile-time error.

I quickly backpedaled due to ergonomic issues, see this May 2017 discussion here (time flies) andreaferretti/linear-algebra#5 (comment)

The main reason is that it prevents me from creating a sequence of closures: Matrix[2, 3] -> Matrix[3, 3] and Matrix[3, 3] -> Matrix[3,3]

And there are other problems that surfaced later which confirmed that not putting dimensions in the type system was better ergonomic-wise, for example for the squeeze which deletes singleton dimension from a tensor.

But actually putting the type is also problematic ergonomically:

  • serialization/deserialization needs the type as input
  • library that build upon tensors for arbitrary inputs like dataframes will need to handle that themselves (https://github.com/SciNim/Datamancer/blob/f35119080428919ad51658c4bea958fbd2d54d39/src/datamancer/column.nim#L9-L35)
    type
      ColKind* = enum
        colNone, colFloat, colInt, colBool, colString, colObject, colConstant, colGeneric
      Column* = ref object
        len*: int
        case kind*: ColKind
        of colFloat: fCol*: Tensor[float]
        of colInt: iCol*: Tensor[int]
        of colBool: bCol*: Tensor[bool]
        of colString: sCol*: Tensor[string]
        of colObject: oCol*: Tensor[Value]
        of colConstant: cCol*: Value
        of colGeneric: discard # added when overwriting `Column` type
        #  gKiloGram: Tensor[KiloGram]
        #  gKiloGram²: Tensor[KiloGram²]
        #  # or
        #  case gKind: colGenericKind
        #  of kKiloGram: gKiloGram: Tensor[KiloGram]
        #  of kKiloGram²: gKiloGram²: Tensor[KiloGram²]
        of colNone: discard
    
      ColumnLike* = concept x
        x.len is int
        x.kind is ColKind
    
      BuiltInTypes* = float | char | int | bool | string | Value
      SupportedTypes* = SomeNumber | BuiltInTypes

The scientific computing world doesn't work with static types in serialization, when deserializing csv, json, ONNX or tfrecords, you are expected to read the format from the file.
Similarly Arraymancer NN layers/models would need to be completely type-erased for proper loading.

See also https://forum.nim-lang.org/t/10223#67808

Now, it's probable that libraries depend on Arraymancer support for static T for example for dual-numbers.

So a way forward to allow ergonomic serialization/deserialization (#163, #417) from common scientific format would be starting a new library, with a restricted set of types in mind (all types used in Numpy / PyTorch). The internal Arraymancer typed primitives can be reused however.

@mratsim
Copy link
Owner Author

mratsim commented Dec 31, 2023

Reserved

@mratsim
Copy link
Owner Author

mratsim commented Dec 31, 2023

Neural network backend - 1. Nvidia Cuda

Cuda support has been broken since Nim v1.2:

The culprit is: nim-lang/Nim#16936

There are 3 ways to fix this:

  1. Create a nim.cfg config for NVCC either in Nim, https://github.com/nim-lang/Nim/blob/devel/config/nim.cfg or in Arraymancer. The main issue is that it switch the compiler to nvcc for everything. Then --passC flags need special handling, they all need to be prefixed by -Xcompiler -myflag, so it become hard to compose with other libraries.
    Note: we can directly link to the dll/.so without needing the header files from here
    cincludes:"/opt/cuda/include"
  2. Using a JIT like LLVM, the scaffolding code is available in Constantine:
  3. Using NVRTC, NVRTC (Nvidia Runtime Compilation library) is something I actually missed when implementing Arraymancer, well it was introduced in Cuda 7 in 2015, and I started Arraymancer with Cuda 7 so there was no tutorials. NVRTC allows on-the-fly compilation of Cuda kernels, similar to how OpenCL works today in Arraymancer. The code can be provided as in-memory .cu files. I.e. we keep all the current interpolated C++ but remove the emit and nvcc.

@mratsim
Copy link
Owner Author

mratsim commented Dec 31, 2023

Neural Network backend - 2. Implementation woes: CPU forward, CPU backward, GPU forward, GPU backward, all optimized

Implementing Neural Network layers is extremely time-consuming, you need:

  1. Implement the forward pass
  2. Make it fast
  3. Implement the backward pass
  4. Make it fast
  5. Rince and repeat on every accelerator/GPU backend

This require several expertise in a team: linear-algebra and high-performance computing. And both require significant investment of time to achieve.

The only way to compete is by creating a compiler, only the forward pass would need to be implemented and the rest is automatically derived.
The idea is very well captured in https://people.csail.mit.edu/tzumao/gradient_halide/ and has been discussed extensively in #347 .

This led to start experimentations on a deep learning compiler called Lux, in Laser:

Then I identified that we need a multithreading runtime to power this compiler, which led to https://github.com/mratsim/weave/

Then I got pulled by other projects.

Fortunately, there is another project with the same idea: https://github.com/can-lehmann/exprgrad that went as far as generating OpenCL code.

@mratsim
Copy link
Owner Author

mratsim commented Dec 31, 2023

Neural network backend - 3. Ergonomic serialization and deserialization of models

This is a very important problem at the moment:

References:

The issue is that Model / Layers are statically typed which means you need end-user to properly type everything.
This is a significant ergonomic limitation when other frameworks can just load a .onnx file and be done.
It also prevents Arraymancer being used as a library as you would need recompilation.

The solution is to type-erase all layers and model, via inherited ref objects for example.

@mratsim
Copy link
Owner Author

mratsim commented Dec 31, 2023

Neural network backend - 4. Slowness of reduction operations like sigmnoid or softmax the more cores you have

Another key problem which causes scaling problems. The issue is that reduction operations are always used in deep learning since any loss function is a reduction.

See

Weave and Constantine's threadpool take into account those slowness, see benches:

In summary, we need to allow nested parallelism and use loop tiling / loop blocking as well

@mratsim
Copy link
Owner Author

mratsim commented Dec 31, 2023

Reserved

@mratsim
Copy link
Owner Author

mratsim commented Dec 31, 2023

Summary

In summary, here is how I see how to make more progress on Arraymancer, tensor libraries, deep learning in Nim.

  1. Create a compiler for tensor arithmetic which supports CPU, OpenCL, Nvidia backends.
  2. Use Weave or Constantine's threadpool instead of OpenMP for the CPU parallel runtime.
  3. Implement a pure Nim BLAS & Lapack replacement. This may or may not use the compiler.
  4. Port Arraymancer primitives to those that compiler, threadpool and BLAS/Lapack.
  5. Create a new library with type-erased tensors and (de)serializable models that will be focused on scientific computing interop (.csv, .onnx, .tfrecords) and deep-learning
  6. Deprecate the deep learning part of Arraymancer, keeping only Numpy/Scipy/Tensor, and redirect deep learning needs to that new library.

@Niminem
Copy link
Contributor

Niminem commented Jan 1, 2024

@mratsim Appreciate all the work you've done for the community man, not just for this library. Some time in Q2 I'll be able to return to a hobby project and will look into improving Arraymancer.

I wonder if we can trim the need for various backends by leveraging WebGPU. It has first class support for compute shaders, it's nearly as fast as vulkan, portable to EVERY device imaginable, and can be ran natively (it isn't just for browser). I think that reduces a lot of work load. For the CPU backend 100% we'll want to use something like weave and a pure Nim BLAS implementation ( how hard can that be? :) )

@AngelEzquerra
Copy link
Contributor

For the tensor part of the library I also think that there are a few important features that are still missing. For example, we don’t support most of the tensor manipulation features (e.g. add, insert, roll…) and also some important signal processing algorithms (FFT, filtering and convolution).

@hlclemson
Copy link

For the tensor part of the library I also think that there are a few important features that are still missing. For example, we don’t support most of the tensor manipulation features (e.g. add, insert, roll…) and also some important signal processing algorithms (FFT, filtering and convolution).

Is it possible to use FFTW as a backend for the fft implementation? I want to contribute to this project but I am not sure if the DFT algorithm needs to be written from the scratch or the existing library can be used.

@mratsim
Copy link
Owner Author

mratsim commented Jan 19, 2024

Is it possible to use FFTW as a backend for the fft implementation? I want to contribute to this project but I am not sure if the DFT algorithm needs to be written from the scratch or the existing library can be used.

FFTW cannot be distributed with Arraymancer due its GPL license, but it's available as a separate package: https://github.com/SciNim/nimfftw3

What can be distributed with Arraymancer is https://github.com/scinim/impulse

also some important signal processing algorithms (FFT, filtering and convolution).

@AngelEzquerra, see https://github.com/scinim/impulse for a low-level implementation.

@AngelEzquerra
Copy link
Contributor

I didn't know about impulse. I can have a look, but what about @arnetheduck's nim-fffr (https://github.com/arnetheduck/nim-fftr)? It's MIT licensed and according to his benchmarks it's pretty fast. I've played a bit with it (and even made a couple of PR's to it) and it is pretty easy to use...

@mratsim
Copy link
Owner Author

mratsim commented Jan 22, 2024

The gap is large at the moment. And for large FFTs you really need multithreading.

fftr

   1.423 ms    1.461 ms  ±0.017  x1000 pow2 - 65536
 244.258 ms  246.235 ms  ±1.584    x21 prime - 746497
  47.279 ms   47.748 ms  ±0.308   x105 prime-power - 160801
   1.915 ms    1.959 ms  ±0.023  x1000 mult-of-power-of-2 - 20736
   4.171 ms    4.251 ms  ±0.037  x1000 small-comp-large-prime -
  11.567 ms   11.741 ms  ±0.067   x426 small-comp - 100000

fftw

   0.285 ms    0.471 ms  ±0.348  x1000 pow2 - 65536
  55.563 ms   63.118 ms  ±8.796    x79 prime - 746497
   5.052 ms    5.345 ms  ±0.471   x904 prime-power - 160801
   0.114 ms    0.137 ms  ±0.033  x1000 mult-of-power-of-2 - 20736
   0.711 ms    0.754 ms  ±0.025  x1000 small-comp-large-prime - 30270
   0.529 ms    0.560 ms  ±0.033  x1000 small-comp - 100000

@arnetheduck
Copy link

for prime powers, fftr uses bluestein which is quite inefficient, ie it's an algorithm implementation away to get it on par with fftw (if anything, https://github.com/ejmahler/RustFFT is one of the fastest ones, beating fftw too) - re multithreading, that's something you'd set up outside of the core fft algorithm I suspect

@mratsim
Copy link
Owner Author

mratsim commented Jan 22, 2024

re multithreading, that's something you'd set up outside of the core fft algorithm I suspect

For load-balancing, recursive divide-and-conquer algorithm are best because they ensure that there are plenty of tasks so no CPU is starved. This is very easy with Cooley-Tukey as function calls: https://github.com/mratsim/constantine/blob/bc5faaaef8e270b6e4913a704e1132f96bfe7349/research/kzg/fft_fr.nim#L91-L150

i.e. you replace all fft_internal calls with spawn fft_internal

func simpleFT[F](
       output: var View[F],
       vals: View[F],
       rootsOfUnity: View[F]
     ) =
  # FFT is a recursive algorithm
  # This is the base-case using a O(n²) algorithm

  let L = output.len
  var last {.noInit.}, v {.noInit.}: F

  for i in 0 ..< L:
    last.prod(vals[0], rootsOfUnity[0])
    for j in 1 ..< L:
      v.prod(vals[j], rootsOfUnity[(i*j) mod L])
      last += v
    output[i] = last

func fft_internal[F](
       output: var View[F],
       vals: View[F],
       rootsOfUnity: View[F]
     ) =
  if output.len <= 4:
    simpleFT(output, vals, rootsOfUnity)
    return

  # Recursive Divide-and-Conquer
  let (evenVals, oddVals) = vals.splitAlternate()
  var (outLeft, outRight) = output.splitMiddle()
  let halfROI = rootsOfUnity.skipHalf()

  fft_internal(outLeft, evenVals, halfROI)
  fft_internal(outRight, oddVals, halfROI)

  let half = outLeft.len
  var y_times_root{.noinit.}: F

  for i in 0 ..< half:
    # FFT Butterfly
    y_times_root   .prod(output[i+half], rootsOfUnity[i])
    output[i+half] .diff(output[i], y_times_root)
    output[i]      += y_times_root

func fft*[F](
       desc: FFTDescriptor[F],
       output: var openarray[F],
       vals: openarray[F]): FFT_Status =
  if vals.len > desc.maxWidth:
    return FFTS_TooManyValues
  if not vals.len.uint64.isPowerOf2_vartime():
    return FFTS_SizeNotPowerOfTwo

  let rootz = desc.expandedRootsOfUnity
                  .toView()
                  .slice(0, desc.maxWidth-1, desc.maxWidth div vals.len)

  var voutput = output.toView()
  fft_internal(voutput, vals.toView(), rootz)
  return FFTS_Success

If using the for loop approach like on Wikipedia second pseudocode

lgorithm iterative-fft is
    input: Array a of n complex values where n is a power of 2.
    output: Array A the DFT of a.
 
    bit-reverse-copy(a, A)
    n ← a.length 
    for s = 1 to log(n) do
        m ← 2s
        ωm ← exp(−2πi/m) 
        for k = 0 to n-1 by m do
            ω ← 1
            for j = 0 to m/2 – 1 do
                t ← ω A[k + j + m/2]
                u ← A[k + j]
                A[k + j] ← u + t
                A[k + j + m/2] ← u – t
                ω ← ω ωm
   
    return A

We can do parallel-for loops.

But an interface/concepts for either spawning tasks or parallelizing loop is needed, and then needs to be use for effective parallelization of the FFT.

@AngelEzquerra
Copy link
Contributor

The gap is large at the moment. And for large FFTs you really need multithreading.

fftr

   1.423 ms    1.461 ms  ±0.017  x1000 pow2 - 65536
 244.258 ms  246.235 ms  ±1.584    x21 prime - 746497
  47.279 ms   47.748 ms  ±0.308   x105 prime-power - 160801
   1.915 ms    1.959 ms  ±0.023  x1000 mult-of-power-of-2 - 20736
   4.171 ms    4.251 ms  ±0.037  x1000 small-comp-large-prime -
  11.567 ms   11.741 ms  ±0.067   x426 small-comp - 100000

fftw

   0.285 ms    0.471 ms  ±0.348  x1000 pow2 - 65536
  55.563 ms   63.118 ms  ±8.796    x79 prime - 746497
   5.052 ms    5.345 ms  ±0.471   x904 prime-power - 160801
   0.114 ms    0.137 ms  ±0.033  x1000 mult-of-power-of-2 - 20736
   0.711 ms    0.754 ms  ±0.025  x1000 small-comp-large-prime - 30270
   0.529 ms    0.560 ms  ±0.033  x1000 small-comp - 100000

Do you have similar measurements for impulse?

@mratsim
Copy link
Owner Author

mratsim commented Jan 23, 2024

Not at the moment. It would be interesting to make some. impulse at the moment uses the same backend as numpy/scipy.

@AngelEzquerra
Copy link
Contributor

@mratsim, I tried to install impulse via nimble but it failed. Assuming we used it to implement the FFT in arraymancer, would you expect to embed it in arraymancer or to install it as a dependency? If the latter, I assume it would require impulse to be a proper nimble package, right?

Repository owner deleted a comment from BKB00001 Feb 21, 2024
@github-staff github-staff deleted a comment from JK-0 Apr 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants