Sum function is slow #679

clintonTE · 2020-04-11T00:32:49Z

Describe the bug
Summing a vector is very slow, slower than allocating a new vector and taking a dot-product, and much slower than taking a dot product with a pre-allocated vector.

To Reproduce
The Minimal Working Example (MWE) for this bug:

using Flux,BenchmarkTools,LinearAlgebra
import CuArrays

CuArrays.allowscalar(false)
function mwesum(N)
  cpuv= rand(Float32, N)
  print("\nStandard cpu sum: ")
  @btime sum($cpuv)

  cuv=CuArrays.cu(cpuv)
  print("Standard cuda sum: ")
  @btime sum($cuv)

  print("Summing using allocating a vector of 1s and dot: ")
  onesum(v) = dot(cuv, CuArrays.ones(N))
  @btime $onesum($cuv)

  cuvones = CuArrays.ones(N)
  print("Summing just using dot and a pre-allocated vector of 1s: ")
  onesum2(v) = dot(cuv, cuvones)
  @btime $onesum2($cuv)
end

sleep(0.5)
mwesum(10^8)

Output:

Standard cpu sum:   20.484 ms (0 allocations: 0 bytes)
Standard cuda sum:   32.141 ms (186 allocations: 6.55 KiB)
Summing using allocating a vector of 1s and dot:   14.257 ms (18 allocations: 352 bytes)
Summing just using dot and a pre-allocated vector of 1s:   2.582 ms (3 allocations: 48 bytes)

Expected behavior
Fast summing

Build log

   Building WebIO ──→ `C:\Users\Clinton\.julia\packages\WebIO\2mZPb\deps\build.log`
   Building NNlib ──→ `C:\Users\Clinton\.julia\packages\NNlib\FAI3o\deps\build.log`
   Building Libtask → `C:\Users\Clinton\.julia\packages\Libtask\GQPaW\deps\build.log`
   Building ZipFile → `C:\Users\Clinton\.julia\packages\ZipFile\DW0Qr\deps\build.log`
   Building CMake ──→ `C:\Users\Clinton\.julia\packages\CMake\ULbyn\deps\build.log`
   Building NLopt ──→ `C:\Users\Clinton\.julia\packages\NLopt\eqN9a\deps\build.log`

Environment details (please complete this section)
Details on Julia:

Julia Version 1.4.0
Commit b8e9a9ecc6 (2020-03-21 16:36 UTC)       
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
  WORD_SIZE: 64    
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = "C:\Users\Clinton\AppData\Local\atom\app-1.45.0\atom.exe"  -a
  JULIA_NUM_THREADS = 6

Julia packages:
  [c52e3926] Atom v0.12.10
  [6e4b80f9] BenchmarkTools v0.5.0
  [336ed68f] CSV v0.6.0
  [3895d2a7] CUDAapi v4.0.0
  [be33ccc6] CUDAnative v3.0.4
  [324d7699] CategoricalArrays v0.7.7
  [5ba52731] CodecLz4 v0.4.0
  [944b1d66] CodecZlib v0.7.0
  [6b39b394] CodecZstd v0.7.0
  [3a865a2d] CuArrays v2.0.1
  [a93c6f00] DataFrames v0.20.2
  [864edb3b] DataStructures v0.17.11
  [31c24e10] Distributions v0.23.2
  [2167859e] Finometrics v0.1.0 #master (https://github.com/clintonTE/Finometrics)
  [587475ba] Flux v0.10.4 #master (https://github.com/FluxML/Flux.jl.git)
  [59287772] Formatting v0.4.1
  [38e38edf] GLM v1.3.9
  [e5e0dc1b] Juno v0.8.1
  [442fdcdd] Measures v0.3.1
  [76087f3c] NLopt v0.5.1
  [37e2e3b7] ReverseDiff v1.1.0
  [295af30f] Revise v2.6.0
  [90137ffa] StaticArrays v0.12.1
  [2913bbd2] StatsBase v0.32.2
  [9f7883ad] Tracker v0.2.6
  [fce5fe82] Turing v0.10.1
  [112f6efa] VegaLite v2.0.1
  [fdbf4ff8] XLSX v0.6.1
  [e88e6eb3] Zygote v0.4.13
  [ade2ca70] Dates
  [8ba89e20] Distributed
  [37e2e46d] LinearAlgebra
  [9a3f8284] Random
  [9e88b42a] Serialization

CUDA: toolkit and driver version
Toolkit: 10.2
Driver: 441.66

The text was updated successfully, but these errors were encountered:

maleadt · 2020-04-13T06:57:14Z

Please try again with latest master:

Standard cpu sum:   20.186 ms (0 allocations: 0 bytes)
Standard cuda sum:   1.527 ms (358 allocations: 11.69 KiB)
Summing using allocating a vector of 1s and dot:   3.392 ms (18 allocations: 352 bytes)
Summing just using dot and a pre-allocated vector of 1s:   2.050 ms (3 allocations: 48 bytes)

It's not unsurprising that dot, which dispatches to highly-optimized CUBLAS kernels in this case, performs well. Our sum kernel ultimately has to deal with quite a lot of flexibility (e.g. passing a map function, dimensions, type support, etc).

clintonTE · 2020-04-13T18:57:27Z

Makse sense regarding the dot product. Master doesn't work for me (though the release version still works). Here is the output:

Standard cpu sum:   20.667 ms (0 allocations: 0 bytes)
Standard cuda sum: ERROR: LoadError: Not implemented
Stacktrace:
 [1] error(::String) at .\error.jl:33
 [2] mapreducedim!(::Function, ::Function, ::CuArrays.CuArray{Float32,1,Nothing}, ::CuArrays.CuArray{Float32,1,Nothing}, 
::Float32) at C:\Users\Clinton\.julia\packages\GPUArrays\QDGmr\src\host\mapreduce.jl:5
 [3] mapreduce(::Function, ::Function, ::CuArrays.CuArray{Float32,1,Nothing}; dims::Function, init::Nothing) at C:\Users\Clinton\.julia\packages\GPUArrays\QDGmr\src\host\mapreduce.jl:38
 [4] mapreduce at C:\Users\Clinton\.julia\packages\GPUArrays\QDGmr\src\host\mapreduce.jl:24 [inlined]
 [5] _sum(::Function, ::CuArrays.CuArray{Float32,1,Nothing}, ::Colon) at .\reducedim.jl:657
 [6] _sum(::CuArrays.CuArray{Float32,1,Nothing}, ::Colon) at .\reducedim.jl:656
 [7] #sum#583 at .\reducedim.jl:652 [inlined]
 [8] sum at .\reducedim.jl:652 [inlined]
 [9] ##core#552(::CuArrays.CuArray{Float32,1,Nothing}) at C:\Users\Clinton\.julia\packages\BenchmarkTools\eCEpo\src\execution.jl:371
 [10] ##sample#553(::BenchmarkTools.Parameters) at C:\Users\Clinton\.julia\packages\BenchmarkTools\eCEpo\src\execution.jl:377
 [11] _run(::BenchmarkTools.Benchmark{Symbol("##benchmark#551")}, ::BenchmarkTools.Parameters; verbose::Bool, pad::String, kwargs::Base.Iterators.Pairs{Symbol,Integer,NTuple{4,Symbol},NamedTuple{(:samples, :evals, :gctrial, :gcsample),Tuple{Int64,Int64,Bool,Bool}}}) at C:\Users\Clinton\.julia\packages\BenchmarkTools\eCEpo\src\execution.jl:405
 [12] (::Base.var"#inner#2"{Base.Iterators.Pairs{Symbol,Integer,NTuple{5,Symbol},NamedTuple{(:verbose, :samples, :evals, 
:gctrial, :gcsample),Tuple{Bool,Int64,Int64,Bool,Bool}}},typeof(BenchmarkTools._run),Tuple{BenchmarkTools.Benchmark{Symbol("##benchmark#551")},BenchmarkTools.Parameters}})() at .\essentials.jl:715
 [13] #invokelatest#1 at .\essentials.jl:716 [inlined]
 [14] #run_result#37 at C:\Users\Clinton\.julia\packages\BenchmarkTools\eCEpo\src\execution.jl:32 [inlined]
 [15] run(::BenchmarkTools.Benchmark{Symbol("##benchmark#551")}, ::BenchmarkTools.Parameters; progressid::Nothing, nleaves::Float64, ndone::Float64, kwargs::Base.Iterators.Pairs{Symbol,Integer,NTuple{5,Symbol},NamedTuple{(:verbose, :samples, 
:evals, :gctrial, :gcsample),Tuple{Bool,Int64,Int64,Bool,Bool}}}) at C:\Users\Clinton\.julia\packages\BenchmarkTools\eCEpo\src\execution.jl:94
 [16] #warmup#45 at C:\Users\Clinton\.julia\packages\BenchmarkTools\eCEpo\src\execution.jl:141 [inlined]
 [17] warmup(::BenchmarkTools.Benchmark{Symbol("##benchmark#551")}) at C:\Users\Clinton\.julia\packages\BenchmarkTools\eCEpo\src\execution.jl:141
 [18] macro expansion at C:\Users\Clinton\.julia\packages\BenchmarkTools\eCEpo\src\execution.jl:481 [inlined]
 [19] mwesum(::Int64) at C:\Users\Clinton\Dropbox\Projects\Capacity\mwe.jl:383
 [20] top-level scope at C:\Users\Clinton\Dropbox\Projects\Capacity\mwe.jl:396
in expression starting at C:\Users\Clinton\Dropbox\Projects\Capacity\mwe.jl:396

maleadt · 2020-04-13T19:11:22Z

You need to upgrade both GPUArrays and CuArrays to master.

clintonTE · 2020-04-13T19:59:12Z

Perfect- that worked. Thanks for having a look!

Standard cpu sum:   20.455 ms (0 allocations: 0 bytes)
Standard cuda sum:   1.999 ms (338 allocations: 11.31 KiB)
Summing using allocation and dot:   14.213 ms (18 allocations: 352 bytes)
Summing just using dot and a pre-allocated vector of 1s:   2.587 ms (3 allocations: 48 bytes)

tverho · 2020-07-06T15:24:29Z

With smaller arrays the cuda performance is rather abysmal. For N=256^2:

Standard cpu sum:   4.131 μs (0 allocations: 0 bytes)
Standard cuda sum:   148.908 μs (356 allocations: 11.66 KiB)
Summing using allocating a vector of 1s and dot:   16.415 μs (14 allocations: 272 bytes)
Summing just using dot and a pre-allocated vector of 1s:   12.562 μs (3 allocations: 48 bytes)

clintonTE · 2020-07-06T15:44:24Z

It's actually a bit worse than this - I should have included CUDA.@sync in the benchmarks. This at least makes the last two results in the ballpark of the sum function. Here is an update of the MWE.

using Revise, LinearAlgebra, CUDA, BenchmarkTools


CUDA.allowscalar(false)
function mwesum(N)
  cpuv= rand(Float32, N)
  print("\nStandard cpu sum: ")
  @btime CUDA.@sync sum($cpuv)

  cuv=CUDA.cu(cpuv)
  print("Standard cuda sum: ")
  @btime CUDA.@sync sum($cuv)

  print("Summing using allocating a vector of 1s and dot: ")
  onesum(v) = dot(cuv, CUDA.ones(N))
  @btime CUDA.@sync $onesum($cuv)

  cuvones = CUDA.ones(N)
  print("Summing just using dot and a pre-allocated vector of 1s: ")
  onesum2(v) = dot(cuv, cuvones)
  @btime CUDA.@sync $onesum2($cuv)
end

sleep(0.5)
mwesum(256^2)

This gives me:

Standard cpu sum:   4.471 μs (0 allocations: 0 bytes)
Standard cuda sum:   255.800 μs (366 allocations: 11.83 KiB)
Summing using allocating a vector of 1s and dot:   108.499 μs (24 allocations: 448 bytes)
Summing just using dot and a pre-allocated vector of 1s:   106.599 μs (13 allocations: 224 bytes)

If the issue is worth reopening, perhaps it should be moved to CUDA.jl?

maleadt · 2020-07-06T16:46:58Z

Sure, feel free to open an issue about the performance on small arrays. Do know that the launch overhead is multiple us already, let alone transferring the memory, so it's never going to be fast. And it's not possible to fall back to a CPU-based implementation, that should be done at the higher level.

tverho · 2020-07-07T07:09:20Z

What do you mean by transferring the memory? Normally the array is in the GPU memory to begin with if the code uses CuArrays.

Trying to fall back to CPU summing is not very fast because that does involve memory transfer. For small arrays it's faster than doing the CUDA sum but much slower than without the memory transfer. I added a case for this in the example:

using LinearAlgebra, CUDA, BenchmarkTools


CUDA.allowscalar(false)
function mwesum(N)
  cpuv= rand(Float32, N)
  print("\nStandard cpu sum: ")
  @btime CUDA.@sync sum($cpuv)

  cuv=CUDA.cu(cpuv)
  print("Standard cuda sum: ")
  @btime CUDA.@sync sum($cuv)

  cuv=CUDA.cu(cpuv)
  print("Transfer to CPU and sum: ")
  @btime CUDA.@sync sum(collect($cuv))
  
  print("Summing using allocating a vector of 1s and dot: ")
  onesum(v) = dot(cuv, CUDA.ones(N))
  @btime CUDA.@sync $onesum($cuv)

  cuvones = CUDA.ones(N)
  print("Summing just using dot and a pre-allocated vector of 1s: ")
  onesum2(v) = dot(cuv, cuvones)
  @btime CUDA.@sync $onesum2($cuv)
end

sleep(0.5)
mwesum(256^2)

Output:

Standard cpu sum:   10.966 μs (10 allocations: 176 bytes)
Standard cuda sum:   151.740 μs (366 allocations: 11.83 KiB)
Transfer to CPU and sum:   62.251 μs (14 allocations: 256.28 KiB)
Summing using allocating a vector of 1s and dot:   24.281 μs (25 allocations: 464 bytes)
Summing just using dot and a pre-allocated vector of 1s:   20.161 μs (14 allocations: 240 bytes)

clintonTE added the bug label Apr 11, 2020

maleadt added performance and removed bug labels Apr 13, 2020

clintonTE closed this as completed Apr 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sum function is slow #679

Sum function is slow #679

clintonTE commented Apr 11, 2020

maleadt commented Apr 13, 2020

clintonTE commented Apr 13, 2020

maleadt commented Apr 13, 2020

clintonTE commented Apr 13, 2020

tverho commented Jul 6, 2020

clintonTE commented Jul 6, 2020

maleadt commented Jul 6, 2020 •

edited

Loading

tverho commented Jul 7, 2020

Sum function is slow #679

Sum function is slow #679

Comments

clintonTE commented Apr 11, 2020

maleadt commented Apr 13, 2020

clintonTE commented Apr 13, 2020

maleadt commented Apr 13, 2020

clintonTE commented Apr 13, 2020

tverho commented Jul 6, 2020

clintonTE commented Jul 6, 2020

maleadt commented Jul 6, 2020 • edited Loading

tverho commented Jul 7, 2020

maleadt commented Jul 6, 2020 •

edited

Loading