Skip to content
This repository has been archived by the owner on Mar 12, 2021. It is now read-only.

Sum function is slow #679

Closed
clintonTE opened this issue Apr 11, 2020 · 8 comments
Closed

Sum function is slow #679

clintonTE opened this issue Apr 11, 2020 · 8 comments

Comments

@clintonTE
Copy link

Describe the bug
Summing a vector is very slow, slower than allocating a new vector and taking a dot-product, and much slower than taking a dot product with a pre-allocated vector.

To Reproduce
The Minimal Working Example (MWE) for this bug:

using Flux,BenchmarkTools,LinearAlgebra
import CuArrays

CuArrays.allowscalar(false)
function mwesum(N)
  cpuv= rand(Float32, N)
  print("\nStandard cpu sum: ")
  @btime sum($cpuv)

  cuv=CuArrays.cu(cpuv)
  print("Standard cuda sum: ")
  @btime sum($cuv)

  print("Summing using allocating a vector of 1s and dot: ")
  onesum(v) = dot(cuv, CuArrays.ones(N))
  @btime $onesum($cuv)

  cuvones = CuArrays.ones(N)
  print("Summing just using dot and a pre-allocated vector of 1s: ")
  onesum2(v) = dot(cuv, cuvones)
  @btime $onesum2($cuv)
end

sleep(0.5)
mwesum(10^8)

Output:

Standard cpu sum:   20.484 ms (0 allocations: 0 bytes)
Standard cuda sum:   32.141 ms (186 allocations: 6.55 KiB)
Summing using allocating a vector of 1s and dot:   14.257 ms (18 allocations: 352 bytes)
Summing just using dot and a pre-allocated vector of 1s:   2.582 ms (3 allocations: 48 bytes)

Expected behavior
Fast summing

Build log

   Building WebIO ── `C:\Users\Clinton\.julia\packages\WebIO\2mZPb\deps\build.log`
   Building NNlib ── `C:\Users\Clinton\.julia\packages\NNlib\FAI3o\deps\build.log`
   Building Libtask  `C:\Users\Clinton\.julia\packages\Libtask\GQPaW\deps\build.log`
   Building ZipFile  `C:\Users\Clinton\.julia\packages\ZipFile\DW0Qr\deps\build.log`
   Building CMake ── `C:\Users\Clinton\.julia\packages\CMake\ULbyn\deps\build.log`
   Building NLopt ── `C:\Users\Clinton\.julia\packages\NLopt\eqN9a\deps\build.log`

Environment details (please complete this section)
Details on Julia:

Julia Version 1.4.0
Commit b8e9a9ecc6 (2020-03-21 16:36 UTC)       
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
  WORD_SIZE: 64    
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = "C:\Users\Clinton\AppData\Local\atom\app-1.45.0\atom.exe"  -a
  JULIA_NUM_THREADS = 6
Julia packages:
  [c52e3926] Atom v0.12.10
  [6e4b80f9] BenchmarkTools v0.5.0
  [336ed68f] CSV v0.6.0
  [3895d2a7] CUDAapi v4.0.0
  [be33ccc6] CUDAnative v3.0.4
  [324d7699] CategoricalArrays v0.7.7
  [5ba52731] CodecLz4 v0.4.0
  [944b1d66] CodecZlib v0.7.0
  [6b39b394] CodecZstd v0.7.0
  [3a865a2d] CuArrays v2.0.1
  [a93c6f00] DataFrames v0.20.2
  [864edb3b] DataStructures v0.17.11
  [31c24e10] Distributions v0.23.2
  [2167859e] Finometrics v0.1.0 #master (https://github.com/clintonTE/Finometrics)
  [587475ba] Flux v0.10.4 #master (https://github.com/FluxML/Flux.jl.git)
  [59287772] Formatting v0.4.1
  [38e38edf] GLM v1.3.9
  [e5e0dc1b] Juno v0.8.1
  [442fdcdd] Measures v0.3.1
  [76087f3c] NLopt v0.5.1
  [37e2e3b7] ReverseDiff v1.1.0
  [295af30f] Revise v2.6.0
  [90137ffa] StaticArrays v0.12.1
  [2913bbd2] StatsBase v0.32.2
  [9f7883ad] Tracker v0.2.6
  [fce5fe82] Turing v0.10.1
  [112f6efa] VegaLite v2.0.1
  [fdbf4ff8] XLSX v0.6.1
  [e88e6eb3] Zygote v0.4.13
  [ade2ca70] Dates
  [8ba89e20] Distributed
  [37e2e46d] LinearAlgebra
  [9a3f8284] Random
  [9e88b42a] Serialization 

CUDA: toolkit and driver version
Toolkit: 10.2
Driver: 441.66

@clintonTE clintonTE added the bug label Apr 11, 2020
@maleadt
Copy link
Member

maleadt commented Apr 13, 2020

Please try again with latest master:

Standard cpu sum:   20.186 ms (0 allocations: 0 bytes)
Standard cuda sum:   1.527 ms (358 allocations: 11.69 KiB)
Summing using allocating a vector of 1s and dot:   3.392 ms (18 allocations: 352 bytes)
Summing just using dot and a pre-allocated vector of 1s:   2.050 ms (3 allocations: 48 bytes)

It's not unsurprising that dot, which dispatches to highly-optimized CUBLAS kernels in this case, performs well. Our sum kernel ultimately has to deal with quite a lot of flexibility (e.g. passing a map function, dimensions, type support, etc).

@maleadt maleadt added performance and removed bug labels Apr 13, 2020
@clintonTE
Copy link
Author

Makse sense regarding the dot product. Master doesn't work for me (though the release version still works). Here is the output:

Standard cpu sum:   20.667 ms (0 allocations: 0 bytes)
Standard cuda sum: ERROR: LoadError: Not implemented
Stacktrace:
 [1] error(::String) at .\error.jl:33
 [2] mapreducedim!(::Function, ::Function, ::CuArrays.CuArray{Float32,1,Nothing}, ::CuArrays.CuArray{Float32,1,Nothing}, 
::Float32) at C:\Users\Clinton\.julia\packages\GPUArrays\QDGmr\src\host\mapreduce.jl:5
 [3] mapreduce(::Function, ::Function, ::CuArrays.CuArray{Float32,1,Nothing}; dims::Function, init::Nothing) at C:\Users\Clinton\.julia\packages\GPUArrays\QDGmr\src\host\mapreduce.jl:38
 [4] mapreduce at C:\Users\Clinton\.julia\packages\GPUArrays\QDGmr\src\host\mapreduce.jl:24 [inlined]
 [5] _sum(::Function, ::CuArrays.CuArray{Float32,1,Nothing}, ::Colon) at .\reducedim.jl:657
 [6] _sum(::CuArrays.CuArray{Float32,1,Nothing}, ::Colon) at .\reducedim.jl:656
 [7] #sum#583 at .\reducedim.jl:652 [inlined]
 [8] sum at .\reducedim.jl:652 [inlined]
 [9] ##core#552(::CuArrays.CuArray{Float32,1,Nothing}) at C:\Users\Clinton\.julia\packages\BenchmarkTools\eCEpo\src\execution.jl:371
 [10] ##sample#553(::BenchmarkTools.Parameters) at C:\Users\Clinton\.julia\packages\BenchmarkTools\eCEpo\src\execution.jl:377
 [11] _run(::BenchmarkTools.Benchmark{Symbol("##benchmark#551")}, ::BenchmarkTools.Parameters; verbose::Bool, pad::String, kwargs::Base.Iterators.Pairs{Symbol,Integer,NTuple{4,Symbol},NamedTuple{(:samples, :evals, :gctrial, :gcsample),Tuple{Int64,Int64,Bool,Bool}}}) at C:\Users\Clinton\.julia\packages\BenchmarkTools\eCEpo\src\execution.jl:405
 [12] (::Base.var"#inner#2"{Base.Iterators.Pairs{Symbol,Integer,NTuple{5,Symbol},NamedTuple{(:verbose, :samples, :evals, 
:gctrial, :gcsample),Tuple{Bool,Int64,Int64,Bool,Bool}}},typeof(BenchmarkTools._run),Tuple{BenchmarkTools.Benchmark{Symbol("##benchmark#551")},BenchmarkTools.Parameters}})() at .\essentials.jl:715
 [13] #invokelatest#1 at .\essentials.jl:716 [inlined]
 [14] #run_result#37 at C:\Users\Clinton\.julia\packages\BenchmarkTools\eCEpo\src\execution.jl:32 [inlined]
 [15] run(::BenchmarkTools.Benchmark{Symbol("##benchmark#551")}, ::BenchmarkTools.Parameters; progressid::Nothing, nleaves::Float64, ndone::Float64, kwargs::Base.Iterators.Pairs{Symbol,Integer,NTuple{5,Symbol},NamedTuple{(:verbose, :samples, 
:evals, :gctrial, :gcsample),Tuple{Bool,Int64,Int64,Bool,Bool}}}) at C:\Users\Clinton\.julia\packages\BenchmarkTools\eCEpo\src\execution.jl:94
 [16] #warmup#45 at C:\Users\Clinton\.julia\packages\BenchmarkTools\eCEpo\src\execution.jl:141 [inlined]
 [17] warmup(::BenchmarkTools.Benchmark{Symbol("##benchmark#551")}) at C:\Users\Clinton\.julia\packages\BenchmarkTools\eCEpo\src\execution.jl:141
 [18] macro expansion at C:\Users\Clinton\.julia\packages\BenchmarkTools\eCEpo\src\execution.jl:481 [inlined]
 [19] mwesum(::Int64) at C:\Users\Clinton\Dropbox\Projects\Capacity\mwe.jl:383
 [20] top-level scope at C:\Users\Clinton\Dropbox\Projects\Capacity\mwe.jl:396
in expression starting at C:\Users\Clinton\Dropbox\Projects\Capacity\mwe.jl:396


@maleadt
Copy link
Member

maleadt commented Apr 13, 2020

You need to upgrade both GPUArrays and CuArrays to master.

@clintonTE
Copy link
Author

Perfect- that worked. Thanks for having a look!

Standard cpu sum:   20.455 ms (0 allocations: 0 bytes)
Standard cuda sum:   1.999 ms (338 allocations: 11.31 KiB)
Summing using allocation and dot:   14.213 ms (18 allocations: 352 bytes)
Summing just using dot and a pre-allocated vector of 1s:   2.587 ms (3 allocations: 48 bytes)

@tverho
Copy link

tverho commented Jul 6, 2020

With smaller arrays the cuda performance is rather abysmal. For N=256^2:

Standard cpu sum:   4.131 μs (0 allocations: 0 bytes)
Standard cuda sum:   148.908 μs (356 allocations: 11.66 KiB)
Summing using allocating a vector of 1s and dot:   16.415 μs (14 allocations: 272 bytes)
Summing just using dot and a pre-allocated vector of 1s:   12.562 μs (3 allocations: 48 bytes)

@clintonTE
Copy link
Author

It's actually a bit worse than this - I should have included CUDA.@sync in the benchmarks. This at least makes the last two results in the ballpark of the sum function. Here is an update of the MWE.

using Revise, LinearAlgebra, CUDA, BenchmarkTools


CUDA.allowscalar(false)
function mwesum(N)
  cpuv= rand(Float32, N)
  print("\nStandard cpu sum: ")
  @btime CUDA.@sync sum($cpuv)

  cuv=CUDA.cu(cpuv)
  print("Standard cuda sum: ")
  @btime CUDA.@sync sum($cuv)

  print("Summing using allocating a vector of 1s and dot: ")
  onesum(v) = dot(cuv, CUDA.ones(N))
  @btime CUDA.@sync $onesum($cuv)

  cuvones = CUDA.ones(N)
  print("Summing just using dot and a pre-allocated vector of 1s: ")
  onesum2(v) = dot(cuv, cuvones)
  @btime CUDA.@sync $onesum2($cuv)
end

sleep(0.5)
mwesum(256^2)

This gives me:

Standard cpu sum:   4.471 μs (0 allocations: 0 bytes)
Standard cuda sum:   255.800 μs (366 allocations: 11.83 KiB)
Summing using allocating a vector of 1s and dot:   108.499 μs (24 allocations: 448 bytes)
Summing just using dot and a pre-allocated vector of 1s:   106.599 μs (13 allocations: 224 bytes)

If the issue is worth reopening, perhaps it should be moved to CUDA.jl?

@maleadt
Copy link
Member

maleadt commented Jul 6, 2020

Sure, feel free to open an issue about the performance on small arrays. Do know that the launch overhead is multiple us already, let alone transferring the memory, so it's never going to be fast. And it's not possible to fall back to a CPU-based implementation, that should be done at the higher level.

@tverho
Copy link

tverho commented Jul 7, 2020

What do you mean by transferring the memory? Normally the array is in the GPU memory to begin with if the code uses CuArrays.

Trying to fall back to CPU summing is not very fast because that does involve memory transfer. For small arrays it's faster than doing the CUDA sum but much slower than without the memory transfer. I added a case for this in the example:

using LinearAlgebra, CUDA, BenchmarkTools


CUDA.allowscalar(false)
function mwesum(N)
  cpuv= rand(Float32, N)
  print("\nStandard cpu sum: ")
  @btime CUDA.@sync sum($cpuv)

  cuv=CUDA.cu(cpuv)
  print("Standard cuda sum: ")
  @btime CUDA.@sync sum($cuv)

  cuv=CUDA.cu(cpuv)
  print("Transfer to CPU and sum: ")
  @btime CUDA.@sync sum(collect($cuv))
  
  print("Summing using allocating a vector of 1s and dot: ")
  onesum(v) = dot(cuv, CUDA.ones(N))
  @btime CUDA.@sync $onesum($cuv)

  cuvones = CUDA.ones(N)
  print("Summing just using dot and a pre-allocated vector of 1s: ")
  onesum2(v) = dot(cuv, cuvones)
  @btime CUDA.@sync $onesum2($cuv)
end

sleep(0.5)
mwesum(256^2)

Output:

Standard cpu sum:   10.966 μs (10 allocations: 176 bytes)
Standard cuda sum:   151.740 μs (366 allocations: 11.83 KiB)
Transfer to CPU and sum:   62.251 μs (14 allocations: 256.28 KiB)
Summing using allocating a vector of 1s and dot:   24.281 μs (25 allocations: 464 bytes)
Summing just using dot and a pre-allocated vector of 1s:   20.161 μs (14 allocations: 240 bytes)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants