In [1]:
using Revise
using CUDA
using BenchmarkTools
A = CUDA.rand(Float32, 1024, 1024);
B = CUDA.rand(Float32, 1024, 1024);

┌ Info: Precompiling Revise [295af30f-e4ad-537b-8983-00126c2a3abe]
└ @ Base loading.jl:1423


### NVIDIA Tools Extensions

To accurately measure the total time it takes to execute this operation, you can use the mouse to measure on the timeline. A better approach is to denote the operation in source code, using NVIDIA's Tools Extensions (NVTX) library. This will then be picked up by NSight Systems, and added to the timeline:

In [2]:
CUDA.@profile NVTX.@range "mul!" CUDA.@sync A * B;

┌ Info: Running under Nsight Systems, CUDA.@profile will automatically start the profiler
└ @ CUDA.Profile /home/tim/Julia/pkg/CUDA/lib/cudadrv/profile.jl:49

waiting for capture range to start the collection
command ignored
┌ Info: Profiling has finished, open the report listed above with `nsight-sys`
└ @ CUDA.Profile /home/tim/Julia/pkg/CUDA/lib/cudadrv/profile.jl:91


![image.png](attachment:image.png)

Note how our NVTX range nicely includes the time it took to queue the operation, as well as the execution on the GPU. This requires the NVTX range to include a synchronization!

The initial API call here is suspiciously slow though. This is a common occurence, and therefore it's recommended to profile any short-running application twice:

In [3]:
CUDA.@profile begin
    NVTX.@range "mul! 1" CUDA.@sync A * B
    NVTX.@range "mul! 2" CUDA.@sync A * B
end;


waiting for capture range to start the collection
command ignored
┌ Info: Profiling has finished, open the report listed above with `nsight-sys`
└ @ CUDA.Profile /home/tim/Julia/pkg/CUDA/lib/cudadrv/profile.jl:91


![image.png](attachment:image.png)

That's better, and much closer to our earlier benchmark results. Notice that it is still slightly slower, and some overhead is to be expected when running under the profiler.

NVTX can also be used to add markers to the source code:

In [4]:
CUDA.@profile begin
    NVTX.@range "mul! 1" CUDA.@sync A * B
    NVTX.@range "mul! 2" CUDA.@sync A * B
    NVTX.mark("done")
end;


waiting for capture range to start the collection
command ignored
┌ Info: Profiling has finished, open the report listed above with `nsight-sys`
└ @ CUDA.Profile /home/tim/Julia/pkg/CUDA/lib/cudadrv/profile.jl:91


![image.png](attachment:image.png)

For details on the kernel's execution, expand the `GPU` part of the timeline and hover the kernel in question:

![image.png](attachment:image.png)

## Case study: application optimization

In [3]:
NVTX.@range function rmse(A::AbstractMatrix, B::AbstractMatrix)
    E = A - B
    SQE = E .^ 2
    MSE = sum(SQE) / length(SQE)
    return sqrt(MSE)
end

rmse (generic function with 1 method)

In [29]:
CUDA.allowscalar(false)
N = 16
A = CUDA.rand(1024, 1024, N)
B = CUDA.rand(1024, 1024, N)

NVTX.@range function doit(f)
    rmses = Vector{Float64}(undef, N)
    for i in 1:N
        rmses[i] = f(A[:, :, i], B[:, :, i])
    end
    rmses
end

b = @benchmark doit(rmse)
#CUDA.@profile doit(rmse)
b

BenchmarkTools.Trial: 537 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m2.921 ms[22m[39m … [35m277.111 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 1.21%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m3.553 ms               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m9.338 ms[22m[39m ± [32m 38.805 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.72% ± 0.17%

  [34m█[39m[32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [34m█[39m[32m▄[39m[39m▁[39m▁[39m▁

no CUDA.@sync, because memcpy is synchronizing

![image.png](attachment:image.png)

In [30]:
NVTX.@range function doit2(f)
    rmses = Vector{Float64}(undef, N)
    for i in 1:N
        rmses[i] = f(view(A, :, :, i), view(B, :, :, i))
    end
    rmses
end

b = @benchmark doit2(rmse)
#CUDA.@profile doit(rmse)
b

BenchmarkTools.Trial: 1052 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m1.836 ms[22m[39m … [35m284.380 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 1.41%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m2.019 ms               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m4.750 ms[22m[39m ± [32m 26.407 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.80% ± 0.14%

  [39m [39m▁[39m [39m█[34m▃[39m[39m [39m [39m▂[39m▁[39m▂[39m▃[39m▁[39m▁[39m▃[39m▂[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m█[39m█[39m▄[39m█[34m█[39m[39m▆[39

In [31]:
NVTX.@range function rmse2(A::AbstractMatrix, B::AbstractMatrix)
    SQE = (A - B) .^ 2
    MSE = sum(SQE) / length(A)
    return sqrt(MSE)
end
b = @benchmark doit2(rmse2)
#CUDA.@profile doit2(rmse2)
b

BenchmarkTools.Trial: 1010 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m1.843 ms[22m[39m … [35m278.905 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 1.49%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m2.030 ms               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m4.948 ms[22m[39m ± [32m 27.220 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.83% ± 0.15%

  [34m█[39m[39m [39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [34m█[39m[39m▃[39m▂[32m▁[39m[39m▁

![image.png](attachment:image.png)

In [32]:
NVTX.@range function rmse3(A::AbstractMatrix, B::AbstractMatrix)
    @assert size(A) == size(B)
    MSE = mapreduce((a,b)->(a - b) ^ 2, +, A, B) / length(A)
    return sqrt(MSE)
end
b = @benchmark doit2(rmse3)
#CUDA.@profile doit2(rmse3)
b

BenchmarkTools.Trial: 5235 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m802.001 μs[22m[39m … [35m48.730 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 27.06%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m829.831 μs              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m952.662 μs[22m[39m ± [32m 1.705 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m1.78% ±  0.98%

  [39m█[34m█[39m[39m▁[39m [39m [39m [32m [39m[39m [39m [39m [39m [39m▁[39m▃[39m▂[39m [39m [39m [39m▁[39m▂[39m▂[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▁
  [39m█[34m█[39m[39m█[39m▄

In [43]:
NVTX.@range function rmse4(A::AbstractMatrix, B::AbstractMatrix)
    @assert size(A) == size(B)
    MSE = mapreduce((a,b)->(a - b) ^ 2, +, A, B; dims=(1,2)) ./ length(A)
    return sqrt.(MSE)
end

NVTX.@range function doit3(f)
    rmses = Vector(undef, N)
    for i in 1:N
        rmses[i] = f(view(A, :, :, i), view(B, :, :, i))
    end
    map(rmses) do rmse
        Array(rmse)[]
    end
end

#b = @benchmark doit3(rmse4)
#CUDA.@profile doit3(rmse4)
#b
doit3(rmse4)

typeof(R) = CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}
size(R) = (1, 1)
typeof(A) = Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, var"#76#77", Tuple{CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}}
size(A) = (1024, 1024)
typeof(R) = CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}
size(R) = (1, 1, 1)
typeof(A) = CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}
size(A) = (1, 1, 48)
typeof(R) = CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}
size(R) = (1, 1)
typeof(A) = Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, var"#76#77", Tuple{CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}}
size(A) = (1024, 1024)
typeof(R) = CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}
size(R) = (1, 1, 1)
typeof(A) = CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}
size(A) = (1, 1, 48)
typeof(R) = CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}
size(R) = (1, 1)
typeof(A

16-element Vector{Float32}:
 0.40819794
 0.40824947
 0.40832016
 0.40772977
 0.40824547
 0.40791473
 0.40810397
 0.4083595
 0.408447
 0.40815404
 0.4085505
 0.40849838
 0.40851662
 0.40797493
 0.40841326
 0.4080057

![image.png](attachment:image.png)

In [49]:
NVTX.@range function rmse5(A::AbstractMatrix, B::AbstractMatrix, C::AbstractArray)
    SQE = Broadcast.broadcasted(A, B) do a,b
        (a - b) ^ 2
    end
    SQE = Broadcast.instantiate(SQE)
    MSE = Base.mapreducedim!(identity, +, C, SQE)
    C .= sqrt.(C ./ length(SQE))
    return
end

NVTX.@range function doit4(f)
    rmses = CuVector{Float64}(undef, N)
    for i in 1:N
        f(view(A, :, :, i), view(B, :, :, i), reshape(view(rmses, i), 1, 1))
    end
    Array(rmses)
end

b = @benchmark doit4(rmse5)
#CUDA.@profile doit4(rmse5)
b

BenchmarkTools.Trial: 2258 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m2.074 ms[22m[39m … [35m50.510 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 30.64%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m2.130 ms              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m2.213 ms[22m[39m ± [32m 1.437 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.57% ±  0.85%

  [39m [39m▃[39m▆[39m▇[39m█[39m▇[34m▇[39m[39m▆[39m▅[39m▄[39m▃[39m▁[39m▁[39m [39m▁[32m [39m[39m▁[39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▁
  [39m▇[39m█[39m█[39m█[39m█[39m█[34m█[39m

Waarom is dit traag? geen idee. ligt dus niet aan de broadcasted, dus mag weg (gewoon dense). somehow zijn er massas D2D copies :/