In [1]:
using Pkg
Pkg.activate(@__DIR__)

[32m[1m  Activating[22m[39m environment at `~/Julia/doc/cscs_gpu_course/Project.toml`


In [65]:
using Revise
using CUDA
using BenchmarkTools

# Kernel analysis and optimization

Once your application has been optimized, it's time to look at individual kernels. Initially, BenchmarkTools.jl and NSight Systems still are good tools to estimate the execution time of a kernel. For more insights, you can use the CUDA APIs, or use NSight Compute to dive deeply into a kernel's execution properties.

## Case study: Batched RMSE

As a case study, let's look at the batched RMSE calculation from the previous notebook again. Implementing such an operation in a single step is difficult, so let's start with a simplified version on a smaller input:

In [66]:
A = CUDA.rand(10,10)
B = CUDA.rand(10,10)
sqrt(sum((A-B).^2)/length(A))

0.39697844f0

In [67]:
function rmse_kernel(C, A, B)  
    i = threadIdx().x

    # initialize the memory
    if i == 1
        C[1] = 0
    end
    sync_threads()
    
    # grid-stride loop to process each batch in a block
    a = A[i]
    b = B[i]
    CUDA.@atomic C[1] += (a-b)^2
    sync_threads()
    
    # finalize the computation
    if i == 1
        C[1] = sqrt(C[1] / length(A))
    end
    return
end;

In [68]:
C = CUDA.similar(A, 1)
@cuda threads=length(A) rmse_kernel(C,A,B)
C

1-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
 0.39697844

Now we need to extend this implementation to:
- cover multiple batches
- support inputs that don't fit in a single block

As has already been covered before, a grid-stride loop is an easy way to make a single block process multiple items. Alternatively, we could launch multiple blocks, but that complicates synchronization and communication between blocks that need to process the same batch.

In [69]:
A = CUDA.rand(1024,1024)
B = CUDA.rand(1024,1024)
sqrt(sum((A-B).^2)/length(A))

0.408413f0

In [70]:
function rmse_kernel(C, A, B)  
    # initialize the memory
    if threadIdx().x == 1
        C[1] = 0
    end
    sync_threads()
    
    # grid-stride loop to process each batch in a block
    for i in threadIdx().x:blockDim().x:length(A)
        a = A[i]
        b = B[i]
        CUDA.@atomic C[1] += (a-b)^2
    end    
    sync_threads()
    
    # finalize the computation
    if threadIdx().x == 1
        C[1] = sqrt(C[1] / length(A))
    end
    return
end;

In [71]:
C = CUDA.similar(A, 1)
@cuda threads=256 rmse_kernel(C,A,B)
C

1-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
 0.40829936

Notice the slight differences between individual invocations, because of how the order of `@atomic` expressions is not guaranteed.

Finally, let's extend this to multiple batches:

In [72]:
N = 16
A = CUDA.rand(1024,1024,N)
B = CUDA.rand(1024,1024,N)
C = CUDA.similar(A, N)
sqrt.(sum((A.-B).^2; dims=1:(ndims(A)-1))./prod(size(A)[1:end-1]))

1×1×16 CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}:
[:, :, 1] =
 0.4085672

[:, :, 2] =
 0.40834135

[:, :, 3] =
 0.40830332

;;; … 

[:, :, 14] =
 0.408083

[:, :, 15] =
 0.40812522

[:, :, 16] =
 0.40820798

A common pattern for dealing with multiple independent datasets or batches within a single kernel (i.e. without launching multiple kernels, one for each batch) so is to compute and pass separate cartesian indices to the kernel, and make sure those map into hardware indices the way we want. For example, here we have N-dimensional inputs whose last index represents the batch, so we can pass two separate cartesian indices:
- one representing the 'main' iteration space, where the last index doesn't count
- one representing the batches, having the samen dimensionality, but with only the last index set

As we want each RMSE calculation between arrays from a single batch to happen within a single block (again, to simplify communication and synchronization), we should index the main cartesian indices object using a thread index, while using a block index for the batch indices. Within the kernel, we can then merge these two objects using the `max` operator to get a usable index. For more information on this technique, refer to the following blog post: https://julialang.org/blog/2016/02/iteration/.

In [73]:
Rmain = ntuple(i->i == ndims(A) ? Base.OneTo(1) : axes(A)[i], ndims(A)) |> CartesianIndices

1024×1024×1 CartesianIndices{3, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}, Base.OneTo{Int64}}}:
[:, :, 1] =
 CartesianIndex(1, 1, 1)     …  CartesianIndex(1, 1024, 1)
 CartesianIndex(2, 1, 1)        CartesianIndex(2, 1024, 1)
 CartesianIndex(3, 1, 1)        CartesianIndex(3, 1024, 1)
 CartesianIndex(4, 1, 1)        CartesianIndex(4, 1024, 1)
 CartesianIndex(5, 1, 1)        CartesianIndex(5, 1024, 1)
 CartesianIndex(6, 1, 1)     …  CartesianIndex(6, 1024, 1)
 CartesianIndex(7, 1, 1)        CartesianIndex(7, 1024, 1)
 CartesianIndex(8, 1, 1)        CartesianIndex(8, 1024, 1)
 CartesianIndex(9, 1, 1)        CartesianIndex(9, 1024, 1)
 CartesianIndex(10, 1, 1)       CartesianIndex(10, 1024, 1)
 CartesianIndex(11, 1, 1)    …  CartesianIndex(11, 1024, 1)
 CartesianIndex(12, 1, 1)       CartesianIndex(12, 1024, 1)
 CartesianIndex(13, 1, 1)       CartesianIndex(13, 1024, 1)
 ⋮                           ⋱  
 CartesianIndex(1013, 1, 1)     CartesianIndex(1013, 1024, 1)
 CartesianIndex(1014, 1, 

In [74]:
Rbatch = ntuple(i->i != ndims(A) ? Base.OneTo(1) : axes(A)[i], ndims(A)) |> CartesianIndices

1×1×16 CartesianIndices{3, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}, Base.OneTo{Int64}}}:
[:, :, 1] =
 CartesianIndex(1, 1, 1)

[:, :, 2] =
 CartesianIndex(1, 1, 2)

[:, :, 3] =
 CartesianIndex(1, 1, 3)

;;; … 

[:, :, 14] =
 CartesianIndex(1, 1, 14)

[:, :, 15] =
 CartesianIndex(1, 1, 15)

[:, :, 16] =
 CartesianIndex(1, 1, 16)

In [75]:
function rmse_kernel(C, A, B, Rmain, Rbatch)
    batch = blockIdx().x
    Ibatch = Rbatch[batch]
    
    # initialize the memory
    if threadIdx().x == 1
        C[batch] = 0
    end
    sync_threads()
    
    # grid-stride loop to process each batch in a block
    for i in threadIdx().x:blockDim().x:length(Rmain)
        Imain = Rmain[i]
        I = max(Imain, Ibatch)
        a = A[I]
        b = B[I]
        CUDA.@atomic C[batch] += (a-b)^2
    end    
    sync_threads()
    
    # finalize the computation
    if threadIdx().x == 1
        C[batch] = sqrt(C[batch] / length(Rmain))
    end
    return
end;

In [76]:
function rmse(C, A, B)
    Rmain = ntuple(i->i == ndims(A) ? Base.OneTo(1) : axes(A)[i], ndims(A)) |> CartesianIndices
    Rbatch = ntuple(i->i != ndims(A) ? Base.OneTo(1) : axes(A)[i], ndims(A)) |> CartesianIndices
    @cuda threads=256 blocks=N rmse_kernel(C, A, B, Rmain, Rbatch)
    return
end;

In [77]:
b = @benchmark CUDA.@sync rmse(C, A, B)
display(C)
b

16-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
 0.4084522
 0.40822744
 0.40819687
 0.40825972
 0.40788049
 0.40766144
 0.4084545
 0.4081247
 0.40773115
 0.408521
 0.40833315
 0.40828344
 0.4080161
 0.40795988
 0.40800813
 0.40809527

BenchmarkTools.Trial: 249 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m19.581 ms[22m[39m … [35m 30.245 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m19.947 ms               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m20.074 ms[22m[39m ± [32m954.269 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m▆[39m▆[39m▆[39m▅[34m▅[39m[39m█[32m▆[39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m█[39m█[39m█[39m█[34m█[

Now that we have a fully functional implementation, let's benchmark in order to compare to the last notebook's implementation using array operations: 

In [78]:
@benchmark CUDA.@sync rmse(C, A, B)

BenchmarkTools.Trial: 250 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m19.672 ms[22m[39m … [35m 21.817 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m19.941 ms               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m20.038 ms[22m[39m ± [32m313.213 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m [39m [39m [39m▁[39m▃[39m█[39m▅[39m▂[34m [39m[39m [39m [32m [39m[39m [39m [39m▁[39m▂[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▃[39m▄[39m▆[39m█[39m█[

How slow is that?! To understand what's going in here, we'll need to dive deeper into this kernel.

TODO: NSight Compute

We could try aggregating in shared memory instead, but ultimately this loop is slow because every iteration performs an atomic operation. There are several techniques to avoid this:

- perform the reduction in parallel
- use a shared memory buffer so that threads can communicate without intrinsics
- use warp-level intrinsics for even finer-grained communication

For more details on these techniques, refer to the following blog: https://developer.nvidia.com/blog/faster-parallel-reductions-kepler/

CUDA.jl implements these techniques for its `mapreducedim!` implementation, and although the individual steps of that reduction aren't a stable API we can still use them to avoid having to implement our own parallel reduction:

In [79]:
function rmse_kernel_opt(C::AbstractArray{T}, A, B, Rmain, Rbatch) where T
    batch = blockIdx().x
    Ibatch = Rbatch[batch]
    
    # grid-stride loop to process each batch in a block
    val = zero(T)
    for i in threadIdx().x:blockDim().x:length(Rmain)
        Imain = Rmain[i]
        I = max(Imain, Ibatch)
        a = A[I]
        b = B[I]
        val += CUDA.reduce_block(+, (a-b)^2, zero(T), #=shuffle=# Val(true))
    end    
    sync_threads()
    
    # finalize the computation
    if threadIdx().x == 1
        C[batch] = sqrt(val / length(Rmain))
    end
    return
end;

In [80]:
function rmse_opt(C, A, B)
    Rmain = ntuple(i->i == ndims(A) ? Base.OneTo(1) : axes(A)[i], ndims(A)) |> CartesianIndices
    Rbatch = ntuple(i->i != ndims(A) ? Base.OneTo(1) : axes(A)[i], ndims(A)) |> CartesianIndices
    @cuda threads=256 blocks=N rmse_kernel_opt(C, A, B, Rmain, Rbatch)
    return
end;

In [81]:
b = @benchmark CUDA.@sync rmse_opt(C, A, B)
display(C)
b

16-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
 0.40856707
 0.40834144
 0.40830347
 0.40837765
 0.40799418
 0.4077742
 0.40856826
 0.40823975
 0.40784803
 0.40863705
 0.408447
 0.4083946
 0.4081305
 0.408083
 0.40812513
 0.40820822

BenchmarkTools.Trial: 1224 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m3.973 ms[22m[39m … [35m  5.691 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m4.040 ms               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m4.081 ms[22m[39m ± [32m154.114 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m [39m█[39m▅[39m [39m▂[39m [39m▃[34m [39m[39m [39m [32m [39m[39m▇[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m█[39m█[39m█[39m▆[39m█[39m█[39

Much better! But still slower than CUDA.jl's reduction. One reason is the launch configuration: We're hard-coding the number of threads to 256, for maximal compatibility with older GPUs, and only launch as many blocks as the number of batches. Is that good enough?

In [82]:
kernel = @cuda launch=false rmse_kernel_opt(C, A, B, CartesianIndices(axes(A)), CartesianIndices(axes(A)))
config = launch_configuration(kernel.fun)

(blocks = 48, threads = 512)

The occupancy API tells use that we should use:
- 512 threads, to maximize occupancy within a SM
- at least 48 blocks, to maximize use of the GPU

Launching more threads is easy enough with a grid-stride loop, but launching more blocks needs a change to the index calculation. The reduction also needs to be adapted, since we may now be processing a single batch across multiple blocks. Lacking an efficient `reduce_grid`, the easiest solution is to re-introduce atomics. Those should not hurt performance, since they are only issues by a minimum of threads, and only once per kernel.

To split the hardware block index into one for additionally indexing the main iteration space, and the batch number, I chose to use the 3D-aspect of these hardware indices. Alternatively, it's possible to use `fldmod1` to decompose a single index based on the length of the iterator.

In [83]:
function rmse_kernel_opt(C::AbstractArray{T}, A, B, Rmain, Rbatch) where T
    batch = blockIdx().y
    Ibatch = Rbatch[batch]
    
    # grid-stride loop to process each batch in a block
    val = zero(T)
    i0 = threadIdx().x + (blockIdx().x-1)*blockDim().x
    for i in i0:(blockDim().x*gridDim().x):length(Rmain)
        Imain = Rmain[i]
        I = max(Imain, Ibatch)
        a = A[I]
        b = B[I]
        val += CUDA.reduce_block(+, (a-b)^2, zero(T), #=shuffle=# Val(true))
    end    
    sync_threads()
    
    # finalize the computation
    if threadIdx().x == 1
        CUDA.@atomic C[batch] += val
    end
    return
end;

In [84]:
function rmse_opt(C, A, B)
    Rmain = ntuple(i->i == ndims(A) ? Base.OneTo(1) : axes(A)[i], ndims(A)) |> CartesianIndices
    Rbatch = ntuple(i->i != ndims(A) ? Base.OneTo(1) : axes(A)[i], ndims(A)) |> CartesianIndices
    
    fill!(C, 0)
    @cuda threads=512 blocks=(48,N) rmse_kernel_opt(C, A, B, Rmain, Rbatch)
    C .= sqrt.(C ./ length(Rmain))
    return
end;

In [85]:
b = @benchmark CUDA.@sync rmse_opt(C, A, B)
display(C)
b

16-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
 0.40856716
 0.40834135
 0.40830332
 0.40837765
 0.4079943
 0.40777388
 0.40856865
 0.40823978
 0.40784812
 0.40863714
 0.40844733
 0.40839452
 0.40813056
 0.408083
 0.40812522
 0.408208

BenchmarkTools.Trial: 5977 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m818.630 μs[22m[39m … [35m 10.566 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m828.739 μs               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m833.407 μs[22m[39m ± [32m133.889 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▃[39m█[39m▇[34m▂[39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▁[39m▂[39m▂[39

Much better. Let's actually use the launch configuration API at run time to make sure we use the best possible configuration for the GPU at hand:

In [86]:
function rmse_opt(C, A, B)
    Rmain = ntuple(i->i == ndims(A) ? Base.OneTo(1) : axes(A)[i], ndims(A)) |> CartesianIndices
    Rbatch = ntuple(i->i != ndims(A) ? Base.OneTo(1) : axes(A)[i], ndims(A)) |> CartesianIndices
    
    kernel = @cuda launch=false rmse_kernel_opt(C, A, B, Rmain, Rbatch)
    config = launch_configuration(kernel.fun)
    blocks_x = min(config.blocks, length(Rmain))
    threads = min(config.threads, cld(length(Rmain), blocks_x))
    blocks = (blocks_x, N)
    
    fill!(C, 0)
    kernel(C, A, B, Rmain, Rbatch; threads, blocks)
    C .= sqrt.(C ./ length(Rmain))
    return
end;

In [87]:
b = @benchmark CUDA.@sync rmse_opt(C, A, B)
display(C)
b

16-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
 0.40856713
 0.40834135
 0.40830332
 0.40837765
 0.40799427
 0.4077739
 0.40856865
 0.4082398
 0.40784806
 0.40863717
 0.40844738
 0.4083945
 0.4081306
 0.408083
 0.40812522
 0.408208

BenchmarkTools.Trial: 5957 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m821.379 μs[22m[39m … [35m 1.937 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m833.659 μs              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m836.315 μs[22m[39m ± [32m41.939 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▅[39m█[39m▅[34m▃[39m[39m▃[39m [39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▁[39m▁[39m▂[39m▃[39m▄

Now we are getting close to the performance we'd expect. For the remaining optimizations, we will need to have a look at the generated code. It's recommended to start out with the LLVM code, which is the most readable:

In [88]:
@device_code_llvm debuginfo=:none rmse_opt(C,A,B)

; PTX CompilerJob of kernel rmse_kernel_opt(CuDeviceVector{Float32, 1}, CuDeviceArray{Float32, 3, 1}, CuDeviceArray{Float32, 3, 1}, CartesianIndices{3, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}, Base.OneTo{Int64}}}, CartesianIndices{3, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}, Base.OneTo{Int64}}}) for sm_75
[94mdefine[39;49;00m [94mptx_kernel[39;49;00m [94mvoid[39;49;00m [91m@_Z27julia_rmse_kernel_opt_1346513CuDeviceArrayI7Float32Li1ELi1EES_IS0_Li3ELi1EES_IS0_Li3ELi1EE16CartesianIndicesILi3E5TupleI5OneToI5Int64ES3_IS4_ES3_IS4_EEES1_ILi3ES2_IS3_IS4_ES3_IS4_ES3_IS4_EEE[39;49;00m([[94m1[39;49;00m x [96mi64[39;49;00m] [91m%state[39;49;00m, { [96mi8[39;49;00m [94maddrspace[39;49;00m([94m1[39;49;00m)*, [96mi64[39;49;00m, [[94m1[39;49;00m x [96mi64[39;49;00m] } [91m%0[39;49;00m, { [96mi8[39;49;00m [94maddrspace[39;49;00m([94m1[39;49;00m)*, [96mi64[39;49;00m, [[94m3[39;49;00m x [96mi64[39;49;00m] } [91m%1[39;49;00m, { [96mi8[39;49;00m [94maddrspa

; PTX CompilerJob of kernel broadcast_kernel(CUDA.CuKernelContext, CuDeviceVector{Float32, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}}, typeof(sqrt), Tuple{Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Nothing, typeof(/), Tuple{Base.Broadcast.Extruded{CuDeviceVector{Float32, 1}, Tuple{Bool}, Tuple{Int64}}, Int64}}}}, Int64) for sm_75
[94mdefine[39;49;00m [94mptx_kernel[39;49;00m [94mvoid[39;49;00m [91m@_Z28julia_broadcast_kernel_1364715CuKernelContext13CuDeviceArrayI7Float32Li1ELi1EE11BroadcastedIv5TupleI5OneToI5Int64EE5_sqrtS3_IS2_I12CuArrayStyleILi1EEv2__S3_I8ExtrudedIS0_IS1_Li1ELi1EES3_I4BoolES3_IS5_EES5_EEEES5_[39;49;00m([[94m1[39;49;00m x [96mi64[39;49;00m] [91m%state[39;49;00m, { [96mi8[39;49;00m [94maddrspace[39;49;00m([94m1[39;49;00m)*, [96mi64[39;49;00m, [[94m1[39;49;00m x [96mi64[39;49;00m] } [91m%0[39;49;00m, { [[94m1[39;49;00m x { { { { [96mi8[39;49;00m [94maddrspace[39;49;00m([94m1[39;49;00m)*, [96mi64[39;49;00m

Generally, for GPU kernels we want straight-line code, minimizing the amount of branches, while also reducing the amount of data that needs to be kept in registers. You can easily inspect the former using the compiled kernel object:

In [89]:
kernel = @cuda launch=false rmse_kernel_opt(C, A, B, CartesianIndices(axes(A)), CartesianIndices(axes(A)))
CUDA.registers(kernel)

98

It's important to minimize the amount of registers, because it determines how much threads can be launched as part of a single block:

In [90]:
CUDA.maxthreads(kernel)

512

Note that this is the same number as reported by the occupancy API.

If we now look closer at the generated code, we can quickly see a bunch of branches because of bounds checking. Typically, those function calls also require data to be prepared and put into registers, so eliminating bounds checks should both remove unneeded branches as well as reduce register pressure:

In [91]:
function rmse_kernel_opt(C::AbstractArray{T}, A, B, Rmain, Rbatch) where T
    batch = blockIdx().y
    Ibatch = @inbounds Rbatch[batch]
    
    # grid-stride loop to process each batch in a block
    val = zero(T)
    i0 = threadIdx().x + (blockIdx().x-1)*blockDim().x
    @inbounds for i in i0:(blockDim().x*gridDim().x):length(Rmain)
        Imain = Rmain[i]
        I = max(Imain, Ibatch)
        a = A[I]
        b = B[I]
        val += CUDA.reduce_block(+, (a-b)^2, zero(T), #=shuffle=# Val(true))
    end    
    sync_threads()
    
    # finalize the computation
    if threadIdx().x == 1
        @inbounds CUDA.@atomic C[batch] += val
    end
    return
end;

In [92]:
b = @benchmark CUDA.@sync rmse_opt(C, A, B)
display(C)
b

16-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
 0.40864596
 0.40841642
 0.4083755
 0.40845305
 0.4080715
 0.40784878
 0.4086396
 0.4083154
 0.40791878
 0.40870374
 0.40852144
 0.40846443
 0.40819678
 0.40815797
 0.40820137
 0.40828073

BenchmarkTools.Trial: 6734 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m687.711 μs[22m[39m … [35m50.028 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m702.802 μs              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m739.422 μs[22m[39m ± [32m 1.115 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▃[39m▆[39m█[39m▇[34m▇[39m[39m▆[39m▄[39m▂[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▂[39m▁[39m▁[39m▁[39m▁[39m▂

In [93]:
kernel = @cuda launch=false rmse_kernel_opt(C, A, B, CartesianIndices(axes(A)), CartesianIndices(axes(A)))
CUDA.registers(kernel)

90

In [94]:
CUDA.maxthreads(kernel)

640

Another common source of register pressure is the use of 64-bit integers where only 32-bits are required. For example, the hardware's indices are 32-bit integers, but Julia's literals here are Int64's which results in expressions like `blockIdx().x-1` to be promoted to 64-bits integers. We can avoid this by using `Int32` indices:

In [95]:
function rmse_kernel_opt(C::AbstractArray{T}, A, B, Rmain, Rbatch) where T
    batch = blockIdx().y
    Ibatch = @inbounds Rbatch[batch]
    
    # grid-stride loop to process each batch in a block
    val = zero(T)
    i0 = threadIdx().x + (blockIdx().x-Int32(1))*blockDim().x
    @inbounds for i in i0:(blockDim().x*gridDim().x):length(Rmain)
        Imain = Rmain[i]
        I = max(Imain, Ibatch)
        a = A[I]
        b = B[I]
        val += CUDA.reduce_block(+, (a-b)^2, zero(T), #=shuffle=# Val(true))
    end    
    sync_threads()
    
    # finalize the computation
    if threadIdx().x == Int32(1)
        @inbounds CUDA.@atomic C[batch] += val
    end
    return
end

rmse_kernel_opt (generic function with 1 method)

In [96]:
kernel = @cuda launch=false rmse_kernel_opt(C, A, B, CartesianIndices(axes(A)), CartesianIndices(axes(A)))
CUDA.registers(kernel)

88

If we look even closer at the generated code, there is another exception being thrown from the `div` function in Base:

In [97]:
@device_code_llvm rmse_opt(C,A,B)

; PTX CompilerJob of kernel rmse_kernel_opt(CuDeviceVector{Float32, 1}, CuDeviceArray{Float32, 3, 1}, CuDeviceArray{Float32, 3, 1}, CartesianIndices{3, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}, Base.OneTo{Int64}}}, CartesianIndices{3, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}, Base.OneTo{Int64}}}) for sm_75
[90m;  @ In[95]:1 within `rmse_kernel_opt`[39;49;00m
[94mdefine[39;49;00m [94mptx_kernel[39;49;00m [94mvoid[39;49;00m [91m@_Z27julia_rmse_kernel_opt_1395213CuDeviceArrayI7Float32Li1ELi1EES_IS0_Li3ELi1EES_IS0_Li3ELi1EE16CartesianIndicesILi3E5TupleI5OneToI5Int64ES3_IS4_ES3_IS4_EEES1_ILi3ES2_IS3_IS4_ES3_IS4_ES3_IS4_EEE[39;49;00m([[94m1[39;49;00m x [96mi64[39;49;00m] [91m%state[39;49;00m, { [96mi8[39;49;00m [94maddrspace[39;49;00m([94m1[39;49;00m)*, [96mi64[39;49;00m, [[94m1[39;49;00m x [96mi64[39;49;00m] } [91m%0[39;49;00m, { [96mi8[39;49;00m [94maddrspace[39;49;00m([94m1[39;49;00m)*, [96mi64[39;49;00m, [[94m3[39;49;00m x [96mi64[39;49;00m] 

; PTX CompilerJob of kernel broadcast_kernel(CUDA.CuKernelContext, CuDeviceVector{Float32, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}}, typeof(sqrt), Tuple{Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Nothing, typeof(/), Tuple{Base.Broadcast.Extruded{CuDeviceVector{Float32, 1}, Tuple{Bool}, Tuple{Int64}}, Int64}}}}, Int64) for sm_75
[90m;  @ /home/tim/Julia/depot/packages/GPUArrays/3sW6s/src/host/broadcast.jl:56 within `broadcast_kernel`[39;49;00m
[94mdefine[39;49;00m [94mptx_kernel[39;49;00m [94mvoid[39;49;00m [91m@_Z28julia_broadcast_kernel_1413415CuKernelContext13CuDeviceArrayI7Float32Li1ELi1EE11BroadcastedIv5TupleI5OneToI5Int64EE5_sqrtS3_IS2_I12CuArrayStyleILi1EEv2__S3_I8ExtrudedIS0_IS1_Li1ELi1EES3_I4BoolES3_IS5_EES5_EEEES5_[39;49;00m([[94m1[39;49;00m x [96mi64[39;49;00m] [91m%state[39;49;00m, { [96mi8[39;49;00m [94maddrspace[39;49;00m([94m1[39;49;00m)*, [96mi64[39;49;00m, [[94m1[39;49;00m x [96mi64[39;49;00m] } [91m%0[39;49;00

We cannot simplify avoid this exception using an `@inbounds`-like macro, or by using differently-typed indices. However, if we look at the definition of `div` we can see that the exception is thrown when the divisor is 0. Here, the divisor is the size of the abstract array being indexed, and we can tell LLVM it can never be 0 using the `assume` intrinsic:

In [98]:
using LLVM, LLVM.Interop
function rmse_kernel_opt(C::AbstractArray{T}, A, B, Rmain, Rbatch) where T
    batch = blockIdx().y
    assume.(size(Rbatch) .> 0)
    Ibatch = @inbounds Rbatch[batch]
    
    # grid-stride loop to process each batch in a block
    val = zero(T)
    i0 = threadIdx().x + (blockIdx().x-Int32(1))*blockDim().x
    assume.(size(Rmain) .> 0)
    @inbounds for i in i0:(blockDim().x*gridDim().x):length(Rmain)
        Imain = Rmain[i]
        I = max(Imain, Ibatch)
        a = A[I]
        b = B[I]
        val += CUDA.reduce_block(+, (a-b)^2, zero(T), #=shuffle=# Val(true))
    end    
    sync_threads()
    
    # finalize the computation
    if threadIdx().x == Int32(1)
        @inbounds CUDA.@atomic C[batch] += val
    end
    return
end;

In [99]:
kernel = @cuda launch=false rmse_kernel_opt(C, A, B, CartesianIndices(axes(A)), CartesianIndices(axes(A)))
CUDA.registers(kernel)

82

In [100]:
@device_code_ptx rmse_opt(C,A,B)

// PTX CompilerJob of kernel rmse_kernel_opt(CuDeviceVector{Float32, 1}, CuDeviceArray{Float32, 3, 1}, CuDeviceArray{Float32, 3, 1}, CartesianIndices{3, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}, Base.OneTo{Int64}}}, CartesianIndices{3, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}, Base.OneTo{Int64}}}) for sm_75

[90m//[39;49;00m
[90m// Generated by LLVM NVPTX Back-End[39;49;00m
[90m//[39;49;00m

[94m.version[39;49;00m [94m6.3[39;49;00m
[94m.target[39;49;00m [91msm_75[39;49;00m
[94m.address_size[39;49;00m [94m64[39;49;00m

	[90m// .globl	_Z27julia_rmse_kernel_opt_1434913CuDeviceArrayI7Float32Li1ELi1EES_IS0_Li3ELi1EES_IS0_Li3ELi1EE16CartesianIndicesILi3E5TupleI5OneToI5Int64ES3_IS4_ES3_IS4_EEES1_ILi3ES2_IS3_IS4_ES3_IS4_ES3_IS4_EEE // -- Begin function _Z27julia_rmse_kernel_opt_1434913CuDeviceArrayI7Float32Li1ELi1EES_IS0_Li3ELi1EES_IS0_Li3ELi1EE16CartesianIndicesILi3E5TupleI5OneToI5Int64ES3_IS4_ES3_IS4_EEES1_ILi3ES2_IS3_IS4_ES3_IS4_ES3_IS4_EEE[39;49;00m
[94m.extern[39

// PTX CompilerJob of kernel broadcast_kernel(CUDA.CuKernelContext, CuDeviceVector{Float32, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}}, typeof(sqrt), Tuple{Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Nothing, typeof(/), Tuple{Base.Broadcast.Extruded{CuDeviceVector{Float32, 1}, Tuple{Bool}, Tuple{Int64}}, Int64}}}}, Int64) for sm_75

[90m//[39;49;00m
[90m// Generated by LLVM NVPTX Back-End[39;49;00m
[90m//[39;49;00m

[94m.version[39;49;00m [94m6.3[39;49;00m
[94m.target[39;49;00m [91msm_75[39;49;00m
[94m.address_size[39;49;00m [94m64[39;49;00m

	[90m// .globl	_Z28julia_broadcast_kernel_1454315CuKernelContext13CuDeviceArrayI7Float32Li1ELi1EE11BroadcastedIv5TupleI5OneToI5Int64EE5_sqrtS3_IS2_I12CuArrayStyleILi1EEv2__S3_I8ExtrudedIS0_IS1_Li1ELi1EES3_I4BoolES3_IS5_EES5_EEEES5_ // -- Begin function _Z28julia_broadcast_kernel_1454315CuKernelContext13CuDeviceArrayI7Float32Li1ELi1EE11BroadcastedIv5TupleI5OneToI5Int64EE5_sqrtS3_IS2_I12CuArrayStyleILi1E

In [101]:
@device_code_sass rmse_opt(C,A,B)

// PTX CompilerJob of kernel rmse_kernel_opt(CuDeviceVector{Float32, 1}, CuDeviceArray{Float32, 3, 1}, CuDeviceArray{Float32, 3, 1}, CartesianIndices{3, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}, Base.OneTo{Int64}}}, CartesianIndices{3, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}, Base.OneTo{Int64}}}) for sm_75

	.headerflags	@"EF_CUDA_TEXMODE_UNIFIED EF_CUDA_64BIT_ADDRESS EF_CUDA_SM75 EF_CUDA_VIRTUAL_SM(EF_CUDA_SM75)"
	.elftype	@"ET_EXEC"


//--------------------- .text._Z27julia_rmse_kernel_opt_1466213CuDeviceArrayI7Float32Li1ELi1EES_IS0_Li3ELi1EES_IS0_Li3ELi1EE16CartesianIndicesILi3E5TupleI5OneToI5Int64ES3_IS4_ES3_IS4_EEES1_ILi3ES2_IS3_IS4_ES3_IS4_ES3_IS4_EEE --------------------------
	.section	.text._Z27julia_rmse_kernel_opt_1466213CuDeviceArrayI7Float32Li1ELi1EES_IS0_Li3ELi1EES_IS0_Li3ELi1EE16CartesianIndicesILi3E5TupleI5OneToI5Int64ES3_IS4_ES3_IS4_EEES1_ILi3ES2_IS3_IS4_ES3_IS4_ES3_IS4_EEE,"ax",@progbits
	.sectionflags	@"SHF_BARRIERS=1"
	.sectioninfo	@"SHI_REGISTERS=82"
	.align	1

        IMAD.MOV.U32 R21, RZ, RZ, c[0x0][0x1d4] ;
        CALL.REL.NOINC `($_Z27julia_rmse_kernel_opt_1466213CuDeviceArrayI7Float32Li1ELi1EES_IS0_Li3ELi1EES_IS0_Li3ELi1EE16CartesianIndicesILi3E5TupleI5OneToI5Int64ES3_IS4_ES3_IS4_EEES1_ILi3ES2_IS3_IS4_ES3_IS4_ES3_IS4_EEE$__cuda_sm20_div_s64) ;
        IMAD.MOV.U32 R2, RZ, RZ, R6 ;
        MOV R3, R7 ;

.L_12:
        BSYNC B0 ;

.L_10:
; Location ./int.jl:87
        ULDC.64 UR4, c[0x0][0x1d0] ;
; Location ./int.jl:284
        LOP3.LUT R0, R3, c[0x0][0x1dc], RZ, 0xfc, !PT ;
; Location ./int.jl:87
        UIADD3 UR4, UP0, URZ, -UR4, URZ ;
        BMOV.32.CLEAR RZ, B0 ;
        BSSY B0, `(.L_13) ;
; Location ./int.jl:284
        LOP3.LUT P0, RZ, R0, 0xffffffff, RZ, 0xc0, !PT ;
; Location ./int.jl:87
        UIADD3.X UR5, URZ, ~UR5, URZ, UP0, !UPT ;
        IMAD R7, R3, UR4, RZ ;
        IMAD.WIDE.U32 R4, R2, UR4, R28 ;
        IMAD R7, R2, UR5, R7 ;
        IADD3 R11, P1, R4, -0x1, RZ ;
        IMAD.IADD R10, R5, 0x1, R7 ;
        IADD3.X 

        CALL.REL.NOINC `($_Z27julia_rmse_kernel_opt_1466213CuDeviceArrayI7Float32Li1ELi1EES_IS0_Li3ELi1EES_IS0_Li3ELi1EE16CartesianIndicesILi3E5TupleI5OneToI5Int64ES3_IS4_ES3_IS4_EEES1_ILi3ES2_IS3_IS4_ES3_IS4_ES3_IS4_EEE$__cuda_sm70_shflsync_down) ;
        FADD R5, R5, R4 ;
        MOV R0, 0x8 ;
        IMAD.MOV.U32 R4, RZ, RZ, 0x1f ;
        MOV R2, 0x1510 ;
        IMAD.MOV.U32 R3, RZ, RZ, -0x1 ;
        CALL.REL.NOINC `($_Z27julia_rmse_kernel_opt_1466213CuDeviceArrayI7Float32Li1ELi1EES_IS0_Li3ELi1EES_IS0_Li3ELi1EE16CartesianIndicesILi3E5TupleI5OneToI5Int64ES3_IS4_ES3_IS4_EEES1_ILi3ES2_IS3_IS4_ES3_IS4_ES3_IS4_EEE$__cuda_sm70_shflsync_down) ;
        FADD R5, R5, R4 ;
        MOV R2, 0x1570 ;
        IMAD.MOV.U32 R0, RZ, RZ, 0x10 ;
        IMAD.MOV.U32 R4, RZ, RZ, 0x1f ;
        IMAD.MOV.U32 R3, RZ, RZ, -0x1 ;
        CALL.REL.NOINC `($_Z27julia_rmse_kernel_opt_1466213CuDeviceArrayI7Float32Li1ELi1EES_IS0_Li3ELi1EES_IS0_Li3ELi1EE16CartesianIndicesILi3E5TupleI5OneToI5Int64ES3_IS4_ES3_I

        IMAD R6, R9, R8, RZ ;
        IMAD.HI.U32 R11, P3, R9, R11, R12 ;
        IADD3 R13, P5, RZ, -R0, RZ ;
        SEL R13, R13, R0, !P2 ;
        IMAD.HI.U32 R0, R9, R8, RZ ;
        IADD3 R11, P4, R6, R11, RZ ;
        IMAD.X R6, RZ, RZ, ~R3, P5 ;
        IMAD.HI.U32 R8, R11, R13, RZ ;
        IADD3.X R0, R0, R9, RZ, P1, !PT ;
        SEL R6, R6, R3, !P2 ;
        IMAD.MOV.U32 R9, RZ, RZ, RZ ;
        IADD3.X R3, RZ, RZ, R0, P4, P3 ;
        IMAD.WIDE.U32 R8, R11, R6, R8 ;
        IMAD.HI.U32 R0, R3, R6, RZ ;
        IMAD.HI.U32 R8, P1, R3, R13, R8 ;
        IMAD R9, R3, R6, RZ ;
        IMAD.X R0, RZ, RZ, R0, P1 ;
        IADD3 R3, P3, R9, R8, RZ ;
        IMAD.WIDE.U32 R8, R3, R4, RZ ;
        IMAD.X R11, RZ, RZ, R0, P3 ;
        IMAD R3, R3, R5, R9 ;
        IADD3 R13, P3, -R8, R13, RZ ;
        IMAD R3, R11, R4.reuse, R3 ;
        ISETP.GE.U32.AND P1, PT, R13, R4, PT ;
        IMAD.X R3, R6, 0x1, ~R3, P3 ;
        IADD3 R9, P3, R13, -R4, RZ ;
        ISETP.GE.U32.AND.EX P1, P

        RET.REL.NODEC R4 `(_Z27julia_rmse_kernel_opt_1466213CuDeviceArrayI7Float32Li1ELi1EES_IS0_Li3ELi1EES_IS0_Li3ELi1EE16CartesianIndicesILi3E5TupleI5OneToI5Int64ES3_IS4_ES3_IS4_EEES1_ILi3ES2_IS3_IS4_ES3_IS4_ES3_IS4_EEE) ;
        .type           $_Z27julia_rmse_kernel_opt_1466213CuDeviceArrayI7Float32Li1ELi1EES_IS0_Li3ELi1EES_IS0_Li3ELi1EE16CartesianIndicesILi3E5TupleI5OneToI5Int64ES3_IS4_ES3_IS4_EEES1_ILi3ES2_IS3_IS4_ES3_IS4_ES3_IS4_EEE$gpu_signal_exception,@function
        .size           $_Z27julia_rmse_kernel_opt_1466213CuDeviceArrayI7Float32Li1ELi1EES_IS0_Li3ELi1EES_IS0_Li3ELi1EE16CartesianIndicesILi3E5TupleI5OneToI5Int64ES3_IS4_ES3_IS4_EEES1_ILi3ES2_IS3_IS4_ES3_IS4_ES3_IS4_EEE$gpu_signal_exception,($_Z27julia_rmse_kernel_opt_1466213CuDeviceArrayI7Float32Li1ELi1EES_IS0_Li3ELi1EES_IS0_Li3ELi1EE16CartesianIndicesILi3E5TupleI5OneToI5Int64ES3_IS4_ES3_IS4_EEES1_ILi3ES2_IS3_IS4_ES3_IS4_ES3_IS4_EEE$julia_fldmod1_14750 - $_Z27julia_rmse_kernel_opt_1466213CuDeviceArrayI7Float32Li1ELi1E

        ISETP.GT.U32.AND P1, PT, R0, -0x1, PT ;
        ISETP.GT.AND.EX P1, PT, R3, -0x1, PT, P1 ;
; Location ./range.jl:301
    @P1 BRA `(.L_31) ;
; Location ./int.jl:287
        LOP3.LUT P1, RZ, R3, 0xffffffff, RZ, 0xc0, !PT ;
; Location /home/tim/Julia/pkg/CUDA/src/device/intrinsics/math.jl:208
        IABS R21, R61 ;
; Location ./int.jl:287
    @P1 BRA `(.L_32) ;
        I2F R3, R21 ;
        IMAD.MOV R7, RZ, RZ, -R21 ;
        ISETP.NE.U32.AND P2, PT, R21, RZ, PT ;
        MUFU.RCP R3, R3 ;
        IADD3 R4, R3, 0xffffffe, RZ ;
        F2I.FTZ.U32.TRUNC.NTZ R5, R4 ;
        IMAD.MOV.U32 R4, RZ, RZ, RZ ;
        IMAD R7, R7, R5, RZ ;
        IMAD.HI.U32 R5, R5, R7, R4 ;
        IMAD.HI.U32 R5, R5, R0, RZ ;
        IMAD.MOV R5, RZ, RZ, -R5 ;
        IMAD R0, R21, R5, R0 ;
        IMAD.MOV.U32 R5, RZ, RZ, RZ ;
        ISETP.GE.U32.AND P1, PT, R0, R21, PT ;
    @P1 IMAD.IADD R0, R0, 0x1, -R21 ;
        ISETP.GE.U32.AND P1, PT, R0, R21, PT ;
    @P1 IADD3 R0, -R21, R0, RZ ;
   @!P2 LOP

        FFMA R16, R13, R14, 1 ;
        FFMA R16, R13, R16, R13 ;
        FFMA R13, R18, R16, RZ ;
        FFMA R15, R14, R13, R18 ;
        FFMA R15, R16, R15, R13 ;
        FFMA R14, R14, R15, R18 ;
        FFMA R14, R16, R14, R15 ;
        BRA `(.L_6) ;

.L_5:
        MOV R14, 0x3f0 ;
        CALL.REL.NOINC `($_Z28julia_broadcast_kernel_1485615CuKernelContext13CuDeviceArrayI7Float32Li1ELi1EE11BroadcastedIv5TupleI5OneToI5Int64EE5_sqrtS3_IS2_I12CuArrayStyleILi1EEv2__S3_I8ExtrudedIS0_IS1_Li1ELi1EES3_I4BoolES3_IS5_EES5_EEEES5_$__cuda_sm3x_div_rn_noftz_f32_slowpath) ;
        IMAD.MOV.U32 R14, RZ, RZ, R18 ;

.L_6:
        BSYNC B0 ;

.L_4:
; Location /home/tim/Julia/pkg/CUDA/src/device/intrinsics/math.jl:218
        FSETP.GEU.AND P1, PT, |R14|, 1.175494350822287508e-38, PT ;
; Location /home/tim/Julia/depot/packages/LLVM/wnejv/src/interop/base.jl:45
        IMAD.MOV.U32 R13, RZ, RZ, R11 ;
        ULDC.64 UR4, c[0x0][0x168] ;
   @!P1 FMUL R14, R14, 16777216 ;
; Location /home/tim/Julia/pk

        BRA `(.L_21) ;

.L_16:
        LOP3.LUT R15, R16, 0x80000000, R15, 0x48, !PT ;
        LOP3.LUT R18, R15, 0x7f800000, RZ, 0xfc, !PT ;
        BRA `(.L_21) ;

.L_15:
        LOP3.LUT R18, R16, 0x80000000, R15, 0x48, !PT ;
        BRA `(.L_21) ;

.L_14:
        IMAD.MOV.U32 R18, RZ, RZ, 0x7fffffff ;
        BRA `(.L_21) ;

.L_13:
        FADD.FTZ R18, R18, R0 ;

.L_21:
        BSYNC B1 ;

.L_11:
        IMAD.MOV.U32 R15, RZ, RZ, 0x0 ;
        RET.REL.NODEC R14 `(_Z28julia_broadcast_kernel_1485615CuKernelContext13CuDeviceArrayI7Float32Li1ELi1EE11BroadcastedIv5TupleI5OneToI5Int64EE5_sqrtS3_IS2_I12CuArrayStyleILi1EEv2__S3_I8ExtrudedIS0_IS1_Li1ELi1EES3_I4BoolES3_IS5_EES5_EEEES5_) ;

.L_22:
        BRA `(.L_22);

.L_55:


In [102]:
b = @benchmark CUDA.@sync rmse_opt(C, A, B)
display(C)
b

16-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
 0.408646
 0.40841642
 0.4083755
 0.40845305
 0.4080715
 0.40784878
 0.4086396
 0.40831542
 0.40791884
 0.40870374
 0.40852144
 0.40846443
 0.40819678
 0.40815797
 0.4082014
 0.40828073

BenchmarkTools.Trial: 7138 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m682.561 μs[22m[39m … [35m 5.596 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m694.527 μs              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m697.309 μs[22m[39m ± [32m68.943 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▂[39m▅[39m█[39m▆[34m▅[39m[39m▅[39m▃[39m▂[39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▁[39m▁[39m▁[39m▁[39m▂