# Profiling 

Often is really important to obtain a profiling for our GPU program, to check coalsceed access, race conditions problems, memory managment, etc. For such task, we can call `nvprof` tool from NVIDIA. On a Unix system we should execute:

```bash
$ nvprof --profile-from-start off /path/to/julia
```

The `/path/to/julia` is the path to julia binary. Note that we don't initialize immediately the profiler but we can call the CUDA API's with the macro @profile:

```julia
CUDA.@profile kernel_name(x_d, y_d, r_d)
```

But nvprof is not longer used for GPUs with compute capalities newer than 7.0, instead we need nsys (Nsight system), to set nsys to julia we run: 

```bash
nsys launch julia --trace=cuda 
```
then we can execute the previous code but adding the macro CUDA.@profile before we lunch the kernel:
```julia 
julia> using CUDA

julia> a = CUDA.rand(1024,1024,1024);

julia> sin.(a);

julia> CUDA.@profile sin.(a);
```

Then we open Nshight System using `nsys-cli` and open the report. If we want the old nvprof output we can execute 

```bash
nsys nvprof --profile-from-start off /path/to/julia --trace=cuda
```

then we can execute the previous code or the kernel we want to profile adding the macro CUDA.@profile before we lunch the kernel:
```julia 
julia> using CUDA

julia> a = CUDA.rand(1024,1024,1024);

julia> sin.(a);

julia> CUDA.@profile sin.(a);
```

## Profiling a reduce

We paste here our both atomic and shared mem versions

In [2]:
using BenchmarkTools, CUDA, Test

#######################
# REDUCE GRID ATOMIC
#######################
function reduce_grid_atomic(op, a, b)
    num_elements = blockDim().x*2
    thread = threadIdx().x
    block = blockIdx().x
    
    #parallel reduction of values in a block (stride or distance between each thread reduction) 
    stride_threads = 1
    # parallel reduction between blocks has a stride of 
    stride_blocks = (block - 1)*num_elements

    
    # while still have elements to reduce 
    while stride_threads < num_elements
        # add a barrier to sync threads
        sync_threads()
        # compute index to reduce 
        index = 2*stride_threads*(thread - 1) + 1 
        # check index and index + d are inbounds a
        @inbounds if index ≤ num_elements && index + stride_threads + stride_blocks ≤ length(a)
#             CUDA.@cuprintln ("thread $thread: a[$index] + a[$(index + stride_blocks)] = $(a[index] + a[index + stride_blocks])")
            a[stride_blocks + index] = op(a[index + stride_blocks], a[index + stride_threads + stride_blocks])
        end
        stride_threads *= 2
    end
    # do attomic operatios with the first entry of ech block (sum through each block)
    if thread == 1 
        CUDA.@atomic b[] = op(b[], a[stride_blocks + 1])
    end
    return nothing
end

# define test inputs
c_a = 1:16
d_a = CuArray(1:16)
d_b = CuArray([0])
# lunch kernel
@cuda(
    threads = 2,
    blocks = 4,
    reduce_grid_atomic(+, d_a, d_b)
    )
# test the result 
using Test
CUDA.@allowscalar d_b
@test CUDA.@allowscalar d_b[] == sum(c_a)


#######################
# REDUCE SHARED MEMORY
#######################
function reduce_grid_shared(op, a::AbstractArray{T}, b) where {T}
    num_elements = blockDim().x*2
    thread = threadIdx().x
    block = blockIdx().x
    #parallel reduction of values in a block (stride or distance between each thread reduction) 
    stride_threads = 1
    # parallel reduction between blocks has a stride of 
    stride_blocks = (block - 1)*num_elements
    
    # shared mem to buffer the a elements
    shared = @cuStaticSharedMem(T, (2048,))
    @inbounds shared[thread] = a[thread + stride_blocks]
    @inbounds shared[thread + blockDim().x] = a[thread + stride_blocks + blockDim().x]
 
    # while still have elements to reduce 
    while stride_threads < num_elements
        # add a barrier to sync threads
        sync_threads()
        # compute index to reduce 
        index = 2*stride_threads*(thread - 1) + 1 
        # check index and index + d are inbounds a
        @inbounds if index ≤ num_elements && index + stride_threads + stride_blocks ≤ length(a)
            shared[index] = op(shared[index], shared[index + stride_threads])
        end
        stride_threads *= 2
    end
    # do attomic operatios with the first entry of ech block reduction at shared 
    if thread == 1 
        CUDA.@atomic b[] = op(b[], shared[1])
    end
    return nothing
end

reduce_grid_shared (generic function with 1 method)

### Testing both reduce implementations

In [3]:
# define test inputs
c_a = 1:16
d_a = CuArray(1:16)
d_b = CuArray([0])
# lunch kernel shared
@cuda(
    threads = 4,
    blocks = 2,
    reduce_grid_atomic(+, d_a, d_b)
    )
# test the result 
@test CUDA.@allowscalar d_b[] == sum(c_a)
# re-define test inputs
CUDA.unsafe_free!(d_b)
CUDA.unsafe_free!(d_a)
d_a = CuArray(1:16)
d_b = CuArray([0])

# lunch kernel shared
@cuda(
    threads = 4,
    blocks = 2,
    reduce_grid_shared(+, d_a, d_b)
    )
@test CUDA.@allowscalar d_b[] == sum(c_a)

[32m[1mTest Passed[22m[39m
  Expression: [90m#= In[3]:25 =#[39m CUDA.@allowscalar d_b[] == sum(c_a)

Then we just define two different functions first `main()` function to profile our kernels and then `my_reduce()` to call both reducntion kernels: 

In [7]:
function my_reduce(op::Function, a::AbstractArray{T}) where {T}
    # launch atomic reduction
    a_atomic = copy(a) 
    b_atomic = CUDA.zeros(T, 1)

    kernel_atomic = @cuda(
        launch=false,
        reduce_grid_atomic(+, a_atomic, b_atomic)
    ) 

    config = launch_configuration(kernel_atomic.fun)
    threads_config = min(config.threads, length(a))
    threads = 1024
    blocks = cld(length(a_atomic), threads*2)

    @cuda(
        threads=threads,
        blocks=blocks,
        reduce_grid_atomic(op, a_atomic, b_atomic)
    ) 
    # launch shared memory  reduction
    b_shared = CUDA.zeros(T, 1)

    kernel_shared = @cuda(
        launch=false,
        reduce_grid_shared(+, a, b_shared)
    ) 

    config = launch_configuration(kernel_shared.fun)
    threads_config = min(config.threads, length(a))
    threads = 1024
    blocks = cld(length(a), threads*2)

    @cuda(
        threads=threads,
        blocks=blocks,
        reduce_grid_atomic(op, a, b_shared)
    ) 
    # test outputs
    @assert b_shared ≈ b_atomic

    CUDA.@allowscalar b_atomic[]
end

my_reduce (generic function with 1 method)

We just added the `@profile` macro and the `NVTX.@range`which makes it possible to enrich the profile tracer. Also to execute the function twice because sometimes an overhead is incurred at the first call. 

In [8]:
"""
main()

function to launch the kernel
"""
function main()
    N = 1024
    c_a = rand(N,N)
    d_a = CuArray(c_a)
    @test my_reduce(+, d_a) ≈ sum(c_a)

    # profile it 
    CUDA.@profile begin 
        NVTX.@range "my_reduce" my_reduce(+, d_a)
        NVTX.@range "my_reduce" my_reduce(+, d_a)
    end
# execute the function
main()



3.4076861675728695e6