# Profiling 

Often is really important to obtain a profiling for our GPU program, to check coalsceed access, race conditions problems, memory managment, etc. For such task, we can call `nvprof` tool from NVIDIA. On a Unix system we should execute:

```bash
$ nvprof --profile-from-start off /path/to/julia
```

The `/path/to/julia` is the path to julia binary. Note that we don't initialize immediately the profiler but we can call the CUDA API's with the macro @profile:

```julia
CUDA.@profile kernel_name(x_d, y_d, r_d)
```

But nvprof is not longer used for GPUs with compute capalities newer than 7.0, instead we need nsys (Nsight system), to set nsys to julia we run: 

```bash
nsys launch julia --trace=cuda 
```
then we can execute the previous code but adding the macro CUDA.@profile before we lunch the kernel:
```julia 
julia> using CUDA

julia> a = CUDA.rand(1024,1024,1024);

julia> sin.(a);

julia> CUDA.@profile sin.(a);
```

Then we open Nshight System using `nsys-cli` and open the report. If we want the old nvprof output we can execute 

```bash
nsys nvprof --profile-from-start off /path/to/julia --trace=cuda
```

then we can execute the previous code or the kernel we want to profile adding the macro CUDA.@profile before we lunch the kernel:
```julia 
julia> using CUDA

julia> a = CUDA.rand(1024,1024,1024);

julia> sin.(a);

julia> CUDA.@profile sin.(a);
```

## Profiling a reduce

We paste here our both atomic and shared mem versions

In [2]:
using BenchmarkTools, CUDA, Test

#######################
# REDUCE GRID ATOMIC
#######################
function reduce_grid_atomic(op, a, b)
    num_elements = blockDim().x*2
    thread = threadIdx().x
    block = blockIdx().x
    
    #parallel reduction of values in a block (stride or distance between each thread reduction) 
    stride_threads = 1
    # parallel reduction between blocks has a stride of 
    stride_blocks = (block - 1)*num_elements

    
    # while still have elements to reduce 
    while stride_threads < num_elements
        # add a barrier to sync threads
        sync_threads()
        # compute index to reduce 
        index = 2*stride_threads*(thread - 1) + 1 
        # check index and index + d are inbounds a
        @inbounds if index ≤ num_elements && index + stride_threads + stride_blocks ≤ length(a)
#             CUDA.@cuprintln ("thread $thread: a[$index] + a[$(index + stride_blocks)] = $(a[index] + a[index + stride_blocks])")
            a[stride_blocks + index] = op(a[index + stride_blocks], a[index + stride_threads + stride_blocks])
        end
        stride_threads *= 2
    end
    # do attomic operatios with the first entry of ech block (sum through each block)
    if thread == 1 
        CUDA.@atomic b[] = op(b[], a[stride_blocks + 1])
    end
    return nothing
end

# define test inputs
c_a = 1:16
d_a = CuArray(1:16)
d_b = CuArray([0])
# lunch kernel
@cuda(
    threads = 2,
    blocks = 4,
    reduce_grid_atomic(+, d_a, d_b)
    )
# test the result 
using Test
CUDA.@allowscalar d_b
@test CUDA.@allowscalar d_b[] == sum(c_a)


#######################
# REDUCE SHARED MEMORY
#######################
function reduce_grid_shared(op, a::AbstractArray{T}, b) where {T}
    num_elements = blockDim().x*2
    thread = threadIdx().x
    block = blockIdx().x
    #parallel reduction of values in a block (stride or distance between each thread reduction) 
    stride_threads = 1
    # parallel reduction between blocks has a stride of 
    stride_blocks = (block - 1)*num_elements
    
    # shared mem to buffer the a elements
    shared = @cuStaticSharedMem(T, (2048,))
    @inbounds shared[thread] = a[thread + stride_blocks]
    @inbounds shared[thread + blockDim().x] = a[thread + stride_blocks + blockDim().x]
 
    # while still have elements to reduce 
    while stride_threads < num_elements
        # add a barrier to sync threads
        sync_threads()
        # compute index to reduce 
        index = 2*stride_threads*(thread - 1) + 1 
        # check index and index + d are inbounds a
        @inbounds if index ≤ num_elements && index + stride_threads + stride_blocks ≤ length(a)
            shared[index] = op(shared[index], shared[index + stride_threads])
        end
        stride_threads *= 2
    end
    # do attomic operatios with the first entry of ech block reduction at shared 
    if thread == 1 
        CUDA.@atomic b[] = op(b[], shared[1])
    end
    return nothing
end

reduce_grid_shared (generic function with 1 method)

### Testing both reduce implementations

In [3]:
# define test inputs
c_a = 1:16
d_a = CuArray(1:16)
d_b = CuArray([0])
# lunch kernel shared
@cuda(
    threads = 4,
    blocks = 2,
    reduce_grid_atomic(+, d_a, d_b)
    )
# test the result 
@test CUDA.@allowscalar d_b[] == sum(c_a)
# re-define test inputs
CUDA.unsafe_free!(d_b)
CUDA.unsafe_free!(d_a)
d_a = CuArray(1:16)
d_b = CuArray([0])

# lunch kernel shared
@cuda(
    threads = 4,
    blocks = 2,
    reduce_grid_shared(+, d_a, d_b)
    )
@test CUDA.@allowscalar d_b[] == sum(c_a)

[32m[1mTest Passed[22m[39m
  Expression: [90m#= In[3]:25 =#[39m CUDA.@allowscalar d_b[] == sum(c_a)

Then we just define two different functions first `main()` function to profile our kernels and then `my_reduce()` to call both reducntion kernels: 

In [7]:
function my_reduce(op::Function, a::AbstractArray{T}) where {T}
    # launch atomic reduction
    a_atomic = copy(a) 
    b_atomic = CUDA.zeros(T, 1)

    kernel_atomic = @cuda(
        launch=false,
        reduce_grid_atomic(+, a_atomic, b_atomic)
    ) 

    config = launch_configuration(kernel_atomic.fun)
    threads_config = min(config.threads, length(a))
    threads = 1024
    blocks = cld(length(a_atomic), threads*2)

    @cuda(
        threads=threads,
        blocks=blocks,
        reduce_grid_atomic(op, a_atomic, b_atomic)
    ) 
    # launch shared memory  reduction
    b_shared = CUDA.zeros(T, 1)

    kernel_shared = @cuda(
        launch=false,
        reduce_grid_shared(+, a, b_shared)
    ) 

    config = launch_configuration(kernel_shared.fun)
    threads_config = min(config.threads, length(a))
    threads = 1024
    blocks = cld(length(a), threads*2)

    @cuda(
        threads=threads,
        blocks=blocks,
        reduce_grid_atomic(op, a, b_shared)
    ) 
    # test outputs
    @assert b_shared ≈ b_atomic

    CUDA.@allowscalar b_atomic[]
end

my_reduce (generic function with 1 method)

We just added the `@profile` macro and the `NVTX.@range`which makes it possible to enrich the profile tracer. Also to execute the function twice because sometimes an overhead is incurred at the first call. 

In [None]:

"""
main()

function to launch the kernel
"""
function main()
    N = 1024
    c_a = rand(N, N, 10)
    d_a = CuArray(c_a)
    @test my_reduce(+, d_a) ≈ sum(c_a)

    # profile it 
    CUDA.@profile begin 
        NVTX.@range "my_reduce" my_reduce(+, d_a)
        NVTX.@range "my_reduce" my_reduce(+, d_a)
    end
end
# execute the function
main()

# NVIDIA Nsight Systems

Then we open a bash terminal and execute:

```bash
nsys launch julia --trace=cuda  
```

Then a Julia REPL is open and just include and automatically the report will be generated

<img src=nsys.png>

#  NVIDIA Nsight Compute

We open a bash terminal and execute:

```bash
nv-nsight-cu-cli julia --trace=cuda  
```

When we close the app we obtain a lot of info, however by the moment is not possible to attach 

# Nvprof

In [19]:
We open a bash terminal and execute:

```bash
nsys nvprof --profile-from-start off julia --trace=cuda 
``` 

When we close the app it is thrown: 

```
NVTX Range Statistics:

 Time (%)  Total Time (ns)  Instances   Avg (ns)     Med (ns)    Min (ns)   Max (ns)   StdDev (ns)   Style     Range  
 --------  ---------------  ---------  -----------  -----------  ---------  ---------  -----------  -------  ---------
    100.0        6,822,643          2  3,411,321.5  3,411,321.5  3,322,959  3,499,684    124,963.4  PushPop  my_reduce

[4/7] Executing 'cudaapisum' stats report

CUDA API Statistics:

 Time (%)  Total Time (ns)  Num Calls  Avg (ns)   Med (ns)  Min (ns)  Max (ns)   StdDev (ns)               Name              
 --------  ---------------  ---------  ---------  --------  --------  ---------  -----------  -------------------------------
     94.8      118,558,438        729  162,631.6  29,134.0     1,295  2,203,714    313,534.2  cuModuleUnload                 
      4.5        5,648,389         12  470,699.1  27,428.0     4,843  2,714,353  1,046,947.5  cudaMemcpyAsync                
      0.4          549,085          8   68,635.6   2,271.0     1,174    270,168    123,392.8  cuMemAllocAsync                
      0.1           71,170         12    5,930.8   4,949.0     3,000     11,329      2,263.3  cudaLaunchKernel               
      0.1           63,748         10    6,374.8   5,455.5     3,259     12,462      2,856.8  cuLaunchKernel                 
      0.0           50,039          2   25,019.5  25,019.5    13,172     36,867     16,754.9  cuMemcpyDtoDAsync_v2           
      0.0           44,053         13    3,388.7   1,121.0       804     14,373      4,606.7  cuMemFreeAsync                 
      0.0           26,476          2   13,238.0  13,238.0    12,010     14,466      1,736.7  cuMemcpyDtoHAsync_v2           
      0.0           18,916          6    3,152.7   2,340.0     1,510      7,240      2,185.8  cudaStreamSynchronize          
      0.0           15,445          1   15,445.0  15,445.0    15,445     15,445          0.0  cuStreamDestroy_v2             
      0.0           13,534          6    2,255.7   1,534.0       835      6,266      2,012.4  cudaEventQuery                 
      0.0           11,679          2    5,839.5   5,839.5     5,740      5,939        140.7  cuCtxSynchronize               
      0.0           10,035          6    1,672.5   1,376.0     1,007      3,405        900.4  cudaEventRecord                
      0.0            9,558         12      796.5     653.0       466      1,662        424.2  cudaStreamGetCaptureInfo_v10010
      0.0            2,465          2    1,232.5   1,232.5     1,087      1,378        205.8  cuStreamSynchronize            

[5/7] Executing 'gpukernsum' stats report

CUDA Kernel Statistics:

 Time (%)  Total Time (ns)  Instances   Avg (ns)     Med (ns)    Min (ns)   Max (ns)   StdDev (ns)                                                  Name                                                
 --------  ---------------  ---------  -----------  -----------  ---------  ---------  -----------  ----------------------------------------------------------------------------------------------------
     55.5        2,620,058          2  1,310,029.0  1,310,029.0  1,309,965  1,310,093         90.5  julia_reduce_grid_shared_3799(__, CuDeviceArray<Float64, (int)3, (int)1>, CuDeviceArray<Float64, (i…
     39.8        1,877,861          2    938,930.5    938,930.5    931,122    946,739     11,042.9  julia_reduce_grid_atomic_3457(__, CuDeviceArray<Float64, (int)3, (int)1>, CuDeviceArray<Float64, (i…
      4.4          208,700         12     17,391.7     18,847.5      9,216     19,968      3,765.3  void nrm2_kernel<double, double, double, (int)0, (int)0, (int)128>(cublasNrm2Params<T1, T3>)        
      0.2           10,815          4      2,703.8      2,703.5      2,688      2,720         18.2  julia__5_2412(CuKernelContext, CuDeviceArray<Float64, (int)1, (int)1>, Float64)                     
      0.1            6,367          2      3,183.5      3,183.5      3,136      3,231         67.2  julia_broadcast_kernel_3894(CuKernelContext, CuDeviceArray<Float64, (int)1, (int)1>, Broadcasted<Cu…

[6/7] Executing 'gpumemtimesum' stats report

CUDA Memory Operation Statistics (by time):

 Time (%)  Total Time (ns)  Count  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)      Operation     
 --------  ---------------  -----  ---------  ---------  --------  --------  -----------  ------------------
     98.6        1,070,065      2  535,032.5  535,032.5   534,360   535,705        951.1  [CUDA memcpy DtoD]
      0.9            9,473      8    1,184.1    1,120.5     1,088     1,408        130.2  [CUDA memcpy DtoH]
      0.5            5,536      6      922.7      912.0       896       960         31.5  [CUDA memcpy HtoD]

[7/7] Executing 'gpumemsizesum' stats report

CUDA Memory Operation Statistics (by size):

 Total (MB)  Count  Avg (MB)  Med (MB)  Min (MB)  Max (MB)  StdDev (MB)      Operation     
 ----------  -----  --------  --------  --------  --------  -----------  ------------------
    167.772      2    83.886    83.886    83.886    83.886        0.000  [CUDA memcpy DtoD]
      0.000      6     0.000     0.000     0.000     0.000        0.000  [CUDA memcpy HtoD]
      0.000      8     0.000     0.000     0.000     0.000        0.000  [CUDA memcpy DtoH]
```